IaC (Infrastructure-as-Code) fundamentals for data engineers

Create data infrastructure with Terraform. Understand how Terraform uses configs, state, and variable files.

ARCHITECTURE
BEST PRACTICES
HANDS-ON PROJECT
FUNDAMENTALS
Author
Published

May 18, 2026

Keywords

IaC data engineering, IaC Terraform fundamentals

Introduction

IaC can seem intimidating to people without DevOps experience. If you

Feel overwhelmed by multiple Terraform files with hundreds of lines in each file

Feel stuck when you need to deploy, migrate, or reconfigure, because AI wrote your Terraform files

Are struggling to deploy data infrastructure

Then this post is for you.

IaC is similar to git but for managing infrastructure.

By the end of this post, you will know what Terraform is, how it works, and how you can use it starting today.

TipSetup Link

Follow along with code: set up instructions

We use Terraform to explain IaC concepts, but you can use any tool you prefer.

Infrastructure-as-Code (IaC) makes infrastructure management easy

Data pipelines involve multiple infrastructure components: S3, Spark, DBs, Airflow, etc.

While infrastructure can be set up with tools like boto3, aws cli, etc, managing them is time-consuming & error-prone.

With IaC tools, we can manage infrastructure using config files. All we have to do is modify the config files, and IaC tools take care of changing our infrastructure accordingly.

Define infrastructure in Terraform config (.tf) files

Terraform files (.tf) are written in HCL (Hashicorp Configuration Language). The main components of a Terraform file are:

  1. Provider: Systems/Vendors we want to work with. Think of this as libraries you need to work with them (e.g., AWS, GCP, local file systems, etc.)
  2. Resource: Infrastructure we want, e.g., S3, EC2, EMR, etc
  3. Data: Used to get information from a provider, which we will then use to set up some resource.
  4. Output: Used to print information. E.g., EC2 ID, etc.

Let’s see how it works.

Terraform File

Terraform File

Let’s look at our Terraform file.

terraform/main.tf
terraform {
  required_providers { 
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.92"
    }
  }

  required_version = ">= 1.2"
}

provider "aws" {
  region = "us-east-1"
}

# ----------------------------------------
# S3 Bucket
# ----------------------------------------

resource "aws_s3_bucket" "input_bucket" {
  bucket        = "sde-iac-tutorial-bucket"
  force_destroy = true
}

# ----------------------------------------
# AMI
# ----------------------------------------

data "aws_ami" "debian" {
  most_recent = true
  owners      = ["136693071363"]

  filter {
    name   = "name"
    values = ["debian-12-amd64-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

# ----------------------------------------
# IAM Role (EC2 -> S3 access)
# ----------------------------------------

resource "aws_iam_role" "ec2" {
  name = "ec2-s3-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "s3_access" {
  role = aws_iam_role.ec2.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ]
      Resource = [
        aws_s3_bucket.input_bucket.arn,
        "${aws_s3_bucket.input_bucket.arn}/*"
      ]
    }]
  })
}

resource "aws_iam_instance_profile" "ec2" {
  name = "ec2-s3-profile"
  role = aws_iam_role.ec2.name
}

# ----------------------------------------
# EC2 Instance
# ----------------------------------------

resource "aws_instance" "this" {
  aRmi                  = data.aws_ami.debian.id
  instance_type        = "t3.micro"
  iam_instance_profile = aws_iam_instance_profile.ec2.name

  user_data = <<-EOF
    #!/bin/bash
    apt-get update -y
    apt-get install -y python3 python3-pip git

    pip3 install boto3 --break-system-packages
  EOF

}
1
AWS provider to work with their services
2
Define which AWS region to use
3
Create an S3 bucket
4
Get aws image ID for the Debian 12 official image
5
Create IAM Profile -> role -> policy to connect to S3
6
Allow GET, PUT, DELETE, and LIST permissions to S3 for the Policy
7
EC2 with S3 accessible profile
8
Install libraries for the pipeline script to write data to S3

In the Terraform file, we

  1. Define our provider: AWS
  2. Created resources: S3, IAM profile, EC2
  3. Enabled EC2 -> S3 write permissions with an IAM profile
Note

Change the bucket name (line 21). Before running the below command.

Let’s set up our infrastructure.


terraform -chdir=terraform init
terraform -chdir=terraform validate
terraform -chdir=terraform fmt
terraform -chdir=terraform plan
terraform -chdir=terraform apply
1
Downloads the required_provider libraries
2
Validates .tf files
3
Formats .tf files
4
Creates a plan to update infrastructure
5
Shows the plan and asks for approval to apply changes

The config files are usually stored in a terraform folder (specified with the -chdir flag).

The Terraform CLI looks for all .tf files in the running directory.

Once complete, verify that the S3 bucket and EC2 instance have been created.

# check infrastructure
aws s3 ls

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
  --output table
1
List all your S3 buckets.
2
List your EC2 instances formatted as a table.

State files have details about active infrastructure

Terraform keeps track of the state of our infrastructure with a .tfstate file.

Let’s take a look at some information in our state file.

cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")''
aws_ami,debian
aws_iam_instance_profile,ec2
aws_iam_role,ec2
aws_iam_role_policy,s3_access
aws_instance,this
aws_s3_bucket,input_bucket

We can also use Terraform cli to check the state of our infrastructure.

terraform -chdir=terraform state list 
Caution

Terraform only tracks infrastructure set up with it.

Infrastructure created by other tools will not be managed by Terraform.

Terraform uses .tf & .tfstate files to determine infrastructure changes

The apply command compares what we have specified in our .tf files with what is currently running, based on the .tfstate file.

Based on the difference, Terraform will show us a plan of changes it intends to make.

Let’s look at an example of changing the EC2 OS.

Terraform change diff

Terraform change diff

Now check the plan.

terraform -chdir=terraform plan 
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_instance.this must be replaced
-/+ resource "aws_instance" "this" {
      ~ ami = "ami-05b5db63304a51103" -> "ami-0866c8e84ef39b25d" # forces replacement

In the plan, look for

  • ~: in-place update (safe, no cascade)
  • -/+: destroy and recreate (check for downstream cascades)
  • !: replaced due to upstream dependency

Terraform will understand all the cascading changes that need to be made and apply them in the right order.

Use .tfvars to store variables

We updated the bucket name and aws region directly in the main.tf.

However, as we will see in the CI/CD setup (coming soon), we need to be able to define different settings depending on the environment.

For this, we first define a variables.tf file that lists our allowed variables.

terraform/variables.tf
variable "aws_region" {
  type    = string
  default = "us-east-1"
}

variable "input_bucket" {
  type = string
  default = "some-bucket"
}

variable "instance_type" {
  type    = string
  default = "t3.micro"
}
1
us-east-1 as default
2
Need to be defined at run time, else error
3
Defaulting to the smallest size

Let’s move our bucket_name, region, and ec2 size to a variable file shown below.

terraform/envs/dev.tfvars
input_bucket = "sde-iac-tutorial-bucket"
instance_type = "t3.micro"
aws_region = "us-east-1"

Update main.tf and replace hardcoded values with variables.

Use tfvars

Use tfvars

Now we need to specify the variable file to use with Terraform.

terraform -chdir=terraform plan -var-file=envs/dev.tfvars

Store state file on the cloud to collaborate with the team

In a real project, we store state files in a shared location that your team can access. So that any change is reflected in a single location.

This location is called a backend. Terraform supports various backends with varying levels of reliability.

For our use case, let’s use an S3 backend.

To update our backend, let’s clean up the existing infrastructure based on the local state file.

terraform -chdir=terraform destroy -var-file=envs/dev.tfvars

Add backend state

Add backend state
# add to main.tf right after required_providers
backend "s3" {
  bucket  = "your-backend-state-bucket"
  key     = "dev/terraform.tfstate"
  region  = "us-east-1"
  encrypt = true
}

Now you will need to create an S3 bucket for your backend. Create this with the bootstrap script.


# replace your-backend-state-bucket with your bucket name
./terraform/bootstrap.sh -b your-backend-state-bucket
terraform -chdir=terraform init -var-file=envs/dev.tfvars
terraform -chdir=terraform apply -var-file=envs/dev.tfvars
1
init again to set up the backend state

Let’s look at the backend data to see if our state files are showing up.

aws s3 ls s3://your-backend-state-bucket --recursive --human-readable
# 2026-05-25 12:18:18   13.2 KiB dev/terraform.tfstate

tf files are stored within a /terraform folder

Most companies organize Terraform files as shown below.

.
├── README.md
└── terraform
    ├── bootstrap.sh
    ├── envs
    │   └── dev.tfvars
    ├── main.tf
    └── variables.tf
1
A bootstrap script to create the backend S3 bucket
2
envs/dev.tfvars has environment-specific variables
3
terraform/main.tf has the key infrastructure
4
variables.tf to define allowed variables(& defaults) at runtime

Tear down infrastructure with destroy

Do not forget to tear down your infrastructure with the destroy command.

terraform -chdir=terraform destroy -var-file=envs/dev.tfvars

Conclusion

To recap, we saw

  1. How IaC makes managing data infrastructure easy
  2. Defining infrastructure as terraform files
  3. Storing current state of infrastructure in .tfstate files
  4. Using variables with .tfvars file
  5. Using a cloud backend to store shared state files.

The next time you are setting up infrastructure, do not blindly prompt LLMs to write your terraform files. Only you have the necessary context to figure out what needs to be built.

Use the concepts in this post to set up your infrastructre.

Read these

  1. Run CI tests
  2. Data-ops with dbt and Snowflake
Back to top