Build Data Engineering Projects, with Free Template
- 1. Introduction
- 2. Data project template
- 3. Set up data infrastructure
- 4. Set up development workflow
- 5. Putting it all together with a Makefile
- 6. Data projects using other tools and services
- 7. Future work
- 8. Conclusion
- 9. Further reading
- 10. References
1. Introduction
Setting up data infra is one of the most complex parts of starting a data engineering project. If you are overwhelmed by
Setting up data infrastructure such as Airflow, Redshift, Snowflake, etc
Trying to setup your infrastructure with code
Not knowing how to deploy new features/columns to an existing data pipeline
Dev ops practices such as CI/CD for data pipelines
Then this post is for you. This post will cover the critical concepts of setting up data infrastructure, development workflow, and a few sample data projects that follow this pattern. We will also use a data project template that runs Airflow, Postgres, & Metabase to demonstrate how each concept works.
By the end of this post, you will be able to understand how to set up data infrastructure with code, how developers work together on new features to data pipeline, & have a GitHub template that you can use for your data projects.
2. Data project template
2.1. Prerequisites
To use the template, please install the following.
- git
- Github account
- Terraform
- AWS account
- AWS CLI installed and configured
- Docker with at least 4GB of RAM and Docker Compose v1.27.0 or later
2.2. Setup infra
You can create your GitHub repository based on this template by clicking on the Use this template
button in the data_engineering_project_template
repository. Clone your repository and replace content in the following files
- CODEOWNERS
: In this file, change the user id from
@josephmachado
to your GitHub user id. - cd.yml
: In this file change the
data_engineering_project_template
part of theTARGET
parameter to your repository name. - variable.tf
: In this file change the default values for
alert_email_id
andrepo_url
variables with your email and github repository url respectively.
Run the following commands (via terminal) in your project directory. If you are using windows please try WSL to setup ubuntu and run the following commands via that terminal.
# Local run & test
make up # start the docker containers on your computer & run migrations under ./migrations
make ci # Runs auto-formatting, lint checks, & all the test files under ./tests
# Create AWS services with Terraform
make tf-init # Only needed on your first terraform run (or if you add new providers)
make infra-up # type in yes after verifying the changes TF will make
# Wait until the EC2 instance initialization is complete; you can check this via your AWS UI
# See "Status Check" on the EC2 console; it should be "2/2 checks passed" before proceeding
# Wait another 5 mins; Airflow takes a while to start up
make cloud-airflow # this command will forward the Airflow port from EC2 to your machine and opens it in the browser
# the username and password are both "airflow."
make cloud-Metabase # this command will forward the Metabase port from EC2 to your machine and opens it in the browser
# use https://github.com/josephmachado/data_engineering_project_template/blob/main/env file to connect to the warehouse from Metabase
Data infrastructure
Project structure
Create database migrations as shown below.
make db-migration # enter a description, e.g., create some schema
# make your changes to the newly created file under ./migrations
make warehouse-migration # to run the new migration on your warehouse
2.3. Tear down infra
Make sure to destroy your cloud infrastructure when not in use.
make down # Stop docker containers on your computer
make infra-down # type in yes after verifying the changes TF will make
3. Set up data infrastructure
3.1. Run data infra on your laptop with containers
We will run the required data components in our local machine using containers. We will use docker-based containers.
Within the containers
folder, we define individuals data components, each with its own Dockerfile
, and python requirements file as needed.
We can manage multiple containers using docker-compose
. We also use env
files to store credentials; the docker-compose command can use this with the --env-file
flag.
Checkout this part of the Makefile on how to use docker-compose.
3.2. Manage cloud infrastructure with code
While we can create, & manage cloud services with custom code, UI, etc., Having the infrastructure defined as code is beneficial. Infrastructure as code (IAC) allows you to store the configuration on version control and easily spin up or bring down services. We can recover quickly in case of accidental deletions. We will use AWS for our projects.
We will use Terraform as our IAC framework (others are Pulumi, etc.). Our terraform
folder has 3 files
main.tf
defines all the services we need. In our main.tf , we create an EC2 instance, its security group where we configure access and a cost alert. Note that in the EC2, we run a script withuser_data
; this is a script run at EC2 initialization time (cloud-init ). Note that we can define our services in multiple files, like s3.tf, emr.tf, etc., if we choose to.variable.tf
are variables that can be provided as inputs when spinning up the infrastructure. Themain.tf
file uses the variables defined in thevariable.tf
with var.variable_name.output.tf
defines all the configurations we want to print out when we use the commandterraform output
. We have described the ec2 public DNS, private and public keys in ouroutput.tf
. We will later use the private key to connect to our EC2 instance.
Checkout this part of the Makefile to see the Terraform commands. We also forward the Airflow and Metabase from EC2 to our local machines using these commands .
4. Set up development workflow
We set up the development flow to make new feature releases easy and quick. We will use git for version control, GitHub for hosting our repository, GitHub-flow
for developing new features, and Github Actions
for CI/CD.
4.1. CI: Automated tests & checks before the merge with GitHub Actions
Continuous integrations in our repository represent the automated code testing before merging into the main branch (which runs in the production server). In our template, we have defined formatting (isort, black), type checking (mypy), lint/Style checking (flake8), & python testing (pytest) as part of our ci
.
We use GitHub actions to run the checks automatically when someone creates a pull request. The CI workflow is defined in this ci.yml
file.
4.2. CD: Deploy to production servers with GitHub Actions
Continuous delivery in our repository means deploying our code to the production server. We use EC2 running docker containers as our production server; After merging into the main branch, our code is copied to the EC2 server using cd.yml
.
Note that for our CD to work, we will first need to set up the infrastructure with terraform, & defined the following repository secrets. You can set up the repository secrets by going to Settings > Secrets > Actions > New repository secret
.
SERVER_SSH_KEY
: We can get this by runningterraform -chdir=./terraform output -raw private_key
in the project directory and paste the entire content in a new Action secret calledSERVER_SSH_KEY
.REMOTE_HOST
: Get this by runningterraform -chdir=./terraform output -raw ec2_public_dns
in the project directory.REMOTE_USER
: The value for this is ubuntu.
4.3. Database migrations
While code changes are straightforward due to the absence of a state(usually), changing any object within a DB is different since DB changes deal with changing table structures and data. Use a DB migration tool to change or add new DB objects(schema, table, views, columns, etc.).
We will use the lightweight yoyo-migrations
library to
- Create migrations: Creates a new file where we can define multiple
steps
. Eachstep
has an “apply” part and a “rollback” part. The idea is that the apply function modifies an object while the “rollback” reverts the modification. Look atthis migration
, where the apply parts create a schema, and rollback part drops it. - Apply migrations: This will run all the “apply” parts of the steps. The apply function ensures running the migrations that were not run already by keeping track of the migration number in a
yoyo_*
table. - Rollback migrations: This will run all the “rollback” parts of the steps. The rollback function ensures running the migrations that were not run already by keeping track of the migration number in a
yoyo_*
table.
Checkout this part
of the Makefile to see the DB migration commands. Note that we are running the migration commands within the webserver
container since we install yoyo-migrations
within that container for easier maintenance.
5. Putting it all together with a Makefile
We use a Makefile
to define aliases for all the infra commands. We can also have “make” commands that accept user input when run (e.g., db-migration
).
6. Data projects using other tools and services
While an Airflow + Postgres warehouse setup might be sufficient for most practice projects, here are a few projects that use different tools with managed services.
Component | Beginner DE project | Project to impress HM manager | End-to-end DE project |
---|---|---|---|
Scheduler |
Airflow | cron | Dagster |
Executor |
LocalExecutor, EMR | Python process | Dagster, Postgres |
Orchestrator |
Airflow | - | Dagster, dbt |
Source |
Postgres, CSV, S3 | API | Postgres, API, S3 |
Destination |
Redshift | Postgres warehouse | Postgres warehouse |
Visualization/BI tool |
Metabase | Metabase | Metabase |
Data quality checks |
- | - | dbt tests |
Monitoring & Alerting |
- | - | - |
All of the above projects use the same tools for data infrastructure setup.
local development
: Docker & Docker composeDB Migrations
: yoyo-migrationsIAC
: TerraformCI/CD
: Github ActionsTesting
: PytestFormatting
: isort & blackLint check
: flake8Type check
: mypy
7. Future work
- Add pip-compile for package updates
- Add data quality testing framework (e.g., great-expectations)
- Monitoring (e.g., Prometheus, etc.)
- Add VPC
If you would like to add one of the above or another component, please open an issue here .
8. Conclusion
To recap, we saw
The next time you start a new data project or join an existing data team, look for these components to make developing data pipelines quick and easy.
This article helps you understand how to set up data infrastructure with code, how developers work together on new features to data pipeline, & how to use the GitHub template for your data projects.
If you have any questions or comments, please leave them in the comment section below.
9. Further reading
- Creating local end-to-end tests
- Data testing
- Beginner DE project
- Project to impress HM
- End-to-end DE project
10. References
Please consider sharing, it helps out a lot!