Automating data testing with CI pipelines, using Github Actions

1. Introduction

Automated testing is crucial for ensuring that your code is bug-free and avoiding regressions. If you are wondering

How can data tests be integrated into a CI (Continuous Integration) pipeline?

How does a typical CI system work?

Then this post is for you. By the end of this blog post, you will understand what CI is, why it is important and how to create a CI pipeline to run tests automatically when there is a pull-request.

2. CI

CI stands for continuous integration. It is a dev-ops best practice of continually testing and making sure a development branch is ready to be merged into the main branch.

When developers create a new pull request, the CI platform typically does the following steps:

  1. Spin up the new container(s)
  2. Clones the code into the container
  3. Runs tests (including checks)
  4. If all tests pass a check passed ✅ will be displayed on your pull request.
  5. If any tests fail a check failed ❌ will be displayed on your pull request. You can prevent the merge of failing branches by setting up protected branches.

CI process

The key components of a CI pipeline are:

  1. CI platform: These platforms handle spinning up the VMs, alerting on pass/fail, etc. There are many CI/CD platforms available. E.g. Jenkins, Github Actions, circleCI, AWS Code Pipeline, etc
  2. Virtual machines: You can configure the types of virtual machines that you want to run your tests on. These VMs can be run on any service that can create VMs, eg AWS ECS, AWS EC2, etc.

Github actions provide us with some standard VMs we can use, without setting up any infrastructure.

You can see how having tests as part of CI ensures that we do not inadvertently introduce bugs, ensuring that your code is bug-free and avoiding regressions.

3. Sample project: Data testing with Github Actions

3.1. Prerequisites

  1. git
  2. Github account
  3. Docker and Docker Compose v1.27.0

Set up your repository as shown below.

git clone https://github.com/josephmachado/data_test_ci.git # clone sample code
cd data_test_ci
rm -rf .git
git init
git add .
git commit -m 'Sample project for data tests on CI'
# Create a new repository on github.com
git remote add origin https://github.com/your-github-user-id/your-repo-name.git # replace your-github-user-id with your id and your-repo-name with the repo you created
git branch -M main
git push -u origin main

3.2. Project overview

CI project

This data pipeline pulls data from a table (user), enriches it, and loads it into another table(enriched_data).

The python process to enrich data and the database are set up as docker containers. Use the command below to set them up.

make up # spins up the Postgres and python containers
make ci # formats the code, checks typing, checks the formatting, and runs the python test suite

The Makefile contains common commands such as formatting the code, running type & lint checks, and running our test suite.

3.3. Automating data tests with Github Actions

The workflow should be defined in this path .github/workflows/. Our workflow file, named ci.yml is shown below:

name: ci
on: [pull_request]
jobs:
  run-ci-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repo
        uses: actions/checkout@v2
      - name: Spin up containers
        run: make up
      - name: Run CI tests
        run: make ci

The on field specifies the actions (pull-request in our case) that are supposed to trigger this workflow. Our workflow has one job, run-ci-tests, which involves:

  1. Creating a virtual machine running ubuntu.
  2. Checkout repo: Cloning our repository to the virtual machine. The virtual machine has docker installed.
  3. Spin up containers: Running make up which is the command to spin up our Postgres and Python containers
  4. Run CI tests: Running make ci which is the command to format the code, check typing, check the formatting, and run python tests.

When you create a pull-request, the jobs defined in our workflow file will be run. Use the commands below to put up a PR.

git checkout -b sde-20220227-sample-ci-test-branch
echo '' >> src/data_test_ci/data_pipeline.py
git add .
git commit -m 'Fake commit to trigger CI'
git push origin sde-20220227-sample-ci-test-branch

Go to your repository on Github, click on Pull requests and click on Compare & pull request, and then click on the Create pull request button. This will trigger the workflow.

Clicking on the Details button on the Github UI in the run-ci-tests job section shows the steps that were run. The Setup job and Complete job steps are always run before and after our defined steps.

Click CI details CI with our defined steps
CI CI Details

4. Conclusion

To recap, in this article we saw

  1. What is a CI pipeline
  2. Why it is important
  3. Automating data tests with Github Actions

Hope this article gives you a good idea of what happens as part of a CI pipeline, the different platforms to use to set up CI pipelines, and how you can easily set one up using Github actions.

The next time you are building a data pipeline automate your data tests as part of a CI pipeline, your teammates and future self will thank you.

If you have any questions, comments, or suggestions please leave them in the comment section below.

5. Further reading

  1. Trying to figure out what data tests to create? read this article.
  2. Struggling with setting up different components of your data pipeline? checkout this article.
  3. Trying to set up a CI/CD pipeline for dbt? read this article.
  4. Wondering how to run unit tests on dbt? checkout this article.

Please consider sharing, it helps out a lot!