Designing a Data Project to Impress Hiring Managers

Introduction

Building a data project for your portfolio is hard. Getting hiring managers to read through your Github code is even harder. If you are building data projects and are

disappointed that no one looks at your Github projects

frustrated that recruiters don’t take you seriously as you don’t have a lot of work experience

Then this post is for you. In this post, we go one way to design a data project to impress a hiring manager and showcase your expertise. The main theme of this endeavor is show not tell, since you only get a few minutes(if not seconds) of the hiring manager’s time.

Objective

When starting a project, it’s a good idea to work backward from your end goal. In our case, the main goal is to impress the hiring manager. This can be done by

  1. Linking your dashboard URL in your resume and Linkedin.
  2. Hosting a live dashboard that is fed by near real-time data.
  3. Encouraging the hiring manager to look at your Github repository.
  4. Concise and succinct README.md.
  5. Architecture diagram.
  6. Project organization.
  7. Coding best practices: Test, lint, types, and formatting.

You want to showcase your expertise to the hiring manager, without expecting them to read through your codebase. In the following sections, we will build out a simple dashboard that is populated by near real-time bitcoin exchange data. You can use this as a reference to build your dashboards.

Project

For our project, we will pull bitcoin exchange data from CoinCap API. We will pull this data every 5 minutes and load it into our warehouse.

Architecture

To follow along you need

  1. Docker and Docker Compose v1.27.0 or later.
  2. AWS account.
  3. AWS CLI installed and configured.
  4. git.

Clone the code as shown below.

git clone https://github.com/josephmachado/bitcoinMonitor.git
cd bitcoinMonitor

1. ETL Code

The code to pull data from CoinCap API and load it into our warehouse is at exchange_data_etl.py. In this script we

  1. Pull data from CoinCap API using the get_exchange_data function.
  2. Use get_utc_from_unix_time function to get UTC based date time from unix time(in ms).
  3. Load data into our warehouse using the _get_exchange_insert_query insert query.
def run() -> None:
    data = get_exchange_data()
    for d in data:
        d['update_dt'] = get_utc_from_unix_time(d.get('updated'))
    with WarehouseConnection(**get_warehouse_creds()).managed_cursor() as curr:
        p.execute_batch(curr, _get_exchange_insert_query(), data)

Ref: API data pull best practices

There are a few things going on at “with WarehouseConnection(**get_warehouse_creds()).managed_cursor() as curr:".

  1. We use the get_warehouse_creds utility function to get the warehouse connection credentials.
  2. The warehouse connection credentials are stored as environment variables within our docker compose definition. The docker-compose uses the hardcoded values from the env file.
  3. The credentials are passed as **kwargs to the WarehouseConnection class.
  4. The WarehouseConnection class uses contextmanager to enable opening and closing the DB connections easier. This lets us access the DB connection without having to write boilerplate code.
def get_warehouse_creds() -> Dict[str, Optional[Union[str, int]]]:
    return {
        'user': os.getenv('WAREHOUSE_USER'),
        'password': os.getenv('WAREHOUSE_PASSWORD'),
        'db': os.getenv('WAREHOUSE_DB'),
        'host': os.getenv('WAREHOUSE_HOST'),
        'port': int(os.getenv('WAREHOUSE_PORT', 5432)),
    }
class WarehouseConnection:
    def __init__(
        self, db: str, user: str, password: str, host: str, port: int
    ):
        self.conn_url = f'postgresql://{user}:{password}@{host}:{port}/{db}'

    @contextmanager
    def managed_cursor(self, cursor_factory=None):
        self.conn = psycopg2.connect(self.conn_url)
        self.conn.autocommit = True
        self.curr = self.conn.cursor(cursor_factory=cursor_factory)
        try:
            yield self.curr
        finally:
            self.curr.close()
            self.conn.close()

2. Test

Tests are crucial if you want to be confident about refactoring code, adding new features, and code correctness. In this example, we will add 2 major types of tests.

  1. Unit test: To test if individual functions are working as expected. We test get_utc_from_unix_time with the test_get_utc_from_unix_time function.
  2. Integration test: To test if multiple systems work together as expected.

For the integration test we

  1. Mock the Coinbase API call using the mocker functionality of the pytest-mock library. We use fixture data at test/fixtures/sample_raw_exchange_data.csv as a result of an API call. This is to enable deterministic testing.
  2. Assert that the data we store in the warehouse is the same as we expected.
  3. Finally the teardown_method truncates the local warehouse table. This is automatically called by pytest after the test_covid_stats_etl_run test function is run.
class TestBitcoinMonitor:
    def teardown_method(self, test_covid_stats_etl_run):
        with WarehouseConnection(
            **get_warehouse_creds()
        ).managed_cursor() as curr:
            curr.execute("TRUNCATE TABLE bitcoin.exchange;")

    def get_exchange_data(self):
        with WarehouseConnection(**get_warehouse_creds()).managed_cursor(
            cursor_factory=psycopg2.extras.DictCursor
        ) as curr:
            curr.execute(
                '''SELECT id,
                        name,
                        rank,
                        percenttotalvolume,
                        volumeusd,
                        tradingpairs,
                        socket,
                        exchangeurl,
                        updated_unix_millis,
                        updated_utc
                        FROM bitcoin.exchange;'''
            )
            table_data = [dict(r) for r in curr.fetchall()]
        return table_data

    def test_covid_stats_etl_run(self, mocker):
        mocker.patch(
            'bitcoinmonitor.exchange_data_etl.get_exchange_data',
            return_value=[
                r
                for r in csv.DictReader(
                    open('test/fixtures/sample_raw_exchange_data.csv')
                )
            ],
        )
        run()
        expected_result = [
          {"see github repo for full data"}
        ]
        result = self.get_exchange_data()
        assert expected_result == result

See How to add tests to your data pipeline article to add more tests to this pipeline. You can run tests using

make up # to start all your containers 
make pytest

3. Scheduler

Now that we have the ETL script and tests setup. We need to schedule the ETL script to run every 5 minutes. Since this is a simple script we will go with cron instead of setting up a framework like Airflow or Dagster. The cron job is defined at scheduler/pull_bitcoin_exchange_info

SHELL=/bin/bash
HOME=/
*/5 * * * * WAREHOUSE_USER=sdeuser WAREHOUSE_PASSWORD=sdepassword1234 WAREHOUSE_DB=finance WAREHOUSE_HOST=warehouse WAREHOUSE_PORT=5432  PYTHONPATH=/code/src /usr/local/bin/python /code/src/bitcoinmonitor/exchange_data_etl.py


This file is placed inside the pipelinerunner docker container’s crontab location. You may notice that we have hardcoded the environment variables. Not having the environment variables hardcoded in this file is part of future work.

4. Presentation

Now that we have the code and scheduler set up, we can add checks and formatting automation to ensure that we follow best practices. This is what a hiring manager will be exposed to, when they look at your code. Ensuring that the presentation is clear, concise, and consistent is crucial.

4.1. Formatting, Linting, and Type checks

Formatting enables us to stay consistent with the code format. We use black and isort to automate formatting. The -S black module flag ensures that we use single quotes for strings (following PEP8).

Linting analyzes the code for potential errors and ensures that the code formatting is consistent. We use flake8 to lint check our code.

Type checking enables us to catch type errors (when defined). We use mypy for this.

All of these are run within the docker container. We use a Makefile to store shortcuts to run these commands.

4.2. Architecture Diagram

Instead of having a long text, it is usually easier to understand the data flow with an architecture diagram. It does not have to be beautiful, but must be clear and understandable. Our architecture diagram is shown below.

Architecture

4.3. README.md

The readme should be clear and concise. It’s a good idea to have sections for

  1. Description of the problem
  2. Architecture diagram
  3. Setup instructions

You can automatically format and test your code with

make ci # this command will format your code, run lint and type checks and run all your tests 

After which, you can push it to your Github repository.

5. Live Dashboard

We can spin up a dashboard locally using the following commands.

cd bitcoinMonitor
make up

Then visit http://localhost:3000 to log into your local Metabase instance. From the Metabase UI, you can add charts and dashboards. To tear down the local containers you can use

make down

6. Deploy to Production

Now that you have your code working locally, it’s time to deploy to production. We will run our data pipeline and dashboards as containers on an EC2 instance. The first step is to start an EC2 ubuntu instance.

From your AWS UI, select EC2 and create an instance. Make sure to select ubuntu, x86 instance as shown below. A t2.micro instance should suffice (if you plan to keep your ETL job running indefinitely, you might need a bigger machine).

EC2

Accept the default values until you reach the security group section. In this section add a TCP rule, on port 3000 that accepts inbound connections from any 0.0.0.0/0 address.

EC2 SG

If you do not have a pem file, you will be prompted to create and download one. Keep this file safe, as it will be used to connect to your EC2 instance. Wait for the EC2 instance to be in a ready state with all the checks passed. Then note down its public DNS and public IPV4 addresses.

Now that you have a running EC2 instance, you can

  1. deploy your code to EC2, as shown below.
cd bitcoinmonitor
chmod 755 ./deploy_helpers/send_code_to_prod.sh
chmod 400 your-pem-file-full-location
./deploy_helpers/send_code_to_prod.sh your-pem-file-full-location your-EC2-Public-DNS
# this will open up a bash to your remote EC2 instance
# If you are having trouble connecting use method 2 from https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-fix-permission-denied-errors/
# you will be logged into your EC2 instance
  1. Install docker and start your ETL and dashboard containers on EC2.
chmod 755 install_docker.sh
./install_docker.sh
# verify that docker and docker compose installed
docker --version
docker-compose --version

# start the containers
unzip bitcoinmonitor.gzip && cd bitcoinmonitor/
docker-compose --env-file env up --build -d

You can log into your remote Metabase instance by using http://your-public-ipv4-address:3000. From Metabase, you can set up a connection to the Postgres warehouse with the following credentials

Host: warehouse
Database name: finance

The remaining configs are available in the env file. Create a dashboard with a markdown element containing a link to your Github repository as shown here.

6.1 Adding Dashboard to your Profile

Refer to Metabase documentation on how to create a dashboard. Once you create a dashboard, get its public link following the steps here. Create a hyperlink to this dashboard from your resume or LinkedIn page. You can also embed the dashboard as an iframe on any website.

A sample dashboard using bitcoin exchange data is shown below.

Dash

Depending on the EC2 instance type you choose you may occur some cost. Use AWS cost calculator to figure out the cost.

Future Work

Although this provides a good starting point, there is a lot of work to be done. Some future work may include

  1. CI/CD with Github actions
  2. Data quality testing
  3. Better scheduler and workflow manager to handle backfills, reruns, and parallelism
  4. Better failure handling
  5. Streaming data from APIs vs mini-batches
  6. Add system env variable to crontab
  7. Data cleanup job to remove old data, since our Postgres is running on a small EC2 instance
  8. API rate limiting

Conclusion

Building data projects are hard. Getting hiring managers to read through your Github code is even harder. By focusing on the right things, you can achieve your objective. In this case, the objective is to show your data skills to a hiring manager. We do this by making it extremely easy for the hiring manager to see the end product, code, and architecture.

Hope this article gave you a good idea of how to design a data project to impress a hiring manager. If you have any questions or comments please leave them in the comment section below.

Further Reading

  1. Airflow scheduling
  2. Beginner DE project: batch
  3. Adding data tests
  4. API data pull using lambda
  5. dbt, getting started

References

  1. AWS EC2 connection issues
  2. Ubuntu docker install
  3. Crontab env variables
  4. Metabase documentation
  5. Coincap API