Designing a Data Project to Impress Hiring Managers

Jun 25, 2021 · 9 min read

Introduction
Objective
Setup
- Pre-requisites
Project
Future Work
Tear down infra
Conclusion
Further Reading
References

Introduction

Building a data project for your portfolio is hard. Getting hiring managers to read through your Github code is even harder. If you are building data projects and are

disappointed that no one looks at your Github projects

frustrated that recruiters don’t take you seriously as you don’t have a lot of work experience

Then this post is for you. In this post, we go one way to design a data project to impress a hiring manager and showcase your expertise. The main theme of this endeavor is show not tell, since you only get a few minutes(if not seconds) of the hiring manager’s time.

Objective

When starting a project, it’s a good idea to work backward from your end goal. In our case, the main goal is to impress the hiring manager. This can be done by

Linking your dashboard URL in your resume and Linkedin.
Hosting a live dashboard that is fed by near real-time data.
Encouraging the hiring manager to look at your Github repository.
Concise and succinct README.md.
Architecture diagram.
Project organization.
Coding best practices: Test, lint, types, and formatting.

You want to showcase your expertise to the hiring manager, without expecting them to read through your codebase. In the following sections, we will build out a simple dashboard that is populated by near real-time bitcoin exchange data. You can use this as a reference to build your dashboards.

Setup

Pre-requisites

git
Github account
Terraform
AWS account
AWS CLI installed and configured
Docker with at least 4GB of RAM and Docker Compose v1.27.0 or later

Read this post , for information on setting up CI/CD, DB migrations, IAC(terraform), “make” commands and automated testing.

Run these commands to setup your project locally and on the cloud.

# Clone the code as shown below.
git clone https://github.com/josephmachado/bitcoinMonitor.git
cd bitcoinMonitor

# Local run & test
make up # start the docker containers on your computer & runs migrations under ./migrations
make ci # Runs auto formatting, lint checks, & all the test files under ./tests

# Create AWS services with Terraform
make tf-init # Only needed on your first terraform run (or if you add new providers)
make infra-up # type in yes after verifying the changes TF will make

# Wait until the EC2 instance is initialized, you can check this via your AWS UI
# See "Status Check" on the EC2 console, it should be "2/2 checks passed" before proceeding

make cloud-metabase # this command will forward Metabase port from EC2 to your machine and opens it in the browser

You can connect metabase to the warehouse with the configs in the env file. Refer to this doc for creating a Metabase dashboard.

Create database migrations as shown below.

make db-migration # enter a description, e.g., create some schema
# make your changes to the newly created file under ./migrations
make warehouse-migration # to run the new migration on your warehouse

For the continuous delivery to work, set up the infrastructure with terraform, & defined the following repository secrets. You can set up the repository secrets by going to Settings > Secrets > Actions > New repository secret.

SERVER_SSH_KEY: We can get this by running terraform -chdir=./terraform output -raw private_key in the project directory and paste the entire content in a new Action secret called SERVER_SSH_KEY.
REMOTE_HOST: Get this by running terraform -chdir=./terraform output -raw ec2_public_dns in the project directory.
REMOTE_USER: The value for this is ubuntu.

Project

For our project, we will pull bitcoin exchange data from CoinCap API . We will pull this data every 5 minutes and load it into our warehouse.

Architecture

1. ETL Code

The code to pull data from CoinCap API and load it into our warehouse is at exchange_data_etl.py . In this script we

Pull data from CoinCap API using the get_exchange_data function.
Use get_utc_from_unix_time function to get UTC based date time from unix time(in ms).
Load data into our warehouse using the _get_exchange_insert_query insert query.

def run() -> None:
    data = get_exchange_data()
    for d in data:
        d['update_dt'] = get_utc_from_unix_time(d.get('updated'))
    with WarehouseConnection(**get_warehouse_creds()).managed_cursor() as curr:
        p.execute_batch(curr, _get_exchange_insert_query(), data)

Ref: API data pull best practices

There are a few things going on at “with WarehouseConnection(**get_warehouse_creds()).managed_cursor() as curr:".

We use the get_warehouse_creds utility function to get the warehouse connection credentials.
The warehouse connection credentials are stored as environment variables within our docker compose definition. The docker-compose uses the hardcoded values from the env file.
The credentials are passed as **kwargs to the WarehouseConnection class.
The WarehouseConnection class uses contextmanager to enable opening and closing the DB connections easier. This lets us access the DB connection without having to write boilerplate code.

def get_warehouse_creds() -> Dict[str, Optional[Union[str, int]]]:
    return {
        'user': os.getenv('WAREHOUSE_USER'),
        'password': os.getenv('WAREHOUSE_PASSWORD'),
        'db': os.getenv('WAREHOUSE_DB'),
        'host': os.getenv('WAREHOUSE_HOST'),
        'port': int(os.getenv('WAREHOUSE_PORT', 5432)),
    }

class WarehouseConnection:
    def __init__(
        self, db: str, user: str, password: str, host: str, port: int
    ):
        self.conn_url = f'postgresql://{user}:{password}@{host}:{port}/{db}'

    @contextmanager
    def managed_cursor(self, cursor_factory=None):
        self.conn = psycopg2.connect(self.conn_url)
        self.conn.autocommit = True
        self.curr = self.conn.cursor(cursor_factory=cursor_factory)
        try:
            yield self.curr
        finally:
            self.curr.close()
            self.conn.close()

2. Test

Tests are crucial if you want to be confident about refactoring code, adding new features, and code correctness. In this example, we will add 2 major types of tests.

Unit test: To test if individual functions are working as expected. We test get_utc_from_unix_time with the test_get_utc_from_unix_time function.
Integration test: To test if multiple systems work together as expected.

For the integration test we

Mock the Coinbase API call using the mocker functionality of the pytest-mock library. We use fixture data at test/fixtures/sample_raw_exchange_data.csv as a result of an API call. This is to enable deterministic testing.
Assert that the data we store in the warehouse is the same as we expected.
Finally the teardown_method truncates the local warehouse table. This is automatically called by pytest after the test_covid_stats_etl_run test function is run.

class TestBitcoinMonitor:
    def teardown_method(self, test_covid_stats_etl_run):
        with WarehouseConnection(
            **get_warehouse_creds()
        ).managed_cursor() as curr:
            curr.execute("TRUNCATE TABLE bitcoin.exchange;")

    def get_exchange_data(self):
        with WarehouseConnection(**get_warehouse_creds()).managed_cursor(
            cursor_factory=psycopg2.extras.DictCursor
        ) as curr:
            curr.execute(
                '''SELECT id,
                        name,
                        rank,
                        percenttotalvolume,
                        volumeusd,
                        tradingpairs,
                        socket,
                        exchangeurl,
                        updated_unix_millis,
                        updated_utc
                        FROM bitcoin.exchange;'''
            )
            table_data = [dict(r) for r in curr.fetchall()]
        return table_data

    def test_covid_stats_etl_run(self, mocker):
        mocker.patch(
            'bitcoinmonitor.exchange_data_etl.get_exchange_data',
            return_value=[
                r
                for r in csv.DictReader(
                    open('test/fixtures/sample_raw_exchange_data.csv')
                )
            ],
        )
        run()
        expected_result = [
          {"see github repo for full data"}
        ]
        result = self.get_exchange_data()
        assert expected_result == result

See How to add tests to your data pipeline article to add more tests to this pipeline. You can run tests using

make up # to start all your containers 
make pytest

3. Scheduler

Now that we have the ETL script and tests setup. We need to schedule the ETL script to run every 5 minutes. Since this is a simple script we will go with cron instead of setting up a framework like Airflow or Dagster. The cron job is defined at scheduler/pull_bitcoin_exchange_info

SHELL=/bin/bash
HOME=/
*/5 * * * * WAREHOUSE_USER=sdeuser WAREHOUSE_PASSWORD=sdepassword1234 WAREHOUSE_DB=finance WAREHOUSE_HOST=warehouse WAREHOUSE_PORT=5432  PYTHONPATH=/code/src /usr/local/bin/python /code/src/bitcoinmonitor/exchange_data_etl.py

This file is placed inside the pipelinerunner docker container’s crontab location. You may notice that we have hardcoded the environment variables. Not having the environment variables hardcoded in this file is part of future work .

4. Presentation

Now that we have the code and scheduler set up, we can add checks and formatting automation to ensure that we follow best practices. This is what a hiring manager will be exposed to, when they look at your code. Ensuring that the presentation is clear, concise, and consistent is crucial.

4.1. Formatting, Linting, and Type checks

Formatting enables us to stay consistent with the code format. We use black and isort to automate formatting. The -S black module flag ensures that we use single quotes for strings (following PEP8).

Linting analyzes the code for potential errors and ensures that the code formatting is consistent. We use flake8 to lint check our code.

Type checking enables us to catch type errors (when defined). We use mypy for this.

All of these are run within the docker container. We use a Makefile to store shortcuts to run these commands.

4.2. Architecture Diagram

Instead of having a long text, it is usually easier to understand the data flow with an architecture diagram. It does not have to be beautiful, but must be clear and understandable. Our architecture diagram is shown below.

Architecture

4.3. README.md

The readme should be clear and concise. It’s a good idea to have sections for

Description of the problem
Architecture diagram
Setup instructions

You can automatically format and test your code with

make ci # this command will format your code, run lint and type checks and run all your tests

After which, you can push it to your Github repository.

5. Adding Dashboard to your Profile

Refer to Metabase documentation on how to create a dashboard . Once you create a dashboard, get its public link following the steps here . Create a hyperlink to this dashboard from your resume or LinkedIn page. You can also embed the dashboard as an iframe on any website.

A sample dashboard using bitcoin exchange data is shown below.

Dash

Depending on the EC2 instance type you choose you may occur some cost. Use AWS cost calculator to figure out the cost.

Future Work

Although this provides a good starting point, there is a lot of work to be done. Some future work may include

Data quality testing
Better scheduler and workflow manager to handle backfills, reruns, and parallelism
Better failure handling
Streaming data from APIs vs mini-batches
Add system env variable to crontab
Data cleanup job to remove old data, since our Postgres is running on a small EC2 instance
API rate limiting

Tear down infra

After you are done, make sure to destroy your cloud infrastructure.

make down # Stop docker containers on your computer
make infra-down # type in yes after verifying the changes TF will make

Conclusion

Building data projects are hard. Getting hiring managers to read through your Github code is even harder. By focusing on the right things, you can achieve your objective. In this case, the objective is to show your data skills to a hiring manager. We do this by making it extremely easy for the hiring manager to see the end product, code, and architecture.

Hope this article gave you a good idea of how to design a data project to impress a hiring manager. If you have any questions or comments please leave them in the comment section below.

References

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Comment anonymously

M ↓ Markdown

Upvotes Newest Oldest

Anonymous

1 point

4 years ago

Thank you! Clean and concise blog, excellent work. It really helped me to understand some of the key areas within a data pipeline. Keep up the great work.

taras tech

1 point

4 years ago

Awesome post, exactly what I was looking for, really clearly explained for a DE noob like me!

Anonymous

1 point

4 years ago

really impressed by your work and documentation! Keep up the good work!

Anonymous

0 points

3 years ago

Hi... lovely article (post).

Just curious, what tool did you use for the architecture diagram?

Thank you

Joseph Kevin Machado

0 points

3 years ago

Hi, I use google slides + nounproject(for icons)

Anonymous

0 points

4 years ago

Hey, great content and i wish i found it sooner! Im wondering how would i go into loading the response from the api into a dimensional model ? I mean, practically, how do i populate dimension and fact tables from data in a dataframe for example ? I really couldnt find any practical examples. I've thought of some ways but they didnt sound very efficient

Joseph Machado

0 points

4 years ago

hey, Glad you found the content. If I understand correctly, you are asking for a way to load data efficiently into fact/dimension tables. There are a few ways depending on your use case. Most commonly

Fact table are populated by some kind of streaming system.
Dimension tables are populated by CDC or scheduled data pull from OLTP databases.

See this article to go over a few ways to load data into data warehouse.

See this article to use CDC to load data into a dim table.

Let me know if this helps or if I misunderstood your question.

Anonymous

0 points

4 years ago

Thank you for the response! After further research and reading the links above i found out that what i wanted to know is how to go from flat table to a star schema. Some links that helped clear that up for me:

https://www.sqlservercentral.com/forums/topic/loading-150-million-flat-structure-data-into-star-schema

https://community.powerbi.com/t5/Desktop/Terrible-performance-when-normalizing-a-flat-table-in-Power/m-p/490265

Thank you again and keep up the amazing posts!

Jaeyoon Cho

0 points

4 years ago

So I'm kinda confused about one thing. The code is already all written and we're only supposed to clone the repo and set up the stuff? And then we pass off this project as our own? I'm misunderstanding something right?

Joseph Machado

0 points

4 years ago

Yep this is supposed to be a framework that one can use to build their own project. I will make that clearer.

Gregory-Essuman

0 points

4 years ago

Nice project Joseph. One concern though. I've been able to run the containers and logged in to metabase however the actual data is missing from the database. The database shows in metabase but the data itself is missing. What could be wrong?

Joseph Machado

0 points

3 years ago

Thanks Gregory. If the data is showing up on metabase that means that the data is present in the database. Could you explain what you mean by data is missing from the database? and how you checked for data in the database? I can try to help with more info.

Anonymous

0 points

3 years ago

Hi Joseph, I have a similar problem to Gregory-Essuman. From Metabase, I can see that the table exchange with proper columns exists, but it has no data. It is the same (empty) when I'm checking the context on the data with the CLI of the image. Also, the logs of the pipelinerunner image are empty and I'm not sure if it should be like that. Would really appreciate any help with that. If this seems to you like some more complex problem and you would be willing to help me with that, I can also contact you somewhere also to make the communication more efficient.

Joseph Machado

0 points

3 years ago

Hi, Can you let me know what you see when you type the command crontab -l in your VM ? I think the cron job that executes the script every 5min is not running.

Yaroslaw

0 points

3 years ago

I had a similar issue, and I think it relates to cron execution rules. I have created another script that calls run() function from bitcoinmonitor.exchange_data_etl and put it into src folder in the project and called this new script from the cron and it worked.

Aaron Im

0 points

18 months ago

Just in case someone runs into my issue....

I assumed Cron wasn't running because the first line of data would aways show the values from when I first spun up the containers. So I thought the updates weren't happening.. Went down a rabbit hole.... thought it was because I'm on windows...

anyways, FYI if you scroll down in the Metabase table, you'll see the updates are coming through but are just at the end of the table.. filtering to one exchange ID makes this clearer. also fyi , you can create a pivot table by selecting the 'visuals' option.

With that said..... my Logs for the 'pipelinerunner' container still do not show in Docker Desktop.... Probably would've helped my not go so far down the rabbit hole but oh well. learned alot in the proccess!

Anonymous

0 points

2 years ago

Hello, thanks a lot for this project. I will try it on my own. However, i have a question about the "Transform" here. In almost all ETL that I see, I never saw transformation step. What transformation is supposed to be? For example, in this project what would be the transformation step please?

Anonymous

0 points

4 years ago

Great content, thanks!

Petre

0 points

4 years ago

Very good tutorial, thanks! Metabase is not free - are there any alternatives you recommend?

Joseph Machado

1 point

4 years ago

thank you Petre. Metabase is open source and can be self hosted, which is what we are doing. They also have some paid plans https://www.metabase.com/pricing/ Hope this clear things up

Petre

1 point

4 years ago

Thanks for your reply Joseph. I am unable to connect from within the metabase container into the postgres one although I can connect into it using Pgadmin. The ports are mapped as expected within the postgres container. Do you have any idea what might be the issue? I'm getting from metabase: "Database name: Connection to :5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."

--- Meanwhile found the answer to my question at: https://stackoverflow.com/a/56334518/10496082 --- :)

Joseph Machado

0 points

4 years ago

Hi Petre, You will have to set host as warehouse. Basically use this file https://github.com/josephmachado/bitcoinMonitor/blob/main/env to get the required fields when connecting to postgres from Metabase UI at http://localhost:3000 . lmk if that resolves your issue.

Petre

1 point

4 years ago

I was using the right credentials set within the env file but I was mixing things up regarding localhost. I got it working by checking the warehouse container address and pointing to that one instead of my machine's localhost. Thanks for your quick replies!

Jakob Aichinger

0 points

4 years ago

Had the exact same problem, so thanks for the stackoverflow link :)

Joseph Machado

0 points

4 years ago

you are welcome. Glad it worked :)

גיא שפריר

0 points

4 years ago

could you help out by running this? I'm getting some errors (like AWS :command not found). Any totarial vids ?

Anonymous

0 points

4 years ago

Anonymous

0 points

4 years ago

Very informative/useful, thanks for the content.

Anonymous

0 points

3 years ago

Hi! This is my first project of this type and I must have missed something. I get the following error when running the make infra-up command:

terraform -chdir=./terraform apply ╷ │ Error: error configuring Terraform AWS Provider: failed to get shared config profile, default │ │ with provider["registry.terraform.io/hashicorp/aws"], │ on main.tf line 12, in provider "aws": │ 12: provider "aws" { │ ╵ make: *** [Makefile:42: infra-up] Error 1

Joseph Kevin Machado

0 points

3 years ago

Hey, Did you get a chance to run make tf-init, if Yes, what was the response? Also do you have aws cli installed ?

Anonymous

0 points

3 years ago

When I run make tf-init I get:

`terraform -chdir=./terraform init

Initializing the backend...

Initializing provider plugins...

Reusing previous version of hashicorp/aws from the dependency lock file
Reusing previous version of hashicorp/tls from the dependency lock file
Using previously-installed hashicorp/aws v4.36.1
Using previously-installed hashicorp/tls v4.0.3

Terraform has been successfully initialized!`

AWS-CLI version: aws-cli/1.18.69 Python/3.8.10 Linux/5.15.0-53-generic botocore/1.16.19

Joseph Kevin Machado

0 points

3 years ago

Hmm Can you try removing this line https://github.com/josephmachado/bitcoinMonitor/blob/b073e31cf5072117787296b4f05f3afb84ac1ab6/terraform/main.tf#L14 and trying again?

It seems to be some version issue https://discuss.hashicorp.com/t/error-error-configuring-terraform-aws-provider-failed-to-get-shared-config-profile-default/39417/3

If you are using Windows can you try this https://github.com/hashicorp/terraform-provider-aws/issues/15361#issuecomment-699700674 ?

Please LMK if this helps.

Anonymous

0 points

3 years ago

Thanks! I removed this line earlier, installed a new version of aws-cli. Indeed, it looks like a version issue. I need to delve into it:

https://discuss.hashicorp.com/t/error-configuring-terraform-aws-provider-no-valid-credential-sources-for-terraform-aws-provider-found/35708 & what needs to be changed to resolve this https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-4-upgrade#changes-to-authentication

Anonymous

0 points

3 years ago

the reply to make tf-init was:

terraform -chdir=./terraform init

Initializing the backend...

Initializing provider plugins...

Reusing previous version of hashicorp/aws from the dependency lock file
Reusing previous version of hashicorp/tls from the dependency lock file
Using previously-installed hashicorp/aws v4.36.1
Using previously-installed hashicorp/tls v4.0.3

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see any changes that are required for your infrastructure. All Terraform commands should now work.

If you ever set or change modules or backend configuration for Terraform, rerun this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary.

Aws-cli version:

aws-cli/1.18.69 Python/3.8.10 Linux/5.15.0-53-generic botocore/1.16.19

Anonymous

0 points

3 years ago

Hi! I have a problem with the following line when calling the make cloud-metabase command:

/bin/sh: 1: open: not found
make: *** [Makefile:66: cloud-metabase] Error 127

The line in the Makefile that the error points to:

terraform -chdir=./terraform output -raw private_key > private_key.pem && chmod 600 private_key.pem && ssh -o "IdentitiesOnly 
yes" -i private_key.pem ubuntu@$$(terraform -chdir=./terraform output -raw ec2_public_dns) -N -f -L 3001:$$(terraform - 
chdir=./terraform output -raw ec2_public_dns):3000 && open http://localhost:3001 && rm private_key.pem

Anonymous

0 points

3 years ago

I removed the double dollar signs ($), and only insert one and now it seems ok. However, when I did it, the following problem occurred (make cloud-metabase):

 ssh: Could not resolve hostname : Name or service not known
 make: *** [Makefile:66: cloud-metabase] Error 255

Joseph Kevin Machado

0 points

3 years ago

The double $$ sign executes a command. Are you still having issues? Do you mind opening a Github Issue at https://github.com/josephmachado/bitcoinMonitor/issues its easier to track there.

Anonymous

0 points

3 years ago

Learned a lot. Thanks!!

Sreepad Parigi

0 points

3 years ago

Hi Joseph, when I run "make up" command, it throws the following error:

[ 9/10] RUN crontab /etc/cron.d/pull_bitcoin_exchange_info: #0 2.554 "/etc/cron.d/pull_bitcoin_exchange_info":3: bad minute #0 2.554 errors in crontab file, can't install.

What could be the cause of this error?

Aaron Im

0 points

18 months ago

had the same error.

apparently there were 'carriage returns' that cron can't process. This might be due to me using Windows / WSL?

Anyways these links helped: https://stackoverflow.com/questions/65790304/cron-bad-minute-errors-in-crontab-file-cant-install https://knowledge.digicert.com/solution/vmc-convert-cr-lf-to-lf-before-submitting-svg

Joseph Kevin Machado

0 points

18 months ago

TY for reporting! I use mac, so unable to test on windows atm.

Anonymous

0 points

3 years ago

When I try and run make up the following error occurs: "Error response from daemon: Conflict. The container name "/dashboard" is already in use by container "961d9b2f11ad17b8e048ec631a4a60d9cfceab07839d54e2b70344c946cda5bd". You have to remove (or rename) that container to be able to reuse that name. make: *** [up] Error 1"

Joseph Kevin Machado

0 points

3 years ago

Could you try docker stop 961d9b2f11ad17b8e048ec631a4a60d9cfceab07839d54e2b70344c946cda5bd && docker rm 961d9b2f11ad17b8e048ec631a4a60d9cfceab07839d54e2b70344c946cda5bd and try the make up command again? Please LMK if this helps.

Anonymous

0 points

3 years ago

Joseph, I'm curious if there are specific reasons why you prefixed "1-" to the "1-setup-exchange.sql" scripts? Also, is there a reason you created a 'bitcoinmonitor' directory to house the utils and other .py scripts, instead of just having it in "src"?

Joseph Kevin Machado

0 points

3 years ago

Those sql scripts are executed when our warehouse database starts, see this, its a postgres docker feature.

The naming conventions is usually 1-name of script, 2-name of script, etc since the postgres container will execute them in alphabetical order and this is a way to control order of script execution. In this case since we only have one script it does not matter.

I like the structure project_name/src/application_name. This way I can have multiple application as needed. In this case there is only one application.

Hope this helps. LMK if you have more questions :)

Anonymous

0 points

3 years ago

Got it, that makes sense, thank you tremendously! One part I was curious on was what you would recommend as the best-practice for adding in more data (from other endpoints)?

My thought process was to:

create another table (add in a "2-create-xxx-table" SQL query)
create another script similar to 'exchange_data_etl'
Create a script that runs both both of the etl scripts and replace that in the scheduler

Does that make sense or would you approach it differently?

Anonymous

0 points

3 years ago

Hey Joseph, fantastic tutorial and framework for projects! I have a quick question: when I first ran "make up" to test locally, metabase @ :3000 seemed to work fine, but when I closed it via "make down" and tried to get it back up the localhost:3000 doesn't refresh or load at all. When I run "curl http://localhost:3000/api/health" it states "Empty reply from server".

I'm using MacOS with an M1 chip and I'm wondering if its a mac issue - any suggestions?

Joseph Kevin Machado

0 points

3 years ago

Hi, Thank you for the kind words. If I am understanding correctly, the issue is that localhost:3000 (metabase) does not seem to run if you restart it (using make up && make down && make up)? It usually take about a min for metabase to come up, can you try hitting it after a min. Also I'm not familiar with its health API, did a quick search on their docs and couldn't find them, can you provide a link, I can take a look at this issue.

Anonymous

0 points

3 years ago

No worries, I attribute it to some fault of my own but it works fine when deploying it to prod, appreciate you

Anonymous

0 points

3 years ago

Appreciate the fine walkthrough Joseph. I am just having a bit of a challenge making this work with a different API. The framework is exactly the same except I have changed the .SQL to create a slightly different table, as well as modified the exchange_data_etl.py where the data still creates an output in the same format as yours but with different data. The table is created correctly in Metabase but the data never loads.

Could it be something I need to alter in the postgres image? Or if you have any other troubleshooting tips that would be grand.

Thanks again!

Joseph Kevin Machado

0 points

3 years ago

hi, The main reason is usually that the data type/number of columns that your script sends do not match the existing table. There are 2 ways to test this

more interactive way is to sh into the container and run the script and you will see the stack trace. You can sh into the container using make shell in the project directory. See this for details
The other option is to run the ETL and see the docker logs. Run docker ps to get the pipelinerunner docker id, then run docker logs <your-contianer-id> to check the error logs from the script.

I'd start with option 1 as it is the easiest to test out.

LMK if this helps or if you have more questions.

Aaron

0 points

3 years ago

Nice project! I'm current working through this myself to understand what's going on.

Regarding the t2.micro type for EC2, is this really enough? I set my Docker container running on this instance type and eventually lost connection (and still can't reconnect). From looking close, I believe the free tier instances only have 1 Gb memory.

Joseph Kevin Machado

0 points

3 years ago

Hi Aaron, Are you using the exact AMI as shown in https://www.startdataengineering.com/post/data-engineering-project-to-impress-hiring-managers/#6-deploy-to-production ? I was able to get my docker containers running. You can also use the screen command which will help prevent the ssh connection from dying. Hope this helps. LMK if you are still facing issues.

Anonymous

0 points

2 years ago

docker exec pipelinerunner yoyo develop --no-config-file --database postgres://postgres:1234@localhost:5432/postgres ./migrations Traceback (most recent call last): File "/usr/local/bin/yoyo", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/yoyo/scripts/main.py", line 304, in main args.func(args, config) File "/usr/local/lib/python3.9/site-packages/yoyo/scripts/migrate.py", line 284, in develop backend = get_backend(args, config) File "/usr/local/lib/python3.9/site-packages/yoyo/scripts/main.py", line 266, in get_backend return connections.get_backend(dburi, migration_table) File "/usr/local/lib/python3.9/site-packages/yoyo/connections.py", line 79, in get_backend backend = backend_class(parsed, migration_table) File "/usr/local/lib/python3.9/site-packages/yoyo/backends/base.py", line 166, in init self._connection = self.connect(dburi) File "/usr/local/lib/python3.9/site-packages/yoyo/backends/core/postgresql.py", line 41, in connect return self.driver.connect(**kwargs) File "/usr/local/lib/python3.9/site-packages/psycopg2/init.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432? could not connect to server: Cannot assign requested address Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5432?

make: *** [Makefile:57: warehouse-migration] Error 1 I am getting this eeror when I run the make up command I have checked my postgre credentials they are correct

StartDataEngineering

0 points

2 years ago

Can you please run docker ps and paste theoutput here. It looks like the container isn't running or port isnt open.

Anonymous

0 points

15 months ago

Thanks for this project, quite detailed. Please how do you schedule the cron job? I can't see it in the make file?

Anonymous

0 points

15 months ago

You are welcome The schedule is defined here https://github.com/josephmachado/bitcoinMonitor/blob/main/scheduler/pull_bitcoin_exchange_info

And is copied into the docker file here https://github.com/josephmachado/bitcoinMonitor/blob/2ef436995a2bd1be626aeb52ecd89a708e041f74/containers/pipelinerunner/Dockerfile#L19

And the cron is started here https://github.com/josephmachado/bitcoinMonitor/blob/2ef436995a2bd1be626aeb52ecd89a708e041f74/containers/pipelinerunner/Dockerfile#L31

I need to write an article about setting up docker

Commento

Designing a Data Project to Impress Hiring Managers

Introduction

Objective

Setup

Pre-requisites

Project

1. ETL Code

2. Test

3. Scheduler

4. Presentation

4.1. Formatting, Linting, and Type checks

4.2. Architecture Diagram

4.3. README.md

5. Adding Dashboard to your Profile

Future Work

Tear down infra

Conclusion

Further Reading

References

Tired of VC-Funded, Fluff-Filled Data Content?

Land your dream Data Engineering job!