How to Transition from Data Analyst to Data Engineer

Start building data projects today

Are you a data analyst/scientist looking to transition into data engineering? Are you stuck in tutorial hell, where you learn so much stuff but it rarely translates to practical work? How to move from “just learn” to a “deliver value” phase?

Are you uncertain about the skills you need to build to be a valuable data engineer in an AI-heavy future?

This post will guide you.

Data engineering is about enabling companies to make data-driven decisions (not just tools).

Imagine, if you can, start doing that today. Build a body of real work that can impress data engineering teams. That is what you will learn to do in this post.

By the end of this post, you will have well-defined, real-world data projects you can start today to showcase your skills.

This is part 1 of a multi-part series for people looking to transition from data analytics/science to data engineering.

Build tables that people rely on

Every data engineering team’s objective is to enable data-driven decision-making. All the DE tools/systems/frameworks are designed to enable the creation of easy-to-use, accurate, and on-time data.

Create tables that people rely on; this is your value to the business. Let’s see how you can do this.

Defining the Problem

In your day-to-day work, what query do you run regularly to answer an ad hoc question you’re asked or to check information yourself? Can you create a table to answer this question?

Some examples are: Quarterly reports, Monday morning checks, Has the data arrived checks, etc.

Now let’s evaluate how this information is used. Answer the following questions.

Who asks this question?
What does the requester do with this information?
How often is this information needed?

The 3 questions above help you define the following.

Stakeholder
Outcome of creating this table &
How frequently the table needs to be refreshed

Is this Project worth your time?

Now let’s decide if this is a viable project. Answer the following questions.

How many people use this data?
If the user did not have this data, would anything change? Would they get this information in another way?

If more than 2 people use this data and it is absolutely essential to their work, then you have a viable project on your hands.

Note: The users can be your team as well. We have used a query based on your day-to-day work, but it can also be your stakeholders running a saved query or a jupyter notebook to get the data.

It’s time to create a STAR point for your project.

You may be wondering why you should define the STAR before doing the project. This is because we want to avoid unnecessary work and pick the project with the maximum impact.

Expect to do this process for 4-5 projects and choose the one you can start building today.

Next up is implementation, and no, you don’t need scalable, performant, jargon-filled tools.

Use SQL for pipeline logic & schedule with cron

Create the entire pipeline with SQL. We assume that the data is already in the database and the output table is created in the same database.

Here is what SQL pipeline logic can look like

summary_table.sql


DROP TABLE IF EXISTS summary_table;

CREATE TABLE summary_table AS
WITH fact_data AS (
    SELECT
        f.dim1_key,
        f.dim2_key,
        f.dim3_key,
        f.measure_1,
        f.measure_2,
        f.event_date
    FROM fact AS f
    WHERE f.event_date >= ADD_MONTHS(CURRENT_DATE(), -24)
),

dim1 AS (
    SELECT
        d.dim1_key,
        d.dim1_attribute_1,
        d.dim1_attribute_2
    FROM scd2_dim1 AS d
    WHERE d.is_current = TRUE
),

dim2 AS (
    SELECT
        d.dim2_key,
        d.dim2_attribute_1,
        d.dim2_attribute_2
    FROM scd2_dim2 AS d
    WHERE d.is_current = TRUE
)

SELECT
    d1.dim1_attribute_1,
    d1.dim1_attribute_2,
    d2.dim2_attribute_1,
    d2.dim2_attribute_2,
    d3.dim3_attribute_1,
    d3.dim3_attribute_2,
    SUM(fd.measure_1)              AS total_measure_1,
    SUM(fd.measure_2)              AS total_measure_2,
    COUNT(*)                       AS record_count
FROM fact_data AS fd
LEFT JOIN dim1 AS d1
    ON fd.dim1_key = d1.dim1_key
LEFT JOIN dim2 AS d2
    ON fd.dim2_key = d2.dim2_key
LEFT JOIN dim3 AS d3
    ON fd.dim3_key = d3.dim3_key
GROUP BY
    d1.dim1_attribute_1,
    d1.dim1_attribute_2,
    d2.dim2_attribute_1,
    d2.dim2_attribute_2,
    d3.dim3_attribute_1,
    d3.dim3_attribute_2;

1: Recreate the output table
2: Use the adhoc query to create the table
3: Create the dataset for past 2 years (change as needed)
4: Example SCD2 tables

Replace the fact and dimension tables with your table.

Note

If your company has a standard scheduler (e.g. Airflow, dbt cloud, Dagster, etc), use that.

If you do not have a scheduler use crontab if you are on Linux or Task Scheduler if you are on Windows.

Use a cli to run the sql file on your database.


psql -h your_host \
-p 5432 \
-U your_user \
-d your_database \
-f summary_table.sql

1: psql is a cli to run scripts on Postgres
2: Use your connection details
3: Use your sql script name

Schedule it to run during work hours, so you can check that it ran

Critical: Showcase your work with a demo

Finally, create a presentation explaining the following points for the data you created

Stakeholder
Outcome of creating this table &
How frequently the table needs to be refreshed

Internal transfer to a data engineering team has a high probability of success

Data teams want people who know the domain. Interviewing is expensive, and hiring is risky. Teams only get a few hours to determine if someone is a good fit.

But teams can de-risk by hiring someone who has already demonstrated the ability to deliver value (hint: it’s you). This is one of the reasons that internal transfers and referrals work much better than hiring via applications.

Reach out to the DE team

Draft a message to a senior data engineer in your company’s data team that you want to join.

Start by stating your interest in the type of work their team is doing. Then show them what you have built and how it delivered outcomes for your stakeholders (use the slides).

Ask them for pointers on how to be eligible for an interview if a position opens on their team. End by thanking them for their time.

Conclusion

To recap, we saw

In the next posts in this series, we will cover advanced design patterns and how to know just enough about the plethora of data tools to speak knowledgeably about them.

In the meantime, think about what you can build now and the outcome you can drive at your current job. Build the data engineering skill set so that when the opportunity arises, you are ready.

Get my feedback

Tell me your STAR points in the comments below, and I will give you feedback on them from an hiring manager’s perspective.