Start building data projects today
Are you a data analyst/scientist looking to transition into data engineering? Are you stuck in tutorial hell, where you learn so much stuff but it rarely translates to practical work? How to move from “just learn” to a “deliver value” phase?
Are you uncertain about the skills you need to build to be a valuable data engineer in an AI-heavy future?
This post will guide you.
Data engineering is about enabling companies to make data-driven decisions (not just tools).
Imagine, if you can, start doing that today. Build a body of real work that can impress data engineering teams. That is what you will learn to do in this post.
By the end of this post, you will have well-defined, real-world data projects you can start today to showcase your skills.
This is part 1 of a multi-part series for people looking to transition from data analytics/science to data engineering.
Build tables that people rely on
Every data engineering team’s objective is to enable data-driven decision-making. All the DE tools/systems/frameworks are designed to enable the creation of easy-to-use, accurate, and on-time data.
Create tables that people rely on; this is your value to the business. Let’s see how you can do this.
In your day-to-day work, what query do you run regularly to answer an ad hoc question you’re asked or to check information yourself? Can you create a table to answer this question?
Some examples are: Quarterly reports, Monday morning checks, Has the data arrived checks, etc.
Now let’s evaluate how this information is used. Answer the following questions.
- Who asks this question?
- What does the requester do with this information?
- How often is this information needed?
The 3 questions above help you define the following.
- Stakeholder
- Outcome of creating this table &
- How frequently the table needs to be refreshed
Now let’s decide if this is a viable project. Answer the following questions.
- How many people use this data?
- If the user did not have this data, would anything change? Would they get this information in another way?
If more than 2 people use this data and it is absolutely essential to their work, then you have a viable project on your hands.
Note: The users can be your team as well. We have used a query based on your day-to-day work, but it can also be your stakeholders running a saved query or a jupyter notebook to get the data.
It’s time to create a STAR point for your project.
You may be wondering why you should define the STAR before doing the project. This is because we want to avoid unnecessary work and pick the project with the maximum impact.
Expect to do this process for 4-5 projects and choose the one you can start building today.
Next up is implementation, and no, you don’t need scalable, performant, jargon-filled tools.
Use SQL for pipeline logic & schedule with cron
Create the entire pipeline with SQL. We assume that the data is already in the database and the output table is created in the same database.
Here is what SQL pipeline logic can look like
summary_table.sql
DROP TABLE IF EXISTS summary_table;
CREATE TABLE summary_table AS
WITH fact_data AS (
SELECT
f.dim1_key,
f.dim2_key,
f.dim3_key,
f.measure_1,
f.measure_2,
f.event_date
FROM fact AS f
WHERE f.event_date >= ADD_MONTHS(CURRENT_DATE(), -24)
),
dim1 AS (
SELECT
d.dim1_key,
d.dim1_attribute_1,
d.dim1_attribute_2
FROM scd2_dim1 AS d
WHERE d.is_current = TRUE
),
dim2 AS (
SELECT
d.dim2_key,
d.dim2_attribute_1,
d.dim2_attribute_2
FROM scd2_dim2 AS d
WHERE d.is_current = TRUE
)
SELECT
d1.dim1_attribute_1,
d1.dim1_attribute_2,
d2.dim2_attribute_1,
d2.dim2_attribute_2,
d3.dim3_attribute_1,
d3.dim3_attribute_2,
SUM(fd.measure_1) AS total_measure_1,
SUM(fd.measure_2) AS total_measure_2,
COUNT(*) AS record_count
FROM fact_data AS fd
LEFT JOIN dim1 AS d1
ON fd.dim1_key = d1.dim1_key
LEFT JOIN dim2 AS d2
ON fd.dim2_key = d2.dim2_key
LEFT JOIN dim3 AS d3
ON fd.dim3_key = d3.dim3_key
GROUP BY
d1.dim1_attribute_1,
d1.dim1_attribute_2,
d2.dim2_attribute_1,
d2.dim2_attribute_2,
d3.dim3_attribute_1,
d3.dim3_attribute_2;- 1
- Recreate the output table
- 2
- Use the adhoc query to create the table
- 3
- Create the dataset for past 2 years (change as needed)
- 4
- Example SCD2 tables
Replace the fact and dimension tables with your table.
If your company has a standard scheduler (e.g. Airflow, dbt cloud, Dagster, etc), use that.
If you do not have a scheduler use crontab if you are on Linux or Task Scheduler if you are on Windows.
Use a cli to run the sql file on your database.
- 1
-
psqlis a cli to run scripts on Postgres - 2
- Use your connection details
- 3
- Use your sql script name
Schedule it to run during work hours, so you can check that it ran
Finally, create a presentation explaining the following points for the data you created
- Stakeholder
- Outcome of creating this table &
- How frequently the table needs to be refreshed
Internal transfer to a data engineering team has a high probability of success
Data teams want people who know the domain. Interviewing is expensive, and hiring is risky. Teams only get a few hours to determine if someone is a good fit.
But teams can de-risk by hiring someone who has already demonstrated the ability to deliver value (hint: it’s you). This is one of the reasons that internal transfers and referrals work much better than hiring via applications.
Draft a message to a senior data engineer in your company’s data team that you want to join.
Start by stating your interest in the type of work their team is doing. Then show them what you have built and how it delivered outcomes for your stakeholders (use the slides).
Ask them for pointers on how to be eligible for an interview if a position opens on their team. End by thanking them for their time.
Conclusion
To recap, we saw
- How to deliver value by materializing an ad hoc query
- Implementation with SQL
- How to lay the foundations for an internal transfer
In the next posts in this series, we will cover advanced design patterns and how to know just enough about the plethora of data tools to speak knowledgeably about them.
In the meantime, think about what you can build now and the outcome you can drive at your current job. Build the data engineering skill set so that when the opportunity arises, you are ready.
Tell me your STAR points in the comments below, and I will give you feedback on them from an employee’s perspective.