10 Skills to Ace Your Data Engineering Interviews

Introduction

Are you a student, analyst, engineer, or someone preparing for a data engineering interview and overwhelmed by all the tools and concepts? Then this post is for you. In this post, we go over the most common tools and concepts you need to know to ace any data engineering interview.

Skills

The following 10 skills are ordered based on how frequently they are asked in interviews. If you are short on time go to this section.

1. SQL

SQL is crucial for any data engineering interview. These questions generally involve the interviewer giving you an ERD and asking you to write queries to answer analytical questions. Things to know include

2. Python

Python is the most common language used in building data engineering pipelines. Concepts to know include

3. Leetcode: data structures and algorithms

This is a standard section for most software job interviews. Use the links below to practice your leetcode skills.

4. Data modeling

You will be tested on data modeling concepts (e.g. star schema, facts, and dimensions). Read the concepts below, understand what they are and why they are needed.

4.1 Data warehousing

4.2 OLTP

5. Data pipelines

Data engineering interviews usually involve a data pipeline design question. Upon designing the data pipeline, you will be asked about testing, backfilling, scaling, handling bad data, resolving dependencies, etc.

6. Distributed system fundamentals

Depending on the job, you might be asked about distributed system fundamentals. It is a good idea to read up on how distributed systems work. Use the links below to understand the basics of distributed systems.

7. Event streaming

Understand what event streams are, why, and how to use them.

8. System design

This is a standard question for most software jobs. Specifically for a data engineering position, you will be asked about building data pipelines where you will have to understand the source system, ETL schedule requirements, data usage pattern, and design a data pipeline to pull data into a warehouse. Some common questions in this section are, "How do you design a clickstream data storage system?", and, "How would you use CDC pattern to replicate OLTP tables into a data warehouse?". Some companies may choose to ask a standard system design question like designing Twitter, Netflix, etc.

9. Business questions

Sometimes interviews may involve questions about business metrics. You can be asked to design metrics to track the health of the business, how to calculate it, why you chose it, etc.

You may be asked about your impact. You will need to explain what KPI/SLA (requirements) you were trying to optimize and how you achieved it. See the STAR method for how to organize your response to such questions.

10. Cloud computing

Be familiar with what cloud computing is. You do not need to know everything in detail but should have a general idea of what they are and what benefits they bring.

You can find corresponding components for GCP or Azure as well.

11. Probabilistic data structures (optional)

Although not frequently asked, these are asked sometimes. You may be asked a question like, “What is an efficient way to count the number of distinct ids in a distributed system where some probability of error is fine?". The most common questions of this type can usually be answered using the concepts shown below.

Optional: Knowledge of a JVM language (e.g. Scala or Java) is helpful.

Interview prep, the TL;DR version

If you have only a few days to prepare for your interview, make sure to understand the topics shown below.

  1. SQL: Basics, joins, union, CTEs, window functions and leetcode medium/hard problems
  2. Python: Basic data structures, Easy and medium (represented by green and yellow respectively) problems from the “blind 75” list.
  3. Answering business questions with SQL: You will be given a set of tables and asked a business question. Usually, the tables have the same name as the business process/entity.
  4. Data pipeline design: You will be asked to design a data pipeline. Make sure to understand the objective of the data pipeline. Some common questions in this section are, "How do you design a clickstream data storage system?", and, "How would you use CDC pattern to replicate OLTP tables into a data warehouse?". The interviewer will want to understand if you think about data lineage, schedule (batch or streaming), data duplication, scaling, loading data, testing, and end-user access patterns.

Conclusion

Hope this article gives you a good list of skills to be better prepared for a data engineering interview. The next time you are interviewing for a data engineering job, use the list above to prepare.

As always, if you have any questions or comments, please feel free to leave them in the comments section below.

Please consider sharing, it helps out a lot!