10 Skills to Ace Your Data Engineering Interview
Introduction
If you are preparing for a data engineering interview and are overwhelmed by all the tools and concepts you need to learn. Then this post is for you, in this post we go over the most common tools and concepts you need to know to ace any data engineering interview.
If you are working on landing a data engineering interview read a proven approach to landing a DE job
.
Skills
The skills are ordered based on how frequently they are asked in interviews. The points links to different posts/docs that explain the concept in detail.
1. SQL
SQL is crucial for any data engineering interview. These questions generally involve the interviewer giving you an ERD and asking you to write queries to answer analytical questions. Read about the below concepts and practice leetcode problems to ace this section.
- Select, from, where, like, joins
- Joins: left outer, right outer, inner, full outer, anti join and know when to use a specific join type
- Window functions
- Table relationships:
one to many
,many to many
,one to one
type table relationships- What is
primary key
,foreign key
- Sub query, derived tables, CTEs
- What is an
index
and why use it
2. Python
Python is the most common language used in building data engineering pipelines. Know the data structures and basics.
3. Data structures and algorithms
This is a standard section for any software job. Read the patterns and practice the common asked problems.
- 14 patterns for coding interviews
- Sliding window problem
- Merge intervals problem
- Graph BFS
- Graph DFS
4. Data Modeling
You will be tested on data modeling concepts, such as star schema and will be asked why a certain methodology maybe better in a given scenario. Read the below concepts, understand what they are, why they are needed, when to use them and when not to use them.
4.1 Data Warehousing
- What is a data warehouse
- What is Star schema, fact and dimension tables
- What are slowly changing dimensions, especially SCD1, SCD2 and SCD3 types ?
- Separation of compute and storage
- External tables
- Data partitioning
- Columnar storage formats, such as parquet, orc
- OLTP vs OLAP
4.2 OLTP
5. Data Pipelines
This will be a part of the system design question where they ask you to design a data pipeline and will test the pipeline with potential source system failures, incorrect dependencies, backfilling and check for efficient data modeling for data access by users.
- Data pipeline dependencies, Idempotent, Time based split, late arriving events, backfilling
- Basics of an orchestration tool like Airflow and DBT
- Difference between ETL and ELT and when to use one over the other
- CDC pattern
- EL tools such as stitch, fivetran
- Data Quality
6. Distributed system fundamentals
- Distributed data storage and processing
- Differences between batch and stream processing
7. Distributed Queue
8. System Design
This is a standard question for most software jobs. Specifically for a data engineering position, you will be asked about building data pipelines where you will have to understand the source system, ETL schedule requirements, data usage pattern and design a data pipeline to pull data into a warehouse. Some common questions in this section are how do you design a clickstream data storage system
, log storage and how would you use CDC pattern to replicate OLTP tables into a data warehouse
. Some companies may choose to ask a standard system design question like designing twitter, netflix, etc.
- Batch ETL
- Streaming ETL
- Change data capture pattern
- Standard system design concepts such as cache, db, load balancers, data modeling, etc and common questions like designing twitter, netflix, and whatsapp.
9. Business Value
This is a key point for experienced people. You will need to justify how you work impacted the money/KPI/OKR/..or other metric that drove business value. See the STAR method for how to organize your response to such questions.
10. Cloud computing
Be familiar with what cloud computing is. The interviewer will check if you know some commonly used components such as
- Cloud storage, such as AWS S3
- Cloud compute, such as AWS EC2
- Cloud database, such as AWS RDS
- Cloud managed HDFS, Spark, Flink, such as AWS EMR
- Cloud server less functions, such as AWS Lambda
- Cloud managed Kafka, such as AWS Kinesis
You can find corresponding components for GCP or Azure as well.
11. Probabilistic Data structures (Optional)
Although not frequently asked, these are asked sometimes. You may be asked what is an efficient way to count the number of distinct ids in a distributed system where some probability of error is fine.
Optional: Depending on the specific role a JVM language Java or Scala expertise
Interview Preparation Tips
When you are preparing for any interview there are 4 main aspects
Coding skills
: Points 1, 2, 3. You will need to practice these problems and be able to write code in your interview.Core concepts
: Points 4 - 9. You will need to know enough to get through an interview. For example if you consider columnar storage format you will need to know what it is, why, when to use it, what its benefits are and when not to use it. But not necessarily the exact implementation details, like what the metadata format in a columnar format such as Apache Parquet is. These are essential to know for the job, but given that an interview is about 1h the probability of getting to such lower level details may be very low.System Design skills
: This will require you to use the concepts you learn from points 4 - 9 and come up with a good solution for the question asked. When faced with a design question make sure to clear any ambiguous requirements before starting to come up with a solution.Communication skills
: When you are taking about your previous experiences use the STAR method to keep you response focussed and clear.
If you are pressed for time use the below to concentrate on high priority skills
- Points 1 - 4, 7: Absolutely necessary, almost all levels of Data engineering interviews need this.
- Points 5 - 10: If you are interviewing for a position that involves designing ETL systems, modeling data warehouses. You will be tested on this.
- Point 11 + JVM: These are good to know, some companies check if you know these topics.
Conclusion
Hope this list helps you ace your data engineering interview. Good luck, if you have any questions or comments leave them in the comment section below.