Data Engineering Interview Preparation Series #1: Data Structures and Algorithms

Preparing for data engineering interviews can be stressful. There are so many things to learn. In this ‘Data Engineering Interview Series’, you will learn how to crack each section of the data engineering interview. If you have felt > That you need to practice 100s of Leetcode questions to crack the data engineering interview > That you have no idea where/how to start preparing for the data structures and algorithms interview > That you are not good enough to crack the data structures and algorithms interview. Then this post is for you!
Author

Joseph Machado

Published

August 13, 2024

Keywords

beginner, interview

1. Introduction

Preparing for data engineering interviews can be stressful. There are so many things to learn. In this Data Engineering Interview Series, you will learn how to crack each section of the data engineering interview.

If you have felt

That you need to practice 100s of Leetcode questions to crack the data engineering interview

That you have no idea where/how to start preparing for the data structures and algorithms interview

That you are not good enough to crack the data structures and algorithms interview.

Then this post is for you! In this post, we will cover the following:

  1. Data structures & algorithms to know
  2. Common patterns of questions asked
  3. How to do industry-specific research

By the end of this post, you will be able to pass the DSA part of your data engineering interview.

Practice content in this post with code using this repository.

This post is part 1 of my data engineering interview series:

  1. Data Engineering Interview Preparation Series #1: Data Structures and Algorithms
  2. Data Engineering Interview Preparation Series #2: System Design
  3. Data Engineering Interview Preparation Series #3: SQL

2. Data structures and algorithms to know

Understanding basic data structures is essential to implement answers to DSA questions. You should know the topics in this section very well.

2.1. List

Description: A list is a dynamic array that allows you to store a sequence of elements. It supports indexing, slicing, and various methods for manipulating the data.

Time Complexity: 1. Read: O(1) - Accessing an element by an index is fast and takes constant time. 2. Write: O(1) - Appending an element is usually O(1), but inserting or removing can be O(n) due to shifting elements.

my_list = [1, 2, 3, 4, 5]
# Accessing an element
element = my_list[2]  # element = 3

# Adding an element
my_list.append(6)  # my_list = [1, 2, 3, 4, 5, 6]

# Removing an element
my_list.remove(3)  # my_list = [1, 2, 4, 5, 6]
print(my_list)

# sort list in place
my_list.sort()


# Iterate through a list

for idx, elt in enumerate(my_list):
    print(f'Use enumerate to get index: {idx} and element: {elt}')

for elt in my_list:
    print(f'Use loop directly to loop through elements: {elt}')

2.2. Dictionary

Description: A dictionary is an unordered collection of key-value pairs. It allows for fast lookups, insertions, and deletions based on keys.

Time Complexity: 1. Read: O(1) - Accessing a value by key is fast due to the underlying hash table. 2. Write: O(1) - Inserting or deleting a key-value pair is also fast, thanks to the hash table.

my_dict = {'a': 1, 'b': 2, 'c': 3}
# Accessing a value
value = my_dict['b']  # value = 2

# Adding a key-value pair
my_dict['d'] = 4  # my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

# Removing a key-value pair
del my_dict['a']  # my_dict = {'b': 2, 'c': 3, 'd': 4}
print(my_dict)

2.3. Queue

Description: A queue is a linear data structure that follows the First-In-First-Out (FIFO) principle. Elements are added at the rear and removed from the front.

Time Complexity: 1. Read: O(1) - Accessing the front element is O(1) if using deque. 2. Write: O(1) - Enqueuing (adding) and dequeuing (removing) operations are O(1) with deque.

from collections import deque

queue = deque([1, 2, 3])
# Adding an element
queue.append(4)  # queue = deque([1, 2, 3, 4])

# Removing an element
element = queue.popleft()  # element = 1, queue = deque([2, 3, 4])
print(f'The popped element is {element} and the remaining queue is {queue}')

2.4. Stack

Description: A stack is a linear data structure that follows the Last-In-First-Out (LIFO) principle. Elements are added and removed from the top.

Time Complexity: 1. Read: O(1) - Accessing the top element is O(1). 2. Write: O(1) - Pushing (adding) and popping (removing) elements are O(1).

stack = [1, 2, 3]
# Adding an element (push)
stack.append(4)  # stack = [1, 2, 3, 4]

# Removing an element (pop)
element = stack.pop()  # element = 4, stack = [1, 2, 3]
print(f'The popped element is {element} and the remaining stack is {stack}')

2.5. Set

Description: A set is an unordered collection of unique elements. Sets are helpful for membership testing, removing duplicates from a sequence, and performing mathematical set operations like union, intersection, and difference.

Time Complexity: 1. Read: O(1) - Checking if an element is in the set (e.g., 4 in my_set) is O(1) on average. 2. Write: * Adding/Removing: O(1) - Adding or removing an element is O(1) on average. * Set Operations: 1. Union: O(len(set1) + len(set2)) - Combining two sets into one. 2. Intersection: O(min(len(set1), len(set2))) - Finding common elements between two sets. 3. Difference: O(len(set1)) - Finding elements in one set but not another.

Sets are highly efficient for operations involving unique elements and membership checks, with an average O(1) time complexity for most read-and-write operations.

# Creating a set
my_set = {1, 2, 3, 4, 5}

# Adding an element
my_set.add(6)  # my_set = {1, 2, 3, 4, 5, 6}

# Removing an element
my_set.remove(3)  # my_set = {1, 2, 4, 5, 6}

# Checking membership
is_in_set = 4 in my_set  # is_in_set = True

# Set operations: union, intersection, difference
other_set = {4, 5, 6, 7, 8}
union_set = my_set.union(other_set)  # union_set = {1, 2, 4, 5, 6, 7, 8}
intersection_set = my_set.intersection(other_set)  # intersection_set = {4, 5, 6}
difference_set = my_set.difference(other_set)  # difference_set = {1, 2}

print(f' Here is my_set {my_set} and other_set {other_set}')
print(f' Here is union_set {union_set}')
print(f' Here is intersection_set {intersection_set}')
print(f' Here is difference_set {difference_set}')

2.6. Counter (from collections module)

Description: Counter is a subclass of the dictionary in Python that helps count hashable objects. It is beneficial for counting the frequency of elements in an iterable.

from collections import Counter

# Creating a Counter from a list
my_counter = Counter(['apple,' 'banana,' 'apple,' 'orange,' 'banana,' 'apple'])

print(my_counter)

2.7. Heap

Description: A heap is a specialized binary tree-based data structure that satisfies the heap property. In a max-heap, for any given node i, the value of i is greater than or equal to the values of its children. In a min-heap, the value of i is less than or equal to the values of its children. Heaps are commonly used to implement priority queues.

Time Complexity: 1. Read: O(1) - Accessing the maximum or minimum element (the root) is fast because it is always at the top of the heap. 2. Write: 1. Insertion: O(log n) - Inserting a new element requires maintaining the heap property, which involves comparing and possibly swapping elements up the tree. 2. Deletion: O(log n) - Deleting the maximum or minimum element requires rebalancing the heap, typically involving a series of swaps down the tree.

import heapq

# by default, this is a min-heap. To get the max heap, multiply the number by -1 before inserting
a = []
heapq.heappush(a, 10)

heapq.heappop(a)

3. Common DSA questions asked during DE interviews

These problems are taken from Blind 75 and based on common questions during data engineering interviews. The topic segmentation is based on Neetcode.

Do the exercises in the order shown below!

You should be able to complete the problems in this section in under 15 minutes per problem.

Practice these problems multiple times until you can do them in under 15 minutes without looking at the solutions!!

3.1. Intervals

  1. Insert interval
  2. Merge intervals
  3. Nonoverlapping intervals

3.2. Graph

  1. Number of Islands
  2. Number of Connected Components
  3. Course Schedule

3.3. Two Pointers

  1. Container with most water
  2. 3Sum
  3. Valid Palindrome

3.4. Array & Hashing

  1. Contains Duplicate
  2. Valid Anagram
  3. Two Sum
  4. Top K Frequent Elements
  5. Longest Consecutive Sequence

3.5. Stacks

  1. Validate parenthesis

3.6. Sliding Window

  1. Best time to buy and sell stock

3.7. Linked List

  1. Reverse Linked List
  2. Merge 2 Sorted List
  3. Linked List Cycle
  4. Merge K sorted List

3.8. Tree

  1. Invert a binary tree
  2. Depth of a binary tree
  3. Subtree of a binary tree
  4. LCA in BST
  5. Serialize and deserialize Binary Tree

3.9. Heap

  1. Find median in a datastream

3.10. Dynamic Programming

  1. Coin change
  2. House robber
  3. Decode Ways

3.11. Construct data structures

  1. LRU Cache
  2. Circular queue

4. Company specific research

Search for company-specific problems on

  1. Leetcode
  2. TeamBlind
  3. Glassdoor

Sort by the latest and do as many of them as possible. Like before, you should be able to complete these problems in under 15 minutes.

5. Conclusion

To recap, we saw:

  1. Data structures & algorithms to know
  2. Common patterns of questions asked
  3. How to do industry-specific research

If you have an interview coming up, make sure you can complete the typical patterns section in order.

Please let me know in the comment section below if you have any questions or comments.

6. Further reading

  1. Steps to land a high-paying data engineering job
  2. How to become a valuable data engineer
Back to top

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

Sign up for my free Data Engineering 101 Course. You will get

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.