10 Key skills, to help you become a data engineer

Key skills required for a data engineer. If you are getting into data engineering, this post gives you a ordered list of topics you can start learning.
Author

Joseph Machado

Published

March 20, 2020

Keywords

getting started

This article gives you an overview of the 10 key skills you need to become a better data engineer. If you are struggling to get started on what to learn, start with the first topic and proceed through the list.

1. Linux

Most applications are built on linux systems so it is crucial to understand how to work with them. The key concepts to know are

  1. File system commands, such as ls, cd, pwd, mkdir, rmdir

  2. Commands to get metadata about your data, such as head, tail, wc, grep, ls -lh

  3. Data processing commands, such as awk, sed

  4. Bash scripting concepts, such as control flow, looping, passing input parameters

2. SQL

SQL is crucial to access your data whether it be for running analysis or for use by your application. The key concepts to know are

  1. Basic CRUD, such as select, where, join (all types of joins), group by, having, window functions

  2. SQL internals, such as index: different types and how they work, transaction concepts such as locks and race conditions

  3. Data modeling, OLTP data modeling using normalization and OLAP data modeling schemas like star and snowflake schemas.

3. Scripting

Knowledge of a scripting language such as bash scripting or python is very helpful to automate multiple steps required for processing data. The key concepts to know are

  1. Basic DS and concept, such as list, dictionaries, map, filter, reduce

  2. Control flow and looping concepts, such as if, for loop, list comprehension(python)

  3. Popular data processing abstraction library such as pandas or Dask in Python

4. Distributed Data Storage

Knowledge of how distributed data store such as HDFS or AWS S3 works. Concepts like data replication, serialization, partitioned data storage, file chunking

5. Distributed Data processing

Knowledge of how data in processed in a distributed fashion. The key concepts to know are

  1. Distributed data processing concepts, such as Mapreduce, in memory data processing such as Apache Spark

  2. Different types of joins across data sets, such as map side and reduce side joins

  3. Common techniques and patterns for data processing such as, partitioning, reducing data shuffles, handling data skews on partitioning

  4. Optimizing data processing code to take advantage of all the cores and memory available in the cluster

6. Building data pipelines

Knowledge of how to connect different data systems to build a data pipeline. The key concepts to know are

  1. A data orchestration tool, such as airflow

  2. Common pitfalls and how to avoid them, such as data quality checks after processing

  3. Building idempotent data pipelines

7. OLAP database

Knowledge of how OLAP database operates and when to use them. The key concepts to know are

  1. what is a column store and why it is better for most types of aggregation queries

  2. Data modeling concepts such as partioning, fact and dimensions, data skew

  3. Figuring out client data query pattern and designing your database accordingly‍

8. Queuing systems

Knowledge of queuing systems and when and how to use them. The key concepts to know are

  1. What is a data producer and a consumer

  2. Knowledge of offsets and log compaction‍

9. Stream processing

Knowledge of what stream processing and how to use them. The key concepts to know are

  1. What is stream processing and how is it different from batch processing

  2. Different types of stream processing such as Event based processing and micro batching

10. JVM language

Knowledge of a JVM based language such as Java or Scala will be extremely useful, since most open source data processing tools are written using JVM languages. e.g Apache Spark, Apache Flink, etc

Back to top

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

Sign up for my free Data Engineering 101 Course. You will get

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.