6 Data Engineering Skills To Progress in the Age of AI

With AI usage skyrocketing, there is uncertainty about our future as data engineers. How do we adapt? What skills do we need to continue growing in our careers?

The answer (as it always has been) is to focus all our efforts on enabling our users to make data-driven decisions.

As data engineers, we are responsible for delivering easy-to-use and reliable data to our users.

AI made code generation cheap. But we still need to understand what to build, why to build, and how to fix what we build.

In this post, we go over the 6 data engineering concepts to always be in demand.

As you read, go over the linked resources, take notes, and digest the information;¹ you will see how each new tool/framework/system is a means of delivering reliable data to users.

6 Data Engineering Concepts To Always Be in Demand

This images (click to enlarge) covers everything you need to focus on to be always in demand as a data engineer.

Transformations in SQL; Python run them in order

Most data is tabular, and SQL provides the right abstraction for working with it. In data processing and analytics, the main patterns are:

Reading data: select, where, db functions
Enriching data: joins.
Identifying trends with metrics: window functions, group by
Storing data: merge into, insert, overwrite

With SQL, we define how to process data. But we still need to run it; this is where Python comes in.

Most data pipelines involve interacting with multiple systems. Python has APIs for almost all systems, and it is used as the glue code in data pipelines.

For example:

[Extract] Pull data from an API → [Load] Dump it into S3 → [Transform] Run a SQL query that reads data from S3 and transforms it.

In this pipeline, extraction and loading are performed in Python, and data transformation is performed in SQL.

Learning resources

Important

AI can write the code for you, but organizing it and ensuring it does what it’s supposed to is still a DE’s responsibility.

Data model & it’s storage format impacts usability

A good data model will enable a user to quickly answer any questions they have. For e.g., with proper fact and dimension modeling, a user can answer any questions about the business (assuming data is collected).

Different users require different levels of sophistication. A data engineer may work with facts and dimensions, whereas a non-technical user may work with a summary table.

For this purpose, data is typically transformed in 3 stages. The terminology varies, but fundamentally, they are data as is from source → data type conversions → Modeled as Kimball → Summary tables.

We also need to ensure that the data is physically stored for efficient processing. Most cloud providers charge based on the amount of data processed, and the more data to process, the slower the query.

In a data warehouse, data is read more often than it is written, so storing data optimized for reads is critical.

Learning resources

Important

AI can speed up table creation (DDLs), query pattern analysis, and the grunt work.

But we still need to think and decide on tradeoffs to make & what to build.

Ensure you get the correct data to users on time

Data quality is extremely important as decisions made based on incorrect data are almost impossible to change.

You need to know the types of DQ checks to run, how to run them, and how to fix issues.

You need an orchestrator to run your pipelines on a schedule. Learn Airflow.

Learning resources

Important

Data quality requires understanding the business context, tolerance thresholds, and stakeholder communication.

A data engineer needs to think carefully about these; they cannot be easily automated by AI.

Design patterns help you build maintainable systems

Your pipelines should be easy to maintain. All best practices and design principles are to ensure this.

Learning resources

Important

Best practices are about tradeoffs and knowing when to break them.

Human design + AI code generation will take you far.

Clearly define requirements before building

Data products are complex systems. Before you build any data set, ensure you have a high-level understanding of your business using the Bus Matrix.

Then make sure to gather and agree on the requirements with your end user. This will ensure your work does not go wasted.

Learning resources

Important

Talking to stakeholders cannot be replaced by AI.

Use AI to assist with notes, and other grunt work.

LLMs can generate code, you need to understand it

Use LLMs to speed up development. However, you need to understand what you are building. LLMs are broadly used for

Development & debugging
Enabling users with RAGs requires metadata and semantic information.
Document your systems and data model

Learning resources

Building a Text-to-SQL RAG

Conclusion

To recap, we saw

These skills will hold long into the future. Keep improving on these, experiment with AI, and you will always be in demand.

Finally, we work with humans, so be friendly.

Footnotes

How AI Impacts Skill Formation ↩︎

6 Data Engineering Concepts To Always Be in Demand

Transformations in SQL; Python run them in order

Learning resources

Data model & it’s storage format impacts usability

Learning resources

Ensure you get the correct data to users on time

Learning resources

Design patterns help you build maintainable systems

Learning resources

Clearly define requirements before building

Learning resources

LLMs can generate code, you need to understand it

Learning resources

Conclusion

Other Articles to Help You

Footnotes