With AI usage skyrocketing, there is uncertainty about our future as data engineers. How do we adapt? What skills do we need to continue growing in our careers?
The answer (as it always has been) is to focus all our efforts on enabling our users to make data-driven decisions.
As data engineers, we are responsible for delivering easy-to-use and reliable data to our users.
AI made code generation cheap. But we still need to understand what to build, why to build, and how to fix what we build.
In this post, we go over the 6 data engineering concepts to always be in demand.
As you read, go over the linked resources, take notes, and digest the information;1 you will see how each new tool/framework/system is a means of delivering reliable data to users.
6 Data Engineering Concepts To Always Be in Demand
This images (click to enlarge) covers everything you need to focus on to be always in demand as a data engineer.
Transformations in SQL; Python run them in order
Most data is tabular, and SQL provides the right abstraction for working with it. In data processing and analytics, the main patterns are:
- Reading data: select, where, db functions
- Enriching data: joins.
- Identifying trends with metrics: window functions, group by
- Storing data: merge into, insert, overwrite
With SQL, we define how to process data. But we still need to run it; this is where Python comes in.
Most data pipelines involve interacting with multiple systems. Python has APIs for almost all systems, and it is used as the glue code in data pipelines.
For example:
[Extract] Pull data from an API → [Load] Dump it into S3 → [Transform] Run a SQL query that reads data from S3 and transforms it.
In this pipeline, extraction and loading are performed in Python, and data transformation is performed in SQL.
Learning resources
AI can write the code for you, but organizing it and ensuring it does what it’s supposed to is still a DE’s responsibility.
Data model & it’s storage format impacts usability
A good data model will enable a user to quickly answer any questions they have. For e.g., with proper fact and dimension modeling, a user can answer any questions about the business (assuming data is collected).
Different users require different levels of sophistication. A data engineer may work with facts and dimensions, whereas a non-technical user may work with a summary table.
For this purpose, data is typically transformed in 3 stages. The terminology varies, but fundamentally, they are data as is from source → data type conversions → Modeled as Kimball → Summary tables.
We also need to ensure that the data is physically stored for efficient processing. Most cloud providers charge based on the amount of data processed, and the more data to process, the slower the query.
In a data warehouse, data is read more often than it is written, so storing data optimized for reads is critical.
Learning resources
AI can speed up table creation (DDLs), query pattern analysis, and the grunt work.
But we still need to think and decide on tradeoffs to make & what to build.
Ensure you get the correct data to users on time
Data quality is extremely important as decisions made based on incorrect data are almost impossible to change.
You need to know the types of DQ checks to run, how to run them, and how to fix issues.
You need an orchestrator to run your pipelines on a schedule. Learn Airflow.
Learning resources
Data quality requires understanding the business context, tolerance thresholds, and stakeholder communication.
A data engineer needs to think carefully about these; they cannot be easily automated by AI.
Design patterns help you build maintainable systems
Your pipelines should be easy to maintain. All best practices and design principles are to ensure this.
Learning resources
- Design Patterns: Data Flow
- Design Patterns: Code Patterns
- Implementation Best Practices
- Metadata and Logging Best Practices
Best practices are about tradeoffs and knowing when to break them.
Human design + AI code generation will take you far.
Clearly define requirements before building
Data products are complex systems. Before you build any data set, ensure you have a high-level understanding of your business using the Bus Matrix.
Then make sure to gather and agree on the requirements with your end user. This will ensure your work does not go wasted.
Learning resources
Talking to stakeholders cannot be replaced by AI.
Use AI to assist with notes, and other grunt work.
LLMs can generate code, you need to understand it
Use LLMs to speed up development. However, you need to understand what you are building. LLMs are broadly used for
- Development & debugging
- Enabling users with RAGs requires metadata and semantic information.
- Document your systems and data model
Learning resources
Conclusion
To recap, we saw
- Why SQL & Python are still essential
- How Data modeling and storage impact usability
- How to ensure data correctness
- How to use best practices for maintainable pipelines
- The importance of defining requirements first
- How LLMs can speed up code generation
These skills will hold long into the future. Keep improving on these, experiment with AI, and you will always be in demand.
Finally, we work with humans, so be friendly.
