Warehouse transformation patterns
- TL;DR: Facts(events) and dimensions (business entity)
- Joins are used to enrich data(left join), or get complete data belonging to the same business entity(inner join), or get any data belonging to the same business entity (full outer join)
- Group by is used to condense numerical information into easy-to-understand numbers (aka metrics)
- UNION/UNION ALL is used to bring together data belonging to the same entity from different systems
- Window functions are used when you need to compare values between rows of your tables
Types of data writing
- insert/append: adds to the table
- overwrite: deletes existing data and inserts new data
- update: changes existing data
- MERGE/Upsert: one command to both update and /or insert rows as per logic
Systematic data processing with CTEs
- Understand your inputs: grain, business unit/event, relationship type (1:1, 1:many, many:many, many:1), completeness of inputs
- define what the output should be ^ + metrics and their calculation
- CTE: filter & group by data to get all the necessary parts in the same grain join the necessary parts and group by (if necessary) midnful of duplication/ incorrect grain when group by ing to compute metrics
- link: cte v temp
Optimizations
- narrow and wide tx: in mem proc & data shuffle
- paritition: partition pruning
- sort: range filter
- bucketing: partition for high cardinality cols
- join strategy: use broadcast join
- query plan: data execution in UI, how to read query plan and quickly identify bottleneck
4 Key Sql Transformation Patterns for Data Engineers
- When & Which JOINS
- When & How GROUP BY
- Use case for UNIONS
- Problems requiring Window Functions
4 Data Write Patterns and When to Use Them
Insert
Overwrite
Delete
Update
Mix: Update, Delete, Insert
Writing Clean Sql With CTEs
- Input understanding
- Defining output schema and metrics
- CTE as a step-by-step process to go from Inputs -> Output
Optimizing Query Performance by Pre-Processing It
- Partition Store
- Bucketing
- Sorting
Optimizing Data Processing by Reducing Shuffle and Scan
- Broadcast Join
- Storage Partition Join