1. Introduction
If you have worked in the data space, you’d inevitably come across tables with so many columns that it gets difficult to remember the exact names. If you are wondering
Is there a more straightforward way to represent data
How to use complex data types to represent relationships in data efficiently
This post is for you. In it, we will discuss how to use complex data types in SQL to represent relationships efficiently, improve developer ergonomics, and potentially reduce metric calculation issues.
By the end of this post, you will know how to use complex data types in your pipelines, their tradeoffs, and how to conceptually consider using complex data types as part of your table.
2. Code & Data
We will use the tpch data for data. The TPC-H data represents a car parts seller’s data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part (parts sold), region, nation, and partsupp (parts supplier).

You can run the code on GitHub codespaces, by following this link & then going to ./concepts/nested_data_types/using_nested_data_types.ipynb.
You can also run the code locally using the following commands:
And going to ./concepts/nested_data_types/using_nested_data_types.ipynb.
How to use nested data types in SQL, YouTube Link (Oct 19th 2024 1PM EST) 
3. Using nested data types effectively
When we think of tables, we think of rows and columns. With the advent of nested data types, we can do the following.
- Represent related columns in a table as a
STRUCT. - Combine data from multiple rows into one row with
ARRAY[STRUCT]
While there are other data types (DBengine specific), exploring the STRUCT and ARRAY data types and their use cases will enable you to explore the other data types.
3.1. Use STRUCT for one-to-one & hierarchical relationships
If you are joining tables with a one-to-one relationship, it is a good use case for using STRUCT. Let’s consider that you are building an OBT table called wide_lineitem.
To create wide_lineitem, we must enrich the lineitem fact table with multiple dimensions(customer and supplier). Let’s see how we would do this without a nested data structure.
| l_orderkey | l_quantity | .. | c_custkey | .. | s_suppkey | .. |
|---|---|---|---|---|---|---|
| 1 | 17.00 | .. | 370 | .. | 93 | .. |
| 1 | 36.00 | .. | 370 | .. | 75 | .. |
We will end up with a very wide table, and the column names would be hard to keep track of. Now consider the same, but with a STRUCT data type to store the combined attributes of customer and order entities.

SELECT
l.*,
struct_pack(
id := c.c_custkey,
name := c.c_name,
address := c.c_address,
nationkey := c.c_nationkey,
phone := c.c_phone,
acctbal := c.c_acctbal,
mktsegment := c.c_mktsegment,
comment := c.c_comment
) AS customer,
struct_pack(
id := s.s_suppkey,
name := s.s_name,
address := s.s_address,
nationkey := s.s_nationkey,
phone := s.s_phone,
acctbal := s.s_acctbal,
comment := s.s_comment
) AS supplier
FROM
lineitem l
LEFT JOIN
orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
supplier s ON l.l_suppkey = s.s_suppkey
LIMIT 5;| l_orderkey | .. | customer | supplier | |
|---|---|---|---|---|
| 1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’} | {‘id’: 93, ‘name’: ‘Supplier#000000093’} | |
| 1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’} | {‘id’: 75, ‘name’: ‘Supplier#000000075’} |
Now, the data for customer and supplier attributes are available as STRUCT types. We can access them using an expressive notation such as customer.name, supplier.address, etc.
Let’s consider a hierarchical relationship. Consider a customer and supplier, each with its own nation attribute. A STRUCT allows us to represent this hierarchical data easily without having to rename the nation table to customer_nation_name, supplier_nation_name, etc. Let’s see how this is done.
-- Hierarchical data
SELECT
l.*,
struct_pack(
id := c.c_custkey,
name := c.c_name,
address := c.c_address,
nationkey := c.c_nationkey,
phone := c.c_phone,
acctbal := c.c_acctbal,
mktsegment := c.c_mktsegment,
comment := c.c_comment,
nation := struct_pack(
nationkey := n.n_nationkey,
name := n.n_name,
regionkey := n.n_regionkey,
comment := n.n_comment
)
) AS customer,
struct_pack(
id := s.s_suppkey,
name := s.s_name,
address := s.s_address,
nationkey := s.s_nationkey,
phone := s.s_phone,
acctbal := s.s_acctbal,
comment := s.s_comment,
nation := struct_pack(
nationkey := sn.n_nationkey,
name := sn.n_name,
regionkey := sn.n_regionkey,
comment := sn.n_comment
)
) AS supplier
FROM
lineitem l
LEFT JOIN
orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
nation n ON c.c_nationkey = n.n_nationkey
LEFT JOIN
supplier s ON l.l_suppkey = s.s_suppkey
LEFT JOIN
nation sn ON s.s_nationkey = sn.n_nationkey
LIMIT 5;| l_orderkey | .. | customer | supplier |
|---|---|---|---|
| 1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’, ‘nation’: {‘nationkey’: 12, ‘name’: ‘JAPAN’, ‘regionkey’: 2, ‘comment’: ’ quickly final packages. furiously i’}} | {‘id’: 93, ‘name’: ‘Supplier#000000093’, ‘nation’: {‘nationkey’: 16, ‘name’: ‘MOZAMBIQUE’, ‘regionkey’: 0, ‘comment’: ’ beans after the carefully regular accounts r’}} |
| 1 | .. | {‘id’: 370, ‘name’: ‘Customer#000000370’, ‘nation’: {‘nationkey’: 12, ‘name’: ‘JAPAN’, ‘regionkey’: 2, ‘comment’: ’ quickly final packages. furiously i’}} | {‘id’: 75, ‘name’: ‘Supplier#000000075’, ‘nation’: {‘nationkey’: 18, ‘name’: ‘CHINA’, ‘regionkey’: 2, ‘comment’: ‘ckly special packages cajole slyly. unusual, unusual theodolites mold furiously. slyly sile’}} |
3.2. Use ARRAY[STRUCT] for one-to-many relationships
When you have a one-to-many relationship(say fact—fact), you’d either aggregate both tables to the same grain (aka what one row represents in that table) or duplicate some facts(numerical data). Duplicated numerical data, if not careful, can lead to incorrect metrics.
To avoid confusion in calculating metrics and have all the necessary data in one place, we can store the lower-grained data as an ARRAY of STRUCTS.
Let’s look at an example where we create a wide table at order grain, but it has lineitems (items that make up an order) data as an ARRAY[STRUCT].

WITH line_items as (
SELECT
l_orderkey as orderkey,
array_agg(struct_pack(
lineitemkey := l.l_linenumber,
partkey := l.l_partkey,
suppkey := l.l_suppkey,
quantity := l.l_quantity,
extendedprice := l.l_extendedprice,
discount := l.l_discount,
tax := l.l_tax,
returnflag := l.l_returnflag,
linestatus := l.l_linestatus,
shipdate := l.l_shipdate,
commitdate := l.l_commitdate,
receiptdate := l.l_receiptdate,
shipinstruct := l.l_shipinstruct,
shipmode := l.l_shipmode,
comment := l.l_comment
)) AS lineitems
FROM
lineitem l
GROUP BY
l_orderkey)
SELECT o.*,
len(l.lineitems) as num_lineitems,
l.lineitems
FROM orders o
LEFT JOIN line_items l
on o.o_orderkey = l.orderkey
LIMIT 5;| orderkey | .. | lineitems |
|---|---|---|
| 2 | .. | [{‘lineitemkey’: 1, ‘partkey’: 1062, ‘suppkey’: 33, ‘quantity’: Decimal(‘38.00’), ‘extendedprice’: Decimal(‘36596.28’), ‘discount’: Decimal(‘0.00’), ‘tax’: Decimal(‘0.05’), ‘returnflag’: ‘N’, ‘linestatus’: ‘O’, ‘shipdate’: ‘1997-01-28’, ‘commitdate’: ‘1997-01-14’, ‘receiptdate’: ‘1997-02-02’, ‘shipinstruct’: ‘TAKE BACK RETURN’, ‘shipmode’: ‘RAIL’, ‘comment’: ‘re. enticingly regular instruct’}] |
In the above code, we aggregate lineitems to order grain as ARRAY[STRUCT] and join it to the order table.
NOTE: We must ensure that the size of a single value in a column does not exceed the allowed limits (check your DB documentation).
3.3. Using nested data types in data processing
We saw how we can model 1:1, 1:m, and hierarchical relationships with nested data types. In this section, we will see how we transform nested data types.
To demonstrate the tips and techniques, let’s create an OBT table called wide_orders, which has orders, lineitems, customer and nation data at order grain.
CREATE TABLE IF NOT EXISTS wide_orders AS
WITH line_items as (
SELECT
l_orderkey as orderkey,
array_agg(struct_pack(
lineitemkey := l.l_linenumber,
partkey := l.l_partkey,
suppkey := l.l_suppkey,
quantity := l.l_quantity,
extendedprice := l.l_extendedprice,
discount := l.l_discount,
tax := l.l_tax,
returnflag := l.l_returnflag,
linestatus := l.l_linestatus,
shipdate := l.l_shipdate,
commitdate := l.l_commitdate,
receiptdate := l.l_receiptdate,
shipinstruct := l.l_shipinstruct,
shipmode := l.l_shipmode,
comment := l.l_comment
)) AS lineitems
FROM
lineitem l
GROUP BY
l_orderkey)
SELECT
o.*,
l.lineitems,
struct_pack(
id := c.c_custkey,
name := c.c_name,
address := c.c_address,
nationkey := c.c_nationkey,
phone := c.c_phone,
acctbal := c.c_acctbal,
mktsegment := c.c_mktsegment,
comment := c.c_comment,
nation := struct_pack(
nationkey := n.n_nationkey,
name := n.n_name,
regionkey := n.n_regionkey,
comment := n.n_comment
)
) AS customer
FROM
orders o
LEFT JOIN
line_items l ON o.o_orderkey = l.orderkey
LEFT JOIN
customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
nation n ON c.c_nationkey = n.n_nationkey;3.3.1. STRUCT enables more straightforward data schema and data access
While structs may seem different, you can use them usually like any other column. All you have to do is use the . notation.
| o_orderkey | customer_name | customer_address | first_lineitem |
|---|---|---|---|
| 1 | Customer#000000370 | DtqbDO5rid | {‘lineitemkey’: 1, ‘partkey’: 1552, ‘suppkey’: 93} |
| 2 | Customer#000000781 | maoqGuL5,rHfX0leqZcFqHqpQH | {‘lineitemkey’: 1, ‘partkey’: 1062, ‘suppkey’: 33} |
In the above query, we access data from the customer struct with the . notation and access the first lineitem struct from an array of lifetime structs using the array notation [].
Note: In DuckDB, the array index starts at 1.
3.3.2. Nested data types can be sorted
Structs can be ordered as any other data type. The db engine compares each value of the struct to define the order.
Let’s see an example: When we order by our customer struct, the id field will be compared since that is the first key in the struct field.
We can also order elements in our lineitems ARRAY with the list_sort function.
3.3.3. UNNEST ARRAY to rows and GROUP rows to ARRAY
When you want to convert a set of rows into an array of structs or vice versa, use the ARRAY_AGG and UNNEST functions, respectively.
Let’s see how to convert a set of rows into an array.
WITH lineitems as (SELECT
o.o_orderkey,
UNNEST(o.lineitems) as line_item
FROM
wide_orders o),
unnested_line_items AS (
SELECT o_orderkey,
line_item.lineitemkey,
line_item.partkey,
line_item.quantity
FROM lineitems
)
SELECT o_orderkey,
array_agg(struct_pack(
line_item_key := lineitemkey,
part_key := partkey,
quantity := quantity)) as lineitems
FROM unnested_line_items
GROUP BY 1
LIMIT 5;Let’s see how to convert an array of elements into individual rows.
3.3.4. Improve OBT usability with nested data types
In wide tables, you often end up with column names like customer_name, supplier_name, etc. Using structs, you can enable more straightforward naming conventions with the. `notation, such ascustomer.name&supplier.name. These are even more applicable with deeply nested structures such ascustomer.nation.name`.
In addition, nested data types also help us avoid incorrect metrics computation. If we store data with multiple grains in the same table, we must be cautious when aggregating the data.
With nested data structure, we can avoid this, as shown below.

We are getting lineitem level metrics from the orders OBT table.
We are getting order-level metrics from the orders OBT table with lineitem data.
3.2. Ensure your performance meets your expectations
By using structs, we add overhead in terms of maintenance (the db engine will need to store the schema per struct), and the DB engine will have to parse out the struct when performing any operation (difficult with nested structs).
You’d want to check the performance of your DB engine + storage format using the EXPLAIN command. More often than not, the only additional step you’d see is a step to UNNEST to convert a single row into multiple rows and STRUCT_EXTRACT to get the required column from the struct data type.
4. Conclusion
To recap we saw
- Using STRUCT for one-to-one & hierarchical relationships
- Using ARRAY of STRUCT for one-to-many relationships
- Using nested data types in data processing
- Improving OBT usability with nested data types
- Checking performance of your queries
If your current data model seem inflexible try out nested data structures. When used correctly it can significanty improve table usability and maintainability.
Please let me know in the comment section below if you have any questions or comments.