Free 10-Minute Polars Tutorial for Data Engineers

Introduction

If you work with data, you’ve come across Pandas. If you feel that

Pandas is confusing and wildly different from SQL, the lingua franca for data pipelines

Pandas is unintuitive and confuses you

Debugging random data type changes is a nightmare

Pandas API is unintuitive, complex, and over-engineered.

Data gets mangled in unpredictable ways

This post is for you.

Imagine building intuitive pipelines that are easy to maintain. You will actually find pride in your craft. That is what Polars enables you to do.

In this 10-minute tutorial, you will learn how simple and intuitive Polars is to use. You will learn the key concepts of Polars and use them to build resilient data pipelines.

Setup

Install uv and set up a directory as shown below.

# Set up
uv init polars-tutorial
cd polars-tutorial
uv python install 3.14
uv python pin 3.14
uv add polars
uv venv
uv sync 
source .venv/bin/activate  # or for windows .venv\Scripts\activate
python

1: Create a Python project
2: Install Python version 3.14 and set it as the default for the project
3: Add polars library
4: Create a virtual environment
5: Activate the virtual environment
6: Start the Python REPL. You should see Python 3.14

10-Minute Polars Tutorial

This image represents key Polars concepts. Click on it to enlarge and take a few minutes to get an overview.

Polars Can Read & Write to Multiple Formats

Polars can read data from and write data to multiple formats and systems.

Example: Read data from a URL

import polars as pl

gist_url = "https://gist.githubusercontent.com/josephmachado/85f5c8d73ac840906cce590f657ffb06/raw/8d9d29b1466d49abc9dbf09b21d508f7a1071e69/your_file.csv"

supplier_df = pl.read_csv(gist_url)
supplier_df.write_csv("./supplier_df.csv")

1: Read data from the gist_url
2: Write data to a local file

Polars read and write functions follow the pattern below:

read_format to read data from a source. The data format could be parquet, csv, etc. The functions have optional parameters to read from S3, Cloud Storage, etc. The data is read into memory and represented as a dataframe (ad: link).
write_destination writes a dataframe to a destination in a specific format. E.g., write_parquet.
The scan_source and sink_destination are read and write functions for processing data in chunks.

The scan_source function creates a representation of data called LazyFrame. The sink_destination only works on LazyFrames.

We will cover LazyAPI in a later section.

Dataframe is a Set of Columns

A DataFrame represents a tabular data structure with rows and columns. In Polars, one or more columns (type = Series) make up a DataFrame.

Every column must be of one of the allowed data types.

Let’s inspect the data we read in as supplier_df:

# Print DataFrame Schema
supplier_df.schema

1: Prints the column names and data types

Schema([('s_suppkey', Int64),
        ('s_name', String),
        ('s_address', String),
        ('s_nationkey', Int64),
        ('s_phone', String),
        ('s_acctbal', Float64),
        ('s_comment', String)])

# Print shape
supplier_df.shape

1: Prints the number of rows and columns in supplier_df

(5, 7)

Define Transformations with Expressions

Expressions represent the transformations you want to perform. Let’s define an expression to uppercase the column s_name.

# Defining expression
s_name_uppercase = pl.col("s_name").str.to_uppercase().alias("s_name_uppercase")

def remove_prefix(col_name: str) -> pl.Expr:
    return pl.col(col_name).str.replace_all(r"^.*#", "")

1: An expression to apply uppercase to the s_name column
2: A function to create an expression removing everything before the # in the given column

Note

Expressions represent transformations and will not be executed unless used within a context.

Example: using expressions within a Context.

supplier_df\
.select(
  pl.col("s_name")
  , s_name_uppercase
  , remove_prefix("s_name").alias("prefix_removed_s_name")
)

shape: (5, 3)

s_name	s_name_uppercase	prefix_removed_s_name
str	str	str
"Supplier#000000001"	"SUPPLIER#000000001"	"000000001"
"Supplier#000000003"	"SUPPLIER#000000003"	"000000003"
"Supplier#000000004"	"SUPPLIER#000000004"	"000000004"
"Supplier#000000006"	"SUPPLIER#000000006"	"000000006"
"Supplier#000000009"	"SUPPLIER#000000009"	"000000009"

Window functions are also a type of expression.

national_ranking_expr = pl.col("s_acctbal")\
.rank("dense", descending=True)\
.over("s_nationkey")\
.alias("national_ranking")


supplier_df\
.select(
  pl.col("s_name")
  , pl.col("s_acctbal")
  , pl.col("s_nationkey")
  , national_ranking_expr
)\
.filter(
  pl.col("national_ranking") <= 3
)\
.sort("s_nationkey")

1: Define a window expression & alias result as national_ranking
2: Select the window expression
3: Filter with the result of the window expression

shape: (5, 4)

s_name	s_acctbal	s_nationkey	national_ranking
str	f64	i64	u32
"Supplier#000000003"	4192.4	1	1
"Supplier#000000009"	5302.37	10	1
"Supplier#000000006"	1365.79	14	1
"Supplier#000000004"	4641.08	15	1
"Supplier#000000001"	5755.94	17	1

Context Executes Expressions and Returns a New Dataframe

Use context to execute expressions and create a new DataFrame. Polars has 4 common contexts, they are

select to select columns and expressions from an existing DataFrame.
filter to filter a dataframe based on a given criterion.
group_by to aggregate a dataframe.
with_columns to add new columns to an existing DataFrame.

# Applying context to supplier_df
supplier_df.select(remove_prefix("s_name"))

1: Apply remove_prefix within a select context

shape: (5, 1)

s_name
str
"000000001"
"000000003"
"000000004"
"000000006"
"000000009"

supplier_df.filter(
  remove_prefix("s_name").cast(pl.Int32) > 5
)

1: Filter rows using remove_prefix

shape: (2, 7)

s_suppkey	s_name	s_address	s_nationkey	s_phone	s_acctbal	s_comment
i64	str	str	i64	str	f64	str
6	"Supplier#000000006"	"zaux5FTzToEg"	14	"24-696-997-4969"	1365.79	" sleep fluffily against the bl…
9	"Supplier#000000009"	",gJ6K2MKveYxQTN 2EMG3pzg"	10	"20-403-398-8662"	5302.37	"asymptotes cajole along the fu…

supplier_df.group_by(
  pl.col("s_nationkey")
).agg(
  pl.len().alias("num_suppliers")
)

1: Group by s_nationkey and print count(*)

shape: (5, 2)

s_nationkey	num_suppliers
i64	u32
15	1
10	1
17	1
1	1
14	1

supplier_df.with_columns(
  s_acctbal_100x=pl.col("s_acctbal") * 100
)

1: Create a new DataFrame with an additional column s_acctbal_100x

shape: (5, 8)

s_suppkey	s_name	s_address	s_nationkey	s_phone	s_acctbal	s_comment	s_acctbal_100x
i64	str	str	i64	str	f64	str	f64
1	"Supplier#000000001"	"sdrGnXCDRcfriBvY0KL,ipCanOTyK …	17	"27-918-335-1736"	5755.94	" instructions. slyly unusual"	575594.0
3	"Supplier#000000003"	"BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP…	1	"11-383-516-1199"	4192.4	"ong the fluffily idle packages…	419240.0
4	"Supplier#000000004"	"qGTQJXogS83a7MBnEweGHKevK"	15	"25-843-787-7479"	4641.08	"al braids affix through the re…	464108.0
6	"Supplier#000000006"	"zaux5FTzToEg"	14	"24-696-997-4969"	1365.79	" sleep fluffily against the bl…	136579.0
9	"Supplier#000000009"	",gJ6K2MKveYxQTN 2EMG3pzg"	10	"20-403-398-8662"	5302.37	"asymptotes cajole along the fu…	530237.0

Use the Same Expression on Multiple Columns with Expression Expansion

Expression Expansion allows you to represent the exact computation on multiple columns without having to repeat the logic.

supplier_df.select(
    pl.col("s_name", "s_address").str.to_uppercase()
)

1: Uppercase s_name and s_address columns

shape: (5, 2)

s_name	s_address
str	str
"SUPPLIER#000000001"	"SDRGNXCDRCFRIBVY0KL,IPCANOTYK …
"SUPPLIER#000000003"	"BZ0KXCHUCHJX62L7CJZSQL7GBWQ6RP…
"SUPPLIER#000000004"	"QGTQJXOGS83A7MBNEWEGHKEVK"
"SUPPLIER#000000006"	"ZAUX5FTZTOEG"
"SUPPLIER#000000009"	",GJ6K2MKVEYXQTN 2EMG3PZG"

supplier_df.select(
  (
    pl.col("s_nationkey", "s_suppkey") / 100
  ).round(2)
)

1: Divide s_nationkey and s_suppkey by 100

shape: (5, 2)

s_nationkey	s_suppkey
f64	f64
0.17	0.01
0.01	0.03
0.15	0.04
0.14	0.06
0.1	0.09

Use regex to apply functions to a set of columns.

# Matching regex patterns
supplier_df.select(pl.col("^.*key$"))

1: Select all columns with names ending in *key

shape: (5, 2)

s_suppkey	s_nationkey
i64	i64
1	17
3	1
4	15
6	14
9	10

supplier_df.select(pl.col(pl.Int64))

1: Select all columns with the Int64 datatype

shape: (5, 2)

s_suppkey	s_nationkey
i64	i64
1	17
3	1
4	15
6	14
9	10

You can select all columns and exclude columns as well.

supplier_df.select(pl.all()).head(3)

shape: (3, 7)

s_suppkey	s_name	s_address	s_nationkey	s_phone	s_acctbal	s_comment
i64	str	str	i64	str	f64	str
1	"Supplier#000000001"	"sdrGnXCDRcfriBvY0KL,ipCanOTyK …	17	"27-918-335-1736"	5755.94	" instructions. slyly unusual"
3	"Supplier#000000003"	"BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP…	1	"11-383-516-1199"	4192.4	"ong the fluffily idle packages…
4	"Supplier#000000004"	"qGTQJXogS83a7MBnEweGHKevK"	15	"25-843-787-7479"	4641.08	"al braids affix through the re…

supplier_df.select(
    pl.all().exclude(r"^.*_na.*$")
)

1: Select all columns, except those with _na in their name

shape: (5, 5)

s_suppkey	s_address	s_phone	s_acctbal	s_comment
i64	str	str	f64	str
1	"sdrGnXCDRcfriBvY0KL,ipCanOTyK …	"27-918-335-1736"	5755.94	" instructions. slyly unusual"
3	"BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP…	"11-383-516-1199"	4192.4	"ong the fluffily idle packages…
4	"qGTQJXogS83a7MBnEweGHKevK"	"25-843-787-7479"	4641.08	"al braids affix through the re…
6	"zaux5FTzToEg"	"24-696-997-4969"	1365.79	" sleep fluffily against the bl…
9	",gJ6K2MKveYxQTN 2EMG3pzg"	"20-403-398-8662"	5302.37	"asymptotes cajole along the fu…

Select Multiple Columns with Column Selectors

Column selectors provide functions to select columns based on data type, column name, or column position in the DataFrame.

Let’s look at some examples:

import polars.selectors as cs

supplier_df.select(cs.starts_with("s_n"))

1: Select columns whose names start with s_n

shape: (5, 2)

s_name	s_nationkey
str	i64
"Supplier#000000001"	17
"Supplier#000000003"	1
"Supplier#000000004"	15
"Supplier#000000006"	14
"Supplier#000000009"	10

# position based selection
supplier_df.select(cs.last())

1: Select the last column from the dataframe

shape: (5, 1)

s_comment
str
" instructions. slyly unusual"
"ong the fluffily idle packages…
"al braids affix through the re…
" sleep fluffily against the bl…
"asymptotes cajole along the fu…

Combine column selectors with set operators (union, intersection, difference, & negate).

# `|` is UNION
supplier_df.select(cs.starts_with("s_n") | cs.last())

1: Select all columns starting with s_n or the last column in the dataframe

shape: (5, 3)

s_name	s_nationkey	s_comment
str	i64	str
"Supplier#000000001"	17	" instructions. slyly unusual"
"Supplier#000000003"	1	"ong the fluffily idle packages…
"Supplier#000000004"	15	"al braids affix through the re…
"Supplier#000000006"	14	" sleep fluffily against the bl…
"Supplier#000000009"	10	"asymptotes cajole along the fu…

# `A-B` = Column is in A and not in B
supplier_df.select(cs.starts_with("s_n") - cs.last())

1: Select columns starting with s_n and remove the last column from that list

shape: (5, 2)

s_name	s_nationkey
str	i64
"Supplier#000000001"	17
"Supplier#000000003"	1
"Supplier#000000004"	15
"Supplier#000000006"	14
"Supplier#000000009"	10

supplier_df.select(
  cs.starts_with("s_s") | 
  cs.starts_with("s_n") & 
  cs.first()
)

1: Select columns starting with (s_s or s_n) that overlap with the first column in the dataframe

shape: (5, 1)

s_suppkey
i64
1
3
4
6
9

Combine Dataframes with Join

Use joins to combine dataframes

nation_url = "https://gist.githubusercontent.com/josephmachado/4c48d64b7dcbbdb419cd6181e7c562c1/raw/068c121f30d9ccb64e26b979bb730902990618b1/nation.csv"

nation_df = pl.read_csv(nation_url)

supplier_df.join(
  nation_df, 
  left_on="s_nationkey", 
  right_on="n_nationkey"
)

1: This will return a new dataframe

shape: (5, 10)

s_suppkey	s_name	s_address	s_nationkey	s_phone	s_acctbal	s_comment	n_name	n_regionkey	n_comment
i64	str	str	i64	str	f64	str	str	i64	str
3	"Supplier#000000003"	"BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP…	1	"11-383-516-1199"	4192.4	"ong the fluffily idle packages…	"ARGENTINA"	1	"al foxes promise slyly accordi…
9	"Supplier#000000009"	",gJ6K2MKveYxQTN 2EMG3pzg"	10	"20-403-398-8662"	5302.37	"asymptotes cajole along the fu…	"IRAN"	4	"efully alongside of the slyly …
6	"Supplier#000000006"	"zaux5FTzToEg"	14	"24-696-997-4969"	1365.79	" sleep fluffily against the bl…	"KENYA"	0	"pending excuses haggle furious…
4	"Supplier#000000004"	"qGTQJXogS83a7MBnEweGHKevK"	15	"25-843-787-7479"	4641.08	"al braids affix through the re…	"MOROCCO"	0	"rns. blithely bold courts amon…
1	"Supplier#000000001"	"sdrGnXCDRcfriBvY0KL,ipCanOTyK …	17	"27-918-335-1736"	5755.94	" instructions. slyly unusual"	"PERU"	1	"platelets. blithely pending de…

Use the how input parameter to define the join type.

Use LazyAPI for Larger-than-Memory Data Processing

LazyAPI functions tell Polars to process data only when needed. Using the lazy way of processing data enables Polars to

Apply optimizations, as Polars has a complete end-to-end view of all the transformations. Process data in chunks, enabling large-than-memory data processing. Catch schema errors before processing the data.

LazyFrames are the lazy-based version of DataFrames. You can create LazyFrames using scan_source function to read data into a LazyFrame. .lazy() function on a DataFrame.

Let’s look at how you can create a LazyFrame.

# Creating supplier_df lazy Frame
supplier_df = pl.scan_csv(gist_url)

transformed_supplier_df = supplier_df.filter(
    remove_prefix("s_name").cast(pl.Int32) > 5
).with_columns(s_acctbal_100x=pl.col("s_acctbal") * 100)

transformed_supplier_df.show_graph(optimized=False)

1: Create a LazyFrame with the scan_csv method
2: Define the transformations to be performed
3: The explain method shows you how Polars plans to execute the transformations

A lazyframe can be written to an external system using the sink_destination function, as shown below.

# Data is processed
transformed_supplier_df.sink_csv("./lazy_output.csv")

1: This step will trigger data to be processed

Polars will determine the chunk size based on the hardware it’s running on.

Caution

Certain operations, like pivot & group by, require the entire dataset to be processed.

You will have to convert lazy to df with a .collect() method to do such full-data computations.

Once complete, you can cast it back to a LazyFrame using the .lazy() method.

transformed_supplier_df\
.collect()\
.pivot(
  "s_nationkey", 
  index="s_suppkey", 
  values="s_acctbal_100x", 
  aggregate_function="max"
)\
.lazy()\
.sink_csv("./lazy_output.csv")

1: Convert transformed_supplier_df LazyFrame to a DataFrame
2: Pivot the dataframe. Note this computation needs the entire data to be processed
3: Convert DataFrame back to LazyFrame

Trade-Offs When Adopting Polars

Polars has an intuitive API, but it is a relatively new tool. Here are some trade-offs to be mindful of before adopting Polars in your pipelines

Integrations with the existing data ecosystem is not as extensive as Pandas.
Strict datatypes requires careful planning and design of pipelines. A bad datatype can break your pipelines. Ref Polars datatypes.
DS/DA Familiarity Most data scientists and analysts are usually familiar with pandas, and as such, learning how to use a new tool can be time-consuming

Conclusion

To recap, we saw how

Polars can read/write to multiple systems
DataFrame is a collection of Columns(which are of series type)
Expressions define transformations, and context executes them
LazyAPI can process larger-than-memory data
To think about trade-offs when considering Polars

If you are building a pipeline that requires production-level stability and can process GBs of data without a hitch, give Polars a try.

It’s extremely easy to develop and maintain.

Introduction

Setup

10-Minute Polars Tutorial

Polars Can Read & Write to Multiple Formats

Dataframe is a Set of Columns

Define Transformations with Expressions

Context Executes Expressions and Returns a New Dataframe

Use the Same Expression on Multiple Columns with Expression Expansion

Select Multiple Columns with Column Selectors

Combine Dataframes with Join

Use LazyAPI for Larger-than-Memory Data Processing

Trade-Offs When Adopting Polars

Conclusion

Read These