Free 10-Minute Polars Tutorial for Data Engineers

Learn how to use Polars to build resilient data pipelines, in 10 minutes. With executable code!

Learn how to use Polars to build resilient data pipelines, in 10 minutes. With executable code!
Author

Joseph Machado

Published

August 11, 2025

Keywords

Polars, Data Analysis, Python Data Processing, ETL, Data Pipeline, DataFrame

Introduction

If you work with data, you’ve come across Pandas. If you feel that

Pandas is confusing and wildly different from SQL, the lingua franca for data pipelines

Pandas is unintuitive and confuses you

Debugging random data type changes is a nightmare

Pandas API is unintuitive, complex, and over-engineered.

Data gets mangled in unpredictable ways

This post is for you.

Imagine building intuitive pipelines that are easy to maintain. You will actually find pride in your craft. That is what Polars enables you to do.

In this 10-minute tutorial, you will learn how simple and intuitive Polars is to use. You will learn the key concepts of Polars and use them to build resilient data pipelines.

Setup

Install uv and set up a directory as shown below.

# Set up
uv init polars-tutorial
cd polars-tutorial
uv python install 3.14
uv python pin 3.14
uv add polars
uv venv
uv sync 
source .venv/bin/activate  # or for windows .venv\Scripts\activate
python
1
Create a Python project
2
Install Python version 3.14 and set it as the default for the project
3
Add polars library
4
Create a virtual environment
5
Activate the virtual environment
6
Start the Python REPL. You should see Python 3.14

10-Minute Polars Tutorial

This image represents key Polars concepts. Click on it to enlarge and take a few minutes to get an overview.

Polars Concepts

Polars Concepts

Polars Can Read & Write to Multiple Formats

Polars can read data from and write data to multiple formats and systems.

Example: Read data from a URL

import polars as pl

gist_url = "https://gist.githubusercontent.com/josephmachado/85f5c8d73ac840906cce590f657ffb06/raw/8d9d29b1466d49abc9dbf09b21d508f7a1071e69/your_file.csv"

supplier_df = pl.read_csv(gist_url)
supplier_df.write_csv("./supplier_df.csv")
1
Read data from the gist_url
2
Write data to a local file

Polars read and write functions follow the pattern below:

  1. read_format to read data from a source. The data format could be parquet, csv, etc. The functions have optional parameters to read from S3, Cloud Storage, etc. The data is read into memory and represented as a dataframe (ad: link).
  2. write_destination writes a dataframe to a destination in a specific format. E.g., write_parquet.
  3. The scan_source and sink_destination are read and write functions for processing data in chunks.

The scan_source function creates a representation of data called LazyFrame. The sink_destination only works on LazyFrames.

We will cover LazyAPI in a later section.

Dataframe is a Set of Columns

A DataFrame represents a tabular data structure with rows and columns. In Polars, one or more columns (type = Series) make up a DataFrame.

Every column must be of one of the allowed data types.

Let’s inspect the data we read in as supplier_df:

# Print DataFrame Schema
supplier_df.schema
1
Prints the column names and data types
Schema([('s_suppkey', Int64),
        ('s_name', String),
        ('s_address', String),
        ('s_nationkey', Int64),
        ('s_phone', String),
        ('s_acctbal', Float64),
        ('s_comment', String)])
# Print shape
supplier_df.shape
1
Prints the number of rows and columns in supplier_df
(5, 7)

Define Transformations with Expressions

Expressions represent the transformations you want to perform. Let’s define an expression to uppercase the column s_name.

# Defining expression
s_name_uppercase = pl.col("s_name").str.to_uppercase().alias("s_name_uppercase")

def remove_prefix(col_name: str) -> pl.Expr:
    return pl.col(col_name).str.replace_all(r"^.*#", "")
1
An expression to apply uppercase to the s_name column
2
A function to create an expression removing everything before the # in the given column
Note

Expressions represent transformations and will not be executed unless used within a context.

Example: using expressions within a Context.

supplier_df\
.select(
  pl.col("s_name")
  , s_name_uppercase
  , remove_prefix("s_name").alias("prefix_removed_s_name")
)
shape: (5, 3)
s_name s_name_uppercase prefix_removed_s_name
str str str
"Supplier#000000001" "SUPPLIER#000000001" "000000001"
"Supplier#000000003" "SUPPLIER#000000003" "000000003"
"Supplier#000000004" "SUPPLIER#000000004" "000000004"
"Supplier#000000006" "SUPPLIER#000000006" "000000006"
"Supplier#000000009" "SUPPLIER#000000009" "000000009"

Window functions are also a type of expression.

national_ranking_expr = pl.col("s_acctbal")\
.rank("dense", descending=True)\
.over("s_nationkey")\
.alias("national_ranking")


supplier_df\
.select(
  pl.col("s_name")
  , pl.col("s_acctbal")
  , pl.col("s_nationkey")
  , national_ranking_expr
)\
.filter(
  pl.col("national_ranking") <= 3
)\
.sort("s_nationkey") 
1
Define a window expression & alias result as national_ranking
2
Select the window expression
3
Filter with the result of the window expression
shape: (5, 4)
s_name s_acctbal s_nationkey national_ranking
str f64 i64 u32
"Supplier#000000003" 4192.4 1 1
"Supplier#000000009" 5302.37 10 1
"Supplier#000000006" 1365.79 14 1
"Supplier#000000004" 4641.08 15 1
"Supplier#000000001" 5755.94 17 1

Context Executes Expressions and Returns a New Dataframe

Use context to execute expressions and create a new DataFrame. Polars has 4 common contexts, they are

  1. select to select columns and expressions from an existing DataFrame.
  2. filter to filter a dataframe based on a given criterion.
  3. group_by to aggregate a dataframe.
  4. with_columns to add new columns to an existing DataFrame.
# Applying context to supplier_df
supplier_df.select(remove_prefix("s_name"))
1
Apply remove_prefix within a select context
shape: (5, 1)
s_name
str
"000000001"
"000000003"
"000000004"
"000000006"
"000000009"
supplier_df.filter(
  remove_prefix("s_name").cast(pl.Int32) > 5
)
1
Filter rows using remove_prefix
shape: (2, 7)
s_suppkey s_name s_address s_nationkey s_phone s_acctbal s_comment
i64 str str i64 str f64 str
6 "Supplier#000000006" "zaux5FTzToEg" 14 "24-696-997-4969" 1365.79 " sleep fluffily against the bl…
9 "Supplier#000000009" ",gJ6K2MKveYxQTN 2EMG3pzg" 10 "20-403-398-8662" 5302.37 "asymptotes cajole along the fu…
supplier_df.group_by(
  pl.col("s_nationkey")
).agg(
  pl.len().alias("num_suppliers")
)
1
Group by s_nationkey and print count(*)
shape: (5, 2)
s_nationkey num_suppliers
i64 u32
15 1
14 1
10 1
17 1
1 1
supplier_df.with_columns(
  s_acctbal_100x=pl.col("s_acctbal") * 100
)
1
Create a new DataFrame with an additional column s_acctbal_100x
shape: (5, 8)
s_suppkey s_name s_address s_nationkey s_phone s_acctbal s_comment s_acctbal_100x
i64 str str i64 str f64 str f64
1 "Supplier#000000001" "sdrGnXCDRcfriBvY0KL,ipCanOTyK … 17 "27-918-335-1736" 5755.94 " instructions. slyly unusual" 575594.0
3 "Supplier#000000003" "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… 1 "11-383-516-1199" 4192.4 "ong the fluffily idle packages… 419240.0
4 "Supplier#000000004" "qGTQJXogS83a7MBnEweGHKevK" 15 "25-843-787-7479" 4641.08 "al braids affix through the re… 464108.0
6 "Supplier#000000006" "zaux5FTzToEg" 14 "24-696-997-4969" 1365.79 " sleep fluffily against the bl… 136579.0
9 "Supplier#000000009" ",gJ6K2MKveYxQTN 2EMG3pzg" 10 "20-403-398-8662" 5302.37 "asymptotes cajole along the fu… 530237.0

Use the Same Expression on Multiple Columns with Expression Expansion

Expression Expansion allows you to represent the exact computation on multiple columns without having to repeat the logic.

supplier_df.select(
    pl.col("s_name", "s_address").str.to_uppercase()
)
1
Uppercase s_name and s_address columns
shape: (5, 2)
s_name s_address
str str
"SUPPLIER#000000001" "SDRGNXCDRCFRIBVY0KL,IPCANOTYK …
"SUPPLIER#000000003" "BZ0KXCHUCHJX62L7CJZSQL7GBWQ6RP…
"SUPPLIER#000000004" "QGTQJXOGS83A7MBNEWEGHKEVK"
"SUPPLIER#000000006" "ZAUX5FTZTOEG"
"SUPPLIER#000000009" ",GJ6K2MKVEYXQTN 2EMG3PZG"
supplier_df.select(
  (
    pl.col("s_nationkey", "s_suppkey") / 100
  ).round(2)
)
1
Divide s_nationkey and s_suppkey by 100
shape: (5, 2)
s_nationkey s_suppkey
f64 f64
0.17 0.01
0.01 0.03
0.15 0.04
0.14 0.06
0.1 0.09

Use regex to apply functions to a set of columns.

# Matching regex patterns
supplier_df.select(pl.col("^.*key$"))
1
Select all columns with names ending in *key
shape: (5, 2)
s_suppkey s_nationkey
i64 i64
1 17
3 1
4 15
6 14
9 10
supplier_df.select(pl.col(pl.Int64))
1
Select all columns with the Int64 datatype
shape: (5, 2)
s_suppkey s_nationkey
i64 i64
1 17
3 1
4 15
6 14
9 10

You can select all columns and exclude columns as well.

supplier_df.select(pl.all()).head(3)
shape: (3, 7)
s_suppkey s_name s_address s_nationkey s_phone s_acctbal s_comment
i64 str str i64 str f64 str
1 "Supplier#000000001" "sdrGnXCDRcfriBvY0KL,ipCanOTyK … 17 "27-918-335-1736" 5755.94 " instructions. slyly unusual"
3 "Supplier#000000003" "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… 1 "11-383-516-1199" 4192.4 "ong the fluffily idle packages…
4 "Supplier#000000004" "qGTQJXogS83a7MBnEweGHKevK" 15 "25-843-787-7479" 4641.08 "al braids affix through the re…
supplier_df.select(
    pl.all().exclude(r"^.*_na.*$")
)
1
Select all columns, except those with _na in their name
shape: (5, 5)
s_suppkey s_address s_phone s_acctbal s_comment
i64 str str f64 str
1 "sdrGnXCDRcfriBvY0KL,ipCanOTyK … "27-918-335-1736" 5755.94 " instructions. slyly unusual"
3 "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… "11-383-516-1199" 4192.4 "ong the fluffily idle packages…
4 "qGTQJXogS83a7MBnEweGHKevK" "25-843-787-7479" 4641.08 "al braids affix through the re…
6 "zaux5FTzToEg" "24-696-997-4969" 1365.79 " sleep fluffily against the bl…
9 ",gJ6K2MKveYxQTN 2EMG3pzg" "20-403-398-8662" 5302.37 "asymptotes cajole along the fu…

Select Multiple Columns with Column Selectors

Column selectors provide functions to select columns based on data type, column name, or column position in the DataFrame.

Let’s look at some examples:

import polars.selectors as cs

supplier_df.select(cs.starts_with("s_n"))
1
Select columns whose names start with s_n
shape: (5, 2)
s_name s_nationkey
str i64
"Supplier#000000001" 17
"Supplier#000000003" 1
"Supplier#000000004" 15
"Supplier#000000006" 14
"Supplier#000000009" 10
# position based selection
supplier_df.select(cs.last())
1
Select the last column from the dataframe
shape: (5, 1)
s_comment
str
" instructions. slyly unusual"
"ong the fluffily idle packages…
"al braids affix through the re…
" sleep fluffily against the bl…
"asymptotes cajole along the fu…

Combine column selectors with set operators (union, intersection, difference, & negate).

# `|` is UNION
supplier_df.select(cs.starts_with("s_n") | cs.last())
1
Select all columns starting with s_n or the last column in the dataframe
shape: (5, 3)
s_name s_nationkey s_comment
str i64 str
"Supplier#000000001" 17 " instructions. slyly unusual"
"Supplier#000000003" 1 "ong the fluffily idle packages…
"Supplier#000000004" 15 "al braids affix through the re…
"Supplier#000000006" 14 " sleep fluffily against the bl…
"Supplier#000000009" 10 "asymptotes cajole along the fu…
# `A-B` = Column is in A and not in B
supplier_df.select(cs.starts_with("s_n") - cs.last())
1
Select columns starting with s_n and remove the last column from that list
shape: (5, 2)
s_name s_nationkey
str i64
"Supplier#000000001" 17
"Supplier#000000003" 1
"Supplier#000000004" 15
"Supplier#000000006" 14
"Supplier#000000009" 10
supplier_df.select(
  cs.starts_with("s_s") | 
  cs.starts_with("s_n") & 
  cs.first()
)
1
Select columns starting with (s_s or s_n) that overlap with the first column in the dataframe
shape: (5, 1)
s_suppkey
i64
1
3
4
6
9

Combine Dataframes with Join

Use joins to combine dataframes

nation_url = "https://gist.githubusercontent.com/josephmachado/4c48d64b7dcbbdb419cd6181e7c562c1/raw/068c121f30d9ccb64e26b979bb730902990618b1/nation.csv"

nation_df = pl.read_csv(nation_url)

supplier_df.join(
  nation_df, 
  left_on="s_nationkey", 
  right_on="n_nationkey"
)
1
This will return a new dataframe
shape: (5, 10)
s_suppkey s_name s_address s_nationkey s_phone s_acctbal s_comment n_name n_regionkey n_comment
i64 str str i64 str f64 str str i64 str
3 "Supplier#000000003" "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… 1 "11-383-516-1199" 4192.4 "ong the fluffily idle packages… "ARGENTINA" 1 "al foxes promise slyly accordi…
9 "Supplier#000000009" ",gJ6K2MKveYxQTN 2EMG3pzg" 10 "20-403-398-8662" 5302.37 "asymptotes cajole along the fu… "IRAN" 4 "efully alongside of the slyly …
6 "Supplier#000000006" "zaux5FTzToEg" 14 "24-696-997-4969" 1365.79 " sleep fluffily against the bl… "KENYA" 0 "pending excuses haggle furious…
4 "Supplier#000000004" "qGTQJXogS83a7MBnEweGHKevK" 15 "25-843-787-7479" 4641.08 "al braids affix through the re… "MOROCCO" 0 "rns. blithely bold courts amon…
1 "Supplier#000000001" "sdrGnXCDRcfriBvY0KL,ipCanOTyK … 17 "27-918-335-1736" 5755.94 " instructions. slyly unusual" "PERU" 1 "platelets. blithely pending de…

Use the how input parameter to define the join type.

Use LazyAPI for Larger-than-Memory Data Processing

LazyAPI functions tell Polars to process data only when needed. Using the lazy way of processing data enables Polars to

Apply optimizations, as Polars has a complete end-to-end view of all the transformations. Process data in chunks, enabling large-than-memory data processing. Catch schema errors before processing the data.

LazyFrames are the lazy-based version of DataFrames. You can create LazyFrames using scan_source function to read data into a LazyFrame. .lazy() function on a DataFrame.

Let’s look at how you can create a LazyFrame.

# Creating supplier_df lazy Frame
supplier_df = pl.scan_csv(gist_url)

transformed_supplier_df = supplier_df.filter(
    remove_prefix("s_name").cast(pl.Int32) > 5
).with_columns(s_acctbal_100x=pl.col("s_acctbal") * 100)

transformed_supplier_df.show_graph(optimized=False)
1
Create a LazyFrame with the scan_csv method
2
Define the transformations to be performed
3
The explain method shows you how Polars plans to execute the transformations

A lazyframe can be written to an external system using the sink_destination function, as shown below.

# Data is processed
transformed_supplier_df.sink_csv("./lazy_output.csv")
1
This step will trigger data to be processed

Polars will determine the chunk size based on the hardware it’s running on.

Caution

Certain operations, like pivot & group by, require the entire dataset to be processed.

You will have to convert lazy to df with a .collect() method to do such full-data computations.

Once complete, you can cast it back to a LazyFrame using the .lazy() method.

transformed_supplier_df\
.collect()\
.pivot(
  "s_nationkey", 
  index="s_suppkey", 
  values="s_acctbal_100x", 
  aggregate_function="max"
)\
.lazy()\
.sink_csv("./lazy_output.csv")
1
Convert transformed_supplier_df LazyFrame to a DataFrame
2
Pivot the dataframe. Note this computation needs the entire data to be processed
3
Convert DataFrame back to LazyFrame

Trade-Offs When Adopting Polars

Polars has an intuitive API, but it is a relatively new tool. Here are some trade-offs to be mindful of before adopting Polars in your pipelines

  1. Integrations with the existing data ecosystem is not as extensive as Pandas.
  2. Strict datatypes requires careful planning and design of pipelines. A bad datatype can break your pipelines. Ref Polars datatypes.
  3. DS/DA Familiarity Most data scientists and analysts are usually familiar with pandas, and as such, learning how to use a new tool can be time-consuming

Conclusion

To recap, we saw how

  1. Polars can read/write to multiple systems
  2. DataFrame is a collection of Columns(which are of series type)
  3. Expressions define transformations, and context executes them
  4. LazyAPI can process larger-than-memory data
  5. To think about trade-offs when considering Polars

If you are building a pipeline that requires production-level stability and can process GBs of data without a hitch, give Polars a try.

It’s extremely easy to develop and maintain.

Read These

  1. Python standard library reference for Data Engineers
  2. How to use Pytest
Back to top

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

Sign up for my free Data Engineering 101 Course. You will get

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.