- 1
-
Read data from the
gist_url - 2
- Write data to a local file
Introduction
If you work with data, you’ve come across Pandas. If you feel that
Pandas is confusing and wildly different from SQL, the lingua franca for data pipelines
Pandas is unintuitive and confuses you
Debugging random data type changes is a nightmare
Pandas API is unintuitive, complex, and over-engineered.
Data gets mangled in unpredictable ways
This post is for you.
Imagine building intuitive pipelines that are easy to maintain. You will actually find pride in your craft. That is what Polars enables you to do.
In this 10-minute tutorial, you will learn how simple and intuitive Polars is to use. You will learn the key concepts of Polars and use them to build resilient data pipelines.
Setup
Install uv and set up a directory as shown below.
- 1
- Create a Python project
- 2
- Install Python version 3.14 and set it as the default for the project
- 3
-
Add
polarslibrary - 4
- Create a virtual environment
- 5
- Activate the virtual environment
- 6
- Start the Python REPL. You should see Python 3.14
10-Minute Polars Tutorial
This image represents key Polars concepts. Click on it to enlarge and take a few minutes to get an overview.
Polars Can Read & Write to Multiple Formats
Polars can read data from and write data to multiple formats and systems.
Example: Read data from a URL
Polars read and write functions follow the pattern below:
read_formatto read data from a source. The data format could beparquet, csv, etc. The functions have optional parameters to read from S3, Cloud Storage, etc. The data is read into memory and represented as a dataframe (ad: link).write_destinationwrites a dataframe to a destination in a specific format. E.g.,write_parquet.- The
scan_sourceandsink_destinationare read and write functions for processing data in chunks.
The scan_source function creates a representation of data called LazyFrame. The sink_destination only works on LazyFrames.
We will cover LazyAPI in a later section.
Dataframe is a Set of Columns
A DataFrame represents a tabular data structure with rows and columns. In Polars, one or more columns (type = Series) make up a DataFrame.
Every column must be of one of the allowed data types.
Let’s inspect the data we read in as supplier_df:
- 1
- Prints the column names and data types
Schema([('s_suppkey', Int64),
('s_name', String),
('s_address', String),
('s_nationkey', Int64),
('s_phone', String),
('s_acctbal', Float64),
('s_comment', String)])
Define Transformations with Expressions
Expressions represent the transformations you want to perform. Let’s define an expression to uppercase the column s_name.
- 1
-
An expression to apply uppercase to the
s_namecolumn - 2
-
A function to create an expression removing everything before the
#in the given column
Expressions represent transformations and will not be executed unless used within a context.
Example: using expressions within a Context.
| s_name | s_name_uppercase | prefix_removed_s_name |
|---|---|---|
| str | str | str |
| "Supplier#000000001" | "SUPPLIER#000000001" | "000000001" |
| "Supplier#000000003" | "SUPPLIER#000000003" | "000000003" |
| "Supplier#000000004" | "SUPPLIER#000000004" | "000000004" |
| "Supplier#000000006" | "SUPPLIER#000000006" | "000000006" |
| "Supplier#000000009" | "SUPPLIER#000000009" | "000000009" |
Window functions are also a type of expression.
national_ranking_expr = pl.col("s_acctbal")\
.rank("dense", descending=True)\
.over("s_nationkey")\
.alias("national_ranking")
supplier_df\
.select(
pl.col("s_name")
, pl.col("s_acctbal")
, pl.col("s_nationkey")
, national_ranking_expr
)\
.filter(
pl.col("national_ranking") <= 3
)\
.sort("s_nationkey") - 1
-
Define a window expression & alias result as
national_ranking - 2
- Select the window expression
- 3
- Filter with the result of the window expression
| s_name | s_acctbal | s_nationkey | national_ranking |
|---|---|---|---|
| str | f64 | i64 | u32 |
| "Supplier#000000003" | 4192.4 | 1 | 1 |
| "Supplier#000000009" | 5302.37 | 10 | 1 |
| "Supplier#000000006" | 1365.79 | 14 | 1 |
| "Supplier#000000004" | 4641.08 | 15 | 1 |
| "Supplier#000000001" | 5755.94 | 17 | 1 |
Context Executes Expressions and Returns a New Dataframe
Use context to execute expressions and create a new DataFrame. Polars has 4 common contexts, they are
selectto select columns and expressions from an existing DataFrame.filterto filter a dataframe based on a given criterion.group_byto aggregate a dataframe.with_columnsto add new columns to an existing DataFrame.
- 1
-
Apply
remove_prefixwithin a select context
| s_name |
|---|
| str |
| "000000001" |
| "000000003" |
| "000000004" |
| "000000006" |
| "000000009" |
- 1
-
Filter rows using
remove_prefix
| s_suppkey | s_name | s_address | s_nationkey | s_phone | s_acctbal | s_comment |
|---|---|---|---|---|---|---|
| i64 | str | str | i64 | str | f64 | str |
| 6 | "Supplier#000000006" | "zaux5FTzToEg" | 14 | "24-696-997-4969" | 1365.79 | " sleep fluffily against the bl… |
| 9 | "Supplier#000000009" | ",gJ6K2MKveYxQTN 2EMG3pzg" | 10 | "20-403-398-8662" | 5302.37 | "asymptotes cajole along the fu… |
- 1
- Group by s_nationkey and print count(*)
| s_nationkey | num_suppliers |
|---|---|
| i64 | u32 |
| 15 | 1 |
| 14 | 1 |
| 10 | 1 |
| 17 | 1 |
| 1 | 1 |
- 1
-
Create a new DataFrame with an additional column
s_acctbal_100x
| s_suppkey | s_name | s_address | s_nationkey | s_phone | s_acctbal | s_comment | s_acctbal_100x |
|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | str | f64 | str | f64 |
| 1 | "Supplier#000000001" | "sdrGnXCDRcfriBvY0KL,ipCanOTyK … | 17 | "27-918-335-1736" | 5755.94 | " instructions. slyly unusual" | 575594.0 |
| 3 | "Supplier#000000003" | "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… | 1 | "11-383-516-1199" | 4192.4 | "ong the fluffily idle packages… | 419240.0 |
| 4 | "Supplier#000000004" | "qGTQJXogS83a7MBnEweGHKevK" | 15 | "25-843-787-7479" | 4641.08 | "al braids affix through the re… | 464108.0 |
| 6 | "Supplier#000000006" | "zaux5FTzToEg" | 14 | "24-696-997-4969" | 1365.79 | " sleep fluffily against the bl… | 136579.0 |
| 9 | "Supplier#000000009" | ",gJ6K2MKveYxQTN 2EMG3pzg" | 10 | "20-403-398-8662" | 5302.37 | "asymptotes cajole along the fu… | 530237.0 |
Use the Same Expression on Multiple Columns with Expression Expansion
Expression Expansion allows you to represent the exact computation on multiple columns without having to repeat the logic.
- 1
- Uppercase s_name and s_address columns
| s_name | s_address |
|---|---|
| str | str |
| "SUPPLIER#000000001" | "SDRGNXCDRCFRIBVY0KL,IPCANOTYK … |
| "SUPPLIER#000000003" | "BZ0KXCHUCHJX62L7CJZSQL7GBWQ6RP… |
| "SUPPLIER#000000004" | "QGTQJXOGS83A7MBNEWEGHKEVK" |
| "SUPPLIER#000000006" | "ZAUX5FTZTOEG" |
| "SUPPLIER#000000009" | ",GJ6K2MKVEYXQTN 2EMG3PZG" |
- 1
- Divide s_nationkey and s_suppkey by 100
| s_nationkey | s_suppkey |
|---|---|
| f64 | f64 |
| 0.17 | 0.01 |
| 0.01 | 0.03 |
| 0.15 | 0.04 |
| 0.14 | 0.06 |
| 0.1 | 0.09 |
Use regex to apply functions to a set of columns.
- 1
-
Select all columns with names ending in
*key
| s_suppkey | s_nationkey |
|---|---|
| i64 | i64 |
| 1 | 17 |
| 3 | 1 |
| 4 | 15 |
| 6 | 14 |
| 9 | 10 |
- 1
-
Select all columns with the
Int64datatype
| s_suppkey | s_nationkey |
|---|---|
| i64 | i64 |
| 1 | 17 |
| 3 | 1 |
| 4 | 15 |
| 6 | 14 |
| 9 | 10 |
You can select all columns and exclude columns as well.
| s_suppkey | s_name | s_address | s_nationkey | s_phone | s_acctbal | s_comment |
|---|---|---|---|---|---|---|
| i64 | str | str | i64 | str | f64 | str |
| 1 | "Supplier#000000001" | "sdrGnXCDRcfriBvY0KL,ipCanOTyK … | 17 | "27-918-335-1736" | 5755.94 | " instructions. slyly unusual" |
| 3 | "Supplier#000000003" | "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… | 1 | "11-383-516-1199" | 4192.4 | "ong the fluffily idle packages… |
| 4 | "Supplier#000000004" | "qGTQJXogS83a7MBnEweGHKevK" | 15 | "25-843-787-7479" | 4641.08 | "al braids affix through the re… |
- 1
-
Select all columns, except those with
_nain their name
| s_suppkey | s_address | s_phone | s_acctbal | s_comment |
|---|---|---|---|---|
| i64 | str | str | f64 | str |
| 1 | "sdrGnXCDRcfriBvY0KL,ipCanOTyK … | "27-918-335-1736" | 5755.94 | " instructions. slyly unusual" |
| 3 | "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… | "11-383-516-1199" | 4192.4 | "ong the fluffily idle packages… |
| 4 | "qGTQJXogS83a7MBnEweGHKevK" | "25-843-787-7479" | 4641.08 | "al braids affix through the re… |
| 6 | "zaux5FTzToEg" | "24-696-997-4969" | 1365.79 | " sleep fluffily against the bl… |
| 9 | ",gJ6K2MKveYxQTN 2EMG3pzg" | "20-403-398-8662" | 5302.37 | "asymptotes cajole along the fu… |
Select Multiple Columns with Column Selectors
Column selectors provide functions to select columns based on data type, column name, or column position in the DataFrame.
Let’s look at some examples:
- 1
- Select columns whose names start with s_n
| s_name | s_nationkey |
|---|---|
| str | i64 |
| "Supplier#000000001" | 17 |
| "Supplier#000000003" | 1 |
| "Supplier#000000004" | 15 |
| "Supplier#000000006" | 14 |
| "Supplier#000000009" | 10 |
- 1
- Select the last column from the dataframe
| s_comment |
|---|
| str |
| " instructions. slyly unusual" |
| "ong the fluffily idle packages… |
| "al braids affix through the re… |
| " sleep fluffily against the bl… |
| "asymptotes cajole along the fu… |
Combine column selectors with set operators (union, intersection, difference, & negate).
- 1
-
Select all columns starting with
s_nor the last column in the dataframe
| s_name | s_nationkey | s_comment |
|---|---|---|
| str | i64 | str |
| "Supplier#000000001" | 17 | " instructions. slyly unusual" |
| "Supplier#000000003" | 1 | "ong the fluffily idle packages… |
| "Supplier#000000004" | 15 | "al braids affix through the re… |
| "Supplier#000000006" | 14 | " sleep fluffily against the bl… |
| "Supplier#000000009" | 10 | "asymptotes cajole along the fu… |
- 1
-
Select columns starting with
s_nand remove the last column from that list
| s_name | s_nationkey |
|---|---|
| str | i64 |
| "Supplier#000000001" | 17 |
| "Supplier#000000003" | 1 |
| "Supplier#000000004" | 15 |
| "Supplier#000000006" | 14 |
| "Supplier#000000009" | 10 |
Combine Dataframes with Join
Use joins to combine dataframes
- 1
- This will return a new dataframe
| s_suppkey | s_name | s_address | s_nationkey | s_phone | s_acctbal | s_comment | n_name | n_regionkey | n_comment |
|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | str | f64 | str | str | i64 | str |
| 3 | "Supplier#000000003" | "BZ0kXcHUcHjx62L7CjZSql7gbWQ6RP… | 1 | "11-383-516-1199" | 4192.4 | "ong the fluffily idle packages… | "ARGENTINA" | 1 | "al foxes promise slyly accordi… |
| 9 | "Supplier#000000009" | ",gJ6K2MKveYxQTN 2EMG3pzg" | 10 | "20-403-398-8662" | 5302.37 | "asymptotes cajole along the fu… | "IRAN" | 4 | "efully alongside of the slyly … |
| 6 | "Supplier#000000006" | "zaux5FTzToEg" | 14 | "24-696-997-4969" | 1365.79 | " sleep fluffily against the bl… | "KENYA" | 0 | "pending excuses haggle furious… |
| 4 | "Supplier#000000004" | "qGTQJXogS83a7MBnEweGHKevK" | 15 | "25-843-787-7479" | 4641.08 | "al braids affix through the re… | "MOROCCO" | 0 | "rns. blithely bold courts amon… |
| 1 | "Supplier#000000001" | "sdrGnXCDRcfriBvY0KL,ipCanOTyK … | 17 | "27-918-335-1736" | 5755.94 | " instructions. slyly unusual" | "PERU" | 1 | "platelets. blithely pending de… |
Use the how input parameter to define the join type.
Use LazyAPI for Larger-than-Memory Data Processing
LazyAPI functions tell Polars to process data only when needed. Using the lazy way of processing data enables Polars to
Apply optimizations, as Polars has a complete end-to-end view of all the transformations. Process data in chunks, enabling large-than-memory data processing. Catch schema errors before processing the data.
LazyFrames are the lazy-based version of DataFrames. You can create LazyFrames using scan_source function to read data into a LazyFrame. .lazy() function on a DataFrame.
Let’s look at how you can create a LazyFrame.
- 1
-
Create a LazyFrame with the
scan_csvmethod - 2
- Define the transformations to be performed
- 3
-
The
explainmethod shows you how Polars plans to execute the transformations
A lazyframe can be written to an external system using the sink_destination function, as shown below.
- 1
- This step will trigger data to be processed
Polars will determine the chunk size based on the hardware it’s running on.
Certain operations, like pivot & group by, require the entire dataset to be processed.
You will have to convert lazy to df with a .collect() method to do such full-data computations.
Once complete, you can cast it back to a LazyFrame using the .lazy() method.
- 1
- Convert transformed_supplier_df LazyFrame to a DataFrame
- 2
- Pivot the dataframe. Note this computation needs the entire data to be processed
- 3
- Convert DataFrame back to LazyFrame
Trade-Offs When Adopting Polars
Polars has an intuitive API, but it is a relatively new tool. Here are some trade-offs to be mindful of before adopting Polars in your pipelines
Integrationswith the existing data ecosystem is not as extensive as Pandas.Strict datatypesrequires careful planning and design of pipelines. A bad datatype can break your pipelines. Ref Polars datatypes.DS/DA FamiliarityMost data scientists and analysts are usually familiar with pandas, and as such, learning how to use a new tool can be time-consuming
Conclusion
To recap, we saw how
- Polars can read/write to multiple systems
- DataFrame is a collection of Columns(which are of series type)
- Expressions define transformations, and context executes them
- LazyAPI can process larger-than-memory data
- To think about trade-offs when considering Polars
If you are building a pipeline that requires production-level stability and can process GBs of data without a hitch, give Polars a try.
It’s extremely easy to develop and maintain.
