- 1
- Use your file name
- 2
- Prints full file path
- 3
- Prints the file name
Introduction
Data engineers often default to Pandas/Polars/SQL for pipelines. Even when the transformations don’t require it. However, building resilient pipelines is not just about using a tool but also about knowing the fundamentals.
This is especially jarring if you are new to data engineering (even if you have a ton of SWE experience)! Have you felt that
Pandas is so ridiculously complex that you are constantly reading about it without grasping how it works
SQL is awesome, but it’s not really compatible with testing
Pandas/Polars/Spark is being used to do everything, where Python standard libraries would have been the right choice?
Like shooting in the dark since you never know what data (or types) come from Pandas functions?
This post is for you.
Imagine knowing the principles behind the fundamental system that most data people use: Python. Transform data the right way by knowing how to use the Python standard library.
No more shoehorning everything into a dataframe or SQL.
This post follows a cheatsheet format. Every use case (e.g., Replace regex patterns) will be followed by Python code that you can copy-paste and start immediately.
Python Library
The code blocks have annotations (the numbers on the right); hover over them for added information.
Prerequiste
Install uv and setup a directory to run the code as shown below.
- 1
- Create a Python project
- 2
- Install Python version 3.14 and set it as the default for the project
- 3
-
Add
mypylibrary - 4
- Create a virtual environment
- 5
- Activate the virtual environment
- 6
- Start the Python REPL. You should see Python 3.14
Download and unzip your_file.csv from this GitHub gist link for the code samples.
You are all set; you can copy-paste the code below into your Python REPL.
Data IO (reading & writing data)
- Represent file & folders location with
pathlib.Path
- Create and copy files with
pathlib.Path.touch
- 1
- Create the file
- 2
-
Make a copy of
new_fileatsome_new_file.csv
- Identify files that match a pattern with
pathlib.Path.glob
- 1
- Find any Python file in the current directory
- 2
- Find any Python file in the directory one level down
- 3
- Find any Python file in the directory at any level down
- Read and write to a file with
open
- 1
- Replace with your file name
- 2
- Print one line at a time
- 3
- Read all lines into a list of strings
The r indicates read mode; the key modes are:
- ‘r’: open for reading (default)
- ‘w’: open for writing, truncating the file first
- ‘a’: open for writing, appending to the end of the file if it exists
- ‘+’: open for updating (reading and writing)
- Read csv file with each row as a list of strings using
csv.reader
- 1
- Replace with your file name
- 2
- Read the CSV file as a list of strings
- Write to a csv file using
csv.writer
- 1
- Replace with your file name to write the data into
- 2
- Read the CSV file as a list of strings
- Read csv files with each row as a dict with
csv.DictReader
- 1
- Replace with your file name
- 2
- Each line will be represented as a dict, with the key as column name and value as column value
- Write a list of dict as a csv file with
csv.DictWriter
from pathlib import Path
import csv
file_name = 'your_file_to_write_to.csv'
file_path = Path(file_name)
fake_data = [{'id': 1, 'name': 'emp_1'}, {'id': 2, 'name': 'emp_2'}]
with open(file_path, 'w+') as f:
writer = csv.DictWriter(f,fieldnames=['id', 'name'])
writer.writeheader()
for line in fake_data:
writer.writerow(line)- 1
- Replace with your file to write the dict data into
- 2
- Write headers to the file (i.e., Column Names)
- 3
- Write the dict as csv data
- Create a temporary file with
tempfile.TemporaryFile
- 1
-
Write data into the temp file. The
\nindicates a new line (Python will use the right newline character for your OS) - 2
-
We need to go to the start of the file with
seek, else the tf.read will be empty - 3
- TemporaryFile is removed
- Create a temporary file that can be referenced by name with
tempfile.NamedTemporaryFile
- 1
- Print the NamedTemporaryFile’s name
- 2
- NamedTemporaryFile is removed
- Pass input arguments to a Python script with
argparse
fake_file_operator.py
import pathlib
import argparse
parser = argparse.ArgumentParser(
prog='ProgramName',
description='What the program does')
parser.add_argument(
'-f',
'--filename',
default='some_file.csv',
type=pathlib.Path
)
parser.add_argument(
'-op',
'--operation',
required=True,
choices=['overwrite', 'append', 'truncate', 'delete']
)
args = parser.parse_args()
print(f"File name is {args.filename} and operation is {args.operation}")- 1
-
Initialize the
ArgumentParserto define what to expect - 2
-
This argument can be passed in either as
-for as--filename - 3
- This argument will have a default in case the user does not provide a value for this
- 4
- This argument should represent a valid file or folder path
- 5
- This argument is required; calling the Python script without this argument will raise an exception
- 6
- This argument will only accept one of these allowed choices
- 7
-
The inputs provided will be validated against the settings defined above with
add_argument, and you will now have access to the arguments passed to the script
- Run a Python script with input arguments as shown below
Data Structures
- Represent an ordered list of items with
list
- 1
- A list can be made up of values of different types
- 2
-
Access elements in a list using
[index], index starts at 0 - 3
-
Add elements to the end of a list using
append - 4
-
Add the elements of
second_listtosample_list - 5
- Delete the element at index 2, which is “100”
- Need to access data with an ID? Use dict
- 1
-
Define a dict as
{key1: value1, key2: value2, ...} - 2
-
This will raise a
KeyNotFoundexception, since 3 is not a key of sample_dict - 3
- Exception-safe way to get a value from a dict. Since 3 is not a key, you will get none
- 4
- Similar to the above line, but with a fallback default of 10
- 5
-
This will print an empty list. Since we used
defaultdict(list).
- Representing a unique set of values? Use a set
# Set
uniq_set = set()
uniq_set.add(10)
uniq_set.add(10)
uniq_set.add(10)
uniq_set.add(10)
print(uniq_set)
second_uniq_set = set()
second_uniq_set.add(10)
second_uniq_set.add(20)
second_uniq_set.add(30)
uniq_set.intersection(second_uniq_set)
uniq_set.difference(second_uniq_set)
second_uniq_set.difference(uniq_set)- 1
-
Define a set data structure with
set - 2
-
The
uniq_setvariable will be 10, as duplicates are not added to the set - 3
-
Intersectoperation will find the elements in both the sets, which will be 10 - 4
-
Differenceoperation will find elements in uniq_set but not in second_uniq_set, which will be None - 5
-
This
Differencewill find elements in second_uniq_set but not in uniq_set, which will be 20 and 30
- Represent complex data with
dataclass
from dataclasses import asdict, dataclass, field
@dataclass
class Person:
name: str
friends: list[str] = field(default_factory=list)
def get_friends(self):
print(f"{self.name}s friends are {(',').join(self.friends)}")
p = Person('person_name_1', ['person_2', 'person_10'])
print(p.name, p.friends)
p.get_friends()
print(asdict(p))- 1
-
Define a dataclass with the
@dataclassdecorator - 2
- Create functions to transform your data
- 3
- This will convert the dataclass object into a dict of keys and values
- Some of Python’s data structures and Python Objects are reference copies. Use the
copy.deepcopyto create a separate copy.
- Use the
copymodule to make actual copies of Python data structures
from copy import deepcopy
sample_dict = {1: 'emp_1', 2: 'emp_2'}
sample_dict_2 = sample_dict
sample_dict_2[1] = 'employee_1'
print(sample_dict[1])
sample_dict = {1: 'emp_1', 2: 'emp_2'}
sample_dict_2 = deepcopy(sample_dict)
sample_dict_2[1] = 'employee_1'
print(sample_dict[1])
@dataclass
class Person:
name: str
friends: list[str] = field(default_factory=list)
p = Person('person_name_1', ['person_2', 'person_10'])
p2 = p
p2.name = 'person_name_2'
print(p.name)
from copy import deepcopy
p = Person('person_name_1', ['person_2', 'person_10'])
p2 = deepcopy(p)
p2.name = 'person_name_2'
print(p.name)- 1
-
This will be
employee_1 - 2
-
This will be
emp_1 - 3
-
This will be
person_name_2 - 4
-
This will be
person_name_1
- Use
Decimalto represent precise decimal numbers; Python defaults to floats, which are imprecise.
- 1
- This will print a value with a long list of 0s and some decimals
- 2
-
Set precision for
Decimaltypes - 3
-
This will print
3.3000
- Use
datetimemodule to represent and manipulate date & time
- 1
- Prints the current date and time of your machine
- 2
-
strptimestands for String Parse Time, which converts a string to time. The input string should match the format we have specified - 3
-
strftimestands for String Format Time. This converts a datetime object into a string with the specified format
The format for various datetime representations can be seen here, in the official docs
- Use
datetime.timedeltato calculate time differences
from datetime import datetime
date_str_1 = "2025-10-14"
date_obj_1 = datetime.strptime(date_str_1, "%Y-%m-%d")
date_str_2 = "2025-10-16"
date_obj_2 = datetime.strptime(date_str_2, "%Y-%m-%d")
# Calculate the time delta
time_delta = date_obj_2 - date_obj_1
print(f"Time delta: {time_delta}")
print(f"Days: {time_delta.days}")
print(f"Total seconds: {time_delta.total_seconds()}")
print(f"Hours: {time_delta.total_seconds() / 3600}")- 1
-
The difference of datetime objects is of the
datetime.timedeltatype - 2
- Prints the time delta
- 3
-
The
datetime.timedeltahas methods to print the difference in various time units
JSON is a combination of lists & dicts that is a standard for communication between systems ref.
- Use
json.dumpsfunction to convert a dict to a json string (aka serialization)
- 1
- Convert the dict into a string
- Use
json.loadsfunction to convert a json string into a dict (aka deserialization)
Transformations
- Create single-line functions called
lambdafunctions
- 1
-
Define a lambda function as
lambda input1, input2,..: some_operation - 2
- Define a lambda function with multiple input parameters
- Apply a function to every element with
map
- 1
-
Apply the
squarefunction to every element innumbers - 2
-
The
squarefunction is only executed when we need to write out the output - 3
-
Here, we force
mapto run by converting it to a list
- Filter elements with
filter
- 1
-
Only keep the elements in numbers that return true when
is_evenis called on it - 2
-
filterwill not be executed until needed. Here we force it by converting the result to alist
- Use
min,max, orsumto get the min, max, & sum of elements.
Note: Sum only works with numerical data.
- Use custom logic to define
minormaywith thekeyparameter
students = [
{"name": "Alice", "score": 85, "age": 20},
{"name": "Bob", "score": 92, "age": 22},
{"name": "Charlie", "score": 78, "age": 21},
{"name": "David", "score": 95, "age": 19}
]
lowest_scorer = min(students, key=lambda s: s["score"])
highest_scorer = max(students, key=lambda s: s["score"])
print(f"Lowest scoring student: {lowest_scorer}")
print(f"Highest scoring student: {highest_scorer}")- 1
- Sort based on the score of the dict
- Sort using the
sortedfunction.
- 1
- Returns a new sorted data structure
- Sort with custom logic using the
keyinput parameter.
words = ["banana", "apple", "cherry", "date"]
print("Sorted words:", sorted(words))
print("Sorted by length:", sorted(words, key=len))
students = [
{"name": "Alice", "score": 85},
{"name": "Bob", "score": 92},
{"name": "Charlie", "score": 78},
{"name": "David", "score": 88}
]
sorted_students = sorted(students, key=lambda s: s["score"])
print("Sorted by score:")
for student in sorted_students:
print(f"{student['name']}: {student['score']}")- 1
- Sorted based on alphabetical order
- 2
- Sorted based on the length of the word
- 3
- Sorted based on score
- Sort in place using the
sortfunction.
- 1
-
sortupdatesnumbersin place with the sorted elements
- Find if any or all of the values in an iterable are truthy (not empty or None or False) with the
any&allfunctions, respectively
- 1
- This will check if any of the elements in the list can be considered a true (ie, not None, empty, or 0). This will be True.
- 2
- This will check if all of the elements in the list can be considered true (ie, not None, empty, or 0). This will be False.
- Combine multiple lists into a list of tuples with
zip
- 1
- Prints a list of tuples. Each tuple has a name, an age, and a city.
- Loop through elements in a list using
for element in list:
- Loop through key, values in a dict using
for key, value in dict.items():
- 1
-
dict.items()will return a list of key, value pairs, like[(key1, value1), (key2, value2), ...] - 2
-
Use
dict.keys()anddict.values()to get a list of keys and values in the dict, respectively
- Use
enumerateto get the index and element when looping
- Write single-line loops using
comprehensions
- 1
-
range(5)with return numbers 0 - 4, and we use list comprehension to store them as a list - 2
-
Applying
str.upper()for every word inwordslist
- Single line loops using
comprehensions with if..else
- 1
-
Use
if..elsealong with list comprehension
- Dictionary comprehension is similar to list comprehension, but with
{key: value}
- Only process data as needed with
generators
- 1
-
The
count_up_tofunction is called - 2
-
The
count_up_tofunction returns 1 and moves control to the caller - 3
-
The
numis printed, andcount_up_tois called again - 4
-
The logic continues from
i += 1and so on, until the while condition fails
Check multiple conditions in order with if..elif..else
- 1
- Conditions are checked in the order they are specified
- Check if a regex pattern exists in a string with the
re.searchfunction
- 1
- Search for the given regex pattern and return true if found
- Extract string matching a regex pattern with the
re.searchfunction
- 1
- Search for a given regex pattern and return the first matching substring if found, else None
- Find all strings that match a given pattern with
re.findall
- 1
- Search for a given regex pattern and return a list of all the matching substrings if found, else []
- Replace regex patterns with
re.sub
- 1
-
Replace the specified regex pattern with
[REDACTED]in the text variable
- Split a string into a list of strings with
split
- 1
- Specify the character using which to split the string; here, it is a comma
- Replace character(s) of a string with the
replacefunction.
- 1
- Replace cats with dogs. Note that this can be done with the regex module, but this is simpler.
- Use
f-stringsto create a string interlaced with variables and functions inside{}
- Ensure your code can handle errors and clean up with
try/except/finally/elseblocks
def divide_numbers(a, b):
try:
result = a / b
except ZeroDivisionError:
print("Error: Cannot divide by zero")
return None
except TypeError:
print("Error: Please provide numbers")
return None
else:
print(f"Success! {a} / {b} = {result}")
return result
finally:
print("Division attempt completed\n")
# Test cases
print("Case 1: Valid division")
divide_numbers(10, 2)
print("Case 2: Division by zero")
divide_numbers(10, 0)
print("Case 3: Invalid type")
divide_numbers(10, "abc")- 1
-
Code block to execute for specific types of exceptions
- 2
- Runs only if NO exception occurred
- 3
- Always runs, whether error or not
Maintenance
- Access environment variables
os.getenv
- 1
- This will fail when no HOME environment variable is set
- 2
-
This will print
SOME_DEFAULT_PATHif you do not have theHOMEenvironment variable
- Log what is happening in your code with
logging
sample_log.py
By default, only warning and above messages are logged. But you can run it with another level, as shown below.
- Use
contextlib.contextmanagerto handle opening and closing connections to external systems in one place.
import sqlite3
from contextlib import contextmanager
@contextmanager
def sqlite_connection(db_path):
conn = sqlite3.connect(db_path)
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close()
with sqlite_connection('database.db') as conn:
cursor = conn.cursor()
cursor.execute('SELECT * FROM table_name')
results = cursor.fetchall()
# with exit- 1
- Handling commits, rollbacks, and re-using this function using the with
- 2
-
When code enters the
withblock, thesqlite_connectionis called - 3
-
The
sqlite_connectionfunction responds with an open connectionconn - 4
-
Upon exiting the
withblock, control is passed back tosqlite_connection - 5
- The connection is closed or rolled back if there are exceptions
- Ensure that you are using the correct data type with
mypy
sample_type_check.py
- 1
- The key `a’ is not valid and will raise a KeyNotFound exception
However, if you had used types and a type checker, you would have been able to catch this issue before running your code.
sample_type_check.py
- 1
-
mypy will catch the error and show
sample_type_check.py: error: Invalid index type "str" for "dict[int, str]"; expected type "int" [index] Found 1 error in 1 file (checked 1 source file)
The error can be fixed before running the code.
Testing
- Ensure that your code does what it’s supposed to do with
unittest
sample_test_file.py
- 1
-
assertchecks if the value you get from your function is what you expect it to be
Run the tests as shown below:
Environment Management
- Setup a python project with
uv
- 1
-
Create a project called
my-project - 2
- Define the Python version to be used
- 3
- Add libraries that you need
- Create a virtual env for your Python project with
uv
- 1
-
Create a virtual environment with
uv - 2
-
Activate the virtual environment (use
source ..for Unix,.venv\Scripts..for Windows) - 3
-
Run
main.pyinside the virtual environment - 4
- Deactivate the virtual environment when done
- Remove unnecessary packages with
remove
- Create a single script with the library it needs using
uv add --script
- 1
- Create an individual Python script with a specific Python version
- 2
- Define the libraries for this specific Python script
The created sample_data_proc.py will look like shown below.
sample_data_proc.py
- 1
-
Defines the Python version. Autogenerated when creating the script with
uv - 2
- Defines the libraries. Autogenerated when adding libraries
Conclusion
To recap, we saw
- Data reading and writing with Python
- Python Data Structures
- Transformations in Python
- Writing maintainable code
- Testing
- Setting up Python environment
Use this cheatsheet the next time you encounter a complex dataframe or SQL code within Python.
Search this cheatsheet for what you want (e.g., sort with custom logic), and you will find something that helps.