graph TB
User["Data Engineer<br/>[Person]<br/><br/>Validates data quality<br/>using the library"]
System["Data Quality Checker<br/>[Software System]<br/><br/>Validates Polars DataFrames<br/>, logs results to SQLite, and returns pass/fail"]
Database[("SQLite Database<br/>[External System]<br/><br/>Stores validation<br/>results")]
User -->|Uses| System
System -->|Reads/Writes| Database
classDef person fill:#08427b,stroke:#052e56,color:#fff
classDef system fill:#1168bd,stroke:#0b4884,color:#fff
classDef external fill:#999999,stroke:#6b6b6b,color:#fff
class User person
class System system
class Database external
Python Expertise Isn’t Memorizing Syntax
With drag-and-drop tools and LLM code generation, you may feel like you are falling behind in your Python skills!
If you are wondering
How/if you can demonstrate your Python expertise
If you are losing your Python edge due to using visual tools like ADF and LLMs
How to get better at Python since you are doing the same thing over and over again
This post is for you. Demonstrate your Python expertise even before an interview. Imagine showing employers that you can add value to their company before having to sell them on your resume.
In this post, you will learn how to do this by building and publishing a Python library that other engineers use. The economics of coding are changing, and it’s more important than ever to showcase your expertise and value to an employer.
By the end of this post, you will have a clear, step-by-step roadmap to follow to demonstrate your value.
prerequisites
- Install uv
Build Libraries for Repeated Tasks
Add value by automating/streamlining common tasks. The key is choosing an appropriate problem to solve. Shown below are some examples of libraries you can build:
boto3 wrapperwith functionalities like confirming to a company standard S3 bucket path format, accessing secrets, choosing the right infrastructure for your pipeline, etc. The idea is to remove the tedium of duplicating similar functionality across your repo(s). 2.Database connection managerto handle connecting to different databases of your pipeline.Wrapping a manual workflow(e.g., a salesperson validating a 3-rd party input) into a cli/webapp, etc.
We will use a problem statement from an email that I received.
Shown below is the process we will follow. 
Define Project Scope and Build Iteratively
Start by defining who this library is for and what its features are. Based on the above problem statement.
Who: Data EngineerWhat: Tool to validate dataFeatures:- The library will need to accept input(s).
- The library has to validate data.
- The library has to log the results.
Let’s make some assumptions to reduce the scope:
Inputs: Let’s assume we are working with small-to-medium-sized datasets (< 100 GB). Let’s limit the input to only Polars DataFrames.Results log: The results will be logged to a SQLite3 table.Validation types: Let’s assume, based on talking with users, thatunique, not_null, accepted_values, and relationshipsare the most needed (dbt out-of-box tests).
Note: you may wonder why not use an existing library like Pandera, Cuallee, Great Expectations, etc.? The complexity: APIs or library features do not meet our requirements.
Draw Your System Architecture
System architecture will help you see the components of your library and how they interact.
The C4 model for software architecture is effective for demonstrating how the library works (System Context diagram), and its components interact (Container diagram).
System Context Diagram (C4 Model - Level 1)
Container Diagram (C4 Model - Level 2)
graph TB
subgraph System["Data Quality Checker [Software System]"]
direction TB
subgraph Checker["DataQualityChecker Class"]
direction LR
Method1[is_column_unique]
Method2[is_column_not_null]
Method3[is_column_enum]
Method4[are_tables_referential_integral]
end
subgraph Connector["DBConnector Class"]
direction LR
Log[log]
Print[print_all_logs]
end
Checker -->|Uses| Connector
end
Database[(SQLite Database<br/>.db file<br/><br/>log table:<br/>id, timestamp,<br/>data_quality_check_type,<br/>result, additional_params)]
Connector -->|To log to | Database
classDef container fill:#438dd5,stroke:#2e6295,color:#fff
classDef method fill:#85bbf0,stroke:#5d9dd5,color:#000
classDef external fill:#999999,stroke:#6b6b6b,color:#fff
class Checker,Connector container
class Method1,Method2,Method3,Method4,Log,Print method
class Database external
hint Architecture diagrams are great addition to your README
Here is where you need system design skills.
- Read this Python design patterns.
- Notice that the systems (validator and logger) are orthogonal; i.e., we can switch DBConnector to an implementation that writes to a different system.
In the future, we may want to have an InputReader class to accept inputs of varying types (S3 files, csv, Parquet, sftp, etc.)
Also, note that we have already defined the components’ functions. While we can let LLMs take the wheel on this, it’s best left to the engineer to ensure it stays consistent with the library’s vision and scope. LLMs tend to be verbose, and more code leads to more bugs.
Define the Function Signatures and Generate Code with LLM
Let’s set up our library, uv makes it easy.
Let’s create the folder and files.
Based on the Container Diagram above, let’s define the function signature for the DataQualityChecker class at ./src/data_quality_checker/main.py.
from typing import Any, Literal
import polars as pl
from data_quality_checker.connector.output_log import DBConnector
class DataQualityChecker:
def __init__(self, db_connector: DBConnector):
self.db_connector = db_connector
def is_column_unique(
self, data_frame_to_validate: pl.DataFrame, unique_column: str
) -> bool:
"""Function to check if the `unique_column` in the `data_frame_to_validate` is unique
Args:
data_frame_to_validate (pl.DataFrame): A polars dataframe whose column is to be validated
unique_column (str): The name of the column to be checked for uniqueness
Returns:
bool: True if the column is unique
"""
def is_column_not_null(
self, data_frame_to_validate: pl.DataFrame, not_null_column: str
) -> bool:
"""Function to check if the `not_null_column` in the `data_frame_to_validate` is not null
Args:
data_frame_to_validate (pl.DataFrame): A polars dataframe whose column is to be validated
not_null_column (str): The name of the column to be checked for not null
Returns:
bool: True if the column is not null
"""
def is_column_enum(
self,
data_frame_to_validate: pl.DataFrame,
enum_column: str,
enum_values: list[str],
) -> bool:
"""Function to check if a column only has accepted values
Args:
data_frame_to_validate (pl.DataFrame): A polars dataframe whose column is to be validated
enum_column (str): The column to be checked if it only has the accepted values
enum_values (list[str]): The list of accepted values
Returns:
bool: True if column only has the accepted values
"""
def are_tables_referential_integral(
self,
data_frame_to_validate: pl.DataFrame,
data_frame_to_validate_against: pl.DataFrame,
join_keys: list[str],
) -> bool:
"""Function to check for referential integrity between dataframes
Args:
data_frame_to_validate (pl.DataFrame): A dataframe that is to be checked for referential integrity
data_frame_to_validate_against (pl.DataFrame): A second dataframe that is to be checked for referential integrity
join_keys (list[str]): The left and right join keys for data_frame_to_validate and data_frame_to_validate_against dataframes respectively
Returns:
bool: True if the dataframes have referential integrity based on the join keys
"""
def log_results(
self,
data_quality_check_type: Literal[
"is_column_unique",
"is_column_not_null",
"is_column_enum",
"are_tables_referential_integral",
],
result: bool,
**kwargs: Any,
) -> None:
"""Function to log results of a data quality check to a log location
Args:
data_quality_check_type: Type of data quality check that was performed
result: The boolean result of the check
**kwargs: Additional parameters specific to the check
"""
self.db_connector.log(data_quality_check_type, result, **kwargs)Based on the Container Diagram above, let’s define the function signature for the DBConnector class at ./src/data_quality_checker/connector/output_log.py.
import sqlite3
from datetime import datetime
from pathlib import Path
from typing import Any, Literal
class DBConnector:
"""Class to connect to a sqlite3 db
and log data to a log table
"""
def __init__(self, db_file: Path):
"""Function to initialize a sqlite3 db on the given db_file
Args:
db_file: The path to the db_file for sqlite3
"""
self.db_file = db_file
self._create_log_table()
def _create_log_table(self):
"""Create the log table if it doesn't exist"""
def log(
self,
data_quality_check_type: Literal[
"is_column_unique",
"is_column_not_null",
"is_column_enum",
"are_tables_referential_integral",
],
result: bool,
**kwargs: Any,
):
"""Function to load the results into a log table, with ts of when data is inserted
Args:
data_quality_check_type: Type of dq check that was run
result: The result of the dq check
**kwargs: Additional params to the dq check
"""
def print_all_logs(self):
"""Print all log entries in the table"""With type hints, input parameters, and function documentation; LLMs will create working code.
See this repository for full working code.
Create Tests So That You Don’t Break Existing Logic (AKA Prevent Regression)
Tests enables making quick changes without worrying about breaking existing code. Let’s create unit tests.
Note required reading Pytest usage.
Let’s start by defining the fixtures:
- A
temporary db file & DBConnector objectfor the duration of the test session, to test DBConnector. - A
mock for DBConnectorto test if it’s being called during validation functions, without actually writing to a SQLite3 DB.
I passed in the above fixture context and asked an LLM to generate test classes for DBConnector and DataQualityChecker, testing all its functions. The LLM-generated tests are here.
We can run tests as
Increase Library Adoption with a “How to Use” Readme
A well-formatted, easy-to-understand README will significantly improve usage of your library.
Assume that we have built and published our library to the PyPi package index.
Create a README with the following sections.
- How to install and use the library
- Data validation feature list
- Example Python code block showing how to use the functions and how to see the output log table
- API reference
- Architecture diagram; system context, and container diagram from the C4 model
- Development steps
Paste the above into an LLM, and you will get a relatively well-defined README. Make sure to read through it carefully and make changes as you see fit. The final README should be something like this.
If you are publishing to the public, make sure to add an appropriate license, such as the MIT license.
Build and Publish Your Library
Every company has its unique way of publishing packages. In this section, we will see how to publish to PyPi, which allows anyone to install your library as pip install your-library.
Before we publish to PyPi, publish & test with a test package index.
- Create a test pypi account at test.pypi.org
- Set up 2FA in Account Settings. This is required to create packages.
- Create an API token. Go to
Account Settings → API tokens → Add API token, with scope as Entire account. - Make sure to copy and save the API token.
In your project root directory, type in
You will be prompted for a username and a token; use __token__ and the API token, respectively.
Your project will have its own PyPI page (e.g., data-quality-checker )
Make sure to install the package and test it out with
When satisfied, do the same but with the non-test PyPI account at pypi.org.
Apply This Framework to Your Next Library
You now have a repeatable methodology for building Python libraries:
- Problem scoping with clear constraints
- Design high-level architecture
- LLM-guided implementation and tests
- Create a helpful README
- Publish your library to PyPi, making it pip installable
Next Steps:
- Identify 3 repetitive tasks in your current work
- Pick the simplest one and apply this framework
- Publish to PyPI within 2 weeks
- Add to your resume and GitHub portfolio
The validation library we built is just one example—this process works for any automation or tooling problem you encounter.