Demonstrate Python Expertise by Building Libraries: From Architecture to Published Package

Stand out in data engineering interviews with a published PyPI package. Master the systematic workflow: C4 system design, pytest patterns, LLM-guided development, and PyPI publishing—build a data validation library as portfolio proof.

Master Python library development from system design to PyPI publishing. Build a data validation package as portfolio proof—complete with C4 diagrams and tests.
TECHNICAL UPSKILL
ARCHITECTURE
HANDS-ON PROJECT
BREAK INTO DE
Author

Joseph Machado

Published

January 17, 2026

Keywords

Python, uv, polars, pytest, data quality, library, system design, open source, PyPI publishing, Python packaging, data validation, C4 model, SQLite, pytest fixtures, data engineering portfolio, Python library development, type hints, production Python

Python Expertise Isn’t Memorizing Syntax

With drag-and-drop tools and LLM code generation, you may feel like you are falling behind in your Python skills!

If you are wondering

How/if you can demonstrate your Python expertise

If you are losing your Python edge due to using visual tools like ADF and LLMs

How to get better at Python since you are doing the same thing over and over again

This post is for you. Demonstrate your Python expertise even before an interview. Imagine showing employers that you can add value to their company before having to sell them on your resume.

In this post, you will learn how to do this by building and publishing a Python library that other engineers use. The economics of coding are changing, and it’s more important than ever to showcase your expertise and value to an employer.

By the end of this post, you will have a clear, step-by-step roadmap to follow to demonstrate your value.

prerequisites

  1. Install uv

Build Libraries for Repeated Tasks

Add value by automating/streamlining common tasks. The key is choosing an appropriate problem to solve. Shown below are some examples of libraries you can build:

  1. boto3 wrapper with functionalities like confirming to a company standard S3 bucket path format, accessing secrets, choosing the right infrastructure for your pipeline, etc. The idea is to remove the tedium of duplicating similar functionality across your repo(s). 2.Database connection manager to handle connecting to different databases of your pipeline.
  2. Wrapping a manual workflow (e.g., a salesperson validating a 3-rd party input) into a cli/webapp, etc.

We will use a problem statement from an email that I received.

I'm planning to write test scripts to make sure all the 
end-to-end pipeline tests are satisfied during 
the development phase. 

I was looking for a reference on which tools 
would be the best to use for SQL scripting 
or Python?

Shown below is the process we will follow. Framework to create useful Python Libraries

Define Project Scope and Build Iteratively

Start by defining who this library is for and what its features are. Based on the above problem statement.

  1. Who: Data Engineer
  2. What: Tool to validate data
  3. Features:
    • The library will need to accept input(s).
    • The library has to validate data.
    • The library has to log the results.

Let’s make some assumptions to reduce the scope:

  1. Inputs: Let’s assume we are working with small-to-medium-sized datasets (< 100 GB). Let’s limit the input to only Polars DataFrames.
  2. Results log: The results will be logged to a SQLite3 table.
  3. Validation types: Let’s assume, based on talking with users, that unique, not_null, accepted_values, and relationships are the most needed (dbt out-of-box tests).

Note: you may wonder why not use an existing library like Pandera, Cuallee, Great Expectations, etc.? The complexity: APIs or library features do not meet our requirements.

Draw Your System Architecture

System architecture will help you see the components of your library and how they interact.

The C4 model for software architecture is effective for demonstrating how the library works (System Context diagram), and its components interact (Container diagram).

System Context Diagram (C4 Model - Level 1)

graph TB
    User["Data Engineer<br/>[Person]<br/><br/>Validates data quality<br/>using the library"]
    
    System["Data Quality Checker<br/>[Software System]<br/><br/>Validates Polars DataFrames<br/>, logs results to SQLite, and returns pass/fail"]
    
    Database[("SQLite Database<br/>[External System]<br/><br/>Stores validation<br/>results")]
    
    User -->|Uses| System
    System -->|Reads/Writes| Database
    
    classDef person fill:#08427b,stroke:#052e56,color:#fff
    classDef system fill:#1168bd,stroke:#0b4884,color:#fff
    classDef external fill:#999999,stroke:#6b6b6b,color:#fff
    
    class User person
    class System system
    class Database external

Container Diagram (C4 Model - Level 2)

graph TB
    subgraph System["Data Quality Checker [Software System]"]
        direction TB
        
        subgraph Checker["DataQualityChecker Class"]
            direction LR
            Method1[is_column_unique]
            Method2[is_column_not_null]
            Method3[is_column_enum]
            Method4[are_tables_referential_integral]
        end
        
        subgraph Connector["DBConnector Class"]
            direction LR
            Log[log]
            Print[print_all_logs]
        end
        
        Checker -->|Uses| Connector
    end
    
    Database[(SQLite Database<br/>.db file<br/><br/>log table:<br/>id, timestamp,<br/>data_quality_check_type,<br/>result, additional_params)]
    
    Connector -->|To log to | Database
    
    classDef container fill:#438dd5,stroke:#2e6295,color:#fff
    classDef method fill:#85bbf0,stroke:#5d9dd5,color:#000
    classDef external fill:#999999,stroke:#6b6b6b,color:#fff
    
    class Checker,Connector container
    class Method1,Method2,Method3,Method4,Log,Print method
    class Database external

hint Architecture diagrams are great addition to your README

Here is where you need system design skills.

  1. Read this Python design patterns.
  2. Notice that the systems (validator and logger) are orthogonal; i.e., we can switch DBConnector to an implementation that writes to a different system.

In the future, we may want to have an InputReader class to accept inputs of varying types (S3 files, csv, Parquet, sftp, etc.)

Also, note that we have already defined the components’ functions. While we can let LLMs take the wheel on this, it’s best left to the engineer to ensure it stays consistent with the library’s vision and scope. LLMs tend to be verbose, and more code leads to more bugs.

Define the Function Signatures and Generate Code with LLM

Let’s set up our library, uv makes it easy.

uv init --lib data_quality_checker # setup a project to be used as a Python library
cd data_quality_checker

# Install libraries
uv add polars 
uv add --dev pytest pytest-cov pytest-mock # only necessary during developement

Let’s create the folder and files.

mkdir -p ./src/data_quality_checker/connector
# Validation logic 
touch ./src/data_quality_checker/main.py 
# Connector to sqlite3
touch ./src/data_quality_checker/connector/output_log.py 

# Create pytest folder
mkdir -p ./tests/unit
touch ./tests/conftest.py

Based on the Container Diagram above, let’s define the function signature for the DataQualityChecker class at ./src/data_quality_checker/main.py.

from typing import Any, Literal

import polars as pl

from data_quality_checker.connector.output_log import DBConnector


class DataQualityChecker:
    def __init__(self, db_connector: DBConnector):
        self.db_connector = db_connector

    def is_column_unique(
        self, data_frame_to_validate: pl.DataFrame, unique_column: str
    ) -> bool:
        """Function to check if the `unique_column` in the `data_frame_to_validate` is unique

        Args:
            data_frame_to_validate (pl.DataFrame): A polars dataframe whose column is to be validated
            unique_column (str): The name of the column to be checked for uniqueness

        Returns:
            bool: True if the column is unique
        """

    def is_column_not_null(
        self, data_frame_to_validate: pl.DataFrame, not_null_column: str
    ) -> bool:
        """Function to check if the `not_null_column` in the `data_frame_to_validate` is not null

        Args:
            data_frame_to_validate (pl.DataFrame): A polars dataframe whose column is to be validated
            not_null_column (str): The name of the column to be checked for not null

        Returns:
            bool: True if the column is not null
        """

    def is_column_enum(
        self,
        data_frame_to_validate: pl.DataFrame,
        enum_column: str,
        enum_values: list[str],
    ) -> bool:
        """Function to check if a column only has accepted values

        Args:
            data_frame_to_validate (pl.DataFrame): A polars dataframe whose column is to be validated
            enum_column (str): The column to be checked if it only has the accepted values
            enum_values (list[str]): The list of accepted values

        Returns:
            bool: True if column only has the accepted values
        """

    def are_tables_referential_integral(
        self,
        data_frame_to_validate: pl.DataFrame,
        data_frame_to_validate_against: pl.DataFrame,
        join_keys: list[str],
    ) -> bool:
        """Function to check for referential integrity between dataframes

        Args:
            data_frame_to_validate (pl.DataFrame): A dataframe that is to be checked for referential integrity
            data_frame_to_validate_against (pl.DataFrame): A second dataframe that is to be checked for referential integrity
            join_keys (list[str]): The left and right join keys for data_frame_to_validate and data_frame_to_validate_against dataframes respectively

        Returns:
            bool: True if the dataframes have referential integrity based on the join keys
        """

    def log_results(
        self,
        data_quality_check_type: Literal[
            "is_column_unique",
            "is_column_not_null",
            "is_column_enum",
            "are_tables_referential_integral",
        ],
        result: bool,
        **kwargs: Any,
    ) -> None:
        """Function to log results of a data quality check to a log location

        Args:
            data_quality_check_type: Type of data quality check that was performed
            result: The boolean result of the check
            **kwargs: Additional parameters specific to the check
        """
        self.db_connector.log(data_quality_check_type, result, **kwargs)

Based on the Container Diagram above, let’s define the function signature for the DBConnector class at ./src/data_quality_checker/connector/output_log.py.

import sqlite3
from datetime import datetime
from pathlib import Path
from typing import Any, Literal


class DBConnector:
    """Class to connect to a sqlite3 db
    and log data to a log table
    """

    def __init__(self, db_file: Path):
        """Function to initialize a sqlite3 db on the given db_file

        Args:
            db_file: The path to the db_file for sqlite3
        """
        self.db_file = db_file
        self._create_log_table()

    def _create_log_table(self):
        """Create the log table if it doesn't exist"""

    def log(
        self,
        data_quality_check_type: Literal[
            "is_column_unique",
            "is_column_not_null",
            "is_column_enum",
            "are_tables_referential_integral",
        ],
        result: bool,
        **kwargs: Any,
    ):
        """Function to load the results into a log table, with ts of when data is inserted

        Args:
            data_quality_check_type: Type of dq check that was run
            result: The result of the dq check
            **kwargs: Additional params to the dq check
        """

    def print_all_logs(self):
        """Print all log entries in the table"""

With type hints, input parameters, and function documentation; LLMs will create working code.

See this repository for full working code.

Create Tests So That You Don’t Break Existing Logic (AKA Prevent Regression)

Tests enables making quick changes without worrying about breaking existing code. Let’s create unit tests.

Note required reading Pytest usage.

Let’s start by defining the fixtures:

  1. A temporary db file & DBConnector object for the duration of the test session, to test DBConnector.
  2. A mock for DBConnector to test if it’s being called during validation functions, without actually writing to a SQLite3 DB.

I passed in the above fixture context and asked an LLM to generate test classes for DBConnector and DataQualityChecker, testing all its functions. The LLM-generated tests are here.

We can run tests as

uv run pytest tests/

Increase Library Adoption with a “How to Use” Readme

A well-formatted, easy-to-understand README will significantly improve usage of your library.

Assume that we have built and published our library to the PyPi package index.

Create a README with the following sections.

  1. How to install and use the library
  2. Data validation feature list
  3. Example Python code block showing how to use the functions and how to see the output log table
  4. API reference
  5. Architecture diagram; system context, and container diagram from the C4 model
  6. Development steps

Paste the above into an LLM, and you will get a relatively well-defined README. Make sure to read through it carefully and make changes as you see fit. The final README should be something like this.

If you are publishing to the public, make sure to add an appropriate license, such as the MIT license.

Build and Publish Your Library

Every company has its unique way of publishing packages. In this section, we will see how to publish to PyPi, which allows anyone to install your library as pip install your-library.

Before we publish to PyPi, publish & test with a test package index.

  1. Create a test pypi account at test.pypi.org
  2. Set up 2FA in Account Settings. This is required to create packages.
  3. Create an API token. Go to Account Settings → API tokens → Add API token, with scope as Entire account.
  4. Make sure to copy and save the API token.

In your project root directory, type in

uv build 
uv publish --publish-url https://test.pypi.org/legacy/

You will be prompted for a username and a token; use __token__ and the API token, respectively.

Your project will have its own PyPI page (e.g., data-quality-checker )

Make sure to install the package and test it out with

uv pip install --index-url https://test.pypi.org/simple/ data-quality-checker

When satisfied, do the same but with the non-test PyPI account at pypi.org.

Apply This Framework to Your Next Library

You now have a repeatable methodology for building Python libraries:

  1. Problem scoping with clear constraints
  2. Design high-level architecture
  3. LLM-guided implementation and tests
  4. Create a helpful README
  5. Publish your library to PyPi, making it pip installable

Next Steps:

  1. Identify 3 repetitive tasks in your current work
  2. Pick the simplest one and apply this framework
  3. Publish to PyPI within 2 weeks
  4. Add to your resume and GitHub portfolio

The validation library we built is just one example—this process works for any automation or tooling problem you encounter.

Back to top