How to Validate Datatypes in Python

Introduction

Data type issues are one of the biggest concerns when processing data in python. If you are wondering how to

Make sure that a column is of a specific data type ( e.g.: an int column doesn’t have stuff like commas, whitespace, etc)

Ensure correct data types in python, before inserting them into a database

Then this post is for you. In this post, we see how to parse data into expected data types using native python. Then, we will see how we can build a reusable data type parsing pattern using pydantic.

Note that there is a key difference between parsing and validation. Parsing refers to converting data to a specific data type. Validation refers to checking if a given input is of a certain type.

int('123') # parsing
123
isinstance('123', int) # validating
False

In this post, we use the term validate to mean that the data can be parsed into an expected type.

Using Native Python

Let’s assume you are processing some marketing data. The dataset contains the user id, name, amount spent, and user status. The objective is to

  1. Convert user names to lower case
  2. Convert amount spent to decimal values
  3. Ensure that the status value belongs to one of ACTIVE or INACTIVE or SUSPENDED.
  4. Convert the rows into JSON rows

You can do something like

from decimal import Decimal

# assume data from some data source
marketing_data = [
    'uid,user_name,amount_spent,status',
    '1,PeRSON1,100,ACTIVE',
    '2,persOn2,100 USD,INACTIVE',
    '3,persON3,100$,SUSPENDED',
    '4,peRson4,10000.10,ACTIVE',
    '5,PERson5,,ACTIVE',
    '6,person6,20,ACTIVE',
]

output_data = []

for d in marketing_data[1:]:
    uid, user_name, amount_spent, status = d.split(',')
    clean_amt = amount_spent.replace('$', '').replace('USD', '').strip()
    op_amt = Decimal(clean_amt) if clean_amt else None
    clean_status = (
        status if status in {'ACTIVE', 'INACTIVE', 'SUSPENDED'} else None
    )
    output_data.append(
        {
            'uid': uid,
            'user_name': user_name.lower(),
            'amount_spent': op_amt,
            'status': clean_status,
        }
    )

print(output_data)

You can see how data validation and parsing are coupled with data processing(e.g.: lower() function). This will also involve writing a lot of try..except logic to catch any parsing errors. Try running the for loop shown above with this data.

marketing_data = [
    'uid,user_name,amount_spent,status',
    '1,person1,x,ACTIVE',
]

You will get a decimal.InvalidOperation and have to use a try..except block or conditional logic to catch these issues. Although this is possible, it can become hard to manually validate data types and handle all such cases.

Using Pydantic

Pydantic is a popular library that parses our data into the expected data types. We can define our datatypes using a dataclass and let Pydantic handle the datatype parsing. Let’s see how we would handle our marketing data.

pip install pydantic
import json
import logging
from decimal import Decimal, InvalidOperation
from enum import Enum
from typing import Any, Optional

from pydantic import ValidationError
from pydantic.dataclasses import dataclass
from pydantic.json import pydantic_encoder

# assume data from some data source
marketing_data = [
    'uid,user_name,amount_spent,status',
    '1,PeRSON1,100,ACTIVE',
    '2,persOn2,100 USD,INACTIVE',
    '3,persON3,100$,SUSPENDED',
    '4,peRson4,10000.10,ACTIVE',
    '5,PERson5,,ACTIVE',
    '6,person6,x,ACTIVE',
    '7,person7,200,',
]


# status can only be one of these 3
class StatusType(Enum):
    ACTIVE = 'ACTIVE'
    INACTIVE = 'INACTIVE'
    SUSPENDED = 'SUSPENDED'


# amount needs its own custom data type
# ref: https://pydantic-docs.helpmanual.io/usage/types/#custom-data-types
class ExcelNumber:
    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v: Any) -> Optional[Decimal]:
        if v is not None and str(v) != '':
            try:
                return Decimal(
                    str(v)
                    .replace(',', '')
                    .replace('$', '')
                    .replace('USD', '')
                    .strip()
                )
            except InvalidOperation as io:
                logging.error(
                    f'Error while parsing {str(v)} to decimal. Error: {io}'
                )
                raise ValidationError
        return None


@dataclass
class MarketData:
    uid: int
    user_name: str
    amount_spent: Optional[
        ExcelNumber
    ]  # optional means that this field can be None
    status: StatusType


error_count = 0
output_data = []
for idx, d in enumerate(marketing_data[1:]):
    try:
        market_data = MarketData(*d.split(','))
        market_data.user_name = market_data.user_name.lower()
        output_data.append(json.dumps(market_data, default=pydantic_encoder))
    except ValidationError as ve:
        logging.error(f'row number: {idx + 1}, row: {d}, error: {ve}')
        error_count += 1

print(f'Error count: {error_count}')
print(output_data)

You can see the validation issues happening for rows 6 and 7 from the error log.

ERROR:root:Error while parsing x to decimal. Error: [<class 'decimal.ConversionSyntax'>]
ERROR:root:row number: 6, row: 6,person6,x,ACTIVE, error: 1 validation error for MarketData
amount_spent
  __init__() takes exactly 3 positional arguments (1 given) (type=type_error)

ERROR:root:row number: 7, row: 7,person7,200,, error: 1 validation error for MarketData
status
  value is not a valid enumeration member; permitted: 'ACTIVE', 'INACTIVE', 'SUSPENDED' (type=type_error.enum; enum_values=[<StatusType.ACTIVE: 'ACTIVE'>, <StatusType.INACTIVE: 'INACTIVE'>, <StatusType.SUSPENDED: 'SUSPENDED'>])

Pydantic makes it easy to

  1. Define data types for our data set
  2. Parse data to match defined datatypes
  3. Catch datatype errors with a ValidationError
  4. Define custom datatypes
  5. Catch type errors using mypy.

Pydantic offers a lot of features. You can refer to its full documentation here.

Pydantic Caveats

Although Pydantic is a great tool, there are a few caveats of which to be aware.

  1. Possible data loss: If you have data defined as int and pass in a float, Pydantic will convert this float to an int, leading to data loss. Please read this documentation for an example. See this thread for an explanation from the library’s author. You can use Pydantic StrictTypes to prevent this issue.
  2. Basemodel v Dataclass: You can use Pydantic with its BaseModel or use its Dataclass. While we have used data class in our code examples above, it is important to understand the differences between them. Read this thread for more details.
  3. Performance: You are creating an object for each row. This involves some overhead.

Conclusion

Hope this article provided you with a good idea of how to ensure datatype correctness in your data. Ensuring the right data types helps you avoid incompatible datatype transformation errors. To recap, we saw how to

  1. Define expected data types for any data set
  2. Create custom data types
  3. Parse incoming data to match the expected data types

When processing data from a source that might not always have the correct data types, try Pydantic to parse the data into expected data types. You can save time handling edge cases and keeping your code base clean by following the DRY principle.

Please leave any questions or comments in the comment section below.

Further reading

  1. Memory efficient data pipelines in python
  2. How to add tests to your data pipeline
  3. How to make your data pipelines idempotent
  4. Beginner DE project: batch
  5. DE project to impress hiring manager

References

  1. Pydantic documentation
  2. Pydantic is a parsing library
  3. Parse don’t validate