Modern data engineering and analysis workflows will often involve using data manipulation libraries, which, in the Python universe, would be tools like pandas. One problem you may have encountered with this powerful data manipulation tool is that the dataframe can be an opaque object that’s hard to reason about in terms of its contents, data types, and other properties.

One tool that may help you with this problem is pandera, which was accepted by pyOpenSci as part of its ecosystem of packages on September 2019. Pandera provides a flexible and expressive data validation toolkit that helps users make statistical assertions about pandas data structures.

A Statistical Data Validation Toolkit for Pandas

Image showing pandera package logo.

To illustrate pandera’s capabilities let’s use a small toy example. Suppose you’re analyzing data for some insights in the context of a mission-critical project, where it’s vital to ensure the quality of the datasets that you’re looking at.

Each row in the dataset is uniquely identified by a person_id, and each column describes that person’s height_in_cms and age_category.

import pandas as pd

dataset = pd.DataFrame(
    data={
        "height_in_cm": [150, 145, 122, 176, 137, 151],
        "age_category": ["20-30", "10-20", "10-20", "20-30", "10-20", "20-30"],
    },
    index=pd.Series([100, 101, 102, 103, 104, 105], name="person_id"),
)
print(dataset)
           height_in_cm age_category
person_id
100                 150        20-30
101                 145        10-20
102                 122        10-20
103                 176        20-30
104                 137        10-20
105                 151        20-30

You want to ensure that some columns have the correct data type, or that the dataset fulfills certain statistical properties. Pandera allows you to validate a DataFrame to ensure that these conditions are met. It allows you to spend less time worrying about the correctness of a DataFrame’s data so you can make the right assumptions in analyzing it.

Column Presence and Type Checking

The most basic type of schema is one that simply checks that specific columns exist with specific datatypes.

import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "height_in_cm": pa.Column(pa.Int),
        "age_category": pa.Column(pa.String),
    },
    index=pa.Index(pa.Int, name="person_id"),
)

schema(dataset)

The schema object is callable, so you can validate the dataset by passing it in as an argument to the schema call. If the dataframe passes schema validation, schema simply returns the dataframe.

If not, it’ll provide useful error messages:

invalid_dataframe = pd.DataFrame({
    "weight_in_kg": [44, 31, 55, 61, 55, 62],
    "age_category": ["20-30", "10-20", "10-20", "20-30", "10-20", "20-30"],
})

schema(invalid_dataframe)
SchemaError: column 'height_in_cm' not in dataframe
   weight_in_kg age_category
0            44        20-30
1            31        10-20
2            55        10-20
3            61        20-30
4            55        10-20

Basic Statistical Checks

If you want to make stricter assertions about the empirical properties of the dataset, we can supply the checks keyword argument to the Column and Index constructors with a Check or list of Checks.

schema = pa.DataFrameSchema(
    columns={
        "height_in_cm": pa.Column(
            pa.Int,
            # height in centimeters should be between 100 and 300
            checks=pa.Check(lambda s: (100 < s) & (s < 300)),
        ),
        "age_category": pa.Column(
            pa.String,
            # check allowable age categories
            checks=pa.Check(lambda s: s.isin(["10-20", "20-30"]))
        ),
    },
    index=pa.Index(
        pa.Int,
        name="person_id",
        checks=[
            # id is a positive integer
            pa.Check(lambda s: s > 0),

            # id is unique
            pa.Check(lambda s: s.duplicated().sum() == 0),
        ]
    ),
)

schema(dataset)

A Check object specifies the exact implementation of how to validate a column or index. The first positional argument in its constructor is a callable with the signature:

Callable[ pd.Series, Union[ bool, pd.Series[bool] ] ]

Notice that the only constraint to the callable is that takes a Series as input and returns a boolean or a boolean Series. By design, checks have access to the entire pandas Series API to make assertions about the properties of a particular column or index.

Indexed Error Messages

In cases where the Check returns a boolean Series, violations of the schema are reported by the index location of failure cases.

invalid_data = pd.DataFrame(
    data={
        "height_in_cm": [91, 105, 87, 87],
        "age_category": ["10-20", "10-20", "10-20", "10-20"]
    },
    index=pd.Series([200, 201, 202, 203], name="person_id")
)

schema(invalid_data)
pandera.errors.SchemaError: <Schema Column: 'height_in_cm' type=int64> failed element-wise validator 0:
<lambda>
failure cases:
               person_id  count
failure_case
87            [202, 203]      2
91                 [200]      1

The error is reported as a stringified dataframe where the failure_case index enumerates instances of height_in_cm values that failed data validation, the person_id column is the index location of the failure case, and count column displays the number of instances of a particular failure case.

Statistical Hypothesis Tests

What if we wanted to test the hypothesis that older people tend to be taller? We can achieve this with the Hypothesis check:

schema = pa.DataFrameSchema(
    columns={
        "height_in_cm": pa.Column(
            # perform a one-sided two-sample t-test of
            # the distribution of heights by age category,
            # with an alpha value of 5%
            checks=pa.Hypothesis.two_sample_ttest(
                groupby="age_category",
                sample1="20-30",
                relationship="greater_than",
                sample2="10-20",
                alpha=0.05,
                equal_var=True,
            )
        ),
        "age_category": pa.Column(
            pa.String,
            checks=pa.Check(lambda s: s.isin(["10-20", "20-30"])),
        )
    }
)

schema(dataset)

Validate your Pandas Dataframes Today!

Whether you use this tool in Jupyter notebooks, one-off scripts, ETL pipeline code, or unit tests, pandera enables you to make pandas code more readable and robust by enforcing the deterministic and statistical properties of pandas data structures at runtime.

Hopefully this post has given you a flavor of what pandera can do. It offers a few more features that you may find useful:

What’s Next?

I’m actively developing this project and have some exciting features coming up soon, such as built-in checks, first-class Dask support, and yaml schema specification. If you’d like to contribute to this project, you’re welcome to head on over to the github repo!

Categories: blog-post , data-validation , dataframes , highlight , pandas

Updated:

Leave a comment