Getting Started

This module provides basic table-type data normalization that I have personally needed in many different projects over the past couple years. It allows the normalization of table data readable by the popular pyexcel library (mainly .xlsx, .xls, and .csv files).

It uses the concept of rules which can be chained to form a cleaning processes for these files. This module has a lot of room for improvement, but it gets the job done that I needed it to. Hopefully I’ll continue to contribute rules, features, and more clean functionality as I need it.

Installation

Currently SandPaper is on PyPi and can easily be installed through pip!

pip install sandpaper

Usage

Using SandPaper is fairly simple and straightforward. First things first, in order to normalize any data you have to create an instance of the SandPaper object to group together your normalization rules.

from sandpaper import SandPaper

# for an explicitly named sandpaper instance
my_sandpaper = SandPaper('my-sandpaper')

# for an implicitly named sandpaper instance
my_sandpaper = SandPaper()

Chaining Rules

Now that you have a SandPaper instance, you can start chaining in rules that should be applied in order to normalize the data.

Tip

For a full list of available rules, check out the list of rules here.

Rule can be applied by simply chaining the ordered normalization processes directly off of a SandPaper isntance.

my_sandpaper.strip()

This will apply the strip() rule to the my_sandpaper instance. The way it is now, the my_sandpaper instance will strip all whitespace from all values (since no filters were given).

We can add another rule to my_sandpaper by simply calling it.

my_sandpaper.translate_text({
   r'FL': 'Florida',
   r'NC': 'North Carolina'
}, column_filter=r'state')

This will apply the translate_text() rule to the my_sandpaper instance.

Since the strip() rule has already been applied, stripping of all whitespace will occur before this rule is applied. The translate_text() rule will substitute the regular expression matches FL and NC with the values Florida and North Carolina respectively only in the column matching the filter state.

The current state of the my_sandpaper instance could have also been initialized in one go using the chaining feature that rules provide.

my_sandpaper = SandPaper('my-sandpaper')\
   .strip()\
   .translate_text({
      r'FL': 'Florida',
      r'NC': 'North Carolina'
   }, column_filter=r'state')

Applying SandPaper

In order to run this SandPaper instance you need to call the apply() method to a file.

my_sandpaper.apply('/path/to/input_file.csv', '/path/to/output_file.csv')

Important

If applying to .csv files, unnecessary quotations are implicitly removed as part of the reading and saving processes. Currently there is no way of disabling this… sorry 😞.

Typically when dealing with Excell type files you will run into issues where the data that needs to be normalized isn’t on the first row or even the first column. In this instance you can specify two options start_row and start_column where the reading of the file should start in the apply() method.

Note

Both start_row and start_column are 0 indexed. Therefore, if the data starts in row 2 in Excell you need to specify start_row=1 in the apply() method.

By default SandPaper will read data from the first available sheet (only for Excell type files). In order to specify the sheet that you want the normalization to run on, specify the keyword argument sheet_name in the apply() method.

my_sandpaper.apply(
   '/path/to/input_file.xlsx',
   '/path/to/output_file.csv',
   sheet_name='Actual Data Sheet'
)

Rule Filters

An important thing to note about rules is that every value has to first pass several optional filters if the rule is to be applied to that value.

column_filter : regex
A regular expression filter applied to the column name of the value (must have a match to pass)
value_filter : regex
A regular expression filter applied to the value (must have a match to pass)
callable_filter : callable

A callable reference that is executed for each value (must evaluate to true to pass)

Note

This callable should expect to receive the parameters record, column in that order, as well as any specified rule kwargs. The callable should return a boolean value which is True if the rule should be applied, otherwise False.

These filters are processed in the order presented and are completely optional. If no filters are specified, then the rule is applied.

Saving SandPapers

It is possible to export a SandPaper instance using the __json__() function. This exports the configuration of the intance to a dictionary which is suitable for json serialization.

serialized = my_sandpaper.__json__()

This exported format can be used to bootstrap a new SandPaper instance by providing the serialization to the from_json() method.

new_sandpaper = SandPaper.from_json(serialized)

Important

The json serialization does not store any information about callables. A UserWarning is raised during serialization if a callable is found.

def _filter_handler(record, column, **kwargs):
   return record[column].lower().startswith('north')

paper = SandPaper().translate_text({
   r'FL': 'Florida',
   r'NC': 'North Carolina'
}, callable_filter=_filter_handler)

paper.__json__()
# raises: UserWarning because of _filter_handler

Limitations

Several limitations to the effectiveness of the reading and writing of normalized data still exist within this module. These are described in the subsections below…

One Sheet

SandPaper instance’s are really only meant to be applied to one sheet at a time. Of course, this only applies to those table storage types that implement sheets (.xlsx, .xls, etc…).

It is up to the user to create SandPaper instances as necessary for each sheet that requires cleaning. This module is intended to clean and return parsable normalized data, not do Excell’s job for it.

Reading as Records

In order to provide all of the lovely filtering (Rule Filters) that make specifying advanced normalization rules much easier, SandPaper reads rows of table type data in as records (collections.OrderedDict). This allows us to tie row entries to column names easily but unfortunately causes limitations on the format of data that can be properly read in. The main limitation is that table sheets with duplicate column names cannot be read properly.

Because pyexcel reads records as OrderedDict, the last column with a duplicate name is the only column considered.

For example the following table data…

my_column my_column
1 2
3 4

will only output the last my_column column (with values 2 and 4) in the resulting sanded data. This is because the reading of the record first reads the first column and then overwrites it with the second column.

A fix for this issue is possible, however would cause a lot of refactoring and additional testing which (obviously) has not been done.

Translating Dates

The translate_date() rule is quite nifty, but also has a couple limitations that need to be considered. Because dates are not a base type, the mentioned rule can sometimes incorrectly interpret strings as dates and apply a date translation where it is not needed. For this reason, it is recommended to also specify at least a column_filter for all instances of the rule. A value_filter would also help, but causes a lot of extra complexity that is most likely not required.