Getting Started¶
This module provides basic table-type data normalization that I have personally needed in many different projects over the past couple years.
It allows the normalization of table data readable by the popular pyexcel library (mainly .xlsx
, .xls
, and .csv
files).
It uses the concept of rules which can be chained to form a cleaning processes for these files. This module has a lot of room for improvement, but it gets the job done that I needed it to. Hopefully I’ll continue to contribute rules, features, and more clean functionality as I need it.
Installation¶
Currently SandPaper is on PyPi and can easily be installed through pip!
pip install sandpaper
Usage¶
Using SandPaper is fairly simple and straightforward.
First things first, in order to normalize any data you have to create an instance of the SandPaper
object to group together your normalization rules.
from sandpaper import SandPaper
# for an explicitly named sandpaper instance
my_sandpaper = SandPaper('my-sandpaper')
# for an implicitly named sandpaper instance
my_sandpaper = SandPaper()
Chaining Rules¶
Now that you have a SandPaper
instance, you can start chaining in rules that should be applied in order to normalize the data.
Tip
For a full list of available rules, check out the list of rules here.
Rule can be applied by simply chaining the ordered normalization processes directly off of a SandPaper
isntance.
my_sandpaper.strip()
This will apply the strip()
rule to the my_sandpaper
instance.
The way it is now, the my_sandpaper
instance will strip all whitespace from all values (since no filters were given).
We can add another rule to my_sandpaper
by simply calling it.
my_sandpaper.translate_text({
r'FL': 'Florida',
r'NC': 'North Carolina'
}, column_filter=r'state')
This will apply the translate_text()
rule to the my_sandpaper
instance.
Since the strip()
rule has already been applied, stripping of all whitespace will occur before this rule is applied.
The translate_text()
rule will substitute the regular expression matches FL
and NC
with the values Florida
and North Carolina
respectively only in the column matching the filter state
.
The current state of the my_sandpaper
instance could have also been initialized in one go using the chaining feature that rules provide.
my_sandpaper = SandPaper('my-sandpaper')\
.strip()\
.translate_text({
r'FL': 'Florida',
r'NC': 'North Carolina'
}, column_filter=r'state')
Applying SandPaper¶
In order to run this SandPaper
instance you need to call the apply()
method to a file.
my_sandpaper.apply('/path/to/input_file.csv', '/path/to/output_file.csv')
Important
If applying to .csv
files, unnecessary quotations are implicitly removed as part of the reading and saving processes.
Currently there is no way of disabling this… sorry 😞.
Typically when dealing with Excell type files you will run into issues where the data that needs to be normalized isn’t on the first row or even the first column.
In this instance you can specify two options start_row
and start_column
where the reading of the file should start in the apply()
method.
Note
Both start_row
and start_column
are 0 indexed. Therefore, if the data starts in row 2 in Excell you need to specify start_row=1
in the apply()
method.
By default SandPaper will read data from the first available sheet (only for Excell type files).
In order to specify the sheet that you want the normalization to run on, specify the keyword argument sheet_name
in the apply()
method.
my_sandpaper.apply(
'/path/to/input_file.xlsx',
'/path/to/output_file.csv',
sheet_name='Actual Data Sheet'
)
Rule Filters¶
An important thing to note about rules is that every value has to first pass several optional filters if the rule is to be applied to that value.
column_filter
: regex- A regular expression filter applied to the column name of the value (must have a match to pass)
value_filter
: regex- A regular expression filter applied to the value (must have a match to pass)
callable_filter
: callableA callable reference that is executed for each value (must evaluate to true to pass)
Note
This callable should expect to receive the parameters
record
,column
in that order, as well as any specified rule kwargs. The callable should return a boolean value which is True if the rule should be applied, otherwise False.
These filters are processed in the order presented and are completely optional. If no filters are specified, then the rule is applied.
Saving SandPapers¶
It is possible to export a SandPaper
instance using the __json__()
function.
This exports the configuration of the intance to a dictionary which is suitable for json serialization.
serialized = my_sandpaper.__json__()
This exported format can be used to bootstrap a new SandPaper
instance by providing the serialization to the from_json()
method.
new_sandpaper = SandPaper.from_json(serialized)
Important
The json serialization does not store any information about callables.
A UserWarning
is raised during serialization if a callable is found.
def _filter_handler(record, column, **kwargs):
return record[column].lower().startswith('north')
paper = SandPaper().translate_text({
r'FL': 'Florida',
r'NC': 'North Carolina'
}, callable_filter=_filter_handler)
paper.__json__()
# raises: UserWarning because of _filter_handler
Limitations¶
Several limitations to the effectiveness of the reading and writing of normalized data still exist within this module. These are described in the subsections below…
One Sheet¶
SandPaper instance’s are really only meant to be applied to one sheet at a time.
Of course, this only applies to those table storage types that implement sheets (.xlsx
, .xls
, etc…).
It is up to the user to create SandPaper instances as necessary for each sheet that requires cleaning. This module is intended to clean and return parsable normalized data, not do Excell’s job for it.
Reading as Records¶
In order to provide all of the lovely filtering (Rule Filters) that make specifying advanced normalization rules much easier, SandPaper reads rows of table type data in as records (collections.OrderedDict
).
This allows us to tie row entries to column names easily but unfortunately causes limitations on the format of data that can be properly read in.
The main limitation is that table sheets with duplicate column names cannot be read properly.
Because pyexcel reads records as OrderedDict
, the last column with a duplicate name is the only column considered.
For example the following table data…
my_column | my_column |
---|---|
1 | 2 |
3 | 4 |
will only output the last my_column
column (with values 2 and 4) in the resulting sanded
data.
This is because the reading of the record first reads the first column and then overwrites it with the second column.
A fix for this issue is possible, however would cause a lot of refactoring and additional testing which (obviously) has not been done.
Translating Dates¶
The translate_date()
rule is quite nifty, but also has a couple limitations that need to be considered.
Because dates are not a base type, the mentioned rule can sometimes incorrectly interpret strings as dates and apply a date translation where it is not needed.
For this reason, it is recommended to also specify at least a column_filter
for all instances of the rule.
A value_filter
would also help, but causes a lot of extra complexity that is most likely not required.