SandPaper¶
Documentation¶
Getting Started¶
This module provides basic table-type data normalization that I have personally needed in many different projects over the past couple years.
It allows the normalization of table data readable by the popular pyexcel library (mainly .xlsx
, .xls
, and .csv
files).
It uses the concept of rules which can be chained to form a cleaning processes for these files. This module has a lot of room for improvement, but it gets the job done that I needed it to. Hopefully I’ll continue to contribute rules, features, and more clean functionality as I need it.
Installation¶
Currently SandPaper is on PyPi and can easily be installed through pip!
pip install sandpaper
Usage¶
Using SandPaper is fairly simple and straightforward.
First things first, in order to normalize any data you have to create an instance of the SandPaper
object to group together your normalization rules.
from sandpaper import SandPaper
# for an explicitly named sandpaper instance
my_sandpaper = SandPaper('my-sandpaper')
# for an implicitly named sandpaper instance
my_sandpaper = SandPaper()
Chaining Rules¶
Now that you have a SandPaper
instance, you can start chaining in rules that should be applied in order to normalize the data.
Tip
For a full list of available rules, check out the list of rules here.
Rule can be applied by simply chaining the ordered normalization processes directly off of a SandPaper
isntance.
my_sandpaper.strip()
This will apply the strip()
rule to the my_sandpaper
instance.
The way it is now, the my_sandpaper
instance will strip all whitespace from all values (since no filters were given).
We can add another rule to my_sandpaper
by simply calling it.
my_sandpaper.translate_text({
r'FL': 'Florida',
r'NC': 'North Carolina'
}, column_filter=r'state')
This will apply the translate_text()
rule to the my_sandpaper
instance.
Since the strip()
rule has already been applied, stripping of all whitespace will occur before this rule is applied.
The translate_text()
rule will substitute the regular expression matches FL
and NC
with the values Florida
and North Carolina
respectively only in the column matching the filter state
.
The current state of the my_sandpaper
instance could have also been initialized in one go using the chaining feature that rules provide.
my_sandpaper = SandPaper('my-sandpaper')\
.strip()\
.translate_text({
r'FL': 'Florida',
r'NC': 'North Carolina'
}, column_filter=r'state')
Applying SandPaper¶
In order to run this SandPaper
instance you need to call the apply()
method to a file.
my_sandpaper.apply('/path/to/input_file.csv', '/path/to/output_file.csv')
Important
If applying to .csv
files, unnecessary quotations are implicitly removed as part of the reading and saving processes.
Currently there is no way of disabling this… sorry 😞.
Rule Filters¶
An important thing to note about rules is that every value has to first pass several optional filters if the rule is to be applied to that value.
column_filter
: regex- A regular expression filter applied to the column name of the value (must have a match to pass)
value_filter
: regex- A regular expression filter applied to the value (must have a match to pass)
callable_filter
: callableA callable reference that is executed for each value (must evaluate to true to pass)
Note
This callable should expect to receive the parameters
record
,column
in that order, as well as any specified rule kwargs. The callable should return a boolean value which is True if the rule should be applied, otherwise False.
These filters are processed in the order presented and are completely optional. If no filters are specified, then the rule is applied.
Saving SandPapers¶
It is possible to export a SandPaper
instance using the __json__()
function.
This exports the configuration of the intance to a dictionary which is suitable for json serialization.
serialized = my_sandpaper.__json__()
This exported format can be used to bootstrap a new SandPaper
instance by providing the serialization to the from_json()
method.
new_sandpaper = SandPaper.from_json(serialized)
Important
The json serialization does not store any information about callables.
A UserWarning
is raised during serialization if a callable is found.
def _filter_handler(record, column, **kwargs):
return record[column].lower().startswith('north')
paper = SandPaper().translate_text({
r'FL': 'Florida',
r'NC': 'North Carolina'
}, callable_filter=_filter_handler)
paper.__json__()
# raises: UserWarning because of _filter_handler
Limitations¶
Several limitations to the effectiveness of the reading and writing of normalized data still exist within this module. These are described in the subsections below…
One Sheet¶
SandPaper instance’s are really only meant to be applied to one sheet at a time.
Of course, this only applies to those table storage types that implement sheets (.xlsx
, .xls
, etc…).
It is up to the user to create SandPaper instances as necessary for each sheet that requires cleaning. This module is intended to clean and return parsable normalized data, not do Excell’s job for it.
Reading as Records¶
In order to provide all of the lovely filtering (Rule Filters) that make specifying advanced normalization rules much easier, SandPaper reads rows of table type data in as records (collections.OrderedDict
).
This allows us to tie row entries to column names easily but unfortunately causes limitations on the format of data that can be properly read in.
The main limitation is that table sheets with duplicate column names cannot be read properly.
Because pyexcel reads records as OrderedDict
, the last column with a duplicate name is the only column considered.
For example the following table data…
my_column | my_column |
---|---|
1 | 2 |
3 | 4 |
will only output the last my_column
column (with values 2 and 4) in the resulting sanded
data.
This is because the reading of the record first reads the first column and then overwrites it with the second column.
A fix for this issue is possible, however would cause a lot of refactoring and additional testing which (obviously) has not been done.
Translating Dates¶
The translate_date()
rule is quite nifty, but also has a couple limitations that need to be considered.
Because dates are not a base type, the mentioned rule can sometimes incorrectly interpret strings as dates and apply a date translation where it is not needed.
For this reason, it is recommended to also specify at least a column_filter
for all instances of the rule.
A value_filter
would also help, but causes a lot of extra complexity that is most likely not required.
Available Rules¶
Below are a list of available rules that can be attached to a SandPaper
instance.
All of these rules first must pass several optional filters discussed in Rule Filters.
In the following examples of these rules the symbol □ represents whitespace.
Value Rules¶
These rules are applied to every value that passes the specified rule filters documented in Rule Filters.
lower()
¶
A basic rule that lowercases the text in a value.
SandPaper().lower()
Input | Output |
---|---|
DATa | data |
upper()
¶
A basic rule that uppercases the text in a value.
SandPaper().upper()
Input | Output |
---|---|
daTa | DATA |
capitalize()
¶
A basic rule that capitalizes the text in a value.
SandPaper().capitalize()
Input | Output |
---|---|
daTa | Data |
title()
¶
A basic rule that titlecases the text in a value.
SandPaper().title()
Input | Output |
---|---|
mY dAta | My Data |
lstrip()
¶
A basic rule that strips all left whitespace from a value.
SandPaper().lstrip()
Input | Output |
---|---|
□□data | data |
rstrip()
¶
A basic rule that strips all right whitespace from a value.
SandPaper().rstrip()
Input | Output |
---|---|
data□□ | data |
strip()
¶
A basic rule that strips all whitespace from a value.
SandPaper().strip()
Input | Output |
---|---|
□data□ | data |
translate_text()
¶
A translation rule that translate regex matches to a specified format.
SandPaper().translate_text({
r'group_(?P<group_id>\d+)$': '{group_id}'
})
Input | Output |
---|---|
group_47 | 47 |
group_123 | 123 |
group_0 | 0 |
translate_date()
¶
A translation rule that translate greedily evaluated dates to a specified datetime format.
Note
This rule is very greedy and can potentailly evaluate dates incorrectly.
It is highly recommended that at the very least a column_filter
is supplied with this rule.
SandPaper().translate_date({
'%Y-%m-%d': '%Y',
'%Y-%m': '%Y',
'%Y': '%Y'
})
Input | Output |
---|---|
2017-01-32 | 2017 |
2017-01 | 2017 |
2017 | 2017 |
Record Rules¶
These rules are applied to every record regardless to most rule filters documented in Rule Filters.
add_columns()
¶
Adds a column to every record.
The given dictionary should be a key value pairing where the key is a new column name and the paired value is either a callable, string, or other low level data type for the newly added column’s value. If the value is a callable it should expect to receive the record as the only parameter and should return the value desired for the newly added column.
import uuid
def gen_uuid(record):
return uuid.uuid4()
SandPaper().add_columns({
'uuid': gen_uuid
})
Before | ||
---|---|---|
id | name | value |
1 | hello | world |
2 | test | table |
After | |||
---|---|---|---|
id | name | value | uuid |
1 | hello | world | a6a76896-c33d-4654-afdf-12aa80dd6238 |
2 | test | table | b1e171c2-fee9-4270-96e9-4853c3a6e130 |
remove_columns()
¶
Removes a column from every record.
SandPaper().remove_columns([
'name'
])
Before | ||
---|---|---|
id | name | value |
1 | hello | world |
2 | test | table |
After | |
---|---|
id | value |
1 | world |
2 | table |
keep_columns()
¶
Removes all other columns for every record.
SandPaper().keep_columns([
'id',
'name'
])
Before | ||
---|---|---|
id | name | value |
1 | hello | world |
2 | test | table |
After | |
---|---|
id | name |
1 | hello |
2 | test |
rename_columns()
¶
Renames a column for every record.
SandPaper().rename_columns([
'old_name': 'new_name'
])
Before | ||
---|---|---|
id | old_name | value |
1 | hello | world |
2 | test | table |
After | ||
---|---|---|
id | new_name | value |
1 | hello | world |
2 | test | table |
order_columns()
¶
Reorders columns from every record.
SandPaper().order_columns([
'value',
'name',
'id'
])
Before | ||
---|---|---|
id | name | value |
1 | hello | world |
2 | test | table |
After | ||
---|---|---|
value | id | name |
world | 1 | hello |
table | 2 | test |
Changelog¶
All notable changes to SandPaper will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
unreleased¶
- WIP
0.0.6 (2017-12-15)¶
- added functionality for a
row_filter
callable specified duringapply
- added
keep_columns
record rule - added built-in Sphinx theme (borrowed from pocoo click)
- fixed
__json__
export warnings with callables
0.0.5 (2017-11-03)¶
- added enforcement for strict date parsing in
translate_date
rule - added
rename_columns
andorder_columns
record rules - fixed the naming of
add_columns
andremove_columns
- fixed the messy structure of all rules (cleaner and more intuitive use)
- fixed documentation to match new rule structure
- fixed all existing tests to match new rule structure
- removed the
substitute
value rule (utilizetranslate_text
instead) - removed extraneous badges from README and documentation index
0.0.4 (2017-10-26)¶
- added a better badge provider for PyPi package status
- added support for a
sheet_filter
applied to both value rules and record rules - added precompilation of filter regexes before application
- added rule application statistics which is now returned from
apply
in a tuple (output_filepath, output_statistics,) - removed callable filters causing exporting and loading errors (just ignoring callable filters for now)
0.0.3 (2017-10-25)¶
- added more badges to documentation and the README
- fixed (hopefully) the building of documentation for readthedocs.io
- fixed README example with an example that acutally makes sense
0.0.2 (2017-10-24)¶
- added even more badges to the README
- added documentation improvements (linking rules to function references)
- added several small improvements to the tests (better code coverage)
Reference¶
sandpaper¶
This is the base sandpaper package that gets imported.
sandpaper.sandpaper module¶
-
sandpaper.sandpaper.
value_rule
(func)[source]¶ A meta wrapper for value normalization rules.
Note
Value rules take in a full record and a column name as implicit parameters. They are expected to return the value at
record[column]
that has be normalized by the rule.Parameters: func (callable) – The normalization rule Returns: The wrapped normalization rule Return type: callable
-
sandpaper.sandpaper.
record_rule
(func)[source]¶ A meta wrapper for table normalization rules.
Note
Record rules are applied after all value rules have been applied to a record. They take in a full record as an implicit parameter and are expected to return the normalized record back.
Parameters: func (callable) – The normalization rule Returns: The wrapped normalization rule Return type: callable
-
class
sandpaper.sandpaper.
SandPaper
(name=None)[source]¶ Bases:
object
The SandPaper object.
- Allows chained data normalization across multiple different table type
- data files such as
.csv
,.xls
, and.xlsx
.
-
name
¶ The descriptive name of the SandPaper instance.
Note
If no name has been given, a continually updating uid hash of the active rules is used instead
Getter: Returns the given or suitable name for a SandPaper instance Setter: Sets the descriptive name of the SandPaper instance Return type: str
-
uid
¶ A continually updating hash of the active rules.
A hexadecimal digest string
Getter: Returns a continually updating hash of the active rules Return type: str
-
rules
¶ This list of applicable rules for the SandPaper instance.
Getter: Returns the list of applicable rules for the instance Return type: list[tuple(callable, tuple(…,…), dict[str,…])]
-
value_rules
¶ The set of value rules for the SandPaper instance.
Getter: Returns the set rules for the SandPaper instance Return type: set(callable)
-
record_rules
¶ The set of record rules for the SandPaper instance.
Getter: Returns the set rules for the SandPaper instance Return type: set(callable)
-
lower
(record, column, **kwargs)[source]¶ A basic lowercase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value lowercased
- record (collections.OrderedDict) – A record whose value within
-
upper
(record, column, **kwargs)[source]¶ A basic uppercase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value uppercased
- record (collections.OrderedDict) – A record whose value within
-
capitalize
(record, column, **kwargs)[source]¶ A basic capitalization rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value capatilized
- record (collections.OrderedDict) – A record whose value within
-
title
(record, column, **kwargs)[source]¶ A basic titlecase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value titlecased
- record (collections.OrderedDict) – A record whose value within
-
lstrip
(record, column, content=None, **kwargs)[source]¶ A basic lstrip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with left content stripped
- record (collections.OrderedDict) – A record whose value within
-
rstrip
(record, column, content=None, **kwargs)[source]¶ A basic rstrip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with right content stripped
- record (collections.OrderedDict) – A record whose value within
-
strip
(record, column, content=None, **kwargs)[source]¶ A basic strip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with all content stripped
- record (collections.OrderedDict) – A record whose value within
-
increment
(record, column, amount=1, **kwargs)[source]¶ A basic increment rule for a given value.
Only applies to numeric (int, float) type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- amount (int or float) – The amount to increment by
- kwargs (dict) – Any named arguments
Returns: The value incremented by
amount
- record (collections.OrderedDict) – A record whose value within
-
decrement
(record, column, amount=1, **kwargs)[source]¶ A basic decrement rule for a given value.
Only applies to numeric (int, float) type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- amount (int or float) – The amount to decrement by
- kwargs (dict) – Any named arguments
Returns: The value incremented by
amount
- record (collections.OrderedDict) – A record whose value within
-
replace
(record, column, replacements, **kwargs)[source]¶ Applies a replacements dictionary to a value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').replace({ 'this_is_going_to_be_replaced': 'with_this', })
Parameters: Returns: The value with all replacements made
-
translate_text
(record, column, translations, **kwargs)[source]¶ A text translation rule for a given value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').translate_text({ r'^group(?P<group_id>\d+)\s*(.*)$': '{group_id}' }, column_filter=r'^group_definition$')
This will translate all instances of the value
group<GROUP NUMBER>
to<GROUP NUMBER>
only in columns namedgroup_definition
.Important
Note that matched groups and matched groupdicts are passed as
*args
and**kwargs
to the format method of the returnedto_format
string.Parameters: Returns: The potentially translated value
-
translate_date
(record, column, translations, **kwargs)[source]¶ A date translation rule for a given value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').translate_date({ '%Y-%m-%d': '%Y', '%Y': '%Y', '%Y-%m': '%Y' }, column_filter=r'^(.*)_date$')
This will translate all instances of a date value matching the given date formats in columns ending with
_date
to the date format%Y
.Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- translations (dict[str, str]) – A dictionary of translations from an arrow based dateformats to a different format
- kwargs (dict) – Any named arguments
Returns: The value potentially translated value
- record (collections.OrderedDict) – A record whose value within
-
add_columns
(record, additions, **kwargs)[source]¶ Adds columns to a record.
Note
If the value of an entry in
additions
is a callable, then the callable should expect therecord
as the only parameter and should return the value that should be placed in the newly added column.If the value of an entry in
additions
is a string, the record is passed in as kwargs to the value’sformat
method.Otherwise, the value of an entry in
additions
is simply used as the newly added column’s value.Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - additions (dict[str,...]) – A dictionary of column names to callables, strings, or other values
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly added column
- record (collections.OrderedDict) – A record whose value within
-
remove_columns
(record, removes, **kwargs)[source]¶ Removes columns from a record.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - removes (list[str]) – A list of columns to remove
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly removed column
- record (collections.OrderedDict) – A record whose value within
-
keep_columns
(record, keeps, **kwargs)[source]¶ Removes all other columns from a record.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - keeps (list[str]) – A list of columns to keep
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly kept column
- record (collections.OrderedDict) – A record whose value within
-
rename_columns
(record, renames, **kwargs)[source]¶ Maps an existing column to a new column.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - renames (dict[str, str]) – A dictionary of column to column renames
- kwargs (dict) – Any named arguments
Returns: The record with the remapped column
- record (collections.OrderedDict) – A record whose value within
-
order_columns
(record, order, ignore_missing=False, **kwargs)[source]¶ Orders columns in a specific order.
Parameters: - record (collections.OrderedDict) – A record who should be ordered
- order (list[str]) – The order that columns need to be in
- ignore_missing (bool) – Boolean which inidicates if missing columns
from
order
should be ignored - kwargs (dict) – Any named arguments
Returns: The record with the columns reordered
-
apply
(from_file, to_file, sheet_name=None, row_filter=None, monitor_rules=False, **kwargs)[source]¶ Applies a SandPaper instance rules to a given glob of files.
Parameters: - from_file (str) – The path of the file to apply the rules to
- to_file (str) – The path of the file to write to
- sheet_name (str) – The name of the sheet to apply rules to (defaults to the first available sheet)
- row_filter (callable) – A callable which accepts a cleaned record and returns True if the record should be written out
- monitor_rules (bool) – Boolean flag that inidicates if the count of applied rules should be monitored
- kwargs (dict) – Any additional named arguments
(applied to the pyexcel
iget_records
method)
Returns: The rule statistics if
monitor_rules
is trueReturn type:
-
classmethod
from_json
(serialization)[source]¶ Loads a SandPaper instance from a json serialization.
Note
Raises a
UserWarning
when the loaded instance does not match the serialized instance’suid
.Parameters: serialization (dict) – The read json serialization Returns: A new SandPaper instance Return type: SandPaper