sandpaper¶
This is the base sandpaper package that gets imported.
sandpaper.sandpaper module¶
-
sandpaper.sandpaper.
value_rule
(func)[source]¶ A meta wrapper for value normalization rules.
Note
Value rules take in a full record and a column name as implicit parameters. They are expected to return the value at
record[column]
that has be normalized by the rule.Parameters: func (callable) – The normalization rule Returns: The wrapped normalization rule Return type: callable
-
sandpaper.sandpaper.
record_rule
(func)[source]¶ A meta wrapper for table normalization rules.
Note
Record rules are applied after all value rules have been applied to a record. They take in a full record as an implicit parameter and are expected to return the normalized record back.
Parameters: func (callable) – The normalization rule Returns: The wrapped normalization rule Return type: callable
-
class
sandpaper.sandpaper.
SandPaper
(name=None)[source]¶ Bases:
object
The SandPaper object.
- Allows chained data normalization across multiple different table type
- data files such as
.csv
,.xls
, and.xlsx
.
-
name
¶ The descriptive name of the SandPaper instance.
Note
If no name has been given, a continually updating uid hash of the active rules is used instead
Getter: Returns the given or suitable name for a SandPaper instance Setter: Sets the descriptive name of the SandPaper instance Return type: str
-
uid
¶ A continually updating hash of the active rules.
A hexadecimal digest string
Getter: Returns a continually updating hash of the active rules Return type: str
-
rules
¶ This list of applicable rules for the SandPaper instance.
Getter: Returns the list of applicable rules for the instance Return type: list[tuple(callable, tuple(…,…), dict[str,…])]
-
value_rules
¶ The set of value rules for the SandPaper instance.
Getter: Returns the set rules for the SandPaper instance Return type: set(callable)
-
record_rules
¶ The set of record rules for the SandPaper instance.
Getter: Returns the set rules for the SandPaper instance Return type: set(callable)
-
lower
(record, column, **kwargs)[source]¶ A basic lowercase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value lowercased
- record (collections.OrderedDict) – A record whose value within
-
upper
(record, column, **kwargs)[source]¶ A basic uppercase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value uppercased
- record (collections.OrderedDict) – A record whose value within
-
capitalize
(record, column, **kwargs)[source]¶ A basic capitalization rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value capatilized
- record (collections.OrderedDict) – A record whose value within
-
title
(record, column, **kwargs)[source]¶ A basic titlecase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value titlecased
- record (collections.OrderedDict) – A record whose value within
-
lstrip
(record, column, content=None, **kwargs)[source]¶ A basic lstrip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with left content stripped
- record (collections.OrderedDict) – A record whose value within
-
rstrip
(record, column, content=None, **kwargs)[source]¶ A basic rstrip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with right content stripped
- record (collections.OrderedDict) – A record whose value within
-
strip
(record, column, content=None, **kwargs)[source]¶ A basic strip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with all content stripped
- record (collections.OrderedDict) – A record whose value within
-
increment
(record, column, amount=1, **kwargs)[source]¶ A basic increment rule for a given value.
Only applies to numeric (int, float) type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- amount (int or float) – The amount to increment by
- kwargs (dict) – Any named arguments
Returns: The value incremented by
amount
- record (collections.OrderedDict) – A record whose value within
-
decrement
(record, column, amount=1, **kwargs)[source]¶ A basic decrement rule for a given value.
Only applies to numeric (int, float) type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- amount (int or float) – The amount to decrement by
- kwargs (dict) – Any named arguments
Returns: The value incremented by
amount
- record (collections.OrderedDict) – A record whose value within
-
replace
(record, column, replacements, **kwargs)[source]¶ Applies a replacements dictionary to a value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').replace({ 'this_is_going_to_be_replaced': 'with_this', })
Parameters: Returns: The value with all replacements made
-
translate_text
(record, column, translations, **kwargs)[source]¶ A text translation rule for a given value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').translate_text({ r'^group(?P<group_id>\d+)\s*(.*)$': '{group_id}' }, column_filter=r'^group_definition$')
This will translate all instances of the value
group<GROUP NUMBER>
to<GROUP NUMBER>
only in columns namedgroup_definition
.Important
Note that matched groups and matched groupdicts are passed as
*args
and**kwargs
to the format method of the returnedto_format
string.Parameters: Returns: The potentially translated value
-
translate_date
(record, column, translations, **kwargs)[source]¶ A date translation rule for a given value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').translate_date({ '%Y-%m-%d': '%Y', '%Y': '%Y', '%Y-%m': '%Y' }, column_filter=r'^(.*)_date$')
This will translate all instances of a date value matching the given date formats in columns ending with
_date
to the date format%Y
.Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - column (str) – A column that indicates what value to normalize
- translations (dict[str, str]) – A dictionary of translations from an arrow based dateformats to a different format
- kwargs (dict) – Any named arguments
Returns: The value potentially translated value
- record (collections.OrderedDict) – A record whose value within
-
add_columns
(record, additions, **kwargs)[source]¶ Adds columns to a record.
Note
If the value of an entry in
additions
is a callable, then the callable should expect therecord
as the only parameter and should return the value that should be placed in the newly added column.If the value of an entry in
additions
is a string, the record is passed in as kwargs to the value’sformat
method.Otherwise, the value of an entry in
additions
is simply used as the newly added column’s value.Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - additions (dict[str,...]) – A dictionary of column names to callables, strings, or other values
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly added column
- record (collections.OrderedDict) – A record whose value within
-
remove_columns
(record, removes, **kwargs)[source]¶ Removes columns from a record.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - removes (list[str]) – A list of columns to remove
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly removed column
- record (collections.OrderedDict) – A record whose value within
-
keep_columns
(record, keeps, **kwargs)[source]¶ Removes all other columns from a record.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - keeps (list[str]) – A list of columns to keep
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly kept column
- record (collections.OrderedDict) – A record whose value within
-
rename_columns
(record, renames, **kwargs)[source]¶ Maps an existing column to a new column.
Parameters: - record (collections.OrderedDict) – A record whose value within
column
should be normalized and returned - renames (dict[str, str]) – A dictionary of column to column renames
- kwargs (dict) – Any named arguments
Returns: The record with the remapped column
- record (collections.OrderedDict) – A record whose value within
-
order_columns
(record, order, ignore_missing=False, **kwargs)[source]¶ Orders columns in a specific order.
Parameters: - record (collections.OrderedDict) – A record who should be ordered
- order (list[str]) – The order that columns need to be in
- ignore_missing (bool) – Boolean which inidicates if missing columns
from
order
should be ignored - kwargs (dict) – Any named arguments
Returns: The record with the columns reordered
-
apply
(from_file, to_file, sheet_name=None, row_filter=None, monitor_rules=False, **kwargs)[source]¶ Applies a SandPaper instance rules to a given glob of files.
Parameters: - from_file (str) – The path of the file to apply the rules to
- to_file (str) – The path of the file to write to
- sheet_name (str) – The name of the sheet to apply rules to (defaults to the first available sheet)
- row_filter (callable) – A callable which accepts a cleaned record and returns True if the record should be written out
- monitor_rules (bool) – Boolean flag that inidicates if the count of applied rules should be monitored
- kwargs (dict) – Any additional named arguments
(applied to the pyexcel
iget_records
method)
Returns: The rule statistics if
monitor_rules
is trueReturn type:
-
classmethod
from_json
(serialization)[source]¶ Loads a SandPaper instance from a json serialization.
Note
Raises a
UserWarning
when the loaded instance does not match the serialized instance’suid
.Parameters: serialization (dict) – The read json serialization Returns: A new SandPaper instance Return type: SandPaper