sandpaper¶
This is the base sandpaper package that gets imported.
sandpaper.sandpaper module¶
-
sandpaper.sandpaper.value_rule(func)[source]¶ A meta wrapper for value normalization rules.
Note
Value rules take in a full record and a column name as implicit parameters. They are expected to return the value at
record[column]that has be normalized by the rule.Parameters: func (callable) – The normalization rule Returns: The wrapped normalization rule Return type: callable
-
sandpaper.sandpaper.record_rule(func)[source]¶ A meta wrapper for table normalization rules.
Note
Record rules are applied after all value rules have been applied to a record. They take in a full record as an implicit parameter and are expected to return the normalized record back.
Parameters: func (callable) – The normalization rule Returns: The wrapped normalization rule Return type: callable
-
class
sandpaper.sandpaper.SandPaper(name=None)[source]¶ Bases:
objectThe SandPaper object.
- Allows chained data normalization across multiple different table type
- data files such as
.csv,.xls, and.xlsx.
-
name¶ The descriptive name of the SandPaper instance.
Note
If no name has been given, a continually updating uid hash of the active rules is used instead
Getter: Returns the given or suitable name for a SandPaper instance Setter: Sets the descriptive name of the SandPaper instance Return type: str
-
uid¶ A continually updating hash of the active rules.
A hexadecimal digest string
Getter: Returns a continually updating hash of the active rules Return type: str
-
rules¶ This list of applicable rules for the SandPaper instance.
Getter: Returns the list of applicable rules for the instance Return type: list[tuple(callable, tuple(…,…), dict[str,…])]
-
value_rules¶ The set of value rules for the SandPaper instance.
Getter: Returns the set rules for the SandPaper instance Return type: set(callable)
-
record_rules¶ The set of record rules for the SandPaper instance.
Getter: Returns the set rules for the SandPaper instance Return type: set(callable)
-
lower(record, column, **kwargs)[source]¶ A basic lowercase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value lowercased
- record (collections.OrderedDict) – A record whose value within
-
upper(record, column, **kwargs)[source]¶ A basic uppercase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value uppercased
- record (collections.OrderedDict) – A record whose value within
-
capitalize(record, column, **kwargs)[source]¶ A basic capitalization rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value capatilized
- record (collections.OrderedDict) – A record whose value within
-
title(record, column, **kwargs)[source]¶ A basic titlecase rule for a given value.
Only applies to text type variables
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- kwargs (dict) – Any named arguments
Returns: The value titlecased
- record (collections.OrderedDict) – A record whose value within
-
lstrip(record, column, content=None, **kwargs)[source]¶ A basic lstrip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with left content stripped
- record (collections.OrderedDict) – A record whose value within
-
rstrip(record, column, content=None, **kwargs)[source]¶ A basic rstrip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with right content stripped
- record (collections.OrderedDict) – A record whose value within
-
strip(record, column, content=None, **kwargs)[source]¶ A basic strip rule for a given value.
Only applies to text type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- content (str) – The content to strip (defaults to whitespace)
- kwargs (dict) – Any named arguments
Returns: The value with all content stripped
- record (collections.OrderedDict) – A record whose value within
-
increment(record, column, amount=1, **kwargs)[source]¶ A basic increment rule for a given value.
Only applies to numeric (int, float) type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- amount (int or float) – The amount to increment by
- kwargs (dict) – Any named arguments
Returns: The value incremented by
amount- record (collections.OrderedDict) – A record whose value within
-
decrement(record, column, amount=1, **kwargs)[source]¶ A basic decrement rule for a given value.
Only applies to numeric (int, float) type variables.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- amount (int or float) – The amount to decrement by
- kwargs (dict) – Any named arguments
Returns: The value incremented by
amount- record (collections.OrderedDict) – A record whose value within
-
replace(record, column, replacements, **kwargs)[source]¶ Applies a replacements dictionary to a value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').replace({ 'this_is_going_to_be_replaced': 'with_this', })
Parameters: Returns: The value with all replacements made
-
translate_text(record, column, translations, **kwargs)[source]¶ A text translation rule for a given value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').translate_text({ r'^group(?P<group_id>\d+)\s*(.*)$': '{group_id}' }, column_filter=r'^group_definition$')
This will translate all instances of the value
group<GROUP NUMBER>to<GROUP NUMBER>only in columns namedgroup_definition.Important
Note that matched groups and matched groupdicts are passed as
*argsand**kwargsto the format method of the returnedto_formatstring.Parameters: Returns: The potentially translated value
-
translate_date(record, column, translations, **kwargs)[source]¶ A date translation rule for a given value.
Take for example the following SandPaper instance:
s = SandPaper('my-sandpaper').translate_date({ '%Y-%m-%d': '%Y', '%Y': '%Y', '%Y-%m': '%Y' }, column_filter=r'^(.*)_date$')
This will translate all instances of a date value matching the given date formats in columns ending with
_dateto the date format%Y.Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - column (str) – A column that indicates what value to normalize
- translations (dict[str, str]) – A dictionary of translations from an arrow based dateformats to a different format
- kwargs (dict) – Any named arguments
Returns: The value potentially translated value
- record (collections.OrderedDict) – A record whose value within
-
add_columns(record, additions, **kwargs)[source]¶ Adds columns to a record.
Note
If the value of an entry in
additionsis a callable, then the callable should expect therecordas the only parameter and should return the value that should be placed in the newly added column.If the value of an entry in
additionsis a string, the record is passed in as kwargs to the value’sformatmethod.Otherwise, the value of an entry in
additionsis simply used as the newly added column’s value.Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - additions (dict[str,...]) – A dictionary of column names to callables, strings, or other values
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly added column
- record (collections.OrderedDict) – A record whose value within
-
remove_columns(record, removes, **kwargs)[source]¶ Removes columns from a record.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - removes (list[str]) – A list of columns to remove
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly removed column
- record (collections.OrderedDict) – A record whose value within
-
keep_columns(record, keeps, **kwargs)[source]¶ Removes all other columns from a record.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - keeps (list[str]) – A list of columns to keep
- kwargs (dict) – Any named arguments
Returns: The record with a potential newly kept column
- record (collections.OrderedDict) – A record whose value within
-
rename_columns(record, renames, **kwargs)[source]¶ Maps an existing column to a new column.
Parameters: - record (collections.OrderedDict) – A record whose value within
columnshould be normalized and returned - renames (dict[str, str]) – A dictionary of column to column renames
- kwargs (dict) – Any named arguments
Returns: The record with the remapped column
- record (collections.OrderedDict) – A record whose value within
-
order_columns(record, order, ignore_missing=False, **kwargs)[source]¶ Orders columns in a specific order.
Parameters: - record (collections.OrderedDict) – A record who should be ordered
- order (list[str]) – The order that columns need to be in
- ignore_missing (bool) – Boolean which inidicates if missing columns
from
ordershould be ignored - kwargs (dict) – Any named arguments
Returns: The record with the columns reordered
-
apply(from_file, to_file, sheet_name=None, row_filter=None, monitor_rules=False, **kwargs)[source]¶ Applies a SandPaper instance rules to a given glob of files.
Parameters: - from_file (str) – The path of the file to apply the rules to
- to_file (str) – The path of the file to write to
- sheet_name (str) – The name of the sheet to apply rules to (defaults to the first available sheet)
- row_filter (callable) – A callable which accepts a cleaned record and returns True if the record should be written out
- monitor_rules (bool) – Boolean flag that inidicates if the count of applied rules should be monitored
- kwargs (dict) – Any additional named arguments
(applied to the pyexcel
iget_recordsmethod)
Returns: The rule statistics if
monitor_rulesis trueReturn type:
-
classmethod
from_json(serialization)[source]¶ Loads a SandPaper instance from a json serialization.
Note
Raises a
UserWarningwhen the loaded instance does not match the serialized instance’suid.Parameters: serialization (dict) – The read json serialization Returns: A new SandPaper instance Return type: SandPaper