sandpaper

This is the base sandpaper package that gets imported.

sandpaper.sandpaper module

sandpaper.sandpaper.value_rule(func)[source]

A meta wrapper for value normalization rules.

Note

Value rules take in a full record and a column name as implicit parameters. They are expected to return the value at record[column] that has be normalized by the rule.

Parameters:func (callable) – The normalization rule
Returns:The wrapped normalization rule
Return type:callable
sandpaper.sandpaper.record_rule(func)[source]

A meta wrapper for table normalization rules.

Note

Record rules are applied after all value rules have been applied to a record. They take in a full record as an implicit parameter and are expected to return the normalized record back.

Parameters:func (callable) – The normalization rule
Returns:The wrapped normalization rule
Return type:callable
class sandpaper.sandpaper.SandPaper(name=None)[source]

Bases: object

The SandPaper object.

Allows chained data normalization across multiple different table type
data files such as .csv, .xls, and .xlsx.
name

The descriptive name of the SandPaper instance.

Note

If no name has been given, a continually updating uid hash of the active rules is used instead

Getter:Returns the given or suitable name for a SandPaper instance
Setter:Sets the descriptive name of the SandPaper instance
Return type:str
uid

A continually updating hash of the active rules.

A hexadecimal digest string

Getter:Returns a continually updating hash of the active rules
Return type:str
rules

This list of applicable rules for the SandPaper instance.

Getter:Returns the list of applicable rules for the instance
Return type:list[tuple(callable, tuple(…,…), dict[str,…])]
value_rules

The set of value rules for the SandPaper instance.

Getter:Returns the set rules for the SandPaper instance
Return type:set(callable)
record_rules

The set of record rules for the SandPaper instance.

Getter:Returns the set rules for the SandPaper instance
Return type:set(callable)
lower(record, column, **kwargs)[source]

A basic lowercase rule for a given value.

Only applies to text type variables

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • kwargs (dict) – Any named arguments
Returns:

The value lowercased

upper(record, column, **kwargs)[source]

A basic uppercase rule for a given value.

Only applies to text type variables

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • kwargs (dict) – Any named arguments
Returns:

The value uppercased

capitalize(record, column, **kwargs)[source]

A basic capitalization rule for a given value.

Only applies to text type variables

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • kwargs (dict) – Any named arguments
Returns:

The value capatilized

title(record, column, **kwargs)[source]

A basic titlecase rule for a given value.

Only applies to text type variables

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • kwargs (dict) – Any named arguments
Returns:

The value titlecased

lstrip(record, column, content=None, **kwargs)[source]

A basic lstrip rule for a given value.

Only applies to text type variables.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • content (str) – The content to strip (defaults to whitespace)
  • kwargs (dict) – Any named arguments
Returns:

The value with left content stripped

rstrip(record, column, content=None, **kwargs)[source]

A basic rstrip rule for a given value.

Only applies to text type variables.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • content (str) – The content to strip (defaults to whitespace)
  • kwargs (dict) – Any named arguments
Returns:

The value with right content stripped

strip(record, column, content=None, **kwargs)[source]

A basic strip rule for a given value.

Only applies to text type variables.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • content (str) – The content to strip (defaults to whitespace)
  • kwargs (dict) – Any named arguments
Returns:

The value with all content stripped

increment(record, column, amount=1, **kwargs)[source]

A basic increment rule for a given value.

Only applies to numeric (int, float) type variables.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • amount (int or float) – The amount to increment by
  • kwargs (dict) – Any named arguments
Returns:

The value incremented by amount

decrement(record, column, amount=1, **kwargs)[source]

A basic decrement rule for a given value.

Only applies to numeric (int, float) type variables.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • amount (int or float) – The amount to decrement by
  • kwargs (dict) – Any named arguments
Returns:

The value incremented by amount

replace(record, column, replacements, **kwargs)[source]

Applies a replacements dictionary to a value.

Take for example the following SandPaper instance:

s = SandPaper('my-sandpaper').replace({
    'this_is_going_to_be_replaced': 'with_this',
})
Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • replacements (dict[str, str]) – A dictionary of replacements for the value
  • kwargs (dict) – Any named arguments
Returns:

The value with all replacements made

translate_text(record, column, translations, **kwargs)[source]

A text translation rule for a given value.

Take for example the following SandPaper instance:

s = SandPaper('my-sandpaper').translate_text({
    r'^group(?P<group_id>\d+)\s*(.*)$': '{group_id}'
}, column_filter=r'^group_definition$')

This will translate all instances of the value group<GROUP NUMBER> to <GROUP NUMBER> only in columns named group_definition.

Important

Note that matched groups and matched groupdicts are passed as *args and **kwargs to the format method of the returned to_format string.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • translations (dict[str, str]) – A dictionary of translations the value
  • kwargs (dict) – Any named arguments
Returns:

The potentially translated value

translate_date(record, column, translations, **kwargs)[source]

A date translation rule for a given value.

Take for example the following SandPaper instance:

s = SandPaper('my-sandpaper').translate_date({
    '%Y-%m-%d': '%Y',
    '%Y': '%Y',
    '%Y-%m': '%Y'
}, column_filter=r'^(.*)_date$')

This will translate all instances of a date value matching the given date formats in columns ending with _date to the date format %Y.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • column (str) – A column that indicates what value to normalize
  • translations (dict[str, str]) – A dictionary of translations from an arrow based dateformats to a different format
  • kwargs (dict) – Any named arguments
Returns:

The value potentially translated value

add_columns(record, additions, **kwargs)[source]

Adds columns to a record.

Note

If the value of an entry in additions is a callable, then the callable should expect the record as the only parameter and should return the value that should be placed in the newly added column.

If the value of an entry in additions is a string, the record is passed in as kwargs to the value’s format method.

Otherwise, the value of an entry in additions is simply used as the newly added column’s value.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • additions (dict[str,...]) – A dictionary of column names to callables, strings, or other values
  • kwargs (dict) – Any named arguments
Returns:

The record with a potential newly added column

remove_columns(record, removes, **kwargs)[source]

Removes columns from a record.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • removes (list[str]) – A list of columns to remove
  • kwargs (dict) – Any named arguments
Returns:

The record with a potential newly removed column

keep_columns(record, keeps, **kwargs)[source]

Removes all other columns from a record.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • keeps (list[str]) – A list of columns to keep
  • kwargs (dict) – Any named arguments
Returns:

The record with a potential newly kept column

rename_columns(record, renames, **kwargs)[source]

Maps an existing column to a new column.

Parameters:
  • record (collections.OrderedDict) – A record whose value within column should be normalized and returned
  • renames (dict[str, str]) – A dictionary of column to column renames
  • kwargs (dict) – Any named arguments
Returns:

The record with the remapped column

order_columns(record, order, ignore_missing=False, **kwargs)[source]

Orders columns in a specific order.

Parameters:
  • record (collections.OrderedDict) – A record who should be ordered
  • order (list[str]) – The order that columns need to be in
  • ignore_missing (bool) – Boolean which inidicates if missing columns from order should be ignored
  • kwargs (dict) – Any named arguments
Returns:

The record with the columns reordered

apply(from_file, to_file, sheet_name=None, row_filter=None, monitor_rules=False, **kwargs)[source]

Applies a SandPaper instance rules to a given glob of files.

Parameters:
  • from_file (str) – The path of the file to apply the rules to
  • to_file (str) – The path of the file to write to
  • sheet_name (str) – The name of the sheet to apply rules to (defaults to the first available sheet)
  • row_filter (callable) – A callable which accepts a cleaned record and returns True if the record should be written out
  • monitor_rules (bool) – Boolean flag that inidicates if the count of applied rules should be monitored
  • kwargs (dict) – Any additional named arguments (applied to the pyexcel iget_records method)
Returns:

The rule statistics if monitor_rules is true

Return type:

dict[str, int]

classmethod from_json(serialization)[source]

Loads a SandPaper instance from a json serialization.

Note

Raises a UserWarning when the loaded instance does not match the serialized instance’s uid.

Parameters:serialization (dict) – The read json serialization
Returns:A new SandPaper instance
Return type:SandPaper