Available Rules

Below are a list of available rules that can be attached to a SandPaper instance. All of these rules first must pass several optional filters discussed in Rule Filters.

In the following examples of these rules the symbol □ represents whitespace.

Value Rules

These rules are applied to every value that passes the specified rule filters documented in Rule Filters.

lower()

A basic rule that lowercases the text in a value.

SandPaper().lower()
Input Output
DATa data

upper()

A basic rule that uppercases the text in a value.

SandPaper().upper()
Input Output
daTa DATA

capitalize()

A basic rule that capitalizes the text in a value.

SandPaper().capitalize()
Input Output
daTa Data

title()

A basic rule that titlecases the text in a value.

SandPaper().title()
Input Output
mY dAta My Data

lstrip()

A basic rule that strips all left whitespace from a value.

SandPaper().lstrip()
Input Output
□□data data

rstrip()

A basic rule that strips all right whitespace from a value.

SandPaper().rstrip()
Input Output
data□□ data

strip()

A basic rule that strips all whitespace from a value.

SandPaper().strip()
Input Output
□data□ data

translate_text()

A translation rule that translate regex matches to a specified format.

SandPaper().translate_text({
   r'group_(?P<group_id>\d+)$': '{group_id}'
})
Input Output
group_47 47
group_123 123
group_0 0

translate_date()

A translation rule that translate greedily evaluated dates to a specified datetime format.

Note

This rule is very greedy and can potentailly evaluate dates incorrectly. It is highly recommended that at the very least a column_filter is supplied with this rule.

SandPaper().translate_date({
   '%Y-%m-%d': '%Y',
   '%Y-%m': '%Y',
   '%Y': '%Y'
})
Input Output
2017-01-32 2017
2017-01 2017
2017 2017

Record Rules

These rules are applied to every record regardless to most rule filters documented in Rule Filters.

add_columns()

Adds a column to every record.

The given dictionary should be a key value pairing where the key is a new column name and the paired value is either a callable, string, or other low level data type for the newly added column’s value. If the value is a callable it should expect to receive the record as the only parameter and should return the value desired for the newly added column.

import uuid

def gen_uuid(record):
   return uuid.uuid4()

SandPaper().add_columns({
   'uuid': gen_uuid
})
Before
id name value
1 hello world
2 test table

After
id name value uuid
1 hello world a6a76896-c33d-4654-afdf-12aa80dd6238
2 test table b1e171c2-fee9-4270-96e9-4853c3a6e130

remove_columns()

Removes a column from every record.

SandPaper().remove_columns([
   'name'
])
Before
id name value
1 hello world
2 test table

After
id value
1 world
2 table

keep_columns()

Removes all other columns for every record.

SandPaper().keep_columns([
   'id',
   'name'
])
Before
id name value
1 hello world
2 test table

After
id name
1 hello
2 test

rename_columns()

Renames a column for every record.

SandPaper().rename_columns([
   'old_name': 'new_name'
])
Before
id old_name value
1 hello world
2 test table

After
id new_name value
1 hello world
2 test table

order_columns()

Reorders columns from every record.

SandPaper().order_columns([
   'value',
   'name',
   'id'
])
Before
id name value
1 hello world
2 test table

After
value id name
world 1 hello
table 2 test