We call the anonymization rules for the table columns transformers
.
You should use them in the rules sections of the configuration file.
Example:
rules:
# `field_name` is a name of database column
field_name:
# gets a person full name
person_name: {}
In other examples we will omit the rules.field_name
part.
This document contains the full list of transformers (grouped by categories).
Many of transformers don't need any configuration. You can use them just like this:
# gets a person last name
last_name: {}
If there are no specific notes in the documentation for some transformer, then there are no configuration options for it.
Also, many of them support locale configuration:
# gets a person last name
last_name:
locale: RU
Please refer here for the list of available locales.
In this document we mark such transformers with the globe symbol 🌐.
For some transformers, specifying the locale now may not have any practical effect (but it may have an effect in the future).
You can specify that result values must be unique (they are not unique by default). You can use short or full syntax.
Short:
email:
uniq: true
Full:
email:
uniq:
required: true
try_count: 5
Uniqueness is ensured by re-generating values when they are same.
You can customize the number of attempts with try_count
(this is an optional field, the default
number of tries depends on the rule, for some rules it can be guessed automatically).
Currently, uniqueness is supported by: email, ip, phone, random_num.
In the future, we plan to add support for the uniqueness option for all transformers.
Gets a boolean value (TRUE/FALSE), with a given probability.
Examples:
The default:
boolean: {}
You can specify the probability of TRUE value:
boolean:
# 40% for the TRUE and 60% for the FALSE
ratio: 40
Generates random dates in the specified interval (granularity is a second).
Examples:
The default:
datetime: {}
You can specify a range:
datetime:
from: "1990-01-01T00:00:00+00:00"
to: "2010-12-31T00:00:00+00:00"
Also, you can specify the datetime format.
datetime:
format: "%Y-%m-%d"
For the bounds (from
/to
) you should use the RFC 3339 format (%Y-%m-%dT%H:%M:%S%.f%:z
).
The default output format is also RFC 3339. You don't need to change the format when using this transformer with datetime SQL fields.
Here you can look at the available formatting patterns.
Notes:
%C
,%Z
and%s
are not supported.%.f
works like%.9f
(always 9 digits). The behaviour of the%+
pattern is the same in this regard.- Patterns (e.g.
%x
,%X
,%c
) are not localized. - Modifiers
_
,-
,0
are not supported yet (you can make a feature request).
These are due to the fact that we removed the dependency on the chrono crate and use
the time crate directly (because of
security issue in chrono
).
Generates a fixed text (a plain text).
Example:
plain: "some text"
Gets a random number.
Examples:
The default:
random_num: {}
You can specify a range (one border or both):
random_num:
min: 10
max: 20
The default range is from 0
to 2^64 - 1
(for 64-bit application binary).
If you want to generate unique numbers, use this option:
random_num:
uniq: true
The transformer will collect information about generated numbers and check their uniqueness.
If such a number already exists in the list, then the transformer will try to generate the value again.
You can limit the number of tries (the default is 3
):
random_num:
uniq:
required: true
try_count: 5
Capitalize a given value (from the database, or a previous value in the pipeline).
E.g., the 3 short words
value will be transformed to 3 Short Words
.
Example:
# You should use ~ (the null value in YAML) for this transformer
capitalize: ~
This transformer just does nothing (some sort of noop
).
Example:
# You should use ~ (the null value in YAML) for this transformer
none: ~
You can use pipelines with complicated rules to generate more difficult values. You can use any transformers as steps (as well as other pipelines too)
Example:
pipeline:
pipes:
- email: {}
- capitalize: ~
The pipes will be executed in the order in which they are specified in the config.
This is the most sophisticated and flexible transformer.
It uses the Tera template engine (inspired by Jinja2).
Specification:
Section | Mandatory | YAML type | Description |
---|---|---|---|
format |
yes | text | The template for generated value |
rules |
no | list | Nested rules (transformers). You can use them in the template |
variables |
no | dictionary | Template variables |
Examples:
template:
format: "Hello, {{name}}! {{_1}}:{{_0 | upper}}"
rules:
- email: {}
variables:
name: Alex
where:
_0
- original value;_1
,_2
, ..._N
- nested rules by index (started from 1). You can use any transformer (including templates);name
- the named variable from thevariables
section.
It will generate something like Hello, Alex! [email protected]:ORIGINALVALUE
.
You can use any filter or markup from the Tera template engine.
Also, you can use the global variables in templates.
You can reference values of other row fields in templates.
Use the prev
special variable for original values and the final
special variable - for anonymized:
tables:
- name: some_table
# You must specify the order of rule execution when using `final`
rule_order:
- greeting
- options
rules:
first_name:
first_name: {}
greeting:
template:
# Keeping the first name, but anonymizing the last name
format: "Hello, {{ prev.first_name }} {{ final.last_name }}!"
options:
template:
# Using the anonymized value again
format: "{greeting: \"{{ final.greeting }}\"}"
You must specify the order of rule execution when using final
with rule_order.
All rules not listed will be placed at the beginning (i.e., you must list only rules with final
).
Also, we implemented a built-in key-value store that allows information to be exchanged between anonymized rows.
It is available via the custom functions in templates (you can read about Tera functions here).
Take a look at an example:
tables:
- name: users
rules:
name:
template:
# Save a name to the store as a side effect, the key is `user_names.<USER_ID>`
format: "{{ _1 }}{{ store_write(key='user_names.' ~ prev.id, value=_1) }}"
rules:
- person_name: {}
- name: user_operations
rules:
user_name:
template:
# Using the saved value again
format: "{{ store_read(key='user_names.' ~ prev.user_id) }}"
The full list of functions for working with the store:
-
store_read
- returns a value by key, when no such key returns a default value or raises an error if no default value is provided.
Arguments:key
,default
(thedefault
arg is optional). -
store_write
- stores a value in a key, raises an error when the key is already present.
Arguments:key
,value
. -
store_force_write
- likestore_write
butstore_force_write
overrides values without errors.
Arguments:key
,value
. -
store_inc
- increments a value in a key (in the first time just stores a value). Working only with numbers.
Arguments:key
,value
.
Also, you can use the template transformer for returning NULL values for your database.
For PostgreSQL, we must return \N
from the transformer:
template:
format: '\N'
If you need the \N
literal in your database, please return \\N
from the transformer.
If you need the \\N
literal - return \\\N
and so on.
Warning! This behavior can be changed in the future.
It extends builtin filters in templates with some crypto functions:
bcrypt_hash
- generates bcrypt hash for input string. Arguments:cost
(optional) bcrypt cost.
Take a look at an example:
tables:
- name: users
rules:
password_hash_default:
template:
format: "{{ _1 | bcrypt_hash }}"
rules:
- word: {} # Random word
password_hash_with_cost:
template:
format: "{{ _1 | bcrypt_hash(cost=10) }}"
rules:
- word: {} # Random word
This transformer allows to replace values in JSON and JSONB columns using JSONPath selectors.
It uses the jsonpath_lib crate.
Specification:
Section | Mandatory | YAML type | Description |
---|---|---|---|
fields |
yes | list | List of selectors and related rules (transformers) |
on_invalid |
no | text or dictionary | Reaction on invalid input JSON (the default reaction is to return {} ) |
Example:
json:
fields:
- name: "user_name"
selector: "$..user.name"
quote: true
rule:
template:
format: "UserName"
- name: "user_age"
selector: "$..user.age"
rule:
random_num:
min: 25
max: 55
If a value of the column is {"user": {"name": "Andrew", "age": 20, "comment": "The comment"}}
, the transformed
value will be something like this: {"user": {"name": "UserName", "age": 30, "comment": "The comment"}}
.
The fields are transformed consequently in their order.
A list of field descriptions.
Specification of each field item:
Section | Mandatory | YAML type | Description |
---|---|---|---|
name |
yes | text | Selector name (your choice, but should be unique in scope of this transformer) |
selector |
yes | text | JSONPath selector |
rule |
yes | dictionary | Transform rule |
quote |
no | boolean | Whether a transformation result should be quoted (with " ). The default is false |
There are three possible options for the reaction on an invalid input value (incorrectly formatted JSON):
as_is
- perform no transformation, just return the current value;replace_with
- replace with provided plain values or using provided transformer;error
- stop with an error.
The default is to replace the invalid value with {}
.
Examples:
This config returns an invalid value as is:
json:
fields:
- name: "user_name"
selector: "$..user.name"
quote: true
rule:
first_name: {}
on_invalid: as_is
This one returns specified JSON instead an invalid value:
json:
fields:
- name: "user_name"
selector: "$..user.name"
quote: true
rule:
first_name: {}
on_invalid:
replace_with: '{"user": {"name": "John", "age": 30}}'
This one returns specified transformer's result instead an invalid value:
json:
fields:
- name: "user_name"
selector: "$..user.name"
quote: true
rule:
first_name: {}
on_invalid:
replace_with:
template:
format: '{ "user": { "name": "{{ _1 }}", "age": 30 } }'
rules:
- person_name: {}
And this one raises an error on an invalid value:
json:
fields:
- name: "user_name"
selector: "$..user.name"
quote: true
rule:
first_name: {}
on_invalid: error
Gets a company activity description (e.g., integrate vertical markets
).
Gets a company activity verb.
Gets a company activity adjective.
Gets a company activity noun.
Gets a company motto.
Gets a head component of a company motto.
Gets a middle component of a company motto.
Gets a tail component of a company motto.
Gets a company name.
Gets a company name (an alternative variant).
Gets a company name suffix (e.g., Inc.
or LLC
).
Gets an industry name.
Gets a profession name.
Gets a currency code (e.g., EUR
or USD
).
Gets a currency name.
Gets a currency symbol.
Gets a file directory path.
Gets a file extension.
Gets a file name.
Gets a file path.
Gets a domain suffix (e.g., com
).
Gets a random email. You can specify a kind. The kind can be Safe
(it is default) or Free
.
With the Safe
kind the transformer generates only emails for example domains (e.g., [email protected]
).
It is not real email addresses.
With the Free
kind the transformer generates emails for free email providers (e.g., [email protected]
,
[email protected]
, [email protected]
).
You can add a random alphanumeric prefix and/or suffix (e.g., [email protected]
, [email protected]
,
[email protected]
).
This is useful when you need many unique emails.
Also, you can specify a fixed prefix/suffix (test-
or -test
) or use a transformer as a prefix/suffix
(usually, a template).
The default separator for prefixes and suffixes is -
. You can change it with the affix_separator
option.
Examples:
The default:
email: {}
You can specify the kind:
email:
kind: Free
With a random prefix:
email:
# prefix length
prefix: 5
With a random suffix:
email:
# suffix length
suffix: 5
With a fixed prefix:
email:
# prefix content
prefix: "test"
Using a transformer as prefix:
email:
# prefix template
prefix:
template:
format: "........"
#.......
Custom affix_separator
([email protected]
):
email:
prefix: 5
affix_separator: "__"
If you want to generate unique emails, use this option:
email:
uniq: true
The transformer will collect information about generated emails and check their uniqueness.
If such a email already exists in the list, then the transformer will try to generate the value again.
You can limit the number of tries (the default is 3
):
email:
uniq:
required: true
try_count: 5
Gets a free email provider name (e. g., gmail.com
).
Generates an IP address.
You can specify the kind (V4
or V6
).
Examples:
The default:
ip: {}
Default kind is V4
, you can specify V6:
ip:
kind: V6
Gets a local cell phone number (for a given locale).
Gets a local phone number (for a given locale).
Gets a MAC address.
Generates a random password.
You can set minimum and maximum string length.
Examples:
The default:
password: {}
With a custom length (the default min
option is 8
and the max
option is 20
):
password:
min: 5
max: 10
Gets a random phone number.
Examples:
The default:
phone: {}
You can specify the phone format:
phone:
format: "+7^#########"
where:
#
- any digit from 0 to 9^
- any digit from 1 to 9
Also, you can use any other symbols in format: ^##-00-### (##-##)
.
The default format is +###########
.
If you want to generate unique phone numbers for this database column, use the uniq
option:
phone:
uniq: true
The transformer will collect information about generated numbers and check their uniqueness. If such a number already exists in the list, then the transformer will try to generate the value again. The number of attempts is limited by the number of available invariants based on the format.
Gets a User-Agent header.
Gets a username (login).
Gets a job field.
Gets a job position.
Gets a job seniority (e.g., Lead
, Senior
or Junior
).
Gets a job title (seniority + field + position).
Gets a building number.
Gets a city name.
Gets a city prefix (e.g., North-
or East-
).
Gets a city suffix (e.g., -town
, -berg
or -ville
).
Gets a country code (e.g., RU
).
Gets a country name.
Gets a dwelling unit type (e.g., Apt.
or Suit.
).
Gets a dwelling unit part of the address (apartment, flat...).
Gets a latitude.
Gets a longitude.
Gets a post code.
Gets a state (or the equivalent) abbreviation (e.g., AZ
or LA
).
Gets a state (or the equivalent) name.
Gets a street name.
Gets a street suffix (e.g., Avenue
or Highway
).
Gets a time zone (e.g., Europe/London
).
Gets a zip code.
Gets a person first name.
Gets a person last name.
Gets a person middle name (a patronymic name, if the locale has such a concept).
Gets a name suffix (e.g., Jr.
)
Gets a person name (full).
Gets a person name with title.
Gets a person name title (e.g., Mr
or Ms
).
Gets a "lorem" paragraph (you can specify a count of sentences).
Examples:
The default:
paragraph: {}
This is equal to:
paragraph:
locale: EN
# Min count
min: 2
# Max count
max: 5
Gets several "lorem" paragraphs (you can specify a count).
Examples:
The default:
paragraphs: {}
This is equal to:
paragraphs:
locale: EN
# Min count
min: 2
# Max count
max: 5
Gets a "lorem" sentence (you can specify a count of words).
Examples:
The default:
sentence: {}
This is equal to:
sentence:
locale: EN
# Min count
min: 2
# Max count
max: 5
Gets several "lorem" sentences (you can specify a count).
Examples:
The default:
sentences: {}
This is equal to:
sentences:
locale: EN
# Min count
min: 2
# Max count
max: 5
Gets a "lorem" word.
Gets several "lorem" words (you can specify a count).
Examples:
The default:
words: {}
This is equal to:
words:
locale: EN
# Min count
min: 2
# Max count
max: 5
Generates random Base64 tokens. You can set a token length (default is 32) and a padding (=
symbols) length.
Examples:
With defaults:
#...
base64_token: {}
With a custom length:
base64_token:
# the padding is included into the length, so we have 35 symbols and `=`
len: 36
pad: 1
Generates random Base64Url tokens.
You can set a token length (default is 32) and a padding - a number of %3D
sequences.
Examples:
With defaults:
base64_token: {}
With a custom length:
base64_token:
# the padding is included into the length, so we have 34 symbols and the padding (`%3D%3D`)
len: 36
pad: 2
Generates random hex tokens. You can set a token length (default is 32).
Examples:
The default:
hex_token: {}
With a custom length:
hex_token:
len: 128
Generates random UUIDs. It uses the UUID version 4 algorithm.
Example:
uuid: ~
Gets a color code (e.g., #ffffff
).
Gets a localized digit symbol (e.g., 2
or 5
for the English locale).