Datanymizer uses a configuration file (config.yml
) to determine what data to dump and how to anonymize it.
A config example (for the Postgres demo database DVD Rental):
tables:
- name: actor
rules:
first_name:
# random name
first_name: {}
last_name:
# random surname
last_name: {}
last_update:
# random date
datetime:
from: 1990-01-01T00:00:00+00:00
to: 2010-12-31T00:00:00+00:00
query:
# keeping data of the actor Jane Jackman unanonymized
transform_condition: "NOT (first_name = 'Jane' AND last_name = 'Jackman')"
# not dumping the actor with actor_id = 132 (Adam Hopper)
dump_condition: "actor_id <> 132"
- name: address
rules:
address:
# using template
template:
# using transformed (anonymized) value of district
format: "{{ final.district }}, {{ _1 }}, {{ _2 }}"
rules:
# random street name
- street_name: {}
# random building number
- building_number: {}
address2:
# using the template engine (Tera, it is very similar to Jinja) features: condition and built-in function:
# we add an address comment to roughly half of the rows
# the template engine is very agile
template:
format: "{% if get_random(start=1, end=2) == 1 %}Comment: {{ _1 }}{% endif %}"
rules:
# lorem ipsum words (the number of words is 1-2)
- words:
min: 1
max: 2
district:
template:
format: "{{ _1 }}, {{ _2 }}"
rules:
# nested template
- template:
format: "{{ _2 }} ({{ _1 }})"
rules:
# random country code
- country_code: {}
# random state abbreviation
- state_abbr: {}
- template:
format: "dst"
phone:
# random phone with some format
phone:
format: "7900#######"
# phones will be unique
uniq: true
postal_code:
# random postal code
post_code: {}
# you must specify the order of rule execution when using `final`
rule_order:
- address
- name: city
rules:
city:
city: {}
- name: customer
rules:
active:
# using anonymized `activebool` value
template:
format: "{% if final.activebool == 'TRUE' %}1{% else %}0{% endif %}"
activebool:
# the probability of `true` is 80%
boolean:
ratio: 80
create_date:
datetime:
from: 2000-01-01T00:00:00+00:00
to: 2020-12-31T00:00:00+00:00
email:
# using the original first name value in the anonymized email
# also using the anonymized value of `active`
template:
format: "{{ prev.first_name | lower }}-{{ final.active }}-{{ _1 }}"
rules:
# random email
- email: {}
last_name:
# using of original value (keep the first letter of the last name)
template:
format: "{{ _0 | truncate(length=1) }}"
rule_order:
- active
- email
- name: film
rules:
fulltext:
# no transformation
none: ~
length:
# random number
random_num:
min: 50
max: 200
rating:
pipeline:
# using pipelines
pipes:
- template:
format: "r"
- capitalize: ~
- name: film_actor
rules: {}
query:
# not dumping the actor with id = 132 (Adam Hopper)
dump_condition: "actor_id <> 132"
- name: payment
rules:
amount:
# using the value from globals
template:
format: "{{ prev.amount | float * payment_k }}"
- name: staff
rules:
email:
email: {}
username:
template:
# using the values from globals and template variables
format: "{{ global_value }}.{{ template_var }}.{{ _1 }}"
rules:
# random number
- random_num:
min: 100
max: 999
variables:
template_var: "tv456"
password:
# random hex token
hex_token:
len: 40
default:
locale: EN
# some global variables (they are available in templates)
globals:
global_value: "gv123"
payment_k: 1.73
The config file contains following sections:
Section | Mandatory | YAML type | Description |
---|---|---|---|
tables | yes | list | A list of anonymized tables |
table_order | no | list | An order of table dumping |
default | no | dictionary | Default values for different anonymization rules |
filter | no | dictionary | A filter for tables schema and data (what to skip when dumping) |
globals | no | dictionary | Some global values (they are available in anonymization templates) |
The tables
section is a list of anonymized tables.
This is a main element of the config.
Example (there are anonymization rules for two database tables: actor
and address
):
tables:
- name: actor
rules:
first_name:
# random name
first_name: {}
last_name:
# random surname
last_name: {}
last_update:
# random date
datetime:
from: 1990-01-01T00:00:00+00:00
to: 2010-12-31T00:00:00+00:00
query:
# keeping data of the actor Jane Jackman unanonymized
transform_condition: "NOT (first_name = 'Jane' AND last_name = 'Jackman')"
# not dumping the actor with actor_id = 132 (Adam Hopper)
dump_condition: "actor_id <> 132"
- name: address
rules:
address:
# using template
template:
# using transformed (anonymized) value of district
format: "{{ final.district }}, {{ _1 }}, {{ _2 }}"
rules:
# random street name
- street_name: {}
# random building number
- building_number: {}
Section | Mandatory | YAML type | Description |
---|---|---|---|
name |
yes | text | The table name in the database |
rules | yes | dictionary | Anonymization rules for this table (the column names are the dictionary keys) |
rule_order | no | list | An order of rule execution |
query | no | dictionary | Conditions for SQL queries for dumping data |
You can use table names with schema (e.g. public.users
) or without it (just users
). In the latter case, this means
that the rules will be applied to the users
table in any schema.
Anonymization rules (we call them transformers
) for the table columns.
Dictionary keys are the column names. Each value contains an anonymizing configuration for column (a name of transformer - an address, a company name, a person name, some template, etc, with its options).
Rule (transformer) | Description |
---|---|
email |
Emails with different options |
ip |
IP addresses. Supports IPv4 and IPv6 |
words |
Lorem words with different length |
first_name |
First name generator |
last_name |
Last name generator |
city |
City names generator |
phone |
Generate random phone with different format |
pipeline |
Use pipeline to generate more complicated values |
capitalize |
Like filter, it capitalizes input value |
template |
Template engine for generate random text with included rules |
digit |
Random digit (in range 0..9 ), localized |
random_num |
Random number with min and max options |
password |
Password with different length options (supports max and min options) |
datetime |
Make DateTime strings with options (from and to ) |
more than 70 rules in total... |
For the complete list of rules please refer this document.
Some transformer examples:
It gets a person first name.
Examples:
The default:
rules:
field_name:
first_name: {}
You can configure locale:
rules:
field_name:
first_name:
locale: RU
It gets a random phone number.
Examples:
The default:
rules:
field_name:
phone: {}
You can specify the phone format:
rules:
field_name:
phone:
format: "+7^#########"
where:
#
- any digit from 0 to 9^
- any digit from 1 to 9
Also, you can use any other symbols in format: ^##-00-### (##-##)
.
The default format is +###########
.
If you want to generate unique phone numbers for this database column, use the uniq
option:
rules:
field_name:
phone:
uniq: true
The transformer will collect information about generated numbers and check their uniqueness. If such a number already exists in the list, then the transformer will try to generate the value again. The number of attempts is limited by the number of available invariants based on the format.
Gets a random number.
Examples:
The default:
rules:
field_name:
random_num: {}
You can specify a range (one border or both):
rules:
field_name:
random_num:
min: 10
max: 20
The default range is from 0
to 2^64 - 1
(for 64-bit application binary).
If you want to generate unique numbers, use this option:
rules:
field_name:
random_num:
uniq: true
The transformer will collect information about generated numbers and check their uniqueness. If such a number already exists in the list, then the transformer will try to generate the value again. You can limit the number of tries (the default is 3):
rules:
field_name:
random_num:
uniq:
required: true
try_count: 5
This is the most sophisticated and flexible transformer.
It uses the Tera template engine (inspired by Jinja2).
Specification:
Section | Mandatory | YAML type | Description |
---|---|---|---|
format |
yes | text | The template for generated value |
rules |
no | list | Nested rules (transformers). You can use them in the template |
variables |
no | dictionary | Template variables |
Examples:
rules:
field_name:
template:
format: "Hello, {{name}}! {{_1}}:{{_0 | upper}}"
rules:
- email: {}
variables:
name: Alex
where:
_0
- transformed value (original);_1
,_2
, ..._N
- nested rules by index (started from 1). You can use any transformer (including templates);name
- the named variable from thevariables
section.
It will generate something like Hello, Alex! [email protected]:ORIGINALVALUE
.
You can use any filter or markup from the Tera template engine.
Also, you can use the global variables in templates.
You can reference values of other row fields in templates.
Use the prev
special variable for original values and the final
special variable - for anonymized:
tables:
- name: some_table
# You must specify the order of rule execution when using `final`
rule_order:
- greeting
- options
rules:
first_name:
first_name: {}
greeting:
template:
# Keeping the first name, but anonymizing the last name
format: "Hello, {{ prev.first_name }} {{ final.last_name }}!"
options:
template:
# Using the anonymized value again
format: "{greeting: \"{{ final.greeting }}\"}"
You must specify the order of rule execution when using final
with rule_order.
All rules not listed will be placed at the beginning (i.e., you must list only rules with final
).
A list of columns that will be processed in the specified order (after all columns that are not in the list). The order of execution for other columns is not guaranteed.
Look at this table configuration example:
name: customer
rules:
active:
# using anonymized `activebool` value
template:
format: "{% if final.activebool == 'TRUE' %}1{% else %}0{% endif %}"
activebool:
# the probability of `true` is 80%
boolean:
ratio: 80
create_date:
datetime:
from: 2000-01-01T00:00:00+00:00
to: 2020-12-31T00:00:00+00:00
email:
# using the original first name value in the anonymized email
# also using the anonymized value of `active`
template:
format: "{{ prev.first_name | lower }}-{{ final.active }}-{{ _1 }}"
rules:
# random email
- email: {}
last_name:
# using of original value (keep the first letter of the last name)
template:
format: "{{ _0 | truncate(length=1) }}"
rule_order:
- active
- email
The order of column processing will be as follows:
activebool
,create_date
,last_name
(the exact order is not guaranteed)active
email
You only need the rule_order
section when using the template
transformer with the final
special template variable.
For additional information please refer to the template transformer documentation.
Section | Mandatory | YAML type | Description |
---|---|---|---|
dump_condition |
no | text | SQL WHERE statement for dumped data |
limit |
no | integer | SQL LIMIT for dumped data |
transform_condition |
no | text | SQL WHERE statement for anonymizing data |
You can specify conditions (SQL WHERE
statement) and limit for dumped data from the table:
# config.yml
tables:
- name: people
query:
# don't dump some rows
dump_condition: "last_name <> 'Sensitive'"
# select maximum 100 rows
limit: 100
As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):
# config.yml
tables:
- name: people
query:
# don't dump some rows
dump_condition: "last_name <> 'Sensitive'"
# preserve original values for some rows
transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"
# select maximum 100 rows
limit: 100
You can use the dump_condition
, transform_condition
and limit
options in any combination (only
transform_condition
; transform_condition
and limit
; etc).
If you don't need data from a particular table at all, please refer to the filter section.
A list of tables that will be dumped in the specified order (after all tables that are not in the list). The order of execution for other tables depends on foreign keys.
Look at this configuration example:
tables:
- name: "table1"
rules: {}
- name: "table2"
rules: {}
- name: "table3"
rules: {}
table_order:
- "table1"
- "table2"
The order of table dumping will be as follows:
table3
table1
table2
You may need this section when using the built-in key-value store in the template
transformer for sharing data between
tables.
For additional information please refer to the template transformer documentation.
Section | Mandatory | YAML type | Description |
---|---|---|---|
locale |
no | text | The default locale for transformers |
Supported locales are EN
(the default one), ZH_TW
(traditional chinese) and RU
(translation in progress).
We plan to support more locales in the future.
You can override the locale for each transformer (rule) in its options. Some transformers are not affected by locale.
Example:
default:
locale: RU
You can specify which tables you choose (whitelisting) or ignore (blacklisting) to dump.
You must use the full table names here (with schema).
You can use wildcards:
?
matches exactly one occurrence of any character;*
matches arbitrary many (including zero) occurrences of any character.
For dumping only public.markets
and public.users
data:
filter:
only:
- public.markets
- public.users
For ignoring these tables and dump data from others:
filter:
except:
- public.markets
- public.users
You can also specify data and schema filters separately.
This is equivalent to the previous example:
filter:
data:
except:
- public.markets
- public.users
For skipping schema and data from other tables:
filter:
schema:
only:
- public.markets
- public.users
For skipping schema for markets
table and dumping data only from users
table:
filter:
data:
only:
- public.users
schema:
except:
- public.markets
For skipping schema and data from all tables in the schema other
(you should use the quotes):
filter:
schema:
except:
- "other.*"
For dumping data only from public.table1
, public.table2
, public.table3
, etc:
filter:
- "public.table?"
If you need only a subset of the data, please refer to the query section.
You can specify some templates in config to reuse them in you template rules. There are different kinds of templates:
raw
templates is named templates which may be imported or included by name into your field template, you can use macros to extend complex template.files
templates is array of paths to files with template context.
tables:
- name: some_page
rules:
some_column:
template:
format: >
{% import "base" as macros -%}
{{ macros::decrement(n=10) }}
templates:
raw:
base: >
{% macro decrement(n) -%}
{% if n > 1 %}{{ n }}-{{ self::decrement(n=n-1) }}{% else %}1{% endif -%}
{% endmacro decrement -%}"#;
files:
- ./templates/button.html
You can specify global variables available in all template rules.
tables:
- name: payment
rules:
amount:
# using the value from globals
template:
format: "{{ prev.amount | float * payment_k }}"
- name: staff
rules:
username:
template:
# using the value from globals
format: "{{ global_value }}.{{ _1 }}"
rules:
# random number
- random_num:
min: 100
max: 999
# global variables (they are available in templates)
globals:
global_value: "gv123"
payment_k: 1.73