Configuration file specification

Datanymizer uses a configuration file (config.yml) to determine what data to dump and how to anonymize it.

A config example (for the Postgres demo database DVD Rental):

tables:
  - name: actor
    rules:
      first_name:
        # random name
        first_name: {}
      last_name:
        # random surname
        last_name: {}
      last_update:
        # random date
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00
    query:
      # keeping data of the actor Jane Jackman unanonymized
      transform_condition: "NOT (first_name = 'Jane' AND last_name = 'Jackman')"
      # not dumping the actor with actor_id = 132 (Adam Hopper)
      dump_condition: "actor_id <> 132"

  - name: address
    rules:
      address:
        # using template
        template:
          # using transformed (anonymized) value of district
          format: "{{ final.district }}, {{ _1 }}, {{ _2 }}"
          rules:
            # random street name
            - street_name: {}
            # random building number
            - building_number: {}
      address2:
        # using the template engine (Tera, it is very similar to Jinja) features: condition and built-in function:
        # we add an address comment to roughly half of the rows
        # the template engine is very agile
        template:
          format: "{% if get_random(start=1, end=2) == 1 %}Comment: {{ _1 }}{% endif %}"
          rules:
            # lorem ipsum words (the number of words is 1-2)
            - words:
                min: 1
                max: 2
      district:
        template:
          format: "{{ _1 }}, {{ _2 }}"
          rules:
            # nested template
            - template:
                format: "{{ _2 }} ({{ _1 }})"
                rules:
                  # random country code
                  - country_code: {}
                  # random state abbreviation
                  - state_abbr: {}
            - template:
                format: "dst"
      phone:
        # random phone with some format
        phone:
          format: "7900#######"
          # phones will be unique
          uniq: true
      postal_code:
        # random postal code
        post_code: {}
    # you must specify the order of rule execution when using `final`
    rule_order:
      - address

  - name: city
    rules:
      city:
        city: {}

  - name: customer
    rules:
      active:
        # using anonymized `activebool` value
        template:
          format: "{% if final.activebool == 'TRUE' %}1{% else %}0{% endif %}"
      activebool:
        # the probability of `true` is 80%
        boolean:
          ratio: 80
      create_date:
        datetime:
          from: 2000-01-01T00:00:00+00:00
          to: 2020-12-31T00:00:00+00:00
      email:
        # using the original first name value in the anonymized email
        # also using the anonymized value of `active`
        template:
          format: "{{ prev.first_name | lower }}-{{ final.active }}-{{ _1 }}"
          rules:
            # random email
            - email: {}
      last_name:
        # using of original value (keep the first letter of the last name)
        template:
          format: "{{ _0 | truncate(length=1) }}"
    rule_order:
      - active
      - email

  - name: film
    rules:
      fulltext:
        # no transformation
        none: ~
      length:
        # random number
        random_num:
          min: 50
          max: 200
      rating:
        pipeline:
          # using pipelines
          pipes:
            - template:
                format: "r"
            - capitalize: ~

  - name: film_actor
    rules: {}
    query:
      # not dumping the actor with id = 132 (Adam Hopper)
      dump_condition: "actor_id <> 132"

  - name: payment
    rules:
      amount:
        # using the value from globals
        template:
          format: "{{ prev.amount | float * payment_k }}"

  - name: staff
    rules:
      email:
        email: {}
      username:
        template:
          # using the values from globals and template variables
          format: "{{ global_value }}.{{ template_var }}.{{ _1 }}"
          rules:
            # random number
            - random_num:
                min: 100
                max: 999
          variables:
            template_var: "tv456"
      password:
        # random hex token
        hex_token:
          len: 40

default:
  locale: EN

# some global variables (they are available in templates)
globals:
  global_value: "gv123"
  payment_k: 1.73

The config file contains following sections:

Section	Mandatory	YAML type	Description
tables	yes	list	A list of anonymized tables
table_order	no	list	An order of table dumping
default	no	dictionary	Default values for different anonymization rules
filter	no	dictionary	A filter for tables schema and data (what to skip when dumping)
globals	no	dictionary	Some global values (they are available in anonymization templates)

tables

The tables section is a list of anonymized tables. This is a main element of the config.

Example (there are anonymization rules for two database tables: actor and address):

tables:
  - name: actor
    rules:
      first_name:
        # random name
        first_name: {}
      last_name:
        # random surname
        last_name: {}
      last_update:
        # random date
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00
    query:
      # keeping data of the actor Jane Jackman unanonymized
      transform_condition: "NOT (first_name = 'Jane' AND last_name = 'Jackman')"
      # not dumping the actor with actor_id = 132 (Adam Hopper)
      dump_condition: "actor_id <> 132"

  - name: address
    rules:
      address:
        # using template
        template:
          # using transformed (anonymized) value of district
          format: "{{ final.district }}, {{ _1 }}, {{ _2 }}"
          rules:
            # random street name
            - street_name: {}
            # random building number
            - building_number: {}

table

Section	Mandatory	YAML type	Description
`name`	yes	text	The table name in the database
rules	yes	dictionary	Anonymization rules for this table (the column names are the dictionary keys)
rule_order	no	list	An order of rule execution
query	no	dictionary	Conditions for SQL queries for dumping data

You can use table names with schema (e.g. public.users) or without it (just users). In the latter case, this means that the rules will be applied to the users table in any schema.

rules

Anonymization rules (we call them transformers) for the table columns.

Dictionary keys are the column names. Each value contains an anonymizing configuration for column (a name of transformer - an address, a company name, a person name, some template, etc, with its options).

Rule (transformer)	Description
`email`	Emails with different options
`ip`	IP addresses. Supports IPv4 and IPv6
`words`	Lorem words with different length
`first_name`	First name generator
`last_name`	Last name generator
`city`	City names generator
`phone`	Generate random phone with different `format`
`pipeline`	Use pipeline to generate more complicated values
`capitalize`	Like filter, it capitalizes input value
`template`	Template engine for generate random text with included rules
`digit`	Random digit (in range `0..9`), localized
`random_num`	Random number with `min` and `max` options
`password`	Password with different length options (supports `max` and `min` options)
`datetime`	Make DateTime strings with options (`from` and `to`)
more than 70 rules in total...

For the complete list of rules please refer this document.

Some transformer examples:

first_name

It gets a person first name.

Examples:

The default:

rules:
  field_name:
    first_name: {}

You can configure locale:

rules:
  field_name:
    first_name:
      locale: RU

phone

It gets a random phone number.

Examples:

The default:

rules:
  field_name:
    phone: {}

You can specify the phone format:

rules:
  field_name:
    phone:
      format: "+7^#########"

where:

# - any digit from 0 to 9
^ - any digit from 1 to 9

Also, you can use any other symbols in format: ^##-00-### (##-##).

The default format is +###########.

If you want to generate unique phone numbers for this database column, use the uniq option:

rules:
  field_name:
    phone:
      uniq: true

The transformer will collect information about generated numbers and check their uniqueness. If such a number already exists in the list, then the transformer will try to generate the value again. The number of attempts is limited by the number of available invariants based on the format.

random_num

Gets a random number.

Examples:

The default:

rules:
  field_name:
    random_num: {}

You can specify a range (one border or both):

rules:
  field_name:
    random_num:
      min: 10
      max: 20

The default range is from 0 to 2^64 - 1 (for 64-bit application binary).

If you want to generate unique numbers, use this option:

  rules:
    field_name:
      random_num:
        uniq: true

The transformer will collect information about generated numbers and check their uniqueness. If such a number already exists in the list, then the transformer will try to generate the value again. You can limit the number of tries (the default is 3):

rules:
  field_name:
    random_num:
      uniq:
        required: true
        try_count: 5

template

This is the most sophisticated and flexible transformer.

It uses the Tera template engine (inspired by Jinja2).

Specification:

Section	Mandatory	YAML type	Description
`format`	yes	text	The template for generated value
`rules`	no	list	Nested rules (transformers). You can use them in the template
`variables`	no	dictionary	Template variables

Examples:

rules:
  field_name:
    template:
      format: "Hello, {{name}}! {{_1}}:{{_0 | upper}}"
      rules:
        - email: {}    
      variables:
        name: Alex

where:

_0 - transformed value (original);
_1, _2, ... _N - nested rules by index (started from 1). You can use any transformer (including templates);
name - the named variable from the variables section.

It will generate something like Hello, Alex! some-fake-email@gmail.com:ORIGINALVALUE.

You can use any filter or markup from the Tera template engine.

Also, you can use the global variables in templates.

You can reference values of other row fields in templates. Use the prev special variable for original values and the final special variable - for anonymized:

tables:
  - name: some_table
    # You must specify the order of rule execution when using `final`
    rule_order:
      - greeting
      - options
    rules:
      first_name:
        first_name: {}
      greeting:
        template:
          # Keeping the first name, but anonymizing the last name   
          format: "Hello, {{ prev.first_name }} {{ final.last_name }}!"
      options:
        template:
          # Using the anonymized value again   
          format: "{greeting: \"{{ final.greeting }}\"}"

You must specify the order of rule execution when using final with rule_order. All rules not listed will be placed at the beginning (i.e., you must list only rules with final).

rule_order

A list of columns that will be processed in the specified order (after all columns that are not in the list). The order of execution for other columns is not guaranteed.

Look at this table configuration example:

name: customer
rules:
  active:
    # using anonymized `activebool` value
    template:
      format: "{% if final.activebool == 'TRUE' %}1{% else %}0{% endif %}"
  activebool:
    # the probability of `true` is 80%
    boolean:
      ratio: 80
  create_date:
    datetime:
      from: 2000-01-01T00:00:00+00:00
      to: 2020-12-31T00:00:00+00:00
  email:
    # using the original first name value in the anonymized email
    # also using the anonymized value of `active`
    template:
      format: "{{ prev.first_name | lower }}-{{ final.active }}-{{ _1 }}"
      rules:
        # random email
        - email: {}
  last_name:
    # using of original value (keep the first letter of the last name)
    template:
      format: "{{ _0 | truncate(length=1) }}"
rule_order:
  - active
  - email

The order of column processing will be as follows:

activebool, create_date, last_name (the exact order is not guaranteed)
active
email

You only need the rule_order section when using the template transformer with the final special template variable.

For additional information please refer to the template transformer documentation.

query

Section	Mandatory	YAML type	Description
`dump_condition`	no	text	SQL `WHERE` statement for dumped data
`limit`	no	integer	SQL `LIMIT` for dumped data
`transform_condition`	no	text	SQL `WHERE` statement for anonymizing data

You can specify conditions (SQL WHERE statement) and limit for dumped data from the table:

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # select maximum 100 rows
      limit: 100

As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # preserve original values for some rows
      transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"      
      # select maximum 100 rows
      limit: 100

You can use the dump_condition, transform_condition and limit options in any combination (only transform_condition; transform_condition and limit; etc).

If you don't need data from a particular table at all, please refer to the filter section.

table_order

A list of tables that will be dumped in the specified order (after all tables that are not in the list). The order of execution for other tables depends on foreign keys.

Look at this configuration example:

tables:
  - name: "table1"
    rules: {}
  - name: "table2"
    rules: {}
  - name: "table3"
    rules: {}    
table_order:
  - "table1"
  - "table2"

The order of table dumping will be as follows:

table3
table1
table2

You may need this section when using the built-in key-value store in the template transformer for sharing data between tables.

For additional information please refer to the template transformer documentation.

default

Section	Mandatory	YAML type	Description
`locale`	no	text	The default locale for transformers

Supported locales are EN (the default one), ZH_TW (traditional chinese) and RU (translation in progress). We plan to support more locales in the future.

You can override the locale for each transformer (rule) in its options. Some transformers are not affected by locale.

Example:

default:
  locale: RU

filter

You can specify which tables you choose (whitelisting) or ignore (blacklisting) to dump.

You must use the full table names here (with schema).

You can use wildcards:

? matches exactly one occurrence of any character;
* matches arbitrary many (including zero) occurrences of any character.

Examples

For dumping only public.markets and public.users data:

filter:
  only:
    - public.markets
    - public.users

For ignoring these tables and dump data from others:

filter:
  except:
    - public.markets
    - public.users

You can also specify data and schema filters separately.

This is equivalent to the previous example:

filter:
  data:
    except:
      - public.markets
      - public.users

For skipping schema and data from other tables:

filter:
  schema:
    only:
      - public.markets
      - public.users

For skipping schema for markets table and dumping data only from users table:

filter:
  data:
    only:
      - public.users
  schema:
    except:
      - public.markets

For skipping schema and data from all tables in the schema other (you should use the quotes):

filter:
  schema:
    except:
      - "other.*"

For dumping data only from public.table1, public.table2, public.table3, etc:

filter:
  - "public.table?"

If you need only a subset of the data, please refer to the query section.

templates

You can specify some templates in config to reuse them in you template rules. There are different kinds of templates:

raw templates is named templates which may be imported or included by name into your field template, you can use macros to extend complex template.
files templates is array of paths to files with template context.

tables:
  - name: some_page
    rules:
      some_column:
        template:
          format: >
            {% import "base" as macros -%}
            {{ macros::decrement(n=10) }}
templates:
  raw:
    base: >
      {% macro decrement(n) -%}
      {% if n > 1 %}{{ n }}-{{ self::decrement(n=n-1) }}{% else %}1{% endif -%}
      {% endmacro decrement -%}"#;
  files:
    - ./templates/button.html

globals

You can specify global variables available in all template rules.

tables:
  - name: payment
    rules:
      amount:
      # using the value from globals
      template:
        format: "{{ prev.amount | float * payment_k }}"

  - name: staff
    rules:
      username:
        template:
          # using the value from globals
          format: "{{ global_value }}.{{ _1 }}"
            rules:
              # random number
              - random_num:
                  min: 100
                  max: 999

# global variables (they are available in templates)
globals:
  global_value: "gv123"
  payment_k: 1.73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config.md

config.md

Configuration file specification

tables

table

rules

first_name

phone

random_num

template

rule_order

query

table_order

default

filter

Examples

templates

globals

Files

config.md

Latest commit

History

config.md

File metadata and controls

Configuration file specification

tables

table

rules

first_name

phone

random_num

template

rule_order

query

table_order

default

filter

Examples

templates

globals