Skip to content

Privacy preserving synthetic data generation workflows

License

Notifications You must be signed in to change notification settings

alan-turing-institute/QUIPP-pipeline

Repository files navigation

Binder

QUIPP-pipeline

Privacy-preserving synthetic data generation workflows

Collaboration and project management is in the QUIPP-collab repo.

Please do not open new issues in this repository: Instead, open a new issue in QUIPP-collab and add the 'pipeline' label.

The QUiPP (Quantifying Utility and Preserving Privacy) project aims to produce a framework to facilitate the creation of synthetic population data where the privacy of individuals is quantified. In addition, QUiPP can assess utility in a variety of contexts. Does a model trained on the synthetic data generalize as well to the population as the same model trained on the confidential sample, for example?

The proliferation of individual-level data sets has opened up new research opportunities. This individual information is tightly restricted in many contexts: in health and census records, for example. This creates difficulties in working openly and reproducibly, since full analyses cannot then be shared. Methods exist for creating synthetic populations that are representative of the existing relationships and attributes in the original data. However, understanding the utility of the synthetic data and simultaneously protecting individuals' privacy, such that these data can be released more openly, is challenging.

This repository contains a pipeline for synthetic population generation, using a variety of methods as implemented by several libraries. In addition, the pipeline emits measures of privacy and utility of the resulting data.

Docker

Note that a Docker image is provided with the dependencies pre-installed, as turinginst/quipp-env. More detail on setting this up can be found here.

Local Installation

  • Clone the repository git clone [email protected]:alan-turing-institute/QUIPP-pipeline.git

Dependencies

Installing the dependencies

R and Python dependencies

To install all of the dependencies, ensure you're using the relevant versions of python (>=3.8) and R (>=3.6), then run the following commands in a terminal from the root of this repository:

python -m pip install -r env-configuration/requirements.txt
R
> source("env-configuration/install.R")
> q()
> Save workspace image? [y/n/c]: y

SGF

Another external dependency is the SGF implementation of plausible deniability:

  • Download SGF here
  • See the library's README file for how to compile the code. You will need a recent version of cmake (tested with version 3.17), either installed through your system's package manager, or from here.
  • After compilation, the three executables of the SGF package (sgfinit, sgfgen and sgfextract) should have been built. Add their location to your PATH, or alternatively, assign the environmental variable SGFROOT to point to this location. That is, in bash,
    • either export PATH=$PATH:/path/to/sgf/bin,
    • or export SGFROOT=/path/to/sgf/bin

Top-level directory contents

The top-level directory structure mirrors the data pipeline.

  • doc: The QUiPP report - a high-level overview of the project, our work and the methods we have considered so far.

  • examples: Tutorial examples of using some of the methods (currently just CTGAN). These are independent of the pipeline.

  • binder: Configuration files to set up the pipeline using Binder

  • env-configuration: Set-up of the computational environment needed by the pipeline and its dependencies

  • generators: Quickly generating input data for the pipeline from a few tunable and well-understood models

  • datasets: Sample data that can be consumed by the pipeline.

  • datasets-raw: A few (public, open) datasets that we have used are reproduced here where licence and size permit. They are not necessarily of the correct format to be consumed by the pipeline.

  • synth-methods: One directory per library/tool, each of them implementing a complete synthesis method

  • utility-metrics: Scripts relating to computing the utility metrics

  • privacy-metrics: Scripts relating to computing the privacy metrics

  • run-inputs: Parameter json files (see below), one for each run

When the pipeline is run, additional directories are created:

  • generator-outputs: Sample generated input data (using generators)

  • synth-output: Contains the result of each run (as specified in run-inputs), which will typically consist of the synthetic data itself and a selection of utility and privacy scores

The pipeline

The following indicates the full pipeline, as run on an input file called example.json. This input file has keywords dataset (the base of the filename to use for the original input data) and synth-method which refers to one of the synthesis methods. As output, the pipeline produces:

  • synthetic data, in one or more files synthetic_data_1.csv, synthetic_data_2.csv, ...
  • the disclosure risk privacy score, in disclosure_risk.json
  • classification scores of utility, sklearn_classifiers.json

Flowchart of the pipeline

The files dataset.csv and dataset.json could be in a subdirectory of datasets, but this is not a requirement.

Running the pipeline

  1. Make a parameter json file, in run-inputs/, for each desired synthesis (see below for the structure of these files).

  2. Run make in the top level QUIPP-pipeline directory to run all syntheses (one per file). The output for run-inputs/example.json can be found in synth-output/example/. It will consist of:

    • one or more syntehtic data sets, based on the orignal data (as specified in example.json), called synthetic_data_1.csv, synthetic_data_2.csv, ...
    • the file disclosure_risk.json, containing the disclosure risk scores
    • the file sklearn_classifiers.json, containing the classification scores
  3. make clean removes all synthetic output and generated data.

Adding another synthesis method

  1. Make a subdirectory in synth-methods having the name of the new method.

  2. This directory should contain an executable file run that when called as

    run $input_json $dataset_base $outfile_prefix

    runs the method with the input parameter json file on the dataset $dataset_base.{csv,json} (see data format, below), and puts its output files in the directory $outfile_prefix.

  3. In the parameter JSON file (a JSON file in run-inputs), the method can be used as the value of the "synth-method" name.

Data file format

The input data should be present as two files with the same prefix: a csv file (with suffix .csv) which must contain column headings (along with the column data itself), and a json file (the "data json file") describing the types of the columns used for synthesis.

For example, see the Polish Social Diagnosis dataset. This contains the files

  • datasets/polish_data_2011/polish_data_2011.csv
  • datasets/polish_data_2011/polish_data_2011.json

and so has the prefix datasets/polish_data_2011/polish_data_2011 relative to the root of this repository.

The prefix of the data files (as an absolute path, or relative to the root of the repository) is given in the parameter json file (see the next section) as the top-level property dataset: there is no restriction on where these can be located, although a few examples can be found in datasets/.

Parameter file format

The pipeline takes a single json file, describing the data synthesis to perform, including any parameters the synthesis method might need, as well as any additional parameters for the privacy and utlity methods. The json schema for this parameter file is here.

To be usable by the pipeline, the parameter input file must be located in the run-inputs directory

Example

The following example is in run-inputs/synthpop-example-2.json.

{
    "enabled" : true,
    "dataset" : "generator-outputs/odi-nhs-ae/hospital_ae_data_deidentify",
    "synth-method" : "synthpop",
    "parameters":
    {
        "enabled" : true,
        "num_samples_to_fit": -1,
        "num_samples_to_synthesize": -1,
        "num_datasets_to_synthesize": 5,
        "random_state": 12345,
        "vars_sequence": [5, 3, 8, 1],
        "synthesis_methods": ["cart", "", "cart", "", "cart", "", "", "cart"],
        "proper": true,
        "tree_minbucket": 1,
        "smoothing": {}
    },
    "parameters_disclosure_risk":
    {
        "enabled": true,
        "num_samples_intruder": 100,
        "vars_intruder": ["Treatment", "Gender", "Age bracket"]
    },
    "parameters_sklearn_utility":
    {
        "enabled": true,
        "input_columns": ["Time in A&E (mins)"],
        "label_column": "Age bracket",
        "test_train_ratio": 0.2,
        "num_leaked_rows": 0
    }
}

Schema

The JSON schema for the parameter json file is here.

Description

The parameter JSON file must include the following names:

  • enabled (boolean): Run this example?
  • dataset (string): The prefix of the dataset (.csv and .json are appended to get the paths of the data files)
  • synth-method (string): The synthesis method used by the run. It must correspond to a subdirectory of synth-methods.
  • parameters (object): The parameters passed to the synthesis method. The contents of this object will depend on the synth-method used: the contents of this object are documented separately for each. The following names are common across each method:
    • enabled (boolean): Perform the synthesis step?
    • num_samples_to_fit (integer): How many samples from the input dataset should be used as input to the synthesis procedure? To use all of the input records, pass a value of -1.
    • num_samples_to_synthesize (integer): How many synthetic samples should be produced as output? To produce the same number of output records as input records, pass a value of -1.
    • num_datasets_to_synthesize (integer): How many entire synthetic datasets should be produced?
    • random_seed (integer): the seed for the random number generator (most methods require a PRNG: the seed can be explicitly passed to aid with the testability and reproducibility of the synthetic output)
    • Additional options for CTGAN, SGF and synthpop
  • parameters_disclosure_risk (object): parameters needed to compute the disclosure risk privacy score
    • enabled (boolean): compute this score?
    • num_samples_intruder (integer): how many records corresponding to the original dataset exist in a dataset visible to an attacker.
    • vars_intruder (array):
      • items (string): names of the columns that are available in the attacker-visible dataset.
  • parameters_sklearn_utility (object): parameters needed to compute the classification utility scores with scikit learn:
    • enabled (boolean): compute this score?
    • input_columns (array):
      • items (string): names of the columns to use as the explanatory variables for the classification
    • label_column (string): the column to use for the category labels
    • test_train_ratio (number): fraction of records to use in the test set for the classification
    • num_leaked_rows (integer): the number of additional records from the original dataset with which to augment the synthetic data set before training the classifiers. This is primarily an option to enable testing of the utility metric (i.e. the more rows we leak, the better the utility should become). It should be set to 0 during normal synthesis tasks.