Index / Blog / Datanymizer

Open-sourcing Datanymizer: in-flight template-driven data anonymization

Datanymizer is an open-source, GDPR-compliant, privacy-preserving data anonymization tool flexible about how the anonymization takes place.

January 2021 5 mins

Production systems often need to store sensitive data, including personally-identifiable information (PII). Developers often need their test systems to have data that is as close to that in the production systems as is reasonably possible. Whilst it was always best-practice, legal data protection regimes such as HIPAA, HITECH, CPRA and GDPR means it’s ever more important to ensure that any personal data remains only where it’s strictly needed, and is properly masked or anonymized when being transferred elsewhere.

There are a number of different ways to bridge this gap, such as designing a strict separation between database tables which hold PII and those which don’t, allowing the PII tables to be skipped on export and replaced with synthetic data on the development systems. This approach can certainly work, but it relies on the system adhering to this design pattern, and the synthetic data being kept closely enough in step with the production equivalents to not cause problems.

An alternative might be to generate a special kind of “cleansed” dump on the production system, with PII already masked or replaced with synthetic data, ready for developers to import, keeping the risk of any sensitive data ever leaving the production environment low.

This is the approach Datanymizer takes.

Fakers, anonymizers, and obfuscators — there are various free and open-source data anonymization tools that have been around for a long time and work pretty well, so why did we create a new one? The one that supports globals, uniqueness constraints, inline rules, and other cool features.

We had some particular requirements we wanted our tool to meet. We didn't want the anonymizer to take a "raw" dump and mutate it. Instead, we needed to provide an already anonymized dump, without access to real data. The configuration that determined how the real system data would be anonymized should have been kept separate from that data.

We also wanted a tool that was flexible about how the anonymization itself takes place, ideally allowing the use of templates to populate field contents.

Enter Datanymizer: your flexible privacy-preserving friend

Datanymizer does all of these things: you define a configuration which specifies what to do (and not do), and it then dumps data directly from your database, applying the rules that you define, and it even integrates the Tera templating engine to enable complex values to be synthesized.

The output is an anonymized SQL dump, written either to a file or directly to standard output, ready to be imported into a database using your normal tools.

Getting started

There are several ways to install pg_datanymizer. Choose a more convenient option for you.

Pre-compiled binary:

# Linux / macOS / Windows (MINGW and etc). Installs it into ./bin/ by default
$ curl -sSfL https://raw.githubusercontent.com/datanymizer/datanymizer/main/cli/pg_datanymizer/install.sh | sh -s

# Or more shorter way
$ curl -sSfL https://git.io/pg_datanymizer | sh -s

# Specify installation directory and version
$ curl -sSfL https://git.io/pg_datanymizer | sh -s -- -b usr/local/bin v0.1.0

# Alpine Linux (wget)
$ wget -q -O - https://git.io/pg_datanymizer | sh -s

Homebrew / Linuxbrew:

# Installs the latest stable release
$ brew install datanymizer/tap/pg_datanymizer

# Builds the latest version from the repository
$ brew install --HEAD datanymizer/tap/pg_datanymizer

Docker:

$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer

The README contains an example configuration that you can use as a starting point.

Now you can invoke Datanymizer to generate a cleansed dump of your data:

$ pg_datanymizer -f /tmp/dump.sql -c ./config.yml postgres://postgres:postgres@localhost/test_database

It creates a new dump file /tmp/dump.sql with a native SQL dump for PostgreSQL database. You can import fake data from this dump into new PostgreSQL database with the command:

$ psql -Upostgres -d new_database < /tmp/dump.sql

Table filters

You can specify a list of tables which should never be included in a dump:

For dumping only public.markets and public.users data.

# config.yml
#...
filter:
  only:
    - public.markets
    - public.users

For ignoring those tables and dump data from others.

# config.yml
#...
filter:
  except:
    - public.markets
    - public.users

You can also specify data and schema filters separately.

Global variables

You can specify global variables available from any template rule.

# config.yml
tables:
  users:
    bio:
      template:
        format: "User bio is {{var_a}}"
    age:
      template:
        format: {{_0 * global_multiplicator}}
#...
globals:
  var_a: Global variable 1
  global_multiplicator: 6

Built-in rules

Datanymizer includes built-in support (“rules”) for certain types of value, including a pipeline filter which allows multiple rules to be executed in sequence. Other filters include email, ip, words, first_name, last_name, city, phone, capitalize, template, digit, random_number, password, datetime and more.

Uniqueness constraints

Uniqueness is supported by the email, ip, phone, and random_number rules.

Uniqueness is ensured by keeping track of values that have been generated where uniqueness is required, and re-generating any which are duplicates of those in the list.

You can customize the number of attempts with try_count. This is an optional field, the default number of tries depends on the rule.

Future development

We plan to implement the following additional features soon:

Pre-filtering: for example, if it is necessary to dump not all users but those matching specific criteria (e.g., 100 users, aged 27 years old or more, named Alexander), supporting arbitrary SQL queries for filtering.
Data generation: when you don’t need to anonymize existing data, but instead generate synthetic data based upon certain rules.

RDBMS support

Datanymizer currently supports PostgreSQL databases, although MySQL (and so also MariaDB) support is planned. Contributions are of course very welcome!

Client Review

We have created a tool that helps many projects get a database similar to a product database, while not violating the GDPR. It removes the intermediate stage of data duplication, reducing the likelihood of data leakage. We deliberately chose the Rust development language and it was a great choice! Combining the power of the development language with the unique capabilities of template engines, which are not available in many similar projects, we managed to create an interesting and useful product.

Alexander Kirillov

CTO, Evrone.com