has_many :codes

Data anonymization with Postgres

Published  

At Brella we hired a consultant who takes care of keeping Business Intelligence reporting up to date using Google Data Studio.

In order to be able to always have the most recent and meaningful metrics at hand with these reports, we use a copy of our production Postgres data cloned daily to a dedicated instance, but at the same time we cannot expose personal information concerning our customers and the attendees to their events to someone outside the company, due to privacy concerns.

So I've been looking at ways to easily anonymize data with Postgres and I came across this handy extension that seemed perfect for our purpose, so I wanted to share some quick introduction on how to use it. 

We use Google Cloud SQL for our Postgres instances in the various environments, but Cloud SQL doesn't allow us to use third party extensions so I had to set up a dedicated compute instance with a custom installation of Postgres on it. 


Creating a read only user in the production database

To create an anonymized version of the data you will first need to fetch the actual data from somewhere. It's always good to use a read only user for this just in case. In my case I called this user bi-user. You can create this kind of user/role as follows. Open a CLI session with the production database and first run

\c <production database name> 

to connect to it, then run

CREATE ROLE "bi-user";
ALTER ROLE "bi-user" WITH PASSWORD '<read-only user password>';
ALTER ROLE "bi-user" WITH LOGIN;
GRANT CONNECT ON DATABASE <production database-name> TO "bi-user";
GRANT USAGE ON SCHEMA public TO "bi-user";
GRANT SELECT ON ALL TABLES IN SCHEMA public TO "bi-user";
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO "bi-user";
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO "bi-user";
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT ON SEQUENCES TO "bi-user";

These statements will give the read only user sufficient permissions to read the production data.


Setting up Postgres

Next, we need to set up the Postgres instance that will serve the anonymized copy of the data. In our case we are still on version 11 unfortunately (we are planning to upgrade soon) so you may have to adapt the following instructions if you are targeting a different version of Postgres. However keep in mind that if you are using Google Data Studio too, the latest version of Postgres currently supported by their integration is v12. I'll assume you already have a virtual server or some other kind of instance for this.

Installing Postgres on a Debian based distro is as easy as adding the APT repo and installing the relevant package:

wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O- | sudo apt-key add -
echo "deb [arch=amd64] http://apt.postgresql.org/pub/repos/apt/ focal-pgdg main" | sudo tee /etc/apt/sources.list.d/postgresql.list
apt update
apt install -y postgresql-11

Verify that Postgres is running:

systemctl status postgresql


Updating the configuration for faster restores

To speed up restoring a copy of the production data I've tweaked some settings. You may have to change some configs depending on the resources available on your instance.

In my case I'm using a medium sized instance and I got good results with these settings:

work_mem = 32MB
shared_buffers = 4GB
maintenance_work_mem = 2GB
full_page_writes = off
autovacuum = off
wal_buffers = -1

To change these settings you need to edit the file /etc/postgresql/11/main/postgresql.conf or equivalent for the version of Postgres you have installed.

While you are editing this file you may also want to have Postgres allow connections to 0.0.0.0 so that it can be reached from the outside easily:

listen_addresses = '*'

Save and restart Postgres:

systemctl restart postgresql

Also, if you want to access the instance from the outside it's convenient to enable regular password authentication. For this edit the file /etc/postgresql/11/main/pg_hba.conf or equivalent for your version and add the following line:

host    all             all             0.0.0.0/0            md5

and change 

local   all             postgres                                peer

to 

local   all             postgres                                md5

This will ensure we can connect to PG with a regular password also with a remote client.

To set a password for the root user:

sudo su - postgres
psql

and run 

\password


Creating a dump of the production data

For this next step you need to be able to access your production Postgres instance from the instance you are setting up. In my case I used Cloud SQL Proxy for this. 

We need to restore a dump of the production data to our instance. First, let's create the dump of the production data:

PGPASSWORD=<production password> pg_dump --clean -U bi-user -h <production IP> 8 -Fd -f <destination directory> <database name>

This way we can create a dump in directory format which is a lot faster to create and restore.


Restoring the dump to the target instance

In order to restore the dump of the production data to the target instance, first create the target database. This has to have the same name as the production database to restore a dump created in directory format.

Next, we can restore the dump:

PGPASSWORD=<replica password> pg_restore -U postgres -h 127.0.0.1 -j 8 --format=d -C -d <database name> <dump directory>

replica password is the password of the target/local instance, and dump directory is of course the directory that we used to create the dump previously.


Preparing the data anonymizing extension

The extension we are going to use allows two types of "masking" the data. dynamic masking basically anonymizes the data on the fly when reading it, without changing it at all. However this comes with a performance penalty, so depending on the amount of data and the kind of queries you perform you may prefer static masking, which masks the data on disk.

First, let's install the anonymizer extension:

apt install build-essential postgresql-server-dev-11 pgxnclient -y
pgxn install postgresql_anonymizer

Dynamic masking

The first step is to enable the extension in the target database:

export PGPASSWORD=<replica password>
export DATABASE=<target database>

psql -U postgres -h 127.0.0.1 -d $DATABASE -c "ALTER DATABASE <target database> SET session_preload_libraries = 'anon';"/
psql -U postgres -h 127.0.0.1 -d $DATABASE -c "CREATE EXTENSION anon CASCADE;"/
psql -U postgres -h 127.0.0.1 -d $DATABASE -c "ALTER DATABASE <target database> SET anon.restrict_to_trusted_schemas = on;"/
psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SELECT anon.init();"/

Done this, enabling the dynamic masking is as simple as running:

psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SELECT anon.start_dynamic_masking();"

Then, you can ensure the data is masked for a specific user:

psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SECURITY LABEL FOR anon ON ROLE ""bi-user"" IS 'MASKED';"

NOTE: if you later need to disable the masking for the user, you can run the following command:

psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SECURITY LABEL FOR anon ON ROLE ""bi-user"" IS NULL;"/

Finally, you need to configure which columns of which tables you want to mask/anonymize:

SECURITY LABEL FOR anon ON COLUMN users.email IS 'MASKED WITH VALUE ''CONFIDENTIAL''';

I'm using a static string here ("CONFIDENTIAL") but you can also use some functions for example to replace the actual data with fake but realistic data. Please refer to the link I shared at the top of this post for more details.

Hereinafter, when the "bi-user" read the anonymized data, the columns above will be masked and the value "CONFIDENTIAL" will be returned instead of the actual values.


Static masking

In our case, due to performance I opted for the static masking, which writes the anonymized data to disk. So this requires some additional processing compared to dynamic masking which may take some time depending on the size of the database.

The preparation of the anonymization for the target database remains the same, but now we need to create that anonymized dump from the production database:

/usr/lib/postgresql/11/bin/pg_dump_anon -U postgres -h 127.0.0.1 -f $ANONYMIZED_DUMP_PATH $PRODUCTION_DATABASE

As you can see, this time we are using a different version of pg_dump that comes with the extension and supports the anonymization. This will create a plain SQL file, unfortunately, since directory format exports are not supported, so the dump will be a bit slower.

The final step is restoring the dump with the masked data to a different database, which is what we'll use for reporting etc.

psql -U postgres -h 127.0.0.1 $ANONYMIZED_DATABASE -f $ANONYMIZED_DUMP_PATH

With this, the columns that you have labelled for masking will have anonymized values in the new database.


SSL connections

You will probably want to ensure that connections to Postgres from the outside are encrypted, so it's handy to enable SSL connections; for this we'll need to create a couple of certificates.

First, the root certificate:

openssl req -new -nodes -text -out root.csr \
  -keyout root.key -subj "/CN=brella.io"
chmod og-rwx root.key

openssl x509 -req -in root.csr -text -days 3650 \
  -extfile /usr/lib/ssl/openssl.cnf -extensions v3_ca \
  -signkey root.key -out root.crt

Of course replace "brella" with the correct information. 

Then, we'll create the server certificate:

openssl req -new -nodes -text -out server.csr \
  -keyout server.key -subj "/CN=bi.brella.io"
chmod og-rwx server.key

openssl x509 -req -in server.csr -text -days 365 \
  -CA root.crt -CAkey root.key -CAcreateserial \
  -out server.crt

cp root.crt server.crt server.key /var/lib/postgresql/11/main/

chown postgres /var/lib/postgresql/11/main/root.crt /var/lib/postgresql/11/main/server.crt /var/lib/postgresql/11/main/server.key

Next, edit /etc/postgresql/11/main/pg_hba.conf and add:

hostssl all             all             0.0.0.0/0               md5 clientcert=1

It's recommended you also remove the following line to prevent unencrypted connections:

host    all             all             0.0.0.0/0               md5

Then edit /etc/postgresql/11/main/postgresql.conf and change it as follows:

ssl = on
ssl_ca_file = 'root.crt'
ssl_cert_file = 'server.crt'
ssl_key_file = 'server.key'

And restart Postgres:

systemctl restart  postgresql

We can now create a client certificate:

openssl req -new -nodes -text -out client.csr \
  -keyout client.key -subj "/CN=client.brella.io"
chmod og-rwx client.key

openssl x509 -req -in client.csr -text -days 365 \
  -CA root.crt -CAkey root.key -CAcreateserial \
  -out client.crt

Hereinafter, a client will require the following information in order to connect securely to our instance:

  • IP address: the external IP of the compute instance
  • Database name: whatever you chose for the target database
  • Username: bi-user or whichever user you created
  • Password: the password for the read-only user
  • From the certificates we created: root.crt, client.crt and client.key

Then, to connect to the instance using the information above:

PGSSLCERT=client.crt PGSSLKEY=client.key SSLMODE=verify-full SSLROOTCERT=root.crt psql -U bi-user -d $ANONYMIZED_DATABASE -h $IP

Note: with some clients you may need a client key in pk8 format. In this case you can convert the key with the following command:

openssl pkcs8 -topk8 -inform PEM -outform DER -in client.key -out client.pk8 -nocrypt


Wrapping up

As you can see, data anonymization with Postgres is pretty straightforward with this extension. You can easily automate the various steps with scripts, for example I automated it to happen every morning at 4am so every day we always have fairly fresh and anonymized data for reporting without exposing private information.

Let me know in the comments if you run into issues or know of some other tool/method that is easier or faster etc.


© Vito Botta