Data anonymization with Postgres
In order to be able to always have the most recent and meaningful metrics at hand with these reports, we use a copy of our production Postgres data cloned daily to a dedicated instance, but at the same time we cannot expose personal information concerning our customers and the attendees to their events to someone outside the company, due to privacy concerns.
So I've been looking at ways to easily anonymize data with Postgres and I came across this handy extension that seemed perfect for our purpose, so I wanted to share some quick introduction on how to use it.
We use Google Cloud SQL for our Postgres instances in the various environments, but Cloud SQL doesn't allow us to use third party extensions so I had to set up a dedicated compute instance with a custom installation of Postgres on it.
Creating a read only user in the production database
\c <production database name>
to connect to it, then run
CREATE ROLE "bi-user"; ALTER ROLE "bi-user" WITH PASSWORD '<read-only user password>'; ALTER ROLE "bi-user" WITH LOGIN; GRANT CONNECT ON DATABASE <production database-name> TO "bi-user"; GRANT USAGE ON SCHEMA public TO "bi-user"; GRANT SELECT ON ALL TABLES IN SCHEMA public TO "bi-user"; GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO "bi-user"; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO "bi-user"; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE, SELECT ON SEQUENCES TO "bi-user";
These statements will give the read only user sufficient permissions to read the production data.
Setting up Postgres
Installing Postgres on a Debian based distro is as easy as adding the APT repo and installing the relevant package:
wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O- | sudo apt-key add - echo "deb [arch=amd64] http://apt.postgresql.org/pub/repos/apt/ focal-pgdg main" | sudo tee /etc/apt/sources.list.d/postgresql.list apt update apt install -y postgresql-11
Verify that Postgres is running:
systemctl status postgresql
Updating the configuration for faster restores
In my case I'm using a medium sized instance and I got good results with these settings:
work_mem = 32MB shared_buffers = 4GB maintenance_work_mem = 2GB full_page_writes = off autovacuum = off wal_buffers = -1
To change these settings you need to edit the file /etc/postgresql/11/main/postgresql.conf or equivalent for the version of Postgres you have installed.
While you are editing this file you may also want to have Postgres allow connections to 0.0.0.0 so that it can be reached from the outside easily:
listen_addresses = '*'
Save and restart Postgres:
systemctl restart postgresql
Also, if you want to access the instance from the outside it's convenient to enable regular password authentication. For this edit the file /etc/postgresql/11/main/pg_hba.conf or equivalent for your version and add the following line:
host all all 0.0.0.0/0 md5
and change
local all postgres peer
to
local all postgres md5
This will ensure we can connect to PG with a regular password also with a remote client.
To set a password for the root user:
sudo su - postgres psql
and run
\password
Creating a dump of the production data
We need to restore a dump of the production data to our instance. First, let's create the dump of the production data:
PGPASSWORD=<production password> pg_dump --clean -U bi-user -h <production IP> 8 -Fd -f <destination directory> <database name>
This way we can create a dump in directory format which is a lot faster to create and restore.
Restoring the dump to the target instance
Next, we can restore the dump:
PGPASSWORD=<replica password> pg_restore -U postgres -h 127.0.0.1 -j 8 --format=d -C -d <database name> <dump directory>
replica password is the password of the target/local instance, and dump directory is of course the directory that we used to create the dump previously.
Preparing the data anonymizing extension
First, let's install the anonymizer extension:
apt install build-essential postgresql-server-dev-11 pgxnclient -y pgxn install postgresql_anonymizer
Dynamic masking
export PGPASSWORD=<replica password> export DATABASE=<target database> psql -U postgres -h 127.0.0.1 -d $DATABASE -c "ALTER DATABASE <target database> SET session_preload_libraries = 'anon';"/ psql -U postgres -h 127.0.0.1 -d $DATABASE -c "CREATE EXTENSION anon CASCADE;"/ psql -U postgres -h 127.0.0.1 -d $DATABASE -c "ALTER DATABASE <target database> SET anon.restrict_to_trusted_schemas = on;"/ psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SELECT anon.init();"/
Done this, enabling the dynamic masking is as simple as running:
psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SELECT anon.start_dynamic_masking();"
Then, you can ensure the data is masked for a specific user:
psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SECURITY LABEL FOR anon ON ROLE ""bi-user"" IS 'MASKED';"
NOTE: if you later need to disable the masking for the user, you can run the following command:
psql -U postgres -h 127.0.0.1 -d $DATABASE -c "SECURITY LABEL FOR anon ON ROLE ""bi-user"" IS NULL;"/
Finally, you need to configure which columns of which tables you want to mask/anonymize:
SECURITY LABEL FOR anon ON COLUMN users.email IS 'MASKED WITH VALUE ''CONFIDENTIAL''';
I'm using a static string here ("CONFIDENTIAL") but you can also use some functions for example to replace the actual data with fake but realistic data. Please refer to the link I shared at the top of this post for more details.
Hereinafter, when the "bi-user" read the anonymized data, the columns above will be masked and the value "CONFIDENTIAL" will be returned instead of the actual values.
Static masking
The preparation of the anonymization for the target database remains the same, but now we need to create that anonymized dump from the production database:
/usr/lib/postgresql/11/bin/pg_dump_anon -U postgres -h 127.0.0.1 -f $ANONYMIZED_DUMP_PATH $PRODUCTION_DATABASE
As you can see, this time we are using a different version of pg_dump that comes with the extension and supports the anonymization. This will create a plain SQL file, unfortunately, since directory format exports are not supported, so the dump will be a bit slower.
The final step is restoring the dump with the masked data to a different database, which is what we'll use for reporting etc.
psql -U postgres -h 127.0.0.1 $ANONYMIZED_DATABASE -f $ANONYMIZED_DUMP_PATH
With this, the columns that you have labelled for masking will have anonymized values in the new database.
SSL connections
First, the root certificate:
openssl req -new -nodes -text -out root.csr \ -keyout root.key -subj "/CN=brella.io" chmod og-rwx root.key openssl x509 -req -in root.csr -text -days 3650 \ -extfile /usr/lib/ssl/openssl.cnf -extensions v3_ca \ -signkey root.key -out root.crt
Of course replace "brella" with the correct information.
Then, we'll create the server certificate:
openssl req -new -nodes -text -out server.csr \ -keyout server.key -subj "/CN=bi.brella.io" chmod og-rwx server.key openssl x509 -req -in server.csr -text -days 365 \ -CA root.crt -CAkey root.key -CAcreateserial \ -out server.crt cp root.crt server.crt server.key /var/lib/postgresql/11/main/ chown postgres /var/lib/postgresql/11/main/root.crt /var/lib/postgresql/11/main/server.crt /var/lib/postgresql/11/main/server.key
hostssl all all 0.0.0.0/0 md5 clientcert=1
It's recommended you also remove the following line to prevent unencrypted connections:
host all all 0.0.0.0/0 md5
Then edit /etc/postgresql/11/main/postgresql.conf and change it as follows:
ssl = on ssl_ca_file = 'root.crt' ssl_cert_file = 'server.crt' ssl_key_file = 'server.key'
And restart Postgres:
systemctl restart postgresql
We can now create a client certificate:
openssl req -new -nodes -text -out client.csr \ -keyout client.key -subj "/CN=client.brella.io" chmod og-rwx client.key openssl x509 -req -in client.csr -text -days 365 \ -CA root.crt -CAkey root.key -CAcreateserial \ -out client.crt
Hereinafter, a client will require the following information in order to connect securely to our instance:
- IP address: the external IP of the compute instance
- Database name: whatever you chose for the target database
- Username: bi-user or whichever user you created
- Password: the password for the read-only user
- From the certificates we created: root.crt, client.crt and client.key
Then, to connect to the instance using the information above:
PGSSLCERT=client.crt PGSSLKEY=client.key SSLMODE=verify-full SSLROOTCERT=root.crt psql -U bi-user -d $ANONYMIZED_DATABASE -h $IP
Note: with some clients you may need a client key in pk8 format. In this case you can convert the key with the following command:
openssl pkcs8 -topk8 -inform PEM -outform DER -in client.key -out client.pk8 -nocrypt
Wrapping up
Let me know in the comments if you run into issues or know of some other tool/method that is easier or faster etc.