[HN Gopher] PostgreSQL Anonymizer
___________________________________________________________________
PostgreSQL Anonymizer
Author : chynkm
Score : 209 points
Date : 2025-01-14 07:36 UTC (3 days ago)
(HTM) web link (postgresql-anonymizer.readthedocs.io)
(TXT) w3m dump (postgresql-anonymizer.readthedocs.io)
| sgt wrote:
| Just be careful that you don't anonymize your production data.
| go_prodev wrote:
| With great power comes great responsibility.
| lovasoa wrote:
| The principle of the software seems to be that the original
| data is never altered. It is a postgres extension that "masks"
| the data for certain postgres users. You can always connect as
| the root user and see everything when you need to.
| heeton wrote:
| It allows updating the original data - https://postgresql-
| anonymizer.readthedocs.io/en/stable/stati...
|
| > These methods will destroy the original data. Use with
| care.
| sgt wrote:
| So basically running SELECT anon.anonymize_database(); will
| do it.
| antman wrote:
| Masking is for view for specific users
| dandiep wrote:
| This is a fantastic idea. Now how to get it on RDS...
| zdc1 wrote:
| Assuming if it's for a support team or internal users with a
| lower SLA, I wonder if it's possible to have a small self-
| hosted PostgreSQL server that basically acts as a shim by
| holding a foreign-data wrapper connection to the actual RDS
| instance
| willgdjones wrote:
| +1 for RDS support. I have wanted to use this for a while in
| our production systems. reply
| hylaride wrote:
| It can be done manually:
|
| https://dba.stackexchange.com/questions/306661/how-to-
| instal...
|
| reply
| symfoniq wrote:
| Same. Lack of RDS support is the only reason we aren't using
| this.
| hylaride wrote:
| It can be done manually:
|
| https://dba.stackexchange.com/questions/306661/how-to-
| instal...
|
| reply
| edrenova wrote:
| Just to jump in here -> We support RDS + more and you can
| self-host, Neosync.
|
| https://github.com/nucleuscloud/neosync
|
| (I'm one of the co-founders)
| BoorishBears wrote:
| I tried to figure out how/if this does what I need and your
| README had no examples. I clicked a couple of level deep,
| found no obvious demonstrations and left.
|
| I checked the homepage but I do not watch Loom-style demos
| personally, _definitely_ not 5 minute ones, and so I left.
|
| -
|
| When I click on OP's link, or just search for it on Google,
| it takes less than a full page for the extension to show me
| an extremely straightforward demonstration of its value.
| You should have something like that.
|
| A simple example of what queries will look like, what setup
| will look like, all concisely communicated, no 5 minute
| lectures involved.
| hylaride wrote:
| It can be done manually:
|
| https://dba.stackexchange.com/questions/306661/how-to-instal...
| debarshri wrote:
| In RDS, if you cannot use this, you can create masked view and
| use query rewrite to make it work.
|
| In my experience PG anonymizer has performance issues when it
| comes to large queries.
| riffraff wrote:
| this seems great. I wonder tho, how do you ensure new columns are
| masked by default? It seems a safer alternative would be to start
| with all columns being statically masked and only unveil them
| selectively.
|
| I guess you can add some CI steps when modifying the db to ensure
| a give column is allowed or masked, but still, would be nice if
| this was defaulted the other way around.
| phoronixrly wrote:
| I have some experience with the 'Masking Views' functionality. If
| you are going to rely on it and specifically in a Rails app, know
| that it is against conventions and thus is generally
| inconvenient. This may be the same with any other framework that
| features DB schema migrations.
|
| More specifically the integration of this functionality at a
| fortunately ex-employer was purposefully kept away from the dev
| team (no motivation was offered, however I suspect that some sort
| of segmentation was sought after) and thus did not take into
| account that tables with PII did in fact still need their schema
| changed from time to time.
|
| This lead to the anonymizer extension, together with the
| confidential views to only be installed on production DB
| instances with dev, test, and staging instances running vanilla
| postgres. With this, the possibility to catch DB migration issues
| related to the confidential views was pushed out to the release
| itself. This lead to numerous failed releases which involved
| having the ops team intervene, manually remove the views for the
| duration of the release, then manually re-create them.
|
| So,
|
| If you plan to use this extension, and specifically its views,
| make sure you have it set up in the exactly same way on all
| environments. Also make sure that its initialisation and view
| creation are part of your framework's DB migrations so that they
| are documented and easy to precisely reproduce on new
| environments.
| gkbrk wrote:
| Clickhouse has something similar called clickhouse-obfuscator
| [1]. It even works offline with data dumps so you can quickly
| prepare and send somewhat realistic example data to others.
|
| According to its --help output, it is designed to retain the
| following properties of data:
|
| - cardinalities of values (number of distinct values) for every
| column and for every tuple of columns;
|
| - conditional cardinalities: number of distinct values of one
| column under condition on value of another column;
|
| - probability distributions of absolute value of integers; sign
| of signed integers; exponent and sign for floats;
|
| - probability distributions of length of strings;
|
| - probability of zero values of numbers; empty strings and
| arrays, NULLs;
|
| - data compression ratio when compressed with LZ77 and entropy
| family of codecs;
|
| - continuity (magnitude of difference) of time values across
| table; continuity of floating point values.
|
| - date component of DateTime values;
|
| - UTF-8 validity of string values;
|
| - string values continue to look somewhat natural
|
| [1]:
| https://clickhouse.com/docs/en/operations/utilities/clickhou...
| bux93 wrote:
| The Dutch national office of statistics has tools intended to
| de-identify 'microdata' such that k-anonimity[1] is achieved
| called mu-argus[2] and tau-argus.
|
| [1] A release of data is said to have the k-anonymity property
| if the information for each person contained in the release
| cannot be distinguished from at least k-1 individuals whose
| information also appear in the release.
| https://en.wikipedia.org/wiki/K-anonymity [2]
| https://research.cbs.nl/casc/mu.htm
| aeontech wrote:
| This is really cool, and deserves a submission of its own,
| I'd say!
| JosephRedfern wrote:
| There's a write up from Alexey of different approaches
| considered for clickhouse-obfuscator here:
| https://clickhouse.com/blog/five-methods-of-database-
| obfusca....
|
| The summary is pretty funny:
|
| > _" After trying four methods, I got so tired of this problem
| that it was time just to choose something, make it into a
| usable tool, and announce the solution"_
| pgryko wrote:
| Are these tools able to automatically identify PII information or
| do you have to specify columns and data types manually? What
| happens if you have PII data in a string field? Do you just rely
| on something like spacy to identify the PII data?
| graindcafe wrote:
| From the website:
|
| > The project has a declarative approach of anonymization.
|
| > Finally, the extension offers a panel of detection functions
| that will try to guess which columns need to be anonymized.
|
| https://postgresql-anonymizer.readthedocs.io/en/stable/detec...
| foreigner wrote:
| TIL that PostgreSQL has SECURITY LABEL! It seems like this could
| be useful for storing all sorts of metadata about database
| objects, not just security stuff. E.g. like the COMMENT but not
| global. From reading the docs it looks like you need a "label
| provider" to get it to work though. I can only seem to find a few
| label providers around, does anyone know of one that isn't
| security/anonymization related and could be used more
| generically?
| nickzelei wrote:
| This can work pretty well if you want to either mask the data in
| prod or update it in place.
|
| A good use case that comes to mind is using prod data in a retool
| app or something for your internal team but you want to mask out
| certain bits.
|
| I've been building Neosync [1] to handle more advanced use cases
| where you want to anonymize data for lower level environments.
| This is more useful for stage or dev data. Then prod stays
| completely unexposed to anyone.
|
| It also has a transactional anonymization api too.
|
| [1]: https://github.com/nucleuscloud/neosync
| sam0x17 wrote:
| I was actually tasked with building essentially this same thing
| back in 2014 when I was a junior dev for a fintech startup. They
| needed an anonymized version of prod database suitable for
| support team to pull up when trying to reproduce bugs. Did this
| gigantic thing that would stream the db dump into a C++ app and
| anonymize it on the fly. Took a similar approach to their masking
| they do here. Fun project. Company should have productized it.
| nickzelei wrote:
| This is the exact usecase that we are building for with Neosync
| (https://github.com/nucleuscloud/neosync)
| riskable wrote:
| One of the best ways to handle this sort of thing is to put
| things like PII in a separate database entirely and replace it
| with a token in the "main" database. When something like PII
| _actually_ needs to be retrieved you first retrieve the token and
| then search the other database for said token to get the real
| data.
|
| It certainly complicates things but it provides an additional
| security layer of separation between the PII and it's related
| data. You can provide your end users access to a database without
| having to worry about them getting access to the "dangerous"
| data. If they do indeed need access to the data pointed to via
| the token they can request access to that related database.
|
| This method also provides more performance since you don't need
| to encrypt the entire database (which is often required when
| storing PII) and also don't need to add extra security context
| function calls to every database request.
| VWWHFSfQ wrote:
| The is basically just a foreign database key which, in most
| cases, is not sufficient to satisfy industry and regulatory
| requirements for anonymization and storage of PII.
| ComputerGuru wrote:
| Eh. I get your point and truly appreciate structural safeguards
| as opposed to aspirational ones but this is really not as
| doable as you make it out to be, and doing it properly would be
| a full blown product in its own right. First, this only works
| if you have a very narrow interpretation of PII. Once you
| realize most of your non-int/uuid unique indexes (and all your
| join predicates) are probably PII in some way or the other, the
| scope of the problem greatly increases. How does your solution
| work when you need to group by PII, full text search by PII,
| filter by PII, etc?
| Cynddl wrote:
| I'm going to repeat myself as I do everytime I encounter such
| tools. These tools DO NOT provide anonymization, and especially
| not at the level required by the EU's GDPR (where the notion of
| PII does not exist).
|
| As a computer scientist and academic researcher having worked on
| this topic for now more than a decade (some of my work if you are
| interested: [1, 2]), re-identification is often possible from few
| pieces of information. Masking or replacing a few values or
| columns will often not provide sufficient guarantees--especially
| when a lot of information is being released.
|
| What this tool does is called 'pseudonymization' and maybe, if
| not very carefully, 'de-identification' in some case. With
| colleagues, reviewed all the literature and industry practices a
| few months ago [3], and our conclusion was:
|
| > We find that, although no perfect solution exists, applying
| modern techniques while auditing their guarantees against attacks
| is the best approach to safely use and share data today.
|
| This is clearly not what this tool is doing.
|
| [1] https://www.nature.com/articles/s41467-019-10933-3 [2]
| https://www.nature.com/articles/s41467-024-55296-6 [3]
| https://www.science.org/doi/10.1126/sciadv.adn7053
| aquilaFiera wrote:
| Seems like if you're doing static masking and you mask enough
| data, this works just great. Am I missing something?
| joram87 wrote:
| I've been working on something similar, starting a company around
| the idea! We realized that a lot of people had concerns or
| challenges with installing an extension on their production
| database and also that they wanted non-technical folks in
| compliance or HR to be able to configure and maintain the rules
| for individual employees. pgAnonymizer is a database extension
| but we structured ours to be a proxy server that
| hides/anonymizes/filters the data. We made a web dashboard that
| simplifies the configuration process, and allows you to configure
| what to do if columns get added to the database (default mask or
| hide new columns). We're about to go GA and if anyone has any
| feedback or wants to a free beta testing trial, I'd love to chat
| joram87 wrote:
| I probably should mention the name of the tool I'm working
| on... oops, it's VeilStream :)
| gmassman wrote:
| This is a very handy postgres extension! We've been using it at
| my job for a couple years now to generate test datasets for
| developers. We have a weekly job that restores a prod backup to a
| temporary DB, installs the `anon` extension, and runs pg_dump
| with the masking rules. Overall we've been very happy with this
| workflow since it gives us a very good idea of how new features
| will work with our production data. The masking rules do need
| maintenance as our DB schema changes, but that's par for the
| course with these kinds of dev tools.
|
| All that said, I wouldn't rely on this extension as a way to
| deliver anonymized data to downstream consumers outside of our
| software team. As others have pointed out, this is really more of
| a pseudonymization technique. It's great for removing phone
| numbers, emails, etc. from your data set, but it's not going to
| eradicate PII. Pretty much all anonymized records can be traced
| back to their source data through PKs or FKs.
___________________________________________________________________
(page generated 2025-01-17 23:01 UTC)