[HN Gopher] PostgreSQL Anonymizer
       ___________________________________________________________________
        
       PostgreSQL Anonymizer
        
       Author : chynkm
       Score  : 209 points
       Date   : 2025-01-14 07:36 UTC (3 days ago)
        
 (HTM) web link (postgresql-anonymizer.readthedocs.io)
 (TXT) w3m dump (postgresql-anonymizer.readthedocs.io)
        
       | sgt wrote:
       | Just be careful that you don't anonymize your production data.
        
         | go_prodev wrote:
         | With great power comes great responsibility.
        
         | lovasoa wrote:
         | The principle of the software seems to be that the original
         | data is never altered. It is a postgres extension that "masks"
         | the data for certain postgres users. You can always connect as
         | the root user and see everything when you need to.
        
           | heeton wrote:
           | It allows updating the original data - https://postgresql-
           | anonymizer.readthedocs.io/en/stable/stati...
           | 
           | > These methods will destroy the original data. Use with
           | care.
        
             | sgt wrote:
             | So basically running SELECT anon.anonymize_database(); will
             | do it.
        
         | antman wrote:
         | Masking is for view for specific users
        
       | dandiep wrote:
       | This is a fantastic idea. Now how to get it on RDS...
        
         | zdc1 wrote:
         | Assuming if it's for a support team or internal users with a
         | lower SLA, I wonder if it's possible to have a small self-
         | hosted PostgreSQL server that basically acts as a shim by
         | holding a foreign-data wrapper connection to the actual RDS
         | instance
        
         | willgdjones wrote:
         | +1 for RDS support. I have wanted to use this for a while in
         | our production systems. reply
        
           | hylaride wrote:
           | It can be done manually:
           | 
           | https://dba.stackexchange.com/questions/306661/how-to-
           | instal...
           | 
           | reply
        
         | symfoniq wrote:
         | Same. Lack of RDS support is the only reason we aren't using
         | this.
        
           | hylaride wrote:
           | It can be done manually:
           | 
           | https://dba.stackexchange.com/questions/306661/how-to-
           | instal...
           | 
           | reply
        
           | edrenova wrote:
           | Just to jump in here -> We support RDS + more and you can
           | self-host, Neosync.
           | 
           | https://github.com/nucleuscloud/neosync
           | 
           | (I'm one of the co-founders)
        
             | BoorishBears wrote:
             | I tried to figure out how/if this does what I need and your
             | README had no examples. I clicked a couple of level deep,
             | found no obvious demonstrations and left.
             | 
             | I checked the homepage but I do not watch Loom-style demos
             | personally, _definitely_ not 5 minute ones, and so I left.
             | 
             | -
             | 
             | When I click on OP's link, or just search for it on Google,
             | it takes less than a full page for the extension to show me
             | an extremely straightforward demonstration of its value.
             | You should have something like that.
             | 
             | A simple example of what queries will look like, what setup
             | will look like, all concisely communicated, no 5 minute
             | lectures involved.
        
         | hylaride wrote:
         | It can be done manually:
         | 
         | https://dba.stackexchange.com/questions/306661/how-to-instal...
        
         | debarshri wrote:
         | In RDS, if you cannot use this, you can create masked view and
         | use query rewrite to make it work.
         | 
         | In my experience PG anonymizer has performance issues when it
         | comes to large queries.
        
       | riffraff wrote:
       | this seems great. I wonder tho, how do you ensure new columns are
       | masked by default? It seems a safer alternative would be to start
       | with all columns being statically masked and only unveil them
       | selectively.
       | 
       | I guess you can add some CI steps when modifying the db to ensure
       | a give column is allowed or masked, but still, would be nice if
       | this was defaulted the other way around.
        
       | phoronixrly wrote:
       | I have some experience with the 'Masking Views' functionality. If
       | you are going to rely on it and specifically in a Rails app, know
       | that it is against conventions and thus is generally
       | inconvenient. This may be the same with any other framework that
       | features DB schema migrations.
       | 
       | More specifically the integration of this functionality at a
       | fortunately ex-employer was purposefully kept away from the dev
       | team (no motivation was offered, however I suspect that some sort
       | of segmentation was sought after) and thus did not take into
       | account that tables with PII did in fact still need their schema
       | changed from time to time.
       | 
       | This lead to the anonymizer extension, together with the
       | confidential views to only be installed on production DB
       | instances with dev, test, and staging instances running vanilla
       | postgres. With this, the possibility to catch DB migration issues
       | related to the confidential views was pushed out to the release
       | itself. This lead to numerous failed releases which involved
       | having the ops team intervene, manually remove the views for the
       | duration of the release, then manually re-create them.
       | 
       | So,
       | 
       | If you plan to use this extension, and specifically its views,
       | make sure you have it set up in the exactly same way on all
       | environments. Also make sure that its initialisation and view
       | creation are part of your framework's DB migrations so that they
       | are documented and easy to precisely reproduce on new
       | environments.
        
       | gkbrk wrote:
       | Clickhouse has something similar called clickhouse-obfuscator
       | [1]. It even works offline with data dumps so you can quickly
       | prepare and send somewhat realistic example data to others.
       | 
       | According to its --help output, it is designed to retain the
       | following properties of data:
       | 
       | - cardinalities of values (number of distinct values) for every
       | column and for every tuple of columns;
       | 
       | - conditional cardinalities: number of distinct values of one
       | column under condition on value of another column;
       | 
       | - probability distributions of absolute value of integers; sign
       | of signed integers; exponent and sign for floats;
       | 
       | - probability distributions of length of strings;
       | 
       | - probability of zero values of numbers; empty strings and
       | arrays, NULLs;
       | 
       | - data compression ratio when compressed with LZ77 and entropy
       | family of codecs;
       | 
       | - continuity (magnitude of difference) of time values across
       | table; continuity of floating point values.
       | 
       | - date component of DateTime values;
       | 
       | - UTF-8 validity of string values;
       | 
       | - string values continue to look somewhat natural
       | 
       | [1]:
       | https://clickhouse.com/docs/en/operations/utilities/clickhou...
        
         | bux93 wrote:
         | The Dutch national office of statistics has tools intended to
         | de-identify 'microdata' such that k-anonimity[1] is achieved
         | called mu-argus[2] and tau-argus.
         | 
         | [1] A release of data is said to have the k-anonymity property
         | if the information for each person contained in the release
         | cannot be distinguished from at least k-1 individuals whose
         | information also appear in the release.
         | https://en.wikipedia.org/wiki/K-anonymity [2]
         | https://research.cbs.nl/casc/mu.htm
        
           | aeontech wrote:
           | This is really cool, and deserves a submission of its own,
           | I'd say!
        
         | JosephRedfern wrote:
         | There's a write up from Alexey of different approaches
         | considered for clickhouse-obfuscator here:
         | https://clickhouse.com/blog/five-methods-of-database-
         | obfusca....
         | 
         | The summary is pretty funny:
         | 
         | > _" After trying four methods, I got so tired of this problem
         | that it was time just to choose something, make it into a
         | usable tool, and announce the solution"_
        
       | pgryko wrote:
       | Are these tools able to automatically identify PII information or
       | do you have to specify columns and data types manually? What
       | happens if you have PII data in a string field? Do you just rely
       | on something like spacy to identify the PII data?
        
         | graindcafe wrote:
         | From the website:
         | 
         | > The project has a declarative approach of anonymization.
         | 
         | > Finally, the extension offers a panel of detection functions
         | that will try to guess which columns need to be anonymized.
         | 
         | https://postgresql-anonymizer.readthedocs.io/en/stable/detec...
        
       | foreigner wrote:
       | TIL that PostgreSQL has SECURITY LABEL! It seems like this could
       | be useful for storing all sorts of metadata about database
       | objects, not just security stuff. E.g. like the COMMENT but not
       | global. From reading the docs it looks like you need a "label
       | provider" to get it to work though. I can only seem to find a few
       | label providers around, does anyone know of one that isn't
       | security/anonymization related and could be used more
       | generically?
        
       | nickzelei wrote:
       | This can work pretty well if you want to either mask the data in
       | prod or update it in place.
       | 
       | A good use case that comes to mind is using prod data in a retool
       | app or something for your internal team but you want to mask out
       | certain bits.
       | 
       | I've been building Neosync [1] to handle more advanced use cases
       | where you want to anonymize data for lower level environments.
       | This is more useful for stage or dev data. Then prod stays
       | completely unexposed to anyone.
       | 
       | It also has a transactional anonymization api too.
       | 
       | [1]: https://github.com/nucleuscloud/neosync
        
       | sam0x17 wrote:
       | I was actually tasked with building essentially this same thing
       | back in 2014 when I was a junior dev for a fintech startup. They
       | needed an anonymized version of prod database suitable for
       | support team to pull up when trying to reproduce bugs. Did this
       | gigantic thing that would stream the db dump into a C++ app and
       | anonymize it on the fly. Took a similar approach to their masking
       | they do here. Fun project. Company should have productized it.
        
         | nickzelei wrote:
         | This is the exact usecase that we are building for with Neosync
         | (https://github.com/nucleuscloud/neosync)
        
       | riskable wrote:
       | One of the best ways to handle this sort of thing is to put
       | things like PII in a separate database entirely and replace it
       | with a token in the "main" database. When something like PII
       | _actually_ needs to be retrieved you first retrieve the token and
       | then search the other database for said token to get the real
       | data.
       | 
       | It certainly complicates things but it provides an additional
       | security layer of separation between the PII and it's related
       | data. You can provide your end users access to a database without
       | having to worry about them getting access to the "dangerous"
       | data. If they do indeed need access to the data pointed to via
       | the token they can request access to that related database.
       | 
       | This method also provides more performance since you don't need
       | to encrypt the entire database (which is often required when
       | storing PII) and also don't need to add extra security context
       | function calls to every database request.
        
         | VWWHFSfQ wrote:
         | The is basically just a foreign database key which, in most
         | cases, is not sufficient to satisfy industry and regulatory
         | requirements for anonymization and storage of PII.
        
         | ComputerGuru wrote:
         | Eh. I get your point and truly appreciate structural safeguards
         | as opposed to aspirational ones but this is really not as
         | doable as you make it out to be, and doing it properly would be
         | a full blown product in its own right. First, this only works
         | if you have a very narrow interpretation of PII. Once you
         | realize most of your non-int/uuid unique indexes (and all your
         | join predicates) are probably PII in some way or the other, the
         | scope of the problem greatly increases. How does your solution
         | work when you need to group by PII, full text search by PII,
         | filter by PII, etc?
        
       | Cynddl wrote:
       | I'm going to repeat myself as I do everytime I encounter such
       | tools. These tools DO NOT provide anonymization, and especially
       | not at the level required by the EU's GDPR (where the notion of
       | PII does not exist).
       | 
       | As a computer scientist and academic researcher having worked on
       | this topic for now more than a decade (some of my work if you are
       | interested: [1, 2]), re-identification is often possible from few
       | pieces of information. Masking or replacing a few values or
       | columns will often not provide sufficient guarantees--especially
       | when a lot of information is being released.
       | 
       | What this tool does is called 'pseudonymization' and maybe, if
       | not very carefully, 'de-identification' in some case. With
       | colleagues, reviewed all the literature and industry practices a
       | few months ago [3], and our conclusion was:
       | 
       | > We find that, although no perfect solution exists, applying
       | modern techniques while auditing their guarantees against attacks
       | is the best approach to safely use and share data today.
       | 
       | This is clearly not what this tool is doing.
       | 
       | [1] https://www.nature.com/articles/s41467-019-10933-3 [2]
       | https://www.nature.com/articles/s41467-024-55296-6 [3]
       | https://www.science.org/doi/10.1126/sciadv.adn7053
        
         | aquilaFiera wrote:
         | Seems like if you're doing static masking and you mask enough
         | data, this works just great. Am I missing something?
        
       | joram87 wrote:
       | I've been working on something similar, starting a company around
       | the idea! We realized that a lot of people had concerns or
       | challenges with installing an extension on their production
       | database and also that they wanted non-technical folks in
       | compliance or HR to be able to configure and maintain the rules
       | for individual employees. pgAnonymizer is a database extension
       | but we structured ours to be a proxy server that
       | hides/anonymizes/filters the data. We made a web dashboard that
       | simplifies the configuration process, and allows you to configure
       | what to do if columns get added to the database (default mask or
       | hide new columns). We're about to go GA and if anyone has any
       | feedback or wants to a free beta testing trial, I'd love to chat
        
         | joram87 wrote:
         | I probably should mention the name of the tool I'm working
         | on... oops, it's VeilStream :)
        
       | gmassman wrote:
       | This is a very handy postgres extension! We've been using it at
       | my job for a couple years now to generate test datasets for
       | developers. We have a weekly job that restores a prod backup to a
       | temporary DB, installs the `anon` extension, and runs pg_dump
       | with the masking rules. Overall we've been very happy with this
       | workflow since it gives us a very good idea of how new features
       | will work with our production data. The masking rules do need
       | maintenance as our DB schema changes, but that's par for the
       | course with these kinds of dev tools.
       | 
       | All that said, I wouldn't rely on this extension as a way to
       | deliver anonymized data to downstream consumers outside of our
       | software team. As others have pointed out, this is really more of
       | a pseudonymization technique. It's great for removing phone
       | numbers, emails, etc. from your data set, but it's not going to
       | eradicate PII. Pretty much all anonymized records can be traced
       | back to their source data through PKs or FKs.
        
       ___________________________________________________________________
       (page generated 2025-01-17 23:01 UTC)