[HN Gopher] Greenmask: PostgreSQL Dump and Obfuscation Tool
       ___________________________________________________________________
        
       Greenmask: PostgreSQL Dump and Obfuscation Tool
        
       Author : thunderbong
       Score  : 59 points
       Date   : 2024-02-17 13:18 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jensenbox wrote:
       | I would love to see a comparison with replibyte.
        
       | davidw wrote:
       | Looks interesting. I am struggling mightily with a problem at
       | work where we have a very large production DB and need a subset
       | of it for local dev environments. You can't just randomly sample
       | the data as that will break foreign keys and other relationships.
       | 
       | Anyone have a solution to this problem that works well for them?
        
         | paulryanrogers wrote:
         | At each employer I now start by writing a data extractor that
         | takes one client's data and loads it into my local DB. Usually
         | includes some 'skip' or filter options to make the process
         | faster. And some transformations like anonymizing PII.
         | 
         | With PG you can bypass the FK checks locally by temporarily
         | changing the replication mode from origin to replica. Though if
         | you don't eventually copy the referenced records you may run
         | into FK errors during normal operations.
        
         | dewey wrote:
         | I'm using https://github.com/ankane/pgsync for that, it has a
         | very easy config yaml and you can just define queries that get
         | pulled from the db.
         | 
         | I have a simple one like ,,base_tables" that just pulls me all
         | the fixtures into my local db, then entity specific ones that
         | pull an entity with a specific id + all related entries in
         | other tables for debugging but as long as you can query it you
         | can set up everything very easily.
        
         | antman wrote:
         | Jailer?
        
         | big_whack wrote:
         | I would distinguish between "local dev environment" and
         | "debugging specific data-dependent customer issue".
         | 
         | My experience is you can usually write the application with no
         | dependencies on customer data, and do local development
         | entirely with synthetic data.
         | 
         | For debugging specific customer escalations or understanding
         | data distributions for your synthetic data modeling, you can
         | use a replica of production. No need to extract just a subset
         | or have devs host it themselves. All access can be controlled
         | and logged this way.
         | 
         | Better than a replica would be something like [1] but that
         | isn't available everywhere.
         | 
         | I agree the problem of "dump a DB sample that respects FK
         | constraints" comes up sometimes and is interesting but I'm not
         | sure I'd use it for the problem you describe.
         | 
         | [1] https://devcenter.heroku.com/articles/heroku-postgres-fork
        
         | markhalonen wrote:
         | Postgres can do a pg_dump with RLS (Row Level Security) enabled
         | to only get 1 tenants data. Very convenient if you use RLS
         | already, if you don't, wouldn't take too long if you have
         | tenant_id on every table
        
         | joescharf wrote:
         | I'm about to release v.2.0 of
         | [DBSnapper](https://dbsnapper.com) which includes the
         | subsetting feature. Works with Postgres and MySQL.
         | 
         | I have a prelease build (v2.0.0-alpha) available at
         | https://github.com/dbsnapper/dbsnapper/releases
         | 
         | And I'm updating the documentation for v2.0.0 in real-time
         | https://docs.dbsnapper.com. The new configuration is there, but
         | I have some work to do adding docs for the new features.
         | 
         | Works with Mac and Linux (Windows soon)
         | 
         | If you're interested in giving the alpha a try, I'd love
         | feedback and can walk you through any issues you might
         | encounter.
         | 
         | Joe
        
           | karn09 wrote:
           | Built something like this a few years ago for work, but it
           | has become a monstrosity that is difficult to enhance and
           | maintain. We've been debating whether to build a more
           | extensible v2 since I haven't seen anything that fits our
           | needs on the market. I'm excited to check this out!
        
       | rst wrote:
       | Really interesting. Curious about one particular case --
       | obfuscating the values of columns which are part of a `unique`
       | constraint.
       | 
       | Let's take, say, a set of a hundred thousand nine digit social
       | security numbers, stored in a DB column that has a uniqueness
       | constraint on it. This is a space small enough that hashing
       | doesn't really mask anything -- there are few enough possible
       | values that an adversary can compute hashes of all of them, and
       | unmask hashed data. But the birthday paradox says that the
       | RandomString transformer is highly unlikely to preserve
       | uniqueness -- ask a genuinely random string generator to generate
       | that many nine-digit strings, and you're extremely likely to get
       | the same string out more than once, violating the uniqueness
       | constraint.
       | 
       | One approach that I've seen to this is to assign replacement
       | strings sequentially, in the order that the underlying data is
       | seen -- that is, in effect, building up a dictionary of
       | replacements over time. But that requires a transformer with
       | mutable state, which looks kind of awkward in this framework. The
       | best way I can see to arrange this is a `Cmd` transformer that
       | holds the dictionary in memory. Is there a neater approach?
        
         | ngalstyan4 wrote:
         | Not sure what the approach of this library is, but can't you
         | generate a nonce from a larger alphabet, hash the column values
         | with the nonce `hash(nonce || column)`, and crypto-shred the
         | nonce in the end.
         | 
         | Then, during hashing you just need a constant immutable state,
         | which effectively expands the hash space, without incurring the
         | mutable state overhead of replacement strings strategy.
        
       | kmtrowbr wrote:
       | This is a Rails tool I use: https://github.com/IFTTT/polo -- it
       | works well.
        
       | lakomen wrote:
       | What's wrong with pg_dump?
        
       ___________________________________________________________________
       (page generated 2024-02-17 23:01 UTC)