[HN Gopher] Greenmask: PostgreSQL Dump and Obfuscation Tool
___________________________________________________________________
Greenmask: PostgreSQL Dump and Obfuscation Tool
Author : thunderbong
Score : 59 points
Date : 2024-02-17 13:18 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jensenbox wrote:
| I would love to see a comparison with replibyte.
| davidw wrote:
| Looks interesting. I am struggling mightily with a problem at
| work where we have a very large production DB and need a subset
| of it for local dev environments. You can't just randomly sample
| the data as that will break foreign keys and other relationships.
|
| Anyone have a solution to this problem that works well for them?
| paulryanrogers wrote:
| At each employer I now start by writing a data extractor that
| takes one client's data and loads it into my local DB. Usually
| includes some 'skip' or filter options to make the process
| faster. And some transformations like anonymizing PII.
|
| With PG you can bypass the FK checks locally by temporarily
| changing the replication mode from origin to replica. Though if
| you don't eventually copy the referenced records you may run
| into FK errors during normal operations.
| dewey wrote:
| I'm using https://github.com/ankane/pgsync for that, it has a
| very easy config yaml and you can just define queries that get
| pulled from the db.
|
| I have a simple one like ,,base_tables" that just pulls me all
| the fixtures into my local db, then entity specific ones that
| pull an entity with a specific id + all related entries in
| other tables for debugging but as long as you can query it you
| can set up everything very easily.
| antman wrote:
| Jailer?
| big_whack wrote:
| I would distinguish between "local dev environment" and
| "debugging specific data-dependent customer issue".
|
| My experience is you can usually write the application with no
| dependencies on customer data, and do local development
| entirely with synthetic data.
|
| For debugging specific customer escalations or understanding
| data distributions for your synthetic data modeling, you can
| use a replica of production. No need to extract just a subset
| or have devs host it themselves. All access can be controlled
| and logged this way.
|
| Better than a replica would be something like [1] but that
| isn't available everywhere.
|
| I agree the problem of "dump a DB sample that respects FK
| constraints" comes up sometimes and is interesting but I'm not
| sure I'd use it for the problem you describe.
|
| [1] https://devcenter.heroku.com/articles/heroku-postgres-fork
| markhalonen wrote:
| Postgres can do a pg_dump with RLS (Row Level Security) enabled
| to only get 1 tenants data. Very convenient if you use RLS
| already, if you don't, wouldn't take too long if you have
| tenant_id on every table
| joescharf wrote:
| I'm about to release v.2.0 of
| [DBSnapper](https://dbsnapper.com) which includes the
| subsetting feature. Works with Postgres and MySQL.
|
| I have a prelease build (v2.0.0-alpha) available at
| https://github.com/dbsnapper/dbsnapper/releases
|
| And I'm updating the documentation for v2.0.0 in real-time
| https://docs.dbsnapper.com. The new configuration is there, but
| I have some work to do adding docs for the new features.
|
| Works with Mac and Linux (Windows soon)
|
| If you're interested in giving the alpha a try, I'd love
| feedback and can walk you through any issues you might
| encounter.
|
| Joe
| karn09 wrote:
| Built something like this a few years ago for work, but it
| has become a monstrosity that is difficult to enhance and
| maintain. We've been debating whether to build a more
| extensible v2 since I haven't seen anything that fits our
| needs on the market. I'm excited to check this out!
| rst wrote:
| Really interesting. Curious about one particular case --
| obfuscating the values of columns which are part of a `unique`
| constraint.
|
| Let's take, say, a set of a hundred thousand nine digit social
| security numbers, stored in a DB column that has a uniqueness
| constraint on it. This is a space small enough that hashing
| doesn't really mask anything -- there are few enough possible
| values that an adversary can compute hashes of all of them, and
| unmask hashed data. But the birthday paradox says that the
| RandomString transformer is highly unlikely to preserve
| uniqueness -- ask a genuinely random string generator to generate
| that many nine-digit strings, and you're extremely likely to get
| the same string out more than once, violating the uniqueness
| constraint.
|
| One approach that I've seen to this is to assign replacement
| strings sequentially, in the order that the underlying data is
| seen -- that is, in effect, building up a dictionary of
| replacements over time. But that requires a transformer with
| mutable state, which looks kind of awkward in this framework. The
| best way I can see to arrange this is a `Cmd` transformer that
| holds the dictionary in memory. Is there a neater approach?
| ngalstyan4 wrote:
| Not sure what the approach of this library is, but can't you
| generate a nonce from a larger alphabet, hash the column values
| with the nonce `hash(nonce || column)`, and crypto-shred the
| nonce in the end.
|
| Then, during hashing you just need a constant immutable state,
| which effectively expands the hash space, without incurring the
| mutable state overhead of replacement strings strategy.
| kmtrowbr wrote:
| This is a Rails tool I use: https://github.com/IFTTT/polo -- it
| works well.
| lakomen wrote:
| What's wrong with pg_dump?
___________________________________________________________________
(page generated 2024-02-17 23:01 UTC)