[HN Gopher] Replibyte - Seed your database with real data
___________________________________________________________________
Replibyte - Seed your database with real data
Author : evoxmusic
Score : 96 points
Date : 2022-07-10 18:39 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| CSSer wrote:
| I think the description in the man entry is better than the one
| in the README. Other than that, cool tool!
| bennyp101 wrote:
| How does it keep personal data safe? I had a look at "how it
| works" and "faqs" but they don't answer how you keep stuff safe?
| It also gets uploaded to S3?
|
| I might have missed it, but I need to know exactly where our PII
| is stored (so not on a dev laptop), how do you know what to
| replace and what do you do with any info you do replace?
|
| Edit: To answer my own question, via transformers. But that seems
| to suggest each dev has to keep it up to date with any schema
| changes etc
|
| (Also some links are broken on GitHub)
| pistoriusp wrote:
| You may want to check out Snaplet at https://docs.snaplet.dev.
| I'm the co-founder, but we're not open-source (yet.) Our goal
| is to give developers a database, and data, that they can code
| against.
|
| We identify PII by introspecting your database, suggest fields
| to transform, and provide a JavaScript runtime for writing
| transformations.
|
| Besides transforming data, you can reduce, and generate data.
| We are most excited about data-generation!
|
| The configuration lives in your repository, and you can capture
| the snapshots in GitHub Actions. So you get "gitops workflow"
| for data.
|
| A typical git-ops workflow: 1. Add a schema
| migration for a new column. 2. Add a JS function to
| generate new data for that column. 3. Add core to use the
| new column. 4. Later, once you have data, use the same
| function to transform the original value. (Or just keep
| generating it.)
| ev0xmusic wrote:
| Hi, author of Replibyte here :)
|
| Yes, transformers is the way to go. I plan to add a way to
| detect schema changes and at list not trying to create a dump
| in case of change. I don't think it can be done in a safe way
| without human admin check.
|
| (Thank you for your PR)
| crummy wrote:
| The user tells it what fields need replacing with the yaml
| config.
| dopidopHN wrote:
| The default seems to be to store the sanitized dump on S3.
|
| It's not always available in a professional context. Or might be
| considered extraction.
|
| Keeping everything local and detailing exactly what goes where
| and how would be helpful.
| Svarto wrote:
| Also if it's possible to run everything without uploading it to
| S3. For a smaller time dev with projects in production I would
| find this really interesting for debugging the production
| database data, but in development. Uploading it and having it
| in S3 would needlessly complicate it for me (even though I can
| understand enterprise customers might prefer it that way)
| evoxmusic wrote:
| You have a local storage option
| https://www.replibyte.com/docs/datastores#local-disk
| roskilli wrote:
| One feature I'd love to see is a transformer that instead of
| providing a random value provides a cryptographic one way hash of
| the data (ie sha2) - that way key uniqueness stays the same (to
| avoid unique constraints on columns) and also the same value used
| in one place will match another value in another table after
| transformation which more accurately reflects the "shape" of the
| data.
| pistoriusp wrote:
| We do this via Copycat (https://github.com/snaplet/copycat). We
| generate static "fake values" by hashing your original value to
| a number, and map that to a fake-value.
| MadsRC wrote:
| This will not work, at least not if we're talking PII as it is
| defined by a Somewhat Sane (TM) privacy legislation.
|
| Sure, passwords and credit card info is obscured with your
| methodology, but names, dates of birth, sexual orientation,
| telephone numbers, email and ip will remain unique. This
| uniqueness is what allows you to potentially identify a person
| given enough data.
| MadsRC wrote:
| I suppose that what you'd have to do is change the data and
| then hash it. But once you've changed the data it's no longer
| PII, so there's no reason to hash it.
|
| Of course, given enough data that has been changed can
| potentially allow you to deduce how that data was changed and
| thus revert it, at which point it would become PII again and
| you'd have a problem... but that's probably a fringe scenario
| tyingq wrote:
| >Sure, passwords and credit card info is obscured with your
| methodology
|
| Even that's problematic, because there may be code that
| depends on the data being somewhat "real". Credit cards, for
| example, may need to pass LUHN tests, or have valid BIN
| sections, etc.
| ev0xmusic wrote:
| Hi, author of Replibyte here. Feel free to open an issue and
| explain what is your use case. I will be happy to consider a
| solution with the community.
___________________________________________________________________
(page generated 2022-07-10 23:00 UTC)