hngopher.com

       [HN Gopher] Replibyte - Seed your database with real data
       ___________________________________________________________________
        
       Replibyte - Seed your database with real data
        
       Author : evoxmusic
       Score  : 96 points
       Date   : 2022-07-10 18:39 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | CSSer wrote:
       | I think the description in the man entry is better than the one
       | in the README. Other than that, cool tool!
        
       | bennyp101 wrote:
       | How does it keep personal data safe? I had a look at "how it
       | works" and "faqs" but they don't answer how you keep stuff safe?
       | It also gets uploaded to S3?
       | 
       | I might have missed it, but I need to know exactly where our PII
       | is stored (so not on a dev laptop), how do you know what to
       | replace and what do you do with any info you do replace?
       | 
       | Edit: To answer my own question, via transformers. But that seems
       | to suggest each dev has to keep it up to date with any schema
       | changes etc
       | 
       | (Also some links are broken on GitHub)
        
         | pistoriusp wrote:
         | You may want to check out Snaplet at https://docs.snaplet.dev.
         | I'm the co-founder, but we're not open-source (yet.) Our goal
         | is to give developers a database, and data, that they can code
         | against.
         | 
         | We identify PII by introspecting your database, suggest fields
         | to transform, and provide a JavaScript runtime for writing
         | transformations.
         | 
         | Besides transforming data, you can reduce, and generate data.
         | We are most excited about data-generation!
         | 
         | The configuration lives in your repository, and you can capture
         | the snapshots in GitHub Actions. So you get "gitops workflow"
         | for data.
         | 
         | A typical git-ops workflow:                 1. Add a schema
         | migration for a new column.        2. Add a JS function to
         | generate new data for that column.       3. Add core to use the
         | new column.       4. Later, once you have data, use the same
         | function to transform the original value. (Or just keep
         | generating it.)
        
         | ev0xmusic wrote:
         | Hi, author of Replibyte here :)
         | 
         | Yes, transformers is the way to go. I plan to add a way to
         | detect schema changes and at list not trying to create a dump
         | in case of change. I don't think it can be done in a safe way
         | without human admin check.
         | 
         | (Thank you for your PR)
        
         | crummy wrote:
         | The user tells it what fields need replacing with the yaml
         | config.
        
       | dopidopHN wrote:
       | The default seems to be to store the sanitized dump on S3.
       | 
       | It's not always available in a professional context. Or might be
       | considered extraction.
       | 
       | Keeping everything local and detailing exactly what goes where
       | and how would be helpful.
        
         | Svarto wrote:
         | Also if it's possible to run everything without uploading it to
         | S3. For a smaller time dev with projects in production I would
         | find this really interesting for debugging the production
         | database data, but in development. Uploading it and having it
         | in S3 would needlessly complicate it for me (even though I can
         | understand enterprise customers might prefer it that way)
        
         | evoxmusic wrote:
         | You have a local storage option
         | https://www.replibyte.com/docs/datastores#local-disk
        
       | roskilli wrote:
       | One feature I'd love to see is a transformer that instead of
       | providing a random value provides a cryptographic one way hash of
       | the data (ie sha2) - that way key uniqueness stays the same (to
       | avoid unique constraints on columns) and also the same value used
       | in one place will match another value in another table after
       | transformation which more accurately reflects the "shape" of the
       | data.
        
         | pistoriusp wrote:
         | We do this via Copycat (https://github.com/snaplet/copycat). We
         | generate static "fake values" by hashing your original value to
         | a number, and map that to a fake-value.
        
         | MadsRC wrote:
         | This will not work, at least not if we're talking PII as it is
         | defined by a Somewhat Sane (TM) privacy legislation.
         | 
         | Sure, passwords and credit card info is obscured with your
         | methodology, but names, dates of birth, sexual orientation,
         | telephone numbers, email and ip will remain unique. This
         | uniqueness is what allows you to potentially identify a person
         | given enough data.
        
           | MadsRC wrote:
           | I suppose that what you'd have to do is change the data and
           | then hash it. But once you've changed the data it's no longer
           | PII, so there's no reason to hash it.
           | 
           | Of course, given enough data that has been changed can
           | potentially allow you to deduce how that data was changed and
           | thus revert it, at which point it would become PII again and
           | you'd have a problem... but that's probably a fringe scenario
        
           | tyingq wrote:
           | >Sure, passwords and credit card info is obscured with your
           | methodology
           | 
           | Even that's problematic, because there may be code that
           | depends on the data being somewhat "real". Credit cards, for
           | example, may need to pass LUHN tests, or have valid BIN
           | sections, etc.
        
         | ev0xmusic wrote:
         | Hi, author of Replibyte here. Feel free to open an issue and
         | explain what is your use case. I will be happy to consider a
         | solution with the community.
        
       ___________________________________________________________________
       (page generated 2022-07-10 23:00 UTC)