[HN Gopher] Show HN: Greenmask 0.2 - Database anonymization tool
       ___________________________________________________________________
        
       Show HN: Greenmask 0.2 - Database anonymization tool
        
       Hi! My name is Vadim, and I'm the developer of Greenmask
       (https://github.com/GreenmaskIO/greenmask). Today Greenmask is
       almost 1 year and recently we published one of the most significant
       release with new features:
       https://github.com/GreenmaskIO/greenmask/releases/tag/v0.2.0, as
       well as a new website at https://greenmask.io.  Before I describe
       Greenmask's features, I want to share the story of how and why I
       started implementing it.  Everyone strives to have their staging
       environment resemble production as closely as possible because it
       plays a critical role in ensuring quality and improving metrics
       like time to delivery. To achieve this, many teams have started
       migrating databases and data from production to staging
       environments. Obviously this requires anonymizing the data, and for
       this people use either custom scripts or existing anonymization
       software.  Having worked as a database engineer for 8 years, I
       frequently struggled with routine tasks like setting up development
       environments--this was a common request. Initially, I used custom
       scripts to handle this, but things became increasingly complex as
       the number of services grew, especially with the rise of
       microservices architecture.  When I began exploring tools to solve
       this issue, I listed my key expectations for such software:
       documentation; type safety (the tool should validate any changes to
       the data); streaming (I want the ability to stream the data while
       transformations are being applied); consistency (transformations
       must maintain constraints, functional dependencies, and more);
       reliability; customizability; interactivity and usability;
       simplicity.  I found a few options, but none fully met my
       expectations. Two interesting tools I discovered were pganonymizer
       and replibyte. I liked the architecture of Replibyte, but when I
       tried it, it failed due to architectural limitations.  With these
       thoughts in mind, I began developing Greenmask in mid-2023. My goal
       was to create a tool that meets all of these requirements, based on
       the design principles I laid out. Here are some key highlights:  *
       It is a single utility - Greenmask delegates the schema dump to
       vendor utilities and takes responsibility only for data dumps and
       transformations.  * Database Subset
       (https://docs.greenmask.io/latest/database_subset) - specify the
       subset condition and scale down size. We did a deep research in
       graph algorithms and now we can subset almost any complexity of
       database.  * Database type safety - it uses the DBMS driver to
       decode and encode data into real types (such as int, float, etc.)
       in the stream. This guarantees consistency and almost eliminates
       the chance of corrupted dumps.  * Deterministic engine
       (https://docs.greenmask.io/latest/built_in_transformers/trans...) -
       generate data using the hash engine that produces consistent output
       for the same input.  * Dynamic parameters for transformers
       (https://docs.greenmask.io/latest/built_in_transformers/dynam...) -
       imagine having created_at and updated_at dates with functional
       dependencies. Dynamic parameters ensure these dates are generated
       correctly.  We are actively maintaining the current project and
       continuously improving it--our public roadmap at
       https://github.com/orgs/GreenmaskIO/projects/6. Soon, we will
       release a Python library along with transformation collections to
       help users develop their own complex transformations and integrate
       them with any service. We have plans to support more database
       engines, with MySQL being the next one, and we are working on tools
       which will integrate seamlessly with your CI/CD systems.  To get
       started, we've prepared a playground for you that can be easily set
       up using Docker Compose:
       https://docs.greenmask.io/latest/playground/  I'd love to hear any
       questions or feedback from you!
        
       Author : woyten
       Score  : 21 points
       Date   : 2024-10-16 20:37 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | btown wrote:
       | This is really awesome - and it's so amazing that you've build
       | this as a standalone tool!
       | 
       | I can absolutely speak to the pain of having a dozen pg_dump
       | --exclude-table-data arguments and having a developer experience
       | that makes it difficult to reproduce bugs due to drift between
       | production data and test fixtures (even if they share the same
       | schema, assumptions can change massively!).
       | 
       | Secure and robust database cloning also enables preview apps that
       | actually answer the stakeholder question "can I see/play with
       | what the new code would do, if applied to the actual
       | [document/record/product listing] that motivated the
       | feature/bugfix?" Subsetting and PII masking are both critical for
       | this, and it's amazing to see that you've thought about them as
       | integral parts of the same product.
       | 
       | I really want to see a product like this succeed! The easier the
       | tool is to use, the harder it might be to monetize... but there
       | are so many applications of a tool like this, including ones that
       | can materially improve security at organizations large and small
       | (https://nabeelqu.substack.com/i/150188028/secrets just posted
       | here earlier today remarks on this!) that I'm sure you'll find
       | the right niche!
        
       ___________________________________________________________________
       (page generated 2024-10-16 23:00 UTC)