https://github.com/sdv-dev/SDV

Skip to content
Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
sdv-dev / SDV Public

  * Notifications
  * Fork 267
  * Star 1.9k
  * 

Synthetic data generation for tabular data

docs.sdv.dev/sdv

License

View license
1.9k stars 267 forks Branches Tags Activity
Star
Notifications

  * Code
  * Issues 152
  * Pull requests 3
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

sdv-dev/SDV

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
 main
BranchesTags
  
Go to file
Code

Folders and files

      Name              Name          Last commit       Last commit
                                        message            date
Latest commit

 

History

1,512 Commits
 
.github           .github                              

docs              docs                                 

sdv               sdv                                  

tests             tests                                

.editorconfig     .editorconfig                        

.gitattributes    .gitattributes                       

.gitignore        .gitignore                           

AUTHORS.rst       AUTHORS.rst                          

CONTRIBUTING.rst  CONTRIBUTING.rst                     

EVALUATION.md     EVALUATION.md                        

HISTORY.md        HISTORY.md                           

LICENSE           LICENSE                              

MANIFEST.in       MANIFEST.in                          

Makefile          Makefile                             

README.md         README.md                            

apt.txt           apt.txt                              

requirements.txt  requirements.txt                     

setup.cfg         setup.cfg                            

setup.py          setup.py                             

tasks.py          tasks.py                             

tox.ini           tox.ini                              

View all files

Repository files navigation

  * README
  * License

                                  
   This repository is part of The Synthetic Data Vault Project, a
                       project from DataCebo.

Dev Status PyPi Shield Unit Tests Integration Tests Coverage Status
Downloads Colab Slack


                             [SDV-logo]

Overview

 

The Synthetic Data Vault (SDV) is a Python library designed to be
your one-stop shop for creating tabular synthetic data. The SDV uses
a variety of machine learning algorithms to learn patterns from your
real data and emulate them in synthetic data.

Features

 

 Create synthetic data using machine learning. The SDV offers
multiple models, ranging from classical statistical methods
(GaussianCopula) to deep learning methods (CTGAN). Generate data for
single tables, multiple connected tables or sequential tables.

 Evaluate and visualize data. Compare the synthetic data to the real
data against a variety of measures. Diagnose problems and generate a
quality report to get more insights.

 Preprocess, anonymize and define constraints. Control data
processing to improve the quality of synthetic data, choose from
different types of anonymization and define business rules in the
form of logical constraints.

  Important
    Links
[google_col]  Get some hands-on experience with the SDV. Launch the
Tutorials     tutorial notebooks and run the code yourself.
 Docs        Learn how to use the SDV library with user guides and
              API references.
 Blog        Get more insights about using the SDV, deploying models
              and our synthetic data community.
[slack]       Join our Slack workspace for announcements and
Community     discussions.
 Website     Check out the SDV website for more information about
              the project.

Install

 

The SDV is publicly available under the Business Source License.
Install SDV using pip or conda. We recommend using a virtual
environment to avoid conflicts with other software on your device.

pip install sdv

conda install -c pytorch -c conda-forge sdv

Getting Started

 

Load a demo dataset to get started. This dataset is a single table
describing guests staying at a fictional hotel.

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests')

Single Table Metadata Example

The demo also includes metadata, a description of the dataset,
including the data types in each column and the primary key
(guest_email).

Synthesizing Data

 

Next, we can create an SDV synthesizer, an object that you can use to
create synthetic data. It learns patterns from the real data and
replicates them to generate synthetic data. Let's use the FAST_ML
preset synthesizer, which is optimized for performance.

from sdv.lite import SingleTablePreset

synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer.fit(data=real_data)

And now the synthesizer is ready to create synthetic data!

synthetic_data = synthesizer.sample(num_rows=500)

The synthetic data will have the following properties:

  * Sensitive columns are fully anonymized. The email, billing
    address and credit card number columns contain new data so you
    don't expose the real values.
  * Other columns follow statistical patterns. For example, the
    proportion of room types, the distribution of check in dates and
    the correlations between room rate and room type are preserved.
  * Keys and other relationships are intact. The primary key (guest
    email) is unique for each row. If you have multiple tables, the
    connection between a primary and foreign keys makes sense.

Evaluating Synthetic Data

 

The SDV library allows you to evaluate the synthetic data by
comparing it to the real data. Get started by generating a quality
report.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata)

Creating report: 100%|##########| 4/4 [00:00<00:00, 19.30it/s]
Overall Quality Score: 89.12%
Properties:
Column Shapes: 90.27%
Column Pair Trends: 87.97%

This object computes an overall quality score on a scale of 0 to 100%
(100 being the best) as well as detailed breakdowns. For more
insights, you can also visualize the synthetic vs. real data.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='amenities_fee',
    metadata=metadata
)

fig.show()

Real vs. Synthetic Data

What's Next?

 

Using the SDV library, you can synthesize single table, multi table
and sequential data. You can also customize the full synthetic data
workflow, including preprocessing, anonymization and adding
constraints.

To learn more, visit the SDV Demo page.

Credits

 

Thank you to our team of contributors who have built and maintained
the SDV ecosystem over the years!

View Contributors

Citation

 

If you use SDV for your research, please cite the following paper:

Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic Data
Vault. IEEE DSAA 2016.

@inproceedings{
    SDV,
    title={The Synthetic data vault},
    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},
    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
    year={2016},
    pages={399-410},
    doi={10.1109/DSAA.2016.49},
    month={Oct}
}

---------------------------------------------------------------------
                           [datacebo-logo]



The Synthetic Data Vault Project was first created at MIT's Data to
AI Lab in 2016. After 4 years of research and traction with
enterprise, we created DataCebo in 2020 with the goal of growing the
project. Today, DataCebo is the proud developer of SDV, the largest
ecosystem for synthetic data generation & evaluation. It is home to
multiple libraries that support synthetic data, including:

  *  Data discovery & transformation. Reverse the transforms to
    reproduce realistic data.
  *  Multiple machine learning models -- ranging from Copulas to
    Deep Learning -- to create tabular, multi table and time series
    data.
  *  Measuring quality and privacy of synthetic data, and comparing
    different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and
your one-stop shop for synthetic data. Or, use the standalone
libraries for specific needs.

About

Synthetic data generation for tabular data

docs.sdv.dev/sdv

Topics

machine-learning deep-learning time-series 
generative-adversarial-network gan generative-model data-generation 
gans synthetic-data sdv multi-table synthetic-data-generation 
relational-datasets generative-ai generativeai

Resources

Readme

License

View license
Activity
Custom properties

Stars

1.9k stars

Watchers

42 watching

Forks

267 forks
Report repository

Releases 54

 
v1.10.0 - 2024-02-15 Latest
Feb 15, 2024
+ 53 releases

Packages 0

No packages published

Used by 327

 

  * @SyedMunazzir90
  * @muhammad-mobeen
  * @rebeccalct
  * @alexsalman
  * @tmcarvalho
  * @rohitp12-rp
  * @hakim-l
  * @lucaalbrizzo

+ 319

Contributors 25

  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 
  * 

+ 11 contributors

Languages

  * Python 99.5%
  * Makefile 0.5%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.