https://github.com/sdv-dev/SDV Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} sdv-dev / SDV Public * Notifications * Fork 267 * Star 1.9k * Synthetic data generation for tabular data docs.sdv.dev/sdv License View license 1.9k stars 267 forks Branches Tags Activity Star Notifications * Code * Issues 152 * Pull requests 3 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights sdv-dev/SDV This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 1,512 Commits .github .github docs docs sdv sdv tests tests .editorconfig .editorconfig .gitattributes .gitattributes .gitignore .gitignore AUTHORS.rst AUTHORS.rst CONTRIBUTING.rst CONTRIBUTING.rst EVALUATION.md EVALUATION.md HISTORY.md HISTORY.md LICENSE LICENSE MANIFEST.in MANIFEST.in Makefile Makefile README.md README.md apt.txt apt.txt requirements.txt requirements.txt setup.cfg setup.cfg setup.py setup.py tasks.py tasks.py tox.ini tox.ini View all files Repository files navigation * README * License This repository is part of The Synthetic Data Vault Project, a project from DataCebo. Dev Status PyPi Shield Unit Tests Integration Tests Coverage Status Downloads Colab Slack [SDV-logo] Overview The Synthetic Data Vault (SDV) is a Python library designed to be your one-stop shop for creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn patterns from your real data and emulate them in synthetic data. Features Create synthetic data using machine learning. The SDV offers multiple models, ranging from classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate data for single tables, multiple connected tables or sequential tables. Evaluate and visualize data. Compare the synthetic data to the real data against a variety of measures. Diagnose problems and generate a quality report to get more insights. Preprocess, anonymize and define constraints. Control data processing to improve the quality of synthetic data, choose from different types of anonymization and define business rules in the form of logical constraints. Important Links [google_col] Get some hands-on experience with the SDV. Launch the Tutorials tutorial notebooks and run the code yourself. Docs Learn how to use the SDV library with user guides and API references. Blog Get more insights about using the SDV, deploying models and our synthetic data community. [slack] Join our Slack workspace for announcements and Community discussions. Website Check out the SDV website for more information about the project. Install The SDV is publicly available under the Business Source License. Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device. pip install sdv conda install -c pytorch -c conda-forge sdv Getting Started Load a demo dataset to get started. This dataset is a single table describing guests staying at a fictional hotel. from sdv.datasets.demo import download_demo real_data, metadata = download_demo( modality='single_table', dataset_name='fake_hotel_guests') Single Table Metadata Example The demo also includes metadata, a description of the dataset, including the data types in each column and the primary key (guest_email). Synthesizing Data Next, we can create an SDV synthesizer, an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. Let's use the FAST_ML preset synthesizer, which is optimized for performance. from sdv.lite import SingleTablePreset synthesizer = SingleTablePreset(metadata, name='FAST_ML') synthesizer.fit(data=real_data) And now the synthesizer is ready to create synthetic data! synthetic_data = synthesizer.sample(num_rows=500) The synthetic data will have the following properties: * Sensitive columns are fully anonymized. The email, billing address and credit card number columns contain new data so you don't expose the real values. * Other columns follow statistical patterns. For example, the proportion of room types, the distribution of check in dates and the correlations between room rate and room type are preserved. * Keys and other relationships are intact. The primary key (guest email) is unique for each row. If you have multiple tables, the connection between a primary and foreign keys makes sense. Evaluating Synthetic Data The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get started by generating a quality report. from sdv.evaluation.single_table import evaluate_quality quality_report = evaluate_quality( real_data, synthetic_data, metadata) Creating report: 100%|##########| 4/4 [00:00<00:00, 19.30it/s] Overall Quality Score: 89.12% Properties: Column Shapes: 90.27% Column Pair Trends: 87.97% This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well as detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data. from sdv.evaluation.single_table import get_column_plot fig = get_column_plot( real_data=real_data, synthetic_data=synthetic_data, column_name='amenities_fee', metadata=metadata ) fig.show() Real vs. Synthetic Data What's Next? Using the SDV library, you can synthesize single table, multi table and sequential data. You can also customize the full synthetic data workflow, including preprocessing, anonymization and adding constraints. To learn more, visit the SDV Demo page. Credits Thank you to our team of contributors who have built and maintained the SDV ecosystem over the years! View Contributors Citation If you use SDV for your research, please cite the following paper: Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic Data Vault. IEEE DSAA 2016. @inproceedings{ SDV, title={The Synthetic data vault}, author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan}, booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)}, year={2016}, pages={399-410}, doi={10.1109/DSAA.2016.49}, month={Oct} } --------------------------------------------------------------------- [datacebo-logo] The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including: * Data discovery & transformation. Reverse the transforms to reproduce realistic data. * Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data. * Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models. Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs. About Synthetic data generation for tabular data docs.sdv.dev/sdv Topics machine-learning deep-learning time-series generative-adversarial-network gan generative-model data-generation gans synthetic-data sdv multi-table synthetic-data-generation relational-datasets generative-ai generativeai Resources Readme License View license Activity Custom properties Stars 1.9k stars Watchers 42 watching Forks 267 forks Report repository Releases 54 v1.10.0 - 2024-02-15 Latest Feb 15, 2024 + 53 releases Packages 0 No packages published Used by 327 * @SyedMunazzir90 * @muhammad-mobeen * @rebeccalct * @alexsalman * @tmcarvalho * @rohitp12-rp * @hakim-l * @lucaalbrizzo + 319 Contributors 25 * * * * * * * * * * * * * * + 11 contributors Languages * Python 99.5% * Makefile 0.5% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.