https://github.com/capitalone/DataProfiler Skip to content Sign up * Why GitHub? Features - + Mobile - + Actions - + Codespaces - + Packages - + Security - + Code review - + Project management - + Integrations - + GitHub Sponsors - + Customer stories- * Team * Enterprise * Explore + Explore GitHub - Learn and contribute + Topics - + Collections - + Trending - + Learning Lab - + Open source guides - Connect with others + The ReadME Project - + Events - + Community forum - + GitHub Education - + GitHub Stars program - * Marketplace * Pricing Plans - + Compare plans - + Contact Sales - + Education - [ ] [search-key] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} capitalone / DataProfiler * Notifications * Star 210 * Fork 16 What's in your data? Extract schema, statistics and entities from datasets capitalone.github.io/dataprofiler Apache-2.0 License 210 stars 16 forks Star Notifications * Code * Issues 22 * Pull requests 1 * Actions * Projects 1 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights main Switch branches/tags [ ] Branches Tags Nothing to show {{ refName }} default View all branches Nothing to show {{ refName }} default View all tags 3 branches 12 tags Code Clone HTTPS GitHub CLI [https://github.com/c] Use Git or checkout with SVN using the web URL. [gh repo clone capita] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching Xcode If nothing happens, download Xcode and try again. Go back Launching Visual Studio If nothing happens, download the GitHub extension for Visual Studio and try again. Go back Latest commit @JGSweets JGSweets fix: histogram utils to use builtin numpy (#213) ... 816e923 May 10, 2021 fix: histogram utils to use builtin numpy (#213) 816e923 Git stats * 199 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .github Improved chunking index speeds (#106) Mar 26, 2021 dataprofiler fix: histogram utils to use builtin numpy (#213) May 10, 2021 examples Improved Data Labeler Notebook (#200) Apr 30, 2021 resources remove tfa requirement and update the model with new metric (#143) Apr 8, 2021 .gitignore Profiler runs without TF & TF only executes when necessary (#41) Mar 3, 2021 .whitesource added white source (#195) Apr 27, 2021 CODEOWNERS Codeowners (#37) Mar 2, 2021 LICENSE Initial commit Nov 9, 2020 README.md Add Profiler python notebook example (#198) Apr 29, 2021 pylintrc initial release Feb 9, 2021 requirements-ml.txt remove tfa requirement and update the model with new metric (#143) Apr 8, 2021 requirements-test.txt initial release Feb 9, 2021 requirements.txt Add safety when spinning up pool (#173) Apr 21, 2021 setup.py Update README (#187) Apr 26, 2021 View code Data Profiler | What's in your data? Install What is a Data Profile? Support Supported Data Formats Data Types Data Labels Get Started Load a File Profile a File Updating Profiles Merging Profiles Profile a Pandas DataFrame References README.md PyPI - Python Version GitHub GitHub last commit Data Profiler | What's in your data? The DataProfiler is a Python library designed to make data analysis, monitoring and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI) and more. Data Profiles can then be used in downstream applications or reports. Getting started only takes a few lines of code (example csv): import json from dataprofiler import Data, Profiler data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc readable_report = profile.report(report_options={"output_format":"compact"}) print(json.dumps(readable_report, indent=4)) Note: The Data Profiler comes with a pre-trained deep learning model, used to efficiently identify sensitive data (PII / NPI). If desired, it's easy to add new entities to the existing pre-trained model or insert an entire new pipeline for entity recognition. For API documentation, visit the documentation page. If you have suggestions or find a bug, please open an issue. --------------------------------------------------------------------- Install To install the full package from pypi: pip install DataProfiler[ml] If the ML requirements are too strict (say, you don't want to install tensorflow), you can install a slimmer package. The slimmer package disables the default sensitive data detection / entity recognition (labler) Install from pypi: pip install DataProfiler --------------------------------------------------------------------- What is a Data Profile? In the case of this library, a data profile is a dictionary containing statistics and predictions about the underlying dataset. There are "global statistics" or global_stats, which contain dataset level data and there are "column/row level statistics" or data_stats (each column is a new key-value entry). The format for a profile is below: "global_stats": { "samples_used": int, "column_count": int, "row_count": int, "row_has_null_ratio": float, "row_is_null_ratio": float, "unique_row_ratio": float, "duplicate_row_count": int, "file_type": string, "encoding": string, }, "data_stats": { : { "column_name": string, "data_type": string, "data_label": string, "categorical": bool, "order": string, "samples": list(str), "statistics": { "sample_size": int, "null_count": int, "null_types": list(string), "null_types_index": { string: list(int) }, "data_type_representation": [string, list(string)], "min": [null, float], "max": [null, float], "mean": float, "variance": float, "stddev": float, "histogram": { "bin_counts": list(int), "bin_edges": list(float), }, "quantiles": { int: float } "vocab": list(char), "avg_predictions": dict(float), "data_label_representation": dict(float), "categories": list(str), "unique_count": int, "unique_ratio": float, "precision": { 'min': int, 'max': int, 'mean': float, 'var': float, 'std': float, 'sample_size': int, 'margin_of_error': float, 'confidence_level': float }, "times": dict(float), "format": string } } } Support Supported Data Formats * Any delimited file (CSV, TSV, etc.) * JSON object * Avro file * Parquet file * Pandas DataFrame Data Types Data Types are determined at the column level for structured data * Int * Float * String * DateTime Data Labels Data Labels are determined per cell for structured data (column/row when the profiler is used) or at the character level for unstructured data. * UNKNOWN * ADDRESS * BAN (bank account number, 10-18 digits) * CREDIT_CARD * EMAIL_ADDRESS * UUID * HASH_OR_KEY (md5, sha1, sha256, random hash, etc.) * IPV4 * IPV6 * MAC_ADDRESS * PERSON * PHONE_NUMBER * SSN * URL * US_STATE * DRIVERS_LICENSE * DATE * TIME * DATETIME * INTEGER * FLOAT * QUANTITY * ORDINAL Get Started Load a File The Data Profiler can profile the following data/file types: * CSV file (or any delimited file) * JSON object * Avro file * Parquet file * Pandas DataFrame The profiler should automatically identify the file type and load the data into a Data Class. Along with other attributtes the Data class enables data to be accessed via a valid Pandas DataFrame. # Load a csv file, return a CSVData object csv_data = Data('your_file.csv') # Print the first 10 rows of the csv file print(csv_data.data.head(10)) # Load a parquet file, return a ParquetData object parquet_data = Data('your_file.parquet') # Sort the data by the name column parquet_data.data.sort_values(by='name', inplace=True) # Print the sorted first 10 rows of the parquet data print(parquet_data.data.head(10)) If the file type is not automatically identified (rare), you can specify them specifically, see section Specifying a Filetype or Delimiter. Profile a File Example uses a CSV file for example, but CSV, JSON, Avro or Parquet should also work. import json from dataprofiler import Data, Profiler # Load file (CSV should be automatically identified) data = Data("your_file.csv") # Profile the dataset profile = Profiler(data) # Generate a report and use json to prettify. report = profile.report(report_options={"output_format":"pretty"}) # Print the report print(json.dumps(report, indent=4)) Updating Profiles Currently, the data profiler is equipped to update its profile in batches. import json from dataprofiler import Data, Profiler # Load and profile a CSV file data = Data("your_file.csv") profile = Profiler(data) # Update the profile with new data: new_data = Data("new_data.csv") profile.update_profile(new_data) # Print the report using json to prettify. report = profile.report(report_options={"output_format":"pretty"}) print(json.dumps(report, indent=4)) Note that if the data you update the profile with contains integer indices that overlap with the indices on data originally profiled, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate. Merging Profiles If you have two files with the same schema (but different data), it is possible to merge the two profiles together via an addition operator. This also enables profiles to be determined in a distributed manner. import json from dataprofiler import Data, Profiler # Load a CSV file with a schema data1 = Data("file_a.csv") profile1 = Profiler(data) # Load another CSV file with the same schema data2 = Data("file_b.csv") profile2 = Profiler(data) profile3 = profile1 + profile2 # Print the report using json to prettify. report = profile3.report(report_options={"output_format":"pretty"}) print(json.dumps(report, indent=4)) Note that if merged profiles had overlapping integer indices, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate. Profile a Pandas DataFrame import pandas as pd import dataprofiler as dp import json my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]]) profile = dp.Profiler(my_dataframe) # print the report using json to prettify. report = profile.report(report_options={"output_format":"pretty"}) print(json.dumps(report, indent=4)) # read a specified column, in this case it is labeled 0: print(json.dumps(report["data stats"][0], indent=4)) Visit the documentation page for additional Examples and API details References Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions Authors: Anh Truong, Austin Walters, Jeremy Goodsitt 2020 https://arxiv.org/abs/2012.09597 The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services About What's in your data? Extract schema, statistics and entities from datasets capitalone.github.io/dataprofiler Topics python nlp security data-science privacy avro data-analysis gdpr npi sensitive-data pii dataprofiling data-profiler data-labels Resources Readme License Apache-2.0 License Releases 12 0.4.5 Latest Apr 30, 2021 + 11 releases Packages 0 No packages published Contributors 8 * @lettergram * @JGSweets * @gme5078 * @AnhTruong * @sagars729 * @granteden * @ChrisWallace2020 * @tmbjmu Languages * Python 100.0% * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.