[HN Gopher] Show HN: DataProfiler - What's in your data? Extract...
___________________________________________________________________
Show HN: DataProfiler - What's in your data? Extract schema, stats
and entities
Author : lettergram
Score : 64 points
Date : 2021-05-10 14:30 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| MrPowers wrote:
| Deequ is the big data / Spark alternative for similar
| functionality in case anyone is interested:
| https://github.com/awslabs/deequ
|
| Looks like a great project and wanted to highlight why projects
| like these are much more useful for file formats that don't have
| metadata (CSV) vs file formats with metadata (Parquet).
|
| Parquet file footers contain metadata that provides column-level
| metadata, file-level metadata, and schema information.
|
| Cluster computing technologies can use Parquet metadata to skip
| entire files (Parquet predicate pushdown filtering). If there is
| an age column in the data, then the min/max values will be stored
| in the Parquet footer. If you run a Spark query for df =
| spark.read.parquet(/some_folder).where(age > 90), it'll only read
| in the files that have a max_age greater than 90. Data skipping
| is one of the best performance benefits.
|
| I wrote a blog post on analyzing Parquet file metadata with
| PyArrow if you're interested in learning more:
| https://mungingdata.com/pyarrow/parquet-metadata-min-max-sta...
|
| File formats that make you infer the schema (CSV) are on the way
| out. Enjoy Parquet and the other benefits it provides, like
| column pruning!
| BugsJustFindMe wrote:
| > _File formats that make you infer the schema (CSV) are on the
| way out._
|
| Lol, good luck. People will continue using delimited text files
| for as long as they have eyeballs.
| MrPowers wrote:
| Yea, haha, that was worded too strongly. Should have said
| "more folks are using Parquet for data workflows and adoption
| seems to be increasing". CSV is better when you need
| something mutable / human readable, so they'll always be
| around.
| citilife wrote:
| Main issue with Parquet is actually the metadata. CSV is
| nice because I can send it to a friend and they can
| instantly understand it; you can print it and understand
| it.
|
| In contrast, Parquet is good for systems and I definitely
| recommend it in programming, but sharing.. not so much.
| stevesimmons wrote:
| > it'll only read in the files that have a max_age greater than
| 90. Data skipping is one of the best performance benefits.
|
| Actually Parquet files have a chunked structure, with maybe 200
| row groups in a large Parquet file. The metadata is stored per
| chunk in the file footer.
|
| So if that suggests you only need a few chunks with a file,
| your data reads can be orders of magnitude faster.
|
| The data is stored column wise too. So if you only need a few
| of the columns, you can get another 10x increase in
| performance.
|
| It's like magic, if your data is ordered in a way that can
| exploit this.
| MrPowers wrote:
| Good to know, thanks for clarifying!
|
| Parquet predicate pushdown filtering is even more powerful
| than I thought!
| BugsJustFindMe wrote:
| They use a machine learning model to determine categories that
| have concrete membership rules?
| ZeroCool2u wrote:
| "Note: The Data Profiler comes with a pre-trained deep learning
| model, used to efficiently identify sensitive data (PII / NPI).
| If desired, it's easy to add new entities to the existing pre-
| trained model or insert an entire new pipeline for entity
| recognition."
|
| No, they seem to use the ML model for identifying data that
| typically may be considered private, sensitive, or PII, which
| can be very difficult to define depending on your data and your
| organization. For example, identifying phone numbers, social
| security numbers, addresses, etc. All of those things can take
| a variety of formats and writing a regex or something to
| identify them feels a bit silly when you can easily train a
| decent classification model to do this mostly automatically and
| then catch this kind of data in new use cases where the
| recognition pattern you may have written before doesn't
| necessarily work.
| HenryBemis wrote:
| For the HN folks outside the USA, CapitalOne is a bank in the
| USA.
|
| From Wikipedia: Capital One Financial Corporation is an American
| bank holding company specializing in credit cards, auto loans,
| banking, and savings accounts, headquartered in McLean, Virginia
| with operations primarily in the United States. It is on the list
| of largest banks in the United States _and has developed a
| reputation for being a technology-focused bank._
|
| (italisation is mine)
___________________________________________________________________
(page generated 2021-05-10 23:01 UTC)