[HN Gopher] Show HN: DataProfiler - What's in your data? Extract...
       ___________________________________________________________________
        
       Show HN: DataProfiler - What's in your data? Extract schema, stats
       and entities
        
       Author : lettergram
       Score  : 64 points
       Date   : 2021-05-10 14:30 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | MrPowers wrote:
       | Deequ is the big data / Spark alternative for similar
       | functionality in case anyone is interested:
       | https://github.com/awslabs/deequ
       | 
       | Looks like a great project and wanted to highlight why projects
       | like these are much more useful for file formats that don't have
       | metadata (CSV) vs file formats with metadata (Parquet).
       | 
       | Parquet file footers contain metadata that provides column-level
       | metadata, file-level metadata, and schema information.
       | 
       | Cluster computing technologies can use Parquet metadata to skip
       | entire files (Parquet predicate pushdown filtering). If there is
       | an age column in the data, then the min/max values will be stored
       | in the Parquet footer. If you run a Spark query for df =
       | spark.read.parquet(/some_folder).where(age > 90), it'll only read
       | in the files that have a max_age greater than 90. Data skipping
       | is one of the best performance benefits.
       | 
       | I wrote a blog post on analyzing Parquet file metadata with
       | PyArrow if you're interested in learning more:
       | https://mungingdata.com/pyarrow/parquet-metadata-min-max-sta...
       | 
       | File formats that make you infer the schema (CSV) are on the way
       | out. Enjoy Parquet and the other benefits it provides, like
       | column pruning!
        
         | BugsJustFindMe wrote:
         | > _File formats that make you infer the schema (CSV) are on the
         | way out._
         | 
         | Lol, good luck. People will continue using delimited text files
         | for as long as they have eyeballs.
        
           | MrPowers wrote:
           | Yea, haha, that was worded too strongly. Should have said
           | "more folks are using Parquet for data workflows and adoption
           | seems to be increasing". CSV is better when you need
           | something mutable / human readable, so they'll always be
           | around.
        
             | citilife wrote:
             | Main issue with Parquet is actually the metadata. CSV is
             | nice because I can send it to a friend and they can
             | instantly understand it; you can print it and understand
             | it.
             | 
             | In contrast, Parquet is good for systems and I definitely
             | recommend it in programming, but sharing.. not so much.
        
         | stevesimmons wrote:
         | > it'll only read in the files that have a max_age greater than
         | 90. Data skipping is one of the best performance benefits.
         | 
         | Actually Parquet files have a chunked structure, with maybe 200
         | row groups in a large Parquet file. The metadata is stored per
         | chunk in the file footer.
         | 
         | So if that suggests you only need a few chunks with a file,
         | your data reads can be orders of magnitude faster.
         | 
         | The data is stored column wise too. So if you only need a few
         | of the columns, you can get another 10x increase in
         | performance.
         | 
         | It's like magic, if your data is ordered in a way that can
         | exploit this.
        
           | MrPowers wrote:
           | Good to know, thanks for clarifying!
           | 
           | Parquet predicate pushdown filtering is even more powerful
           | than I thought!
        
       | BugsJustFindMe wrote:
       | They use a machine learning model to determine categories that
       | have concrete membership rules?
        
         | ZeroCool2u wrote:
         | "Note: The Data Profiler comes with a pre-trained deep learning
         | model, used to efficiently identify sensitive data (PII / NPI).
         | If desired, it's easy to add new entities to the existing pre-
         | trained model or insert an entire new pipeline for entity
         | recognition."
         | 
         | No, they seem to use the ML model for identifying data that
         | typically may be considered private, sensitive, or PII, which
         | can be very difficult to define depending on your data and your
         | organization. For example, identifying phone numbers, social
         | security numbers, addresses, etc. All of those things can take
         | a variety of formats and writing a regex or something to
         | identify them feels a bit silly when you can easily train a
         | decent classification model to do this mostly automatically and
         | then catch this kind of data in new use cases where the
         | recognition pattern you may have written before doesn't
         | necessarily work.
        
       | HenryBemis wrote:
       | For the HN folks outside the USA, CapitalOne is a bank in the
       | USA.
       | 
       | From Wikipedia: Capital One Financial Corporation is an American
       | bank holding company specializing in credit cards, auto loans,
       | banking, and savings accounts, headquartered in McLean, Virginia
       | with operations primarily in the United States. It is on the list
       | of largest banks in the United States _and has developed a
       | reputation for being a technology-focused bank._
       | 
       | (italisation is mine)
        
       ___________________________________________________________________
       (page generated 2021-05-10 23:01 UTC)