[HN Gopher] The Birth of Parquet
       ___________________________________________________________________
        
       The Birth of Parquet
        
       Author : whinvik
       Score  : 75 points
       Date   : 2024-05-07 11:09 UTC (1 days ago)
        
 (HTM) web link (sympathetic.ink)
 (TXT) w3m dump (sympathetic.ink)
        
       | jjgreen wrote:
       | Why is it not in Debian? Is there some deep and dark secret?
       | 
       | https://search.debian.org/cgi-bin/omega?DB=en&P=parquet
        
         | whinvik wrote:
         | Parquet is a file format. Should a file format be in Debian?
        
           | jjgreen wrote:
           | I mean library support for reading and writing it:
           | > apt-cache search hdf5 | wc -l       134       > apt-cache
           | search netcdf | wc -l       70       > apt-cache search
           | parquet | wc -l        0
        
             | whinvik wrote:
             | Are there Arrow libraries. Feel like a lot of applications
             | that read Parquet actually outsource raw reads to Arrow.
        
             | jjtheblunt wrote:
             | look for polars, https://pola.rs
        
               | jjgreen wrote:
               | https://packages.debian.org/search?searchon=names&keyword
               | s=p...
        
               | petre wrote:
               | Use pip, Luke!
               | 
               | https://docs.pola.rs/user-guide/getting-started
        
               | jjgreen wrote:
               | My question is "why isn't it in Debian?", I ask that
               | since Debian has rather high standards and the absence
               | from Debian suggests some quality issue in available
               | libraries for the format or the format itself.
               | 
               | Are there dark secrets?
        
               | jjtheblunt wrote:
               | Could the dark secret you seek be "debian isn't for
               | bleeding edge packages"?
               | 
               | it's very modern and perhaps hasn't been around long
               | enough to have debian maintainers feel it's vetted.
               | 
               | for instance, documentation for Python bindings is more
               | advanced than for Rust bindings, but the package itself
               | uses Rust at the low level.
        
               | jjgreen wrote:
               | Parquet is what, 12 years old? Hardly cutting edge. What
               | you say my well be true for polars (I'm not familiar with
               | it), if/when it (or something else) does get packaged
               | I'll give parquet another look ...
        
               | jjtheblunt wrote:
               | Yes, i wasn't clear: it's the polars library that's
               | actively changing, so that might be the issue, or just
               | the vast set of optional components configurable on
               | installation, which isn't the normal package manager
               | experience.
               | 
               | FWIW i think i share your general aversion to _not_ using
               | packages, just for the tidiness of installs and removals,
               | though i'm on fedora and macos.
        
               | jjtheblunt wrote:
               | Why did you ignore the link i gave you above?
               | 
               | If you follow that link, you'll see polars and parquet
               | are a large highly configurable collection of tools for
               | format manipulations across many HPC formats. Debian
               | maintainers possibly don't want to bundle the entirety,
               | as it would be vast.
               | 
               | Might this help you, though?
               | 
               | https://cloudsmith.io/~opencpn/repos/polar-
               | prod/packages/det...
        
             | marginalia_nu wrote:
             | Duckdb is probably what you want, though I don't think it's
             | in debian either. It's in Arch though.
        
         | mistrial9 wrote:
         | pandas is a python-centric, tabular data handler that works
         | well in clouds (and desktop Debian). Pandas can read parquet
         | data today, among other libs mentioned. The binary dot-so
         | driver style is single-host centric and not the emphasis of
         | these cloudy projects (and their cloudy funders)
         | 
         | https://pandas.pydata.org/docs/reference/api/pandas.read_par...
         | 
         | https://packages.debian.org/buster/python3-pandas
         | 
         | Perhaps more alarm is called for when this python+pandas and
         | parquet does not work on Debian, but that is not the case
         | today.
         | 
         | ps- data access in clouds often uses the S3:// endpoint .
         | Contrast to a POSIX endpoint using _fread()_ or similar.. many
         | parquet-aware clients prefer the cloudy, un-POSIX method to
         | access data and that is another reason it is not a simple
         | package in Debian today.
        
           | jjgreen wrote:
           | As I understand it, pandas can read parquet if the pyarrow or
           | fastparquet packages are available, but that's not the case
           | and attempts to fix that have been underway for several
           | years.
           | 
           | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021
        
             | mistrial9 wrote:
             | hm
             | 
             | https://github.com/pandas-
             | dev/pandas/blob/main/pandas/io/par...
             | 
             | says "fastparquet" engine must be available if no pyarrow
             | 
             | https://pypi.org/project/fastparquet/
        
           | datadrivenangel wrote:
           | Pandas often has significant memory overhead, so it's not
           | uncommon to need ~3-5x the amount of memory as your file
           | size.
           | 
           | Polars and DuckDB are much better about memory management.
        
             | marton78 wrote:
             | Also, Polars has a sane and well thought out API, unlike
             | the hot streaming mess that is Pandas.
        
       | calderwoodra wrote:
       | I've been struggeling with a tough parquet problem for a few
       | months now.
       | 
       | I have a 15gb parquet file in a s3 bucket and I need to "unzip"
       | and extract every row from the file to write into my database.
       | The contents of the file are emails and I need to integrate them
       | into our search function.
       | 
       | Is this possible to do without an unreasonable amount of RAM? Are
       | there any affordable services that can help here?
       | 
       | Feel free to contact me (email in bio), happy to pay for a
       | consult at the minimum.
        
         | martinky24 wrote:
         | Wouldn't partial reads fix the RAM problem? e.g. something like
         | this: https://stackoverflow.com/a/69888274
         | 
         | It might not be fast, but a quick 1-off solution that you let
         | run for a while would probably do that job. There shouldn't be
         | a need to load the whole file into memory.
        
         | memset wrote:
         | I can help! Emailing you now :)
         | 
         | Our company (scratchdata.com, open source) is literally built
         | to solve the problem of schlepping large amounts of data
         | between sources and destinations, so I have worked on this
         | problem a lot personally and happy to nerd out about what
         | works.
        
           | fock wrote:
           | I - by my HPC-background - am wondering quite a bit what
           | happened that 15GB-files are considered large data? Not being
           | a crazy parquet-user, but:
           | 
           | - does this decompress to giant sizes? - can't you split the
           | file easily, because it includes row-based segments? - why
           | does it take months to solve this for _one_ file?
        
             | semi-extrinsic wrote:
             | As a fellow HPC user, I tried a couple of years ago to do a
             | tricky data format conversion using these newfangled tools.
             | I was essentially just taking a huge (multi-terabyte) 3D
             | dataset, transposing it and changing the endianness.
             | 
             | The solutions I was able to put together using Dask and
             | Spark and such were all insanely slow, they just got killed
             | by Slurm without getting anywhere. In the end I went back
             | to good ole' shell scripting with xxd to handle most of the
             | heavy lifting. Finished in under an hour.
             | 
             | The appeal of these newfangled tools is that you can work
             | with data sizes that are infeasible to people who only know
             | Excel, yet you don't need to understand a single thing
             | about how your data is actually stored.
             | 
             | If you can be bothered to read the file format
             | specification, open up some files in a hex editor to
             | understand the layout, and write low-level code to parse
             | the data - then you can achieve several orders of magnitude
             | higher performance.
        
         | dijksterhuis wrote:
         | Try pyarrow.ParquetFile.iter_batches()
         | 
         | Streams batches of rows
         | 
         | https://arrow.apache.org/docs/python/generated/pyarrow.parqu...
         | 
         | Edit -- May need to do some extra work with s3fs too from what
         | I recall with the default pandas s3 reading
         | 
         | Edit 2 -- or check out pyarrow.fs.S3FileSystem :facepalm:
        
           | calderwoodra wrote:
           | I've spent many many hours trying these suggestions, didn't
           | have much luck. iter_batches loads the whole file (or some
           | very large amount of it) into memory.
        
             | semi-extrinsic wrote:
             | It sounds like maybe your parquet file has no partitioning.
             | Apart from the iterating over row groups like someone else
             | suggested, I suspect there is no better solution than
             | downloading the whole thing to your computer, partitioning
             | it in a sane way, and uploading it again. It's only 15 GB
             | so it should be fine even on an old laptop.
             | 
             | Of course then you might as well do all the processing
             | you're interested in while the file is on your local disk,
             | since it is probably much faster than the cloud service
             | disk.
        
         | wild_egg wrote:
         | Have you given DuckDB a try? I'm using it to shuttle some hefty
         | data between postgres and some parquet files on S3 and it's
         | only a couple lines. Haven't noted any memory issues so far
        
           | csjh wrote:
           | Agreed on DuckDB, fantastic for working with most major data
           | formats
        
             | calderwoodra wrote:
             | Took your advice and tried DuckDB. Here's what I've got so
             | far:
             | 
             | ```
             | 
             | def _get_duck_db_arrow_results(s3_key):
             | con = duckdb.connect(config={'threads': 1, 'memory_limit':
             | '1GB'})         con.install_extension("aws")
             | con.install_extension("httpfs")
             | con.load_extension("aws")
             | con.load_extension("httpfs")              con.sql("CALL
             | load_aws_credentials('hadrius-dev', set_region=true);")
             | con.sql("CREATE SECRET (TYPE S3,PROVIDER
             | CREDENTIAL_CHAIN);")         results = con \
             | .execute(f"SELECT * FROM read_parquet('{s3_key}');") \
             | .fetch_record_batch(1024)         for index, result in
             | enumerate(results):             print(index)         return
             | results
             | 
             | ```
             | 
             | I ran the above on a 1.4gb parquet file and 15 min later,
             | all of the results were printed at once. This suggests to
             | me that the whole file was loaded loaded into memory at
             | once.
        
         | lonesword wrote:
         | I work with pyspark and parquet quite a lot. I never had to
         | deal with parquet outside spark, but this is how I would do
         | this:
         | 
         | - Write a pandas_udf function in pyspark.
         | 
         | - Parition your data into smaller bits so that the pandas_udf
         | does not get too much data at the same time.
         | 
         | Something like:
         | 
         | ```
         | 
         | from pyspark.sql import SparkSession
         | 
         | import pyspark.sql.functions as f
         | 
         | @f.pandas_udf(return_type=whatever)
         | 
         | def ingest(doc: pd.Series): # doc is a pandas series now
         | # your processing goes here -> write to DB e.t.c
         | pd_series_literal = Create a pd.Series that just contains the
         | integer 0 to make spark happy              return
         | pd_series_literal
         | 
         | spark = SparkSession.builder.getOrCreate()
         | 
         | df = spark.read.parquet("s3 path")
         | 
         | df = df.repartition(1000). # bump up this number if you run
         | into memory issues
         | 
         | df = df.withColumn("foo", ingest(f.col("doc_column"))
         | 
         | ```
         | 
         | Now the trick is, you can limit how much data is given to your
         | pandas_udf by repartitioning your data. The more the
         | partitions, the smaller the pd.Series that your pandas_udf
         | gets. There's also the
         | `spark.sql.execution.arrow.maxRecordsPerBatch` config that you
         | can set in spark to limit memory consumption.
         | 
         | ^ Probably overkill to bring spark into the equation, but this
         | is one way to do it.
         | 
         | You can use a normal udf (i.e `f.udf()`) instead of a
         | pandas_udf, but apparently that's slower due to java <-> python
         | serialization
        
           | fifilura wrote:
           | Pyspark is probably the way to go.
           | 
           | I just wanted to mention that AWS Athena eats 15G parquet
           | files for breakfast.
           | 
           | It is trivial to map the file into Athena.
           | 
           | But you can't connect it to anything else than file output.
           | But it can help you to for example write it to smaller
           | chunks. Or choose another output format such as csv (although
           | arbitrary email content in a csv feels like you are set up
           | for parsing errors).
           | 
           | The benefit is that there is virtually no setup cost. And
           | processing cost for a 15G file will be just a few cents.
        
             | calderwoodra wrote:
             | Athena is probably my best bet tbh, especially if I can do
             | a few clicks and just get smaller files. Processing smaller
             | files is a no brainer / pretty easy and could be outsourced
             | to lambda.
        
               | fifilura wrote:
               | Yeah the big benefit is that it requires very little
               | setup.
               | 
               | You create a new partitioned table/location from the
               | originally mapped file using a CTAS like so:
               | CREATE TABLE new_table_name       WITH (         format =
               | 'PARQUET',         parquet_compression = 'SNAPPY',
               | external_location = 's3://your-bucket/path/to/output/'
               | ) AS       SELECT *       FROM original_table_name
               | PARTITIONED BY partition_column_name
               | 
               | You can probably create a hash and partition by the last
               | character if you want 16 evenly sized partitions. Unless
               | you already have a dimension to partition by.
        
           | jcgrillo wrote:
           | It's been a while (~5yr) since I've done anything with Spark,
           | but IIRC it used to be very difficult to make reliable jobs
           | with the Java or Python APIs due to the impedance mismatch
           | between Scala's lazy evaluation semantics and the eager
           | evaluation of Java and Python. I'd encounter perplexing OOMs
           | whenever I tried to use the Python or Java APIs, so I
           | (reluctantly) learned enough Scala to make the Spark go brr
           | and all was well. Is it still like this?
        
         | cycrutchfield wrote:
         | My suggestion is to load each row group individually, as they
         | generally will be much smaller than your total file size. You
         | can do this via pyarrow.ParquetFile.read_row_group. To truly
         | optimize this for reading from s3 you could use fsspec's
         | open_parquet_file library which would allow you to only load
         | each row group one at a time.
        
         | jfim wrote:
         | Spend a few bucks on an EC2 instance with a few terabytes of
         | RAM for an hour or so. u-3tb1.56xlarge is about $27/hr.
        
         | arrowleaf wrote:
         | Using polars in Python I've gotten similar to work, using
         | LazyFrame and collect in streaming mode:
         | 
         | ``` df = pl.scan_parquet('tmp/'+DUMP_NAME+'_cleaned.parquet')
         | 
         | with open('tmp/'+DUMP_NAME+'_cleaned.jsonl', mode='w',
         | newline='\n', encoding='utf8') as f: for row in
         | df.collect(streaming=True).iter_rows(named=True): row = {k: v
         | for k, v in row.items() if (v is not None and v != [] and v !=
         | '')} f.write(json.dumps(row, default=str) + '\n') ```
        
         | chrisjc wrote:
         | DuckDB?
         | 
         | https://duckdb.org/2024/03/29/external-aggregation.html
         | 
         | https://duckdb.org/2021/06/25/querying-parquet.html
         | 
         | If your DB is mysql or postgres, then you could read a stream
         | from parquet, transform inline and write out to your DB
         | 
         | https://duckdb.org/2024/01/26/multi-database-support-in-duck...
         | 
         | And an unrelated, but interesting read about the parquet bomb
         | 
         | https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...
        
           | calderwoodra wrote:
           | https://news.ycombinator.com/item?id=40301549
        
         | bluedemon wrote:
         | Perhaps look into using dlt from https://dlthub.com, using
         | pyarrow or polars. It handles large datasets well, especially
         | when using generators to process the data in chunks.
        
       | fizx wrote:
       | Fun story time: I was at twitter for a few years and tended to
       | write quick hacks that people wanted to replace with better
       | engineering.
       | 
       | We never had scala thrift bindings, and the Java ones were
       | awkward from Scala, so I wrote a thrift plugin in JRuby that used
       | the Ruby thrift parser and ERb web templates to output some Scala
       | code. Integrated with our build pipeline, worked great for the
       | company.
       | 
       | I also wrote one era of twitter's service deploy system on a
       | hacked up Capistrano.
       | 
       | These projects took a few days because they were dirty hacks, but
       | I still got a below perf review for getting easily distracted,
       | because I didn't yet know how to sell those company-wide
       | projects.
       | 
       | Anyhow, about a month before that team kicked off Parquet, I
       | showed them a columnar format I made for a hackweek based on
       | Lucene's codec packages, and was using to power a mixpanel-alike
       | analytics system.
       | 
       | I'm not sure whether they were inspired or terrified that my hack
       | would reach production, but I like to think I had a small hand in
       | getting Parquet kickstarted.
        
       | pradeepchhetri wrote:
       | Reading through this blog, to me it seems Parquet is lot like
       | ClickHouse native data format.
       | 
       | Best part of ClickHouse native data format is I can use the same
       | ClickHouse queries and can run in local or remote server/cluster
       | and let ClickHouse to decide the available resources in the most
       | performant way.
       | 
       | ClickHouse has a native and the fastest integration with Parquet
       | so i can:
       | 
       | - Query local/s3 parquet data from command line using clickhouse-
       | local.
       | 
       | - Query large amount of local/s3 data programmatically by
       | offloading it to clickhouse server/cluster which can do
       | processing in distributed fashion.
        
         | pradeepchhetri wrote:
         | If you are interested in reading internals of using Parquet
         | with ClickHouse, do read following articles:
         | 
         | - https://clickhouse.com/blog/apache-parquet-clickhouse-
         | local-...
         | 
         | - https://clickhouse.com/blog/apache-parquet-clickhouse-
         | local-...
        
       ___________________________________________________________________
       (page generated 2024-05-08 23:01 UTC)