[HN Gopher] The Birth of Parquet
___________________________________________________________________
The Birth of Parquet
Author : whinvik
Score : 131 points
Date : 2024-05-07 11:09 UTC (2 days ago)
(HTM) web link (sympathetic.ink)
(TXT) w3m dump (sympathetic.ink)
| jjgreen wrote:
| Why is it not in Debian? Is there some deep and dark secret?
|
| https://search.debian.org/cgi-bin/omega?DB=en&P=parquet
| whinvik wrote:
| Parquet is a file format. Should a file format be in Debian?
| jjgreen wrote:
| I mean library support for reading and writing it:
| > apt-cache search hdf5 | wc -l 134 > apt-cache
| search netcdf | wc -l 70 > apt-cache search
| parquet | wc -l 0
| whinvik wrote:
| Are there Arrow libraries. Feel like a lot of applications
| that read Parquet actually outsource raw reads to Arrow.
| jjtheblunt wrote:
| look for polars, https://pola.rs
| jjgreen wrote:
| https://packages.debian.org/search?searchon=names&keyword
| s=p...
| petre wrote:
| Use pip, Luke!
|
| https://docs.pola.rs/user-guide/getting-started
| jjgreen wrote:
| My question is "why isn't it in Debian?", I ask that
| since Debian has rather high standards and the absence
| from Debian suggests some quality issue in available
| libraries for the format or the format itself.
|
| Are there dark secrets?
| jjtheblunt wrote:
| Could the dark secret you seek be "debian isn't for
| bleeding edge packages"?
|
| it's very modern and perhaps hasn't been around long
| enough to have debian maintainers feel it's vetted.
|
| for instance, documentation for Python bindings is more
| advanced than for Rust bindings, but the package itself
| uses Rust at the low level.
| jjgreen wrote:
| Parquet is what, 12 years old? Hardly cutting edge. What
| you say my well be true for polars (I'm not familiar with
| it), if/when it (or something else) does get packaged
| I'll give parquet another look ...
| jjtheblunt wrote:
| Yes, i wasn't clear: it's the polars library that's
| actively changing, so that might be the issue, or just
| the vast set of optional components configurable on
| installation, which isn't the normal package manager
| experience.
|
| FWIW i think i share your general aversion to _not_ using
| packages, just for the tidiness of installs and removals,
| though i'm on fedora and macos.
| petre wrote:
| Pandas is probably in Debian and it can read parquet
| files. Polars is fairly new and under active development.
| It's a python library, I install those in $HOME/.local,
| as opposed to system wide. One can also install it in a
| venv. With pip you can also uninstall packages and keep
| things fairly tidy.
| jjgreen wrote:
| Pandas is in Debian but it cannot read parquet files
| itself, it uses 3rd party "engines" for that purpose and
| those are not available in Debian Python
| 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
| on linux Type "help", "copyright", "credits" or
| "license" for more information. >>> import
| pandas >>> pandas.read_parquet('sample3.parquet')
| Traceback (most recent call last): File
| "<stdin>", line 1, in <module> File
| "/usr/lib/python3/dist-packages/pandas/io/parquet.py",
| line 493, in read_parquet impl =
| get_engine(engine) File "/usr/lib/python3/dist-
| packages/pandas/io/parquet.py", line 53, in
| get_engine raise ImportError(
| ImportError: Unable to find a usable engine; tried using:
| 'pyarrow', 'fastparquet'. A suitable version of
| pyarrow or fastparquet is required for parquet
| support.
| petre wrote:
| Then the options you are left with are either polars via
| pip or a third party parquet-tools Debian package.
|
| https://github.com/hangxie/parquet-
| tools/blob/main/USAGE.md#...
| jjtheblunt wrote:
| polars via Rust (cargo) also
| jjtheblunt wrote:
| Why did you ignore the link i gave you above?
|
| If you follow that link, you'll see polars and parquet
| are a large highly configurable collection of tools for
| format manipulations across many HPC formats. Debian
| maintainers possibly don't want to bundle the entirety,
| as it would be vast.
|
| Might this help you, though?
|
| https://cloudsmith.io/~opencpn/repos/polar-
| prod/packages/det...
| marginalia_nu wrote:
| Duckdb is probably what you want, though I don't think it's
| in debian either. It's in Arch though.
| mistrial9 wrote:
| pandas is a python-centric, tabular data handler that works
| well in clouds (and desktop Debian). Pandas can read parquet
| data today, among other libs mentioned. The binary dot-so
| driver style is single-host centric and not the emphasis of
| these cloudy projects (and their cloudy funders)
|
| https://pandas.pydata.org/docs/reference/api/pandas.read_par...
|
| https://packages.debian.org/buster/python3-pandas
|
| Perhaps more alarm is called for when this python+pandas and
| parquet does not work on Debian, but that is not the case
| today.
|
| ps- data access in clouds often uses the S3:// endpoint .
| Contrast to a POSIX endpoint using _fread()_ or similar.. many
| parquet-aware clients prefer the cloudy, un-POSIX method to
| access data and that is another reason it is not a simple
| package in Debian today.
| jjgreen wrote:
| As I understand it, pandas can read parquet if the pyarrow or
| fastparquet packages are available, but that's not the case
| and attempts to fix that have been underway for several
| years.
|
| https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021
| mistrial9 wrote:
| hm
|
| https://github.com/pandas-
| dev/pandas/blob/main/pandas/io/par...
|
| says "fastparquet" engine must be available if no pyarrow
|
| https://pypi.org/project/fastparquet/
| jjgreen wrote:
| See https://news.ycombinator.com/item?id=40306445
| datadrivenangel wrote:
| Pandas often has significant memory overhead, so it's not
| uncommon to need ~3-5x the amount of memory as your file
| size.
|
| Polars and DuckDB are much better about memory management.
| marton78 wrote:
| Also, Polars has a sane and well thought out API, unlike
| the hot streaming mess that is Pandas.
| pabs3 wrote:
| No-one with the combination of motivation, time and skills
| needed get it into Debian. Someone wanted to get a Python
| implementation in, but it looks like they never found the time.
|
| https://bugs.debian.org/838338
|
| These days Debian packaging has become a bit irrelevant, since
| you can just shove upstream releases into a container and go
| for it.
| mkesper wrote:
| If you want to depend on curl|bash setups, sure. If you want
| to get security updates, good packaging is still required.
| calderwoodra wrote:
| I've been struggeling with a tough parquet problem for a few
| months now.
|
| I have a 15gb parquet file in a s3 bucket and I need to "unzip"
| and extract every row from the file to write into my database.
| The contents of the file are emails and I need to integrate them
| into our search function.
|
| Is this possible to do without an unreasonable amount of RAM? Are
| there any affordable services that can help here?
|
| Feel free to contact me (email in bio), happy to pay for a
| consult at the minimum.
| martinky24 wrote:
| Wouldn't partial reads fix the RAM problem? e.g. something like
| this: https://stackoverflow.com/a/69888274
|
| It might not be fast, but a quick 1-off solution that you let
| run for a while would probably do that job. There shouldn't be
| a need to load the whole file into memory.
| memset wrote:
| I can help! Emailing you now :)
|
| Our company (scratchdata.com, open source) is literally built
| to solve the problem of schlepping large amounts of data
| between sources and destinations, so I have worked on this
| problem a lot personally and happy to nerd out about what
| works.
| fock wrote:
| I - by my HPC-background - am wondering quite a bit what
| happened that 15GB-files are considered large data? Not being
| a crazy parquet-user, but:
|
| - does this decompress to giant sizes? - can't you split the
| file easily, because it includes row-based segments? - why
| does it take months to solve this for _one_ file?
| semi-extrinsic wrote:
| As a fellow HPC user, I tried a couple of years ago to do a
| tricky data format conversion using these newfangled tools.
| I was essentially just taking a huge (multi-terabyte) 3D
| dataset, transposing it and changing the endianness.
|
| The solutions I was able to put together using Dask and
| Spark and such were all insanely slow, they just got killed
| by Slurm without getting anywhere. In the end I went back
| to good ole' shell scripting with xxd to handle most of the
| heavy lifting. Finished in under an hour.
|
| The appeal of these newfangled tools is that you can work
| with data sizes that are infeasible to people who only know
| Excel, yet you don't need to understand a single thing
| about how your data is actually stored.
|
| If you can be bothered to read the file format
| specification, open up some files in a hex editor to
| understand the layout, and write low-level code to parse
| the data - then you can achieve several orders of magnitude
| higher performance.
| fifilura wrote:
| I think command line tools is going to be fine if all you
| do is process one row at a time. Or if your data has a
| known order.
|
| But if you want to do some kind of grouping or for
| example pivoting rows to columns, I think you will still
| benefit from a distributed tool like Spark or Trino. That
| can do the map/reduce job for you in a distributed way.
| jjtheblunt wrote:
| Parquet is column oriented and so row-based manipulation
| can be inefficient
| fock wrote:
| > Hierarchically, a file consists of one or more row
| groups.
|
| https://parquet.apache.org/docs/concepts/
|
| Maybe the file in question only has one row group. Which
| would be weird, because the creator had to go out of
| their way to make it happen.
| jjtheblunt wrote:
| Yep. I use it all the time. But, as you said, depends on
| specific layouts, so can't expect it to be row-convenient
| .
| memset wrote:
| Because most people don't have an HPC background, aren't
| familiar with parquet internals, don't know how to make
| their language stream data instead of buffering it all in
| memory, have slow internet connections at home, are running
| out of disk space on their laptops, and only have 4 GB of
| ram to work with after Chrome and Slack take up the other
| 12 GB.
|
| 15 GB is a real drag to do anything with. So it's a real
| pain when someone says "I'll just give you 1 TB worth of
| parquet in S3", the equivalent of dropping a billion
| dollars on someone's doorstep in $1 bills.
| vladsanchez wrote:
| Funny analogy! I loved it. I'm ready to start with
| ScratchData which btw and respectfully never heard of.
|
| Thanks again for sharing your tool and insightful
| knowledge.
| fifilura wrote:
| How do you see the competition from Trino and Athena in
| your case?
|
| Depends a lot on what you want to do with the data of
| course, but if you want to filter and slice/dice it, my
| experience is that it is really fast and stable. And if
| you already have it on s3, the threshold for using it is
| extremely small.
| fock wrote:
| what is your point? They talked about 15GB of parquet -
| what does this have to do with 1TB of parquet?
|
| Also: How does the tool you sell here solve the problem -
| the data is already there and can't be processed (15GB -
| funny that seems to be the scale of YC startups?)? How
| does a tool to transfer the data into a new database help
| here?
| dijksterhuis wrote:
| Try pyarrow.ParquetFile.iter_batches()
|
| Streams batches of rows
|
| https://arrow.apache.org/docs/python/generated/pyarrow.parqu...
|
| Edit -- May need to do some extra work with s3fs too from what
| I recall with the default pandas s3 reading
|
| Edit 2 -- or check out pyarrow.fs.S3FileSystem :facepalm:
| calderwoodra wrote:
| I've spent many many hours trying these suggestions, didn't
| have much luck. iter_batches loads the whole file (or some
| very large amount of it) into memory.
| semi-extrinsic wrote:
| It sounds like maybe your parquet file has no partitioning.
| Apart from the iterating over row groups like someone else
| suggested, I suspect there is no better solution than
| downloading the whole thing to your computer, partitioning
| it in a sane way, and uploading it again. It's only 15 GB
| so it should be fine even on an old laptop.
|
| Of course then you might as well do all the processing
| you're interested in while the file is on your local disk,
| since it is probably much faster than the cloud service
| disk.
| wild_egg wrote:
| Have you given DuckDB a try? I'm using it to shuttle some hefty
| data between postgres and some parquet files on S3 and it's
| only a couple lines. Haven't noted any memory issues so far
| csjh wrote:
| Agreed on DuckDB, fantastic for working with most major data
| formats
| calderwoodra wrote:
| Took your advice and tried DuckDB. Here's what I've got so
| far:
|
| ```
|
| def _get_duck_db_arrow_results(s3_key):
| con = duckdb.connect(config={'threads': 1, 'memory_limit':
| '1GB'}) con.install_extension("aws")
| con.install_extension("httpfs")
| con.load_extension("aws")
| con.load_extension("httpfs") con.sql("CALL
| load_aws_credentials('hadrius-dev', set_region=true);")
| con.sql("CREATE SECRET (TYPE S3,PROVIDER
| CREDENTIAL_CHAIN);") results = con \
| .execute(f"SELECT * FROM read_parquet('{s3_key}');") \
| .fetch_record_batch(1024) for index, result in
| enumerate(results): print(index) return
| results
|
| ```
|
| I ran the above on a 1.4gb parquet file and 15 min later,
| all of the results were printed at once. This suggests to
| me that the whole file was loaded loaded into memory at
| once.
| wild_egg wrote:
| It's been a long time since I've used python but that
| sounds like buffering in the library maybe? I use it from
| Go and it seems to behave differently.
|
| When I'm writing to postgres though I'm doing into
| entirely inside DuckDB with a `INSERT INTO ... SELECT
| ...` and that seems to stream it over.
| akdor1154 wrote:
| You asked it to fetch a batch (15min) then iterated over
| the batch (all at once).
|
| To stream, fetch more batches.
|
| What ddb does to get the batches depends on hand wavey
| magic around available ram, and also the structure of the
| parquet.
| lonesword wrote:
| I work with pyspark and parquet quite a lot. I never had to
| deal with parquet outside spark, but this is how I would do
| this:
|
| - Write a pandas_udf function in pyspark.
|
| - Parition your data into smaller bits so that the pandas_udf
| does not get too much data at the same time.
|
| Something like:
|
| ```
|
| from pyspark.sql import SparkSession
|
| import pyspark.sql.functions as f
|
| @f.pandas_udf(return_type=whatever)
|
| def ingest(doc: pd.Series): # doc is a pandas series now
| # your processing goes here -> write to DB e.t.c
| pd_series_literal = Create a pd.Series that just contains the
| integer 0 to make spark happy return
| pd_series_literal
|
| spark = SparkSession.builder.getOrCreate()
|
| df = spark.read.parquet("s3 path")
|
| df = df.repartition(1000). # bump up this number if you run
| into memory issues
|
| df = df.withColumn("foo", ingest(f.col("doc_column"))
|
| ```
|
| Now the trick is, you can limit how much data is given to your
| pandas_udf by repartitioning your data. The more the
| partitions, the smaller the pd.Series that your pandas_udf
| gets. There's also the
| `spark.sql.execution.arrow.maxRecordsPerBatch` config that you
| can set in spark to limit memory consumption.
|
| ^ Probably overkill to bring spark into the equation, but this
| is one way to do it.
|
| You can use a normal udf (i.e `f.udf()`) instead of a
| pandas_udf, but apparently that's slower due to java <-> python
| serialization
| fifilura wrote:
| Pyspark is probably the way to go.
|
| I just wanted to mention that AWS Athena eats 15G parquet
| files for breakfast.
|
| It is trivial to map the file into Athena.
|
| But you can't connect it to anything else than file output.
| But it can help you to for example write it to smaller
| chunks. Or choose another output format such as csv (although
| arbitrary email content in a csv feels like you are set up
| for parsing errors).
|
| The benefit is that there is virtually no setup cost. And
| processing cost for a 15G file will be just a few cents.
| calderwoodra wrote:
| Athena is probably my best bet tbh, especially if I can do
| a few clicks and just get smaller files. Processing smaller
| files is a no brainer / pretty easy and could be outsourced
| to lambda.
| fifilura wrote:
| Yeah the big benefit is that it requires very little
| setup.
|
| You create a new partitioned table/location from the
| originally mapped file using a CTAS like so:
| CREATE TABLE new_table_name WITH ( format =
| 'PARQUET', parquet_compression = 'SNAPPY',
| external_location = 's3://your-bucket/path/to/output/'
| ) AS SELECT * FROM original_table_name
| PARTITIONED BY partition_column_name
|
| You can probably create a hash and partition by the last
| character if you want 16 evenly sized partitions. Unless
| you already have a dimension to partition by.
| jcgrillo wrote:
| It's been a while (~5yr) since I've done anything with Spark,
| but IIRC it used to be very difficult to make reliable jobs
| with the Java or Python APIs due to the impedance mismatch
| between Scala's lazy evaluation semantics and the eager
| evaluation of Java and Python. I'd encounter perplexing OOMs
| whenever I tried to use the Python or Java APIs, so I
| (reluctantly) learned enough Scala to make the Spark go brr
| and all was well. Is it still like this?
| cycrutchfield wrote:
| My suggestion is to load each row group individually, as they
| generally will be much smaller than your total file size. You
| can do this via pyarrow.ParquetFile.read_row_group. To truly
| optimize this for reading from s3 you could use fsspec's
| open_parquet_file library which would allow you to only load
| each row group one at a time.
| jfim wrote:
| Spend a few bucks on an EC2 instance with a few terabytes of
| RAM for an hour or so. u-3tb1.56xlarge is about $27/hr.
| nucleardog wrote:
| This is the answer for a one-off or occasional problem unless
| your time is worthless.
|
| $200 to rent a machine that can run the naive solution for an
| entire day is peanuts compared to the dev time for a "better"
| solution. Running that machine for eight hours would only
| cost enough to purchase about a half day of junior engineer
| time.
| arrowleaf wrote:
| Using polars in Python I've gotten similar to work, using
| LazyFrame and collect in streaming mode:
|
| ``` df = pl.scan_parquet('tmp/'+DUMP_NAME+'_cleaned.parquet')
|
| with open('tmp/'+DUMP_NAME+'_cleaned.jsonl', mode='w',
| newline='\n', encoding='utf8') as f: for row in
| df.collect(streaming=True).iter_rows(named=True): row = {k: v
| for k, v in row.items() if (v is not None and v != [] and v !=
| '')} f.write(json.dumps(row, default=str) + '\n') ```
| akdor1154 wrote:
| This collects all into memory, then iterates.
| chrisjc wrote:
| DuckDB?
|
| https://duckdb.org/2024/03/29/external-aggregation.html
|
| https://duckdb.org/2021/06/25/querying-parquet.html
|
| If your DB is mysql or postgres, then you could read a stream
| from parquet, transform inline and write out to your DB
|
| https://duckdb.org/2024/01/26/multi-database-support-in-duck...
|
| And an unrelated, but interesting read about the parquet bomb
|
| https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...
| calderwoodra wrote:
| https://news.ycombinator.com/item?id=40301549
| bluedemon wrote:
| Perhaps look into using dlt from https://dlthub.com, using
| pyarrow or polars. It handles large datasets well, especially
| when using generators to process the data in chunks.
| orf wrote:
| Understand the format of your data.
|
| Look at the parquet file metadata: use whatever tool you want
| for that. The Python parquet library is useful and supports s3.
|
| How big are your row groups? If it's one large row group then
| you will run into this issue.
|
| What's the number of rows in each row group?
| sagia wrote:
| I had similiar issue, but for aggreagations. Use case was to
| "compress" large datasets into smaller aggregations for
| insertion into a costly db. At first we used duckdb but memory
| became an issue there and we also bumped into a couple of
| issues with how duckdb handles arrays. We then moved this
| workload to clickhouse local, which was faster and had more
| fine tuning options to our liking. in this case was limiting
| ram usage with i.e. max_bytes_before_external_group_by
| Hikikomori wrote:
| Should be able to do it with Aws Athena?
| xk3 wrote:
| You may want to give octosql a try. I was able to read 100GiB
| parquet files and aggregate them using less than 100MiB RAM and
| it is pretty fast too
| 392 wrote:
| I'm puzzled as to why this is a problem that has lasted months.
| My phone has enough RAM to work with this file in memory. Do
| not use pyspark, it is unbelievably slow and memory hogging if
| you hold it even slightly wrong. Spark is for tb-sized data, at
| minimum.
|
| Have you tried downloading the file from s3 to /tmp, opening it
| with pandas, iterating through 1000 row chunks, pushing to DB?
| The default DF to SQL built into pandas doesn't batch the
| inserts so it will be about 10x slower than necessary, but
| speeding that up is a quick google->SO away.
| fizx wrote:
| Fun story time: I was at twitter for a few years and tended to
| write quick hacks that people wanted to replace with better
| engineering.
|
| We never had scala thrift bindings, and the Java ones were
| awkward from Scala, so I wrote a thrift plugin in JRuby that used
| the Ruby thrift parser and ERb web templates to output some Scala
| code. Integrated with our build pipeline, worked great for the
| company.
|
| I also wrote one era of twitter's service deploy system on a
| hacked up Capistrano.
|
| These projects took a few days because they were dirty hacks, but
| I still got a below perf review for getting easily distracted,
| because I didn't yet know how to sell those company-wide
| projects.
|
| Anyhow, about a month before that team kicked off Parquet, I
| showed them a columnar format I made for a hackweek based on
| Lucene's codec packages, and was using to power a mixpanel-alike
| analytics system.
|
| I'm not sure whether they were inspired or terrified that my hack
| would reach production, but I like to think I had a small hand in
| getting Parquet kickstarted.
| pradeepchhetri wrote:
| Reading through this blog, to me it seems Parquet is lot like
| ClickHouse native data format.
|
| Best part of ClickHouse native data format is I can use the same
| ClickHouse queries and can run in local or remote server/cluster
| and let ClickHouse to decide the available resources in the most
| performant way.
|
| ClickHouse has a native and the fastest integration with Parquet
| so i can:
|
| - Query local/s3 parquet data from command line using clickhouse-
| local.
|
| - Query large amount of local/s3 data programmatically by
| offloading it to clickhouse server/cluster which can do
| processing in distributed fashion.
| pradeepchhetri wrote:
| If you are interested in reading internals of using Parquet
| with ClickHouse, do read following articles:
|
| - https://clickhouse.com/blog/apache-parquet-clickhouse-
| local-...
|
| - https://clickhouse.com/blog/apache-parquet-clickhouse-
| local-...
| colloydi wrote:
| Tangential to the topic but regarding the supposed _Snowball
| Effect_ there is in real life no such thing. I have pushed large
| 'snowballs' down slopes --in reality they are snow cylinders as
| shown in the photo-- and they invariably do not get far. The
| reason being that when one side of the cylinder randomly thickens
| slightly with respect to the other side this causes the whole
| thing to turn in the opposite direction.
|
| For example, if the RHS of your cylinder has a slightly larger
| radius than the LHS the cylinder will commence turning to the
| left.
|
| The upshot is the thick side picks up more snow than the thin
| side and the disparity in radii increases more rapidly still. The
| cylinder becomes a truncated cone which turns sideways and halts!
| whinvik wrote:
| Sounds to me like there's a long blog post waiting to be
| written.
| lsowen wrote:
| It might be rare, but it _can_ happen:
| https://www.youtube.com/watch?v=ZdIclqha014
| dendrite9 wrote:
| It is highly dependent on the snow conditions and the recent
| weather. Sometimes even just the a couple hours are enough to
| change the conditions to have a good chance of rollerballs. The
| climate also has an impact, in my experience more coastal areas
| have more periods when they form.
|
| And in some cases the rollerballs get too tall for the bonding
| strength of the snow, so they break into parts that can restart
| the cycle if the slope is steep enough.
| anentropic wrote:
| Following along the subsequent blog posts in this series progress
| to Arrow and then to... OpenLineage
|
| I'm curious if anyone has experience with OpenLineage/Marquez or
| similar they'd like to share
___________________________________________________________________
(page generated 2024-05-09 23:02 UTC)