[HN Gopher] Fast analysis with DuckDB and Pyarrow
___________________________________________________________________
Fast analysis with DuckDB and Pyarrow
Author : amrrs
Score : 78 points
Date : 2022-04-30 17:50 UTC (5 hours ago)
(HTM) web link (tech.gerardbentley.com)
(TXT) w3m dump (tech.gerardbentley.com)
| jagtesh wrote:
| First of all, thanks for sharing this OP! So glad to see a way to
| query a df using SQL without further transformation.
|
| Arrow has been truly revolutionary in this regard, providing a
| solid in-memory data format (with performant APIs in many
| languages) for interchange between different engines and even
| formats.
|
| You can go from ORC to Parset to CSV on a local FS or S3.
|
| With DuckDB, it's like you can build your own AWS Athena at
| likely a fraction of the cost. Now if only someone would
| integrate vaex with DuckDB, it will make your powerful Apple
| Silicon machines a compelling alternative to running a full
| fledged Spark/Hadoop cluster.
| singhrac wrote:
| This is mildly off topic, but I am very unhappy with Pandas.
| Every single API feels bolted on without any consideration of
| composability or ergonomics. After spending 4 years with a much
| better proprietary library I cannot deal with arbitrary functions
| I have to learn like "value_counts" or whatever the output of a
| "groupby" is.
| rdedev wrote:
| My go-to these days is Polars. You get good performance since
| it uses arrow in the back. Coupled that with built-in lazy
| evaluation and it's API design it's pretty good for me. There
| are some caveats you need to be aware though. It doesn't always
| work as a drop in replacement for pandas
| bsg75 wrote:
| Are you referring to https://www.pola.rs ?
| rdedev wrote:
| Yup. My bad. Looks like autocorrect screwed me
| shankr wrote:
| yeah even after working for years with pandas, I never feel
| very confident writing it. I always have to look-up even
| simpler stuff.
| isoprophlex wrote:
| Pandas is an absolute horrorshow: poor performance,
| inconsistent API, terrible implicit behavior leading to
| footguns.
|
| And everyone uses it because it's what you do when your boss
| tells you "we're transforming the analytics team, you're all to
| become data scientists because everyone has data scientists
| now". You just grab whatever had the biggest mindshare on SO
| and in random yt tutorials. Can't blame them.
|
| But hooo boy does pd get on my nerves.
|
| Care to share what propriety stuff you were using?
| singhrac wrote:
| I can't really share in any detail, I think, but the best
| part was that "Series" were immutable and had sorted keys
| (indexes). Essentially they were (math) functions, so
| "indexes" had unique elements. All the important bits had
| fast numpy/Cython implementations, but the semantics were
| good because of unique keys.
|
| Honestly I still feel like I'm missing some sort of larger
| story about the semantics of Pandas (like the "functions"
| explanation above), so if anyone knows of anything that made
| Pandas click, please let me know.
| minimaxir wrote:
| The fact that most data analysis/ETL tutorials on the internet
| have converged on the same CSV/pandas tactics over the past
| decade is disappointing when newer tools demonstrated here such
| as DuckDB/Arrow have practical advantages without much code
| complexity overhead.
|
| This post also links to another discussion about the Parquet data
| format (https://pythonspeed.com/articles/pandas-read-csv-fast/),
| also supported by Arrow, which is also extremely useful but I
| never see anyone talking about it. Granted, Parquet data can't
| natively be imported into Excel which is likely the main cause.
| teej wrote:
| These tools are very new compared to CSV+pandas. And most
| things you want to get data out of won't give it to you in
| parquet.
|
| The future is very promising, I am personally very excited
| about DuckDB. But it's too soon to be griping about old
| tutorials.
| philshem wrote:
| As of Pandas 1.4, you can use the pyarrow engine for reading a
| csv df = pd.read_csv("large.csv",
| engine="pyarrow")
|
| https://pythonspeed.com/articles/pandas-read-csv-fast/
| [deleted]
| mkl wrote:
| They do that in the article too.
| mritchie712 wrote:
| Is it me or do posts about data tools do better on HN then your
| average software post?
| minimaxir wrote:
| Higher signal-to-noise, at the least.
| tomrod wrote:
| I like pyarrow a lot, but this is my first time come across
| DuckDB. I'll check it out!
|
| I'm curious how it loads so fast initially.
___________________________________________________________________
(page generated 2022-04-30 23:00 UTC)