[HN Gopher] Show HN: Easily Convert WARC (Web Archive) into Parq...
___________________________________________________________________
Show HN: Easily Convert WARC (Web Archive) into Parquet, Then Query
with DuckDB
Author : llambda
Score : 61 points
Date : 2022-06-24 18:26 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mritchie712 wrote:
| Nice! I've been considering using DuckDB for our product (to
| speed up join's and aggregates of in-memory data), it's an
| incredible technology.
| wahnfrieden wrote:
| How does this compare with SQLite approaches shared recently?
| infogulch wrote:
| Well there's a virtual table extension to read parquet files in
| SQLite. I've not tried it myself.
| https://github.com/cldellow/sqlite-parquet-vtable
| westurner wrote:
| Could this work with datasette (which is a flexible interface
| to sqlite with a web-based query editor)?
| llambda wrote:
| It's a great question: fundamentally the Parquet format offers
| columnar orientation. With datasets like these, there's some
| research[0] indicating this is a preferable way of storing and
| querying WARC.
|
| DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite
| though when it comes to Parquet: Parquet is supported directly
| in Duck and this makes dealing with these datasets a breeze.
|
| [0] https://www.researchgate.net/figure/Comparing-WARC-CDX-
| Parqu...
| 1egg0myegg0 wrote:
| Good question! As a disclaimer, I work for DuckDB Labs.
|
| There are 2 big benefits to working with Parquet files in
| DuckDB, and both relate to speed!
|
| DuckDB can query parquet right where it sits, so there is no
| need to insert it into the db first. This is typically much
| faster. Also, DuckDB's engine is columnar (SQLite is row
| based), so it can do faster analytical queries using that
| format. I have seen 20-100x speed improvements over SQLite in
| analytical workloads.
|
| Happy to answer any questions!
| arpinum wrote:
| Do you see DuckDB as a possible replacement for AWS Athena?
| Where would Athena still be better than DuckDB + Parquet +
| Lambda?
| wenc wrote:
| DuckDB user here. As far as I can tell, DuckDB doesn't
| support distributed computation so you have to set that up
| yourself, whereas Athena is essentially Presto -- it
| handles that detail for you. It also doesn't support Avro
| or Orc yet.
|
| DuckDB excels at single machine compute where everything
| fits in memory or is streamable (data can be local or on
| S3) -- it's lightweight and vectorized. I use it in Jupyter
| notebooks and in Python code.
|
| But it may not be the right tool if you need distributed
| compute over a very large dataset.
| wenc wrote:
| DuckDB has SQLite semantics but is natively built around
| columnar formats (parquet, in-memory Arrow) and strong types
| (including dates). It also supports very complex SQL.
|
| SQLite is a row store built around row based transactional
| workloads. DuckDB is built around analytics workloads (lots of
| filtering, aggregations and transformations) and for these
| workloads DuckDB is just way way faster. Source: personal
| experience.
___________________________________________________________________
(page generated 2022-06-24 23:00 UTC)