https://github.com/maxcountryman/warc-parquet Skip to content Sign up * Product + Features + Mobile + Actions + Codespaces + Copilot + Packages + Security + Code review + Issues + Integrations + GitHub Sponsors + Customer stories * Team * Enterprise * Explore + Explore GitHub + Learn and contribute + Topics + Collections + Trending + Skills + GitHub Sponsors + Open source guides + Connect with others + The ReadME Project + Events + Community forum + GitHub Education + GitHub Stars program * Marketplace * Pricing + Plans + Compare plans + Contact Sales + Education [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} maxcountryman / warc-parquet Public * Notifications * Fork 0 * Star 34 [?] A simple CLI for converting WARC to Parquet. 34 stars 0 forks Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights maxcountryman/warc-parquet This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags 1 branch 6 tags Code Latest commit @maxcountryman maxcountryman provide some vertical whitespace for readability ... 5a742b6 Jun 24, 2022 provide some vertical whitespace for readability 5a742b6 Git stats * 35 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time src .gitignore Cargo.lock Cargo.toml README.md rustfmt.toml View code warc-parquet Install Usage The Binary The Crate DuckDB README.md warc-parquet [?] A utility for converting WARC to Parquet. [6874747073] [6874747073] Install The binary may be installed via cargo: $ cargo install warc-parquet To use the crate in your project, add the following to your Cargo.toml file: [dependencies] warc-parquet = "0.4" Usage The Binary Once installed, the warc-parquet utility can be used to transform WARC into Parquet: $ wget --warc-file example 'https://example.com' $ cat example.warc.gz | warc-parquet --gzipped > example.snappy.parquet warc-parquet is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward: $ wget --warc-file github 'https://github.com' $ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.snappy.parquet It's also simple to preprocess via standard UNIX piping: $ cat example.warc.gz | gzip -d | warc-parquet > example.snappy.parquet Various compression options, including the option to forego compression altogether, are also available: $ cat example.warc.gz | warc-parquet --gzipped --compression brotli > example.brotli.parquet warc-parquet --help displays complete options and usage information. The Crate Refer to the docs for more details about how to use the Reader within your own programs. DuckDB There are any number of ways to consume Parquet once you have it. However a natural fit might be DuckDB: $ duckdb v0.3.3 fe9ba8003 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D select type, id from 'example.snappy.parquet'; +----------+-------------------------------------------------+ | type | id | +----------+-------------------------------------------------+ | warcinfo | | | request | | | response | | | metadata | | | resource | | | resource | | +----------+-------------------------------------------------+ D describe select * from 'example.snappy.parquet'; +-------------------------+-------------+------+-----+---------+-------+ | column_name | column_type | null | key | default | extra | +-------------------------+-------------+------+-----+---------+-------+ | id | VARCHAR | YES | | | | | content_length | UINTEGER | YES | | | | | date | TIMESTAMP | YES | | | | | type | VARCHAR | YES | | | | | content_type | VARCHAR | YES | | | | | concurrent_to | VARCHAR | YES | | | | | block_digest | VARCHAR | YES | | | | | payload_digest | VARCHAR | YES | | | | | ip_address | VARCHAR | YES | | | | | refers_to | VARCHAR | YES | | | | | target_uri | VARCHAR | YES | | | | | truncated | VARCHAR | YES | | | | | warc_info_id | VARCHAR | YES | | | | | filename | VARCHAR | YES | | | | | profile | VARCHAR | YES | | | | | identified_payload_type | VARCHAR | YES | | | | | segment_number | UINTEGER | YES | | | | | segment_origin_id | VARCHAR | YES | | | | | segment_total_length | UINTEGER | YES | | | | | body | BLOB | YES | | | | +-------------------------+-------------+------+-----+---------+-------+ About [?] A simple CLI for converting WARC to Parquet. Topics crawling parquet warc web-archiving duckdb Resources Readme Stars 34 stars Watchers 2 watching Forks 0 forks Releases 6 tags Packages 0 No packages published Languages * Rust 100.0% * (c) 2022 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.