[HN Gopher] Show HN: WarcDB: Web crawl data as SQLite databases
___________________________________________________________________
Show HN: WarcDB: Web crawl data as SQLite databases
Author : fforflo
Score : 124 points
Date : 2022-06-19 13:26 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jbverschoor wrote:
| I like that you're logging responses instead of just the result /
| payload. Reminds me of some idea I had of using something like
| queue as an intermediary between a webserver and the backend. I
| don't exactly remember my reasoning anymore right now
| uniqueuid wrote:
| Since this is pretty new, some background:
|
| WARC is a file format written and read by a rather small but
| specialized set of web crawler tools, most notably the internet
| archive's tooling. For example, its java-based crawler heretrix
| produces warc files.
|
| There are a couple of other very cool tools, such as warcprox,
| which can create web archives from web browser activity by acting
| as a proxy, pywb which play back the same to make archived
| versions browsable, and related libraries [1](shoutout to Noah
| Levitt, Ilya Kreymer and collaborators for building all this).
|
| The file format itself is an iso standard, and it's very simple:
| Capture all http headers sent and received, and all http bodies
| sent and received, and do simple de-duplication based on digests
| of the body.
|
| There is a companion format, CDX, which builds indexes from warc
| files (which in turn are just concatenated records, so rather
| robust).
|
| Although all of this is great, I worry a bit about where we're
| heading with QUIC / udp-based protocols, websockets and other
| very involved protocols which ultimately make archival much
| harder.
|
| If there's anything you can do to help these (or other) tools to
| keep our web's archival records alive and flowing, please do so.
|
| [1] https://github.com/internetarchive/warcprox
| lijogdfljk wrote:
| Would a WARC format reduce effort needed to make Reader-like
| programs? Ie strip pages of HTML cruft, leaving you with text,
| images, etc - the content?
| simonw wrote:
| I've found the Readability.js library to be really good for
| that - here's my recipe for running it as a CLI:
| https://til.simonwillison.net/shot-scraper/readability
| uniqueuid wrote:
| Not at all.
|
| A WARC file gives you exactly what you would see on the wire,
| or in the network inspector tab of your browser. It does
| nothing to the content, and that's the point.
|
| The only thing you gain (and that's very important for other
| reasons as well) is an immutable ground truth to work from
| when creating the reader view of a given article.
| lijogdfljk wrote:
| Gotcha - yea i was hoping maybe it snapshotted the HTML or
| some such, side stepping some issues long dynamic text or
| JS shenanigans
| fforflo wrote:
| The web archiving community is surprisingly small and
| fragmented (in terms of tools) given its impact. Thankfully the
| .warc format looks pretty powerful and standard for the web we
| have so far (which is a lot! ).
|
| Now with the new protocols, dunno maybe its too soon to worry?
| Then again, maybe its an IPv4 / IPv6 analogy.
|
| > There is a companion format, CDX, which builds indexes from
| warc files (which in turn are just concatenated records, so
| rather robust).
|
| Good point. I's planning of combining this fact with the ATTACH
| option that SQLite has - allowing to query multiple database
| files [0]
|
| [0] https://www.sqlite.org/lang_attach.html
| uniqueuid wrote:
| Oh hi, thanks for building this! I haven't had the chance to
| play with it, but my hunch is that sqlite for warc can fill a
| great niche and would be much more portable (and probably
| performant).
|
| Allowing multiple DB files is a great idea, since that
| fundamentally enables large archives, cold-storage and so on.
| [deleted]
| ma2rten wrote:
| Common Crawl is also in WARC format.
| tepitoperrito wrote:
| It'd be neat to extend the warc format and tooling to support
| cached http responses for things like REST endpoints. Then you
| could make sure everything you did in a session is recorded for
| later use.
|
| From the specification it would appear fairly straightforward
| once an approach was chosen...
|
| Here's the relevant extract from the warc spec that informed my
| difficulty estimate - "The WARC (Web ARChive) file format offers
| a convention for concatenating multiple resource records (data
| objects), each consisting of a set of simple text headers and an
| arbitrary data block into one long file."
|
| Edit: Upon 2 minutes of reflection I think the way to go for what
| I'm envisioning is some kind of browser session recording ->
| replayable archive solution.
| uniqueuid wrote:
| Although this is a nice idea, it's extremely difficult to get
| it completely right.
|
| Consider a SPA where navigation happens via xhr or similar
| requests and updates are json that's patched into the DOM. Even
| browsers have a hard time figuring out how to make this a
| coherent session.
|
| Now with warc, you get a single record per _transfer_ , i.e.
| every json file, every image, every css is an individual
| record. It's completely up to the client/downstream tech to re-
| assemble this into a coherent page.
|
| If you want to go down that road, my best suggestion would be
| to start with a browser's history - that's probably the most
| solid version of a session that we have right now.
| fforflo wrote:
| I built this as a small utility within a larger project I'm
| working on these days. (contact me if you're curious or want to
| support).
|
| The WARC format is extremely simple and yet so powerful. Most
| importantly though, there are already pebibytes of already
| crawled archives.
|
| This is a fairly straightforward mapping of a .warc file to a
| .sqlite database. The goal is to make such archives SQL-able even
| in smaller pieces.
|
| The schema I've come up with it's tailored around my
| requirements, but comment if you can spot any obvious pitfalls.
|
| PS: I do believe that at some point .sqlite will become the
| defacto standard for such initiatives. Sure, it's not text... but
| it's pretty close.
| marginalia_nu wrote:
| > PS: I do believe that at some point .sqlite will become the
| defacto standard for such initiatives. Sure, it's not text...
| but it's pretty close.
|
| What is the advantage of moving around .sqlite-files, over just
| loading the (compressed) WARCs into sqlite databases when you
| need them?
|
| I've been messing around with different formats for my own
| search engine crawls, and ended up with the conclusion that
| WARC is a pretty amazing intermediary format that weighs both
| the needs of the producer and consumer very well. I don't use
| WARCs now, instead something similar, but I probably will
| migrate toward that format eventually.
|
| WARC's real selling pint is that it's such an extremely
| portable format.
| uniqueuid wrote:
| Anecdotal evidence, but I produced a medium-size crawl in the
| past (~20TB compressed). I used distributed resources with
| off-the-shelf libraries (i.e. warcprox etc.) and managed to
| get corrupted data in some cases where neither the length-
| delimited (i.e. offset + payload length) nor the newline-
| delimited (triple newlines between records) logics were valid
| any longer. Took me some time to build a repair tool for
| that.
|
| Sqlite has an amazing set of well-understood and documented
| guarantees on top of performance, there's a host of potential
| validation tools to choose from and you can even use
| transactions etc. So that alone seems like a great idea.
|
| What's more, you can potentially skip CDX files if you have
| sqlite databases (or build your own meta sqlite database for
| the others quickly).
| rengler33 wrote:
| Is there a forum or somewhere web crawlers hang out online?
| I'd love the learn about more sophisticated projects like
| this.
| uniqueuid wrote:
| In github issues of said projects, and at scientific web
| archival conferences.
|
| Although I'd absolutely welcome some sort of channel!
| smcnally wrote:
| Topics include Issues and activities across projects.
| These topics are quite active, e.g.
|
| https://github.com/topics/crawling
| https://github.com/topics/web-scraping
| https://github.com/topics/web-archiving
| mynameismon wrote:
| > What is the advantage of moving around .sqlite-files, over
| just loading the (compressed) WARCs into sqlite databases
| when you need them?
|
| I suppose there could be made a case for easy extension: You
| don't need to change the entire spec to add another table in
| the SQLite database, maybe containing other metadata.
|
| > WARC's real selling pint is that it's such an extremely
| portable format.
|
| I mean, so is SQLite: it is also apporoved as a LoC archival
| method. (See SQL Archive [0])
|
| [0]: https://www.sqlite.org/sqlar.html
| fforflo wrote:
| > What is the advantage of moving around .sqlite-files, over
| just loading the (compressed) WARCs into sqlite databases
| when you need them?
|
| The .warc spec is ideal. I'm not saying we replace it (ref.
| xkcd: standards).
|
| On top of what uniqueid said, "loading" is much slower and
| more cumbersome than it sounds. I'm not saying SQLite will
| replace text (maybe my aphorism sounded too firm). I'm saying
| that maybe along with the .warc.gz archives at rest, one
| could have .sql.gz files at rest as well.
|
| In other words: why not move ACID-compliant archives moving
| around?
| marginalia_nu wrote:
| Seems like the benefit of sqlite is the sort of usecases
| where you maybe don't want to load everything you've
| crawled into a search engine, but want to be able to
| cherrypick the data and retrieve specific documents for
| further processing. Which is certainly a use case that
| exists, and indeed not really what WARC is designed for.
| traverseda wrote:
| Great for most end user facing applications though.
| marginalia_nu wrote:
| End-user facing applications usually don't consume
| website crawls, do they? That's impractical for many
| reasons, the sheer size alone being perhaps the biggest
| obstacle.
|
| If you want to do something like have an offline copy of
| a website, ZIM[1] is a far more suitable format as it's
| extremely space-efficient and also fast.
|
| [1] https://docs.fileformat.com/compression/zim/
| nlohmann wrote:
| Have you every played with SQLite virtual tables
| (https://sqlite.org/vtab.html) - they could allow to provide an
| SQLite interface while keeping the same structure on disk.
| Though it requires a bit of work (implementing the interface
| can be tedious), it can avoid the conversion in the first
| place.
| fforflo wrote:
| Then again, do you need virtual tables? The .warc structure
| won't change, so the tables won't change. But you can have
| SQL views defined instead for common queries.
| fforflo wrote:
| Good point. Actually CommonCrawl provides Parquet files for
| their archives too.
|
| And there's this vtable for Parquet extension.
| https://github.com/cldellow/sqlite-parquet-vtable
|
| But for my use case virtual would be too complicated.
___________________________________________________________________
(page generated 2022-06-19 23:00 UTC)