[HN Gopher] Optimizing Large-Scale OpenStreetMap Data with SQLite
___________________________________________________________________
Optimizing Large-Scale OpenStreetMap Data with SQLite
Author : thunderbong
Score : 95 points
Date : 2024-07-03 08:58 UTC (3 days ago)
(HTM) web link (jtarchie.com)
(TXT) w3m dump (jtarchie.com)
| simonw wrote:
| I liked the trick used here for speeding up tag key/value queries
| using a FTS index: SELECT id FROM
| entries e JOIN search s ON s.rowid = e.id WHERE
| -- use FTS index to find subset of possible results
| search MATCH 'amenity cafe' -- use the subset to find
| exact matches AND tags->>'amenity' = 'cafe';
| wcedmisten wrote:
| I recently discovered DuckDB's Read_OSM() function [1], which
| lets you query OSM PBF files directly.
|
| For example, it's simple to count the cafes in North America in
| under 30s: SELECT COUNT(*) FROM
| st_readOSM('/home/wcedmisten/Downloads/north-america-
| latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
| +--------------+ | count_star() | | int64 |
| +--------------+ | 57150 | +--------------+
| Run Time (s): real 24.643 user 379.067204 sys 3.696217
|
| Unfortunately, I discovered there are still some bugs [2] that
| need to be ironed out, but it seems very promising for doing high
| performance queries with minimal effort.
|
| [1]:
| https://duckdb.org/docs/extensions/spatial.html#st_readosm--...
|
| [2]: https://github.com/duckdb/duckdb_spatial/issues/349
| winrid wrote:
| That's cool, but not what I would call high performance. If you
| do these often you would want an index, and should only take
| single digit ms.
| wild_egg wrote:
| Not near a computer to try this out but I'd be surprised if
| you couldn't get a huge speed up by selecting the whole file
| into a real table first and querying against that. DuckDB
| should be able to better vectorize operations then
| wcedmisten wrote:
| The reason I call it high performance is that it avoids the
| hours/days of processing (for the planet file)[1] that would
| be required for pulling the data out of PBF and indexing it.
| And you'd also need RAM at least the size of the planet to
| even get that level of speed.
|
| You could certainly amortize this cost for repeated queries,
| but for one-off queries I haven't seen anything faster.
|
| [1]: https://wiki.openstreetmap.org/wiki/Osm2pgsql/benchmarks
| winrid wrote:
| You wouldn't need hundreds of gigs of ram to answer this
| query in 5ms. You'd need a few mb or so to answer this
| query, after initial indexing is done of course.
| ericjmorey wrote:
| How long do you think it would take to index it?
| wenc wrote:
| Fascinating use of DuckDB!
|
| Can I ask where you get official OSM PBF data from? (I found
| these two links, but not sure what data these contain)
|
| https://planet.openstreetmap.org/pbf/
|
| http://download.geofabrik.de/
| wcedmisten wrote:
| Those are the most popular sources, and I've used both!
|
| The first one is the official OpenStreetMap data, which
| contains the "planet file" - i.e. all the data for the entire
| world. But because OSM has so much stuff in it, the planet
| file is a whopping 76 GB, which can take a long time to
| process for most tasks. I also recommend using the torrent
| file for faster download speeds.
|
| As a result of the planet's size, the German company
| Geofabrik provides unofficial "extracts" of the data, which
| are filtered down to a specific region. E.g. all the data in
| a particular continent, country, or U.S. state. If you click
| on the "Sub Region" link it will show countries, and if you
| click on those it will show states.
| wenc wrote:
| So it's GeoFabrik is the same data, but regionalized. This
| sounds like what I need. I already use DuckDB, so this is
| great.
|
| I appreciate your taking the time to share this tidbit!
| It's a game changer in what I do (geospatial).
| wenc wrote:
| I was able to find all 10 Whole Foods in the City of
| Chicago in 22.6s with DuckDB. It's amazing! (there are tons
| more Whole Foods in the metro area, but it found the exact
| 10 in the city) SELECT
| tags['addr:city'][1] city,
| tags['addr:state'][1] state,
| tags['brand'][1] brand, *,
| FROM st_readosm('us-latest.osm.pbf') WHERE 1=1
| and city = 'Chicago' and state = 'IL'
| and brand = 'Whole Foods Market'
|
| I'm sure there are ways to make this faster (partitioning,
| indexing, COPY TO native format, etc.) but querying a 9.8GB
| compressed raw format file with data (in key-value fields
| stored as strings) for the entire United States at this
| speed is pretty impressive to me.
| sdenton4 wrote:
| So, hype aside, what's the over/under on DuckDB vs Sqlite these
| days? I'm working on a Thing right now, has started with sqlite
| due to being a) good enough, b) stupendously optimized and nails
| hardened, and c) runs on your phone, your toaster, and your
| server.
|
| What's DuckDB bringing to the table relative to sqlite, which
| seems like the boring-and-therefore-best choice?
| sdenton4 wrote:
| Ha, looks like the DuckDB devs have a great and balanced answer
| for this: https://marclamberti.com/blog/duckdb-getting-started-
| for-beg...
|
| If you want row storage, use sqlite. If you want columnar
| storage, use DuckDB.
| sdenton4 wrote:
| And thinking about my specific project, it's clear that a
| great solution would allow me to pick row-vs-column storage
| on a per-table basis.
|
| I basically have a graph over vectors, which is a very 'row-
| storage' kind of things: I need to get a specific vector (a
| row of data), get all of its neighbors (rows in the edges
| table), and do some in-context computations to decide where
| to walk to next on the graph.
|
| However, we also have some data attached to vectors
| (covariates, tags, etc) which we often want to work with in a
| more aggregated way. These tables seem possibly more
| reasonable to approach with a columnar format.
| 1659447091 wrote:
| > a great solution would allow me to pick row-vs-column
| storage on a per-table basis ... and do some in-context
| computations to decide where to walk to next on the graph
|
| You may be interested in Datomic[0] or Datascript[1]
|
| [0]: https://www.datomic.com/benefits.html
|
| [1]: https://github.com/tonsky/datascript
| sitkack wrote:
| Knowing how little I know about your problem and
| constraints, I'd use SQLite for everything. Then if a
| specific part seems columnar, do some experiments in DuckDB
| and see what speedups or space savings you get and then use
| both.
|
| Sounds like you are more than one, so have some folks do
| some 4 hour spikes on DuckDB where you think it might be
| useful.
|
| I'd use Rust as the top level, embedding SQLite, DuckDB and
| Lua. In my application I used SQLite with FTS and then many
| SQLite databases not more than 2GB in size with rows
| containing binary blobs that didn't uncompress to more than
| 20 megs each.
| mnmatin wrote:
| Sometime ago, I had to extract a large amount of data from OSM
| and found the process harder than it should have been. Similar to
| how you shrunk the size by 50% after removing unnecessary tags.
| Ended up creating a python package 'earth-osm'
| (https://github.com/pypsa-meets-earth/earth-osm) that makes
| things more intuitive. Always wanted to push the data into a
| database but never got around to that....
| sitkack wrote:
| I went through a similar process when I converted wikipedia xml
| dumps into sqlite.
|
| https://news.ycombinator.com/item?id=28013347
| Scaevolus wrote:
| > CREATE INDEX entries_name ON entries(tags->>'name'); > However,
| this requires an index per tag, which won't scale, especially for
| a dynamic list of tags.
|
| That's not how indexes work at all. This will be fine.
| DonnyV wrote:
| If you convert those PBFs to Parquet files. You can then use
| Duckdb to search them with sub-sec response. Plus you get the
| added bonus of being able to host in an S3 type cloud storage.
|
| https://duckdb.org/docs/extensions/httpfs/s3api
| dzogchen wrote:
| > This highlights the importance of iterative refinement and the
| power of combining different technologies to solve problems.
|
| This uninformative non-sentence sounds an awful lot like ChatGPT.
| zamazan4ik wrote:
| If you are interested in optimizing the project further, I can
| suggest you rebuilding SQLite with Profile-Guided Optimization
| (PGO). I collected as many as possible materials (including many
| related benchmarks) in my repo:
| https://github.com/zamazan4ik/awesome-pgo . Regarding SQLite and
| PGO, I have the following link:
| https://sqlite.org/forum/forumpost/19870fae957d8c1a
___________________________________________________________________
(page generated 2024-07-06 23:00 UTC)