[HN Gopher] Optimizing Large-Scale OpenStreetMap Data with SQLite
       ___________________________________________________________________
        
       Optimizing Large-Scale OpenStreetMap Data with SQLite
        
       Author : thunderbong
       Score  : 95 points
       Date   : 2024-07-03 08:58 UTC (3 days ago)
        
 (HTM) web link (jtarchie.com)
 (TXT) w3m dump (jtarchie.com)
        
       | simonw wrote:
       | I liked the trick used here for speeding up tag key/value queries
       | using a FTS index:                   SELECT id         FROM
       | entries e         JOIN search s ON s.rowid = e.id         WHERE
       | -- use FTS index to find subset of possible results
       | search MATCH 'amenity cafe'         -- use the subset to find
       | exact matches         AND tags->>'amenity' = 'cafe';
        
       | wcedmisten wrote:
       | I recently discovered DuckDB's Read_OSM() function [1], which
       | lets you query OSM PBF files directly.
       | 
       | For example, it's simple to count the cafes in North America in
       | under 30s:                 SELECT COUNT(*) FROM
       | st_readOSM('/home/wcedmisten/Downloads/north-america-
       | latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
       | +--------------+       | count_star() |       |    int64     |
       | +--------------+       |        57150 |       +--------------+
       | Run Time (s): real 24.643 user 379.067204 sys 3.696217
       | 
       | Unfortunately, I discovered there are still some bugs [2] that
       | need to be ironed out, but it seems very promising for doing high
       | performance queries with minimal effort.
       | 
       | [1]:
       | https://duckdb.org/docs/extensions/spatial.html#st_readosm--...
       | 
       | [2]: https://github.com/duckdb/duckdb_spatial/issues/349
        
         | winrid wrote:
         | That's cool, but not what I would call high performance. If you
         | do these often you would want an index, and should only take
         | single digit ms.
        
           | wild_egg wrote:
           | Not near a computer to try this out but I'd be surprised if
           | you couldn't get a huge speed up by selecting the whole file
           | into a real table first and querying against that. DuckDB
           | should be able to better vectorize operations then
        
           | wcedmisten wrote:
           | The reason I call it high performance is that it avoids the
           | hours/days of processing (for the planet file)[1] that would
           | be required for pulling the data out of PBF and indexing it.
           | And you'd also need RAM at least the size of the planet to
           | even get that level of speed.
           | 
           | You could certainly amortize this cost for repeated queries,
           | but for one-off queries I haven't seen anything faster.
           | 
           | [1]: https://wiki.openstreetmap.org/wiki/Osm2pgsql/benchmarks
        
             | winrid wrote:
             | You wouldn't need hundreds of gigs of ram to answer this
             | query in 5ms. You'd need a few mb or so to answer this
             | query, after initial indexing is done of course.
        
               | ericjmorey wrote:
               | How long do you think it would take to index it?
        
         | wenc wrote:
         | Fascinating use of DuckDB!
         | 
         | Can I ask where you get official OSM PBF data from? (I found
         | these two links, but not sure what data these contain)
         | 
         | https://planet.openstreetmap.org/pbf/
         | 
         | http://download.geofabrik.de/
        
           | wcedmisten wrote:
           | Those are the most popular sources, and I've used both!
           | 
           | The first one is the official OpenStreetMap data, which
           | contains the "planet file" - i.e. all the data for the entire
           | world. But because OSM has so much stuff in it, the planet
           | file is a whopping 76 GB, which can take a long time to
           | process for most tasks. I also recommend using the torrent
           | file for faster download speeds.
           | 
           | As a result of the planet's size, the German company
           | Geofabrik provides unofficial "extracts" of the data, which
           | are filtered down to a specific region. E.g. all the data in
           | a particular continent, country, or U.S. state. If you click
           | on the "Sub Region" link it will show countries, and if you
           | click on those it will show states.
        
             | wenc wrote:
             | So it's GeoFabrik is the same data, but regionalized. This
             | sounds like what I need. I already use DuckDB, so this is
             | great.
             | 
             | I appreciate your taking the time to share this tidbit!
             | It's a game changer in what I do (geospatial).
        
             | wenc wrote:
             | I was able to find all 10 Whole Foods in the City of
             | Chicago in 22.6s with DuckDB. It's amazing! (there are tons
             | more Whole Foods in the metro area, but it found the exact
             | 10 in the city)                       SELECT
             | tags['addr:city'][1] city,
             | tags['addr:state'][1] state,
             | tags['brand'][1] brand,                 *,
             | FROM st_readosm('us-latest.osm.pbf')             WHERE 1=1
             | and city = 'Chicago'             and state = 'IL'
             | and brand = 'Whole Foods Market'
             | 
             | I'm sure there are ways to make this faster (partitioning,
             | indexing, COPY TO native format, etc.) but querying a 9.8GB
             | compressed raw format file with data (in key-value fields
             | stored as strings) for the entire United States at this
             | speed is pretty impressive to me.
        
       | sdenton4 wrote:
       | So, hype aside, what's the over/under on DuckDB vs Sqlite these
       | days? I'm working on a Thing right now, has started with sqlite
       | due to being a) good enough, b) stupendously optimized and nails
       | hardened, and c) runs on your phone, your toaster, and your
       | server.
       | 
       | What's DuckDB bringing to the table relative to sqlite, which
       | seems like the boring-and-therefore-best choice?
        
         | sdenton4 wrote:
         | Ha, looks like the DuckDB devs have a great and balanced answer
         | for this: https://marclamberti.com/blog/duckdb-getting-started-
         | for-beg...
         | 
         | If you want row storage, use sqlite. If you want columnar
         | storage, use DuckDB.
        
           | sdenton4 wrote:
           | And thinking about my specific project, it's clear that a
           | great solution would allow me to pick row-vs-column storage
           | on a per-table basis.
           | 
           | I basically have a graph over vectors, which is a very 'row-
           | storage' kind of things: I need to get a specific vector (a
           | row of data), get all of its neighbors (rows in the edges
           | table), and do some in-context computations to decide where
           | to walk to next on the graph.
           | 
           | However, we also have some data attached to vectors
           | (covariates, tags, etc) which we often want to work with in a
           | more aggregated way. These tables seem possibly more
           | reasonable to approach with a columnar format.
        
             | 1659447091 wrote:
             | > a great solution would allow me to pick row-vs-column
             | storage on a per-table basis ... and do some in-context
             | computations to decide where to walk to next on the graph
             | 
             | You may be interested in Datomic[0] or Datascript[1]
             | 
             | [0]: https://www.datomic.com/benefits.html
             | 
             | [1]: https://github.com/tonsky/datascript
        
             | sitkack wrote:
             | Knowing how little I know about your problem and
             | constraints, I'd use SQLite for everything. Then if a
             | specific part seems columnar, do some experiments in DuckDB
             | and see what speedups or space savings you get and then use
             | both.
             | 
             | Sounds like you are more than one, so have some folks do
             | some 4 hour spikes on DuckDB where you think it might be
             | useful.
             | 
             | I'd use Rust as the top level, embedding SQLite, DuckDB and
             | Lua. In my application I used SQLite with FTS and then many
             | SQLite databases not more than 2GB in size with rows
             | containing binary blobs that didn't uncompress to more than
             | 20 megs each.
        
       | mnmatin wrote:
       | Sometime ago, I had to extract a large amount of data from OSM
       | and found the process harder than it should have been. Similar to
       | how you shrunk the size by 50% after removing unnecessary tags.
       | Ended up creating a python package 'earth-osm'
       | (https://github.com/pypsa-meets-earth/earth-osm) that makes
       | things more intuitive. Always wanted to push the data into a
       | database but never got around to that....
        
       | sitkack wrote:
       | I went through a similar process when I converted wikipedia xml
       | dumps into sqlite.
       | 
       | https://news.ycombinator.com/item?id=28013347
        
       | Scaevolus wrote:
       | > CREATE INDEX entries_name ON entries(tags->>'name'); > However,
       | this requires an index per tag, which won't scale, especially for
       | a dynamic list of tags.
       | 
       | That's not how indexes work at all. This will be fine.
        
       | DonnyV wrote:
       | If you convert those PBFs to Parquet files. You can then use
       | Duckdb to search them with sub-sec response. Plus you get the
       | added bonus of being able to host in an S3 type cloud storage.
       | 
       | https://duckdb.org/docs/extensions/httpfs/s3api
        
       | dzogchen wrote:
       | > This highlights the importance of iterative refinement and the
       | power of combining different technologies to solve problems.
       | 
       | This uninformative non-sentence sounds an awful lot like ChatGPT.
        
       | zamazan4ik wrote:
       | If you are interested in optimizing the project further, I can
       | suggest you rebuilding SQLite with Profile-Guided Optimization
       | (PGO). I collected as many as possible materials (including many
       | related benchmarks) in my repo:
       | https://github.com/zamazan4ik/awesome-pgo . Regarding SQLite and
       | PGO, I have the following link:
       | https://sqlite.org/forum/forumpost/19870fae957d8c1a
        
       ___________________________________________________________________
       (page generated 2024-07-06 23:00 UTC)