https://community.openstreetmap.org/t/new-osm-file-format-30-smaller-than-pbf-5x-faster-to-import/137151 OpenStreetMap Community Forum New OSM file format: 30% smaller than PBF, 5x faster to import General talk geodesk, gol GeoDeskTeam (GeoDesk Team) October 23, 2025, 3:25pm 1 The OSM dataset is huge, and keeps growing every day. Great news, of course, but sometimes the sheer volume can be overwhelming - there are just gobs and gobs of data! Hence, we created GOB ("Geo-Object Bundle"), a new file format that makes tackling OSM data faster and easier. It's a companion format to our now-familiar Geo-Object Library (essentially, a tightly-compressed GOL with its indexes stripped). To support this new format, GOL Tool 2.1 has two new commands: save GOLs as GOBs and load GOBs into a GOL (Of course, like all of the GeoDesk Toolkit, the GOL Tool is free & open-source). GOB Tiling & Size Statistics GOB Tiling & Size Statistics1920x998 169 KB Advantages of GOB * GOB files are on average half the size of a GOL, and 30% smaller than PBFs. * Importing a GOB is 5 times faster than building a GOL from a PBF. A modern system loads a planet-size GOB into a GOL in 3 minutes. The speed advantage grows more pronounced on memory-constrained machines: gol build starts paging heavily with less than 32 GB of RAM, whereas gol load requires minimal resources (even a decade-old laptop loads the whole planet in under an hour). * GOBs are organized into tiles, so it's easy to extract regional subsets (basically at file-copy speed) and stitch them back together; that makes GOB a convenient format for archiving and distributing geodata. The image above shows some of the tiling structure, which mimics that of tile renderers. On the left, the smallest squares are zoom 6, the right shows the most granular level (zoom 12). A typical planet GOB has about 60,000 tiles. Below are some size statistics for the planet file and popular regional extracts (without metadata): PBF GOL GOB Planet 65.4 GB 93.6 GB +43.1% 46.0 GB -29.7% California 1.18 GB 1.59 GB +35.0% 770 MB -36.5% France 4.54 GB 5.89 GB +29.7% 2.84 GB -36.3% Germany 4.29 GB 5.92 GB +38.0% 2.67 GB -37.5% Italy 1.96 GB 2.63 GB +34.0% 1.34 GB -31.6% Japan 2.13 GB 2.91 GB +36.1% 1.34 GB -37.0% Poland 1.84 GB 2.72 GB +47.6% 1.29 GB -29.7% Switzerland 487 MB 634 MB +30.1% 311 MB -36.2% Dense, well-mapped areas tend to compress best as GOB. Less complete regions are below average in terms of GOB's size advantage (GOBs for Brazil and China are only 23% smaller). Limitations Just like GOLs, GOBs don't store: * metadata (timestamp of last edit, changeset, username, etc.) * history (each GOB is a snapshot of the OSM dataset) Therefore, it is not intended for editing, but for archival and distribution. How to work with GOBs You will need GOL Tool 2.1 or above (download). To export a GOL as a GOB: gol save [] If is omitted, it uses the same base name as the GOL. The .gol and .gob extensions are optional. To limit the export to a specific area, use the --area (-a) option. You can specify a (multi)polygon as WKT, GeoJSON or simple coordinates (lon,lat pairs, rings are closed automatically), either directly or as a file. If no file extension is given, .wkt is assumed. For example: gol save world bodensee -a 9.55,47.4,8.78,47.66,9.01,47.88,9.85,47.58,9.82,47.46 exports the tiles covering the region around the Bodensee (Lake Constance). To import tiles into a GOL: gol load [] As with save, if is omitted, the base name of the GOL is used. If the GOL does not exist, it is created. To load just a specific region, restrict it with the -a option. gol load japan -a shikoku loads tiles from japan.gob into japan.gol (creating it if it doesn't yet exist), but only those intersecting the area defined in shikoku.wkt. Available datasets * Open Planet Data publishes the full planet as a GOB (< 50 GB, updated daily). What's next This is still a work in progress, so the format may change. I'm experimenting with different compression algos beyond zlib to make it even tighter and faster (zstd didn't yield any significant gains). I'm also in the process of enabling gol load to download a GOB directly from a URL and build the GOL in the background, which would bring the wall-clock import time to zero. As always, questions/feedback are welcome! Please stop on by on Github and @geodesk@en.osm.town. 10 Likes iandees (Ian Dees) October 23, 2025, 3:28pm 2 What's a GOL? GeoDeskTeam (GeoDesk Team) October 23, 2025, 3:31pm 3 GOL = Geo-Object Library (a compact single-file database for OSM features) Marcos_Dione (Marcos Dione) October 23, 2025, 6:48pm 4 # GeoDeskTeam: It's a companion format to our now-familiar Geo-Object Library (essentially, a tightly-compressed GOL with its indexes stripped). # GeoDeskTeam: Importing a GOB is 5 times faster than building a GOL from a PBF. So, it's a given format (GOL), stripped of everything that can be generated from the rest of the data (the indexes), then compressed, and it imports into the original format (GOL) 5x times faster than importing from another format [but only to that format] [probably because it just decompresses and generates the indexes]. GeoDeskTeam (GeoDesk Team) October 23, 2025, 7:24pm 5 Yes, in a nutshell (plus the ability to load/extract specific areas). This is especially useful for OSM-based applications hosted on a low-power virtual server (e.g. 2 cores, 8 GB RAM). Loading a GOL from a GOB requires far less memory than building from a PBF (for which the process needs to keep a node index in memory to assemble the geometries of ways, or else it will start paging furiously). On that kind of machine, the speed increase will be closer to 15x. From a GOL, you can then export to other formats (GeoJSON, WKT, CSV, OSM-XML), or perform queries using a Python script or C++/Java application. 2 Likes pnorman (Paul Norman) October 24, 2025, 10:01am 6 It sounds like GOB is an intermediate format for your library which reduces data stored or transferred at the cost of some CPU time, is that correct? What software outside of GOL can currently read or write GOB? Can any of the standard libraries people use read it and create geometries? 2 Likes CommanderStorm (Commander Storm) October 24, 2025, 10:30am 7 Could you go into more detail how the format works? I was unable to find a higher level overview in the docs or on GitHub. What encodings do you use? How effective are they for you? Here is our research, in case you are intetersted: https://arxiv.org/ pdf/2508.10791 2 Likes GeoDeskTeam (GeoDesk Team) October 24, 2025, 2:05pm 8 Yes, GOB is a companion format to GOL: * GOL is a single-file database. It is uncompressed and indexed for fast queries (~100 GB for the planet) * GOB is for archival and distribution. Only the essential data is stored, in a tightly-compressed layout, then compressed further with zlib (~50 GB for the planet) The hardest part of working with OSM data as a data consumer is assembling OSM elements into geometries. Since ways store references to nodes rather than actual coordinates, this requires a lookup strategy that turns node IDs into longitude/latitude. The typical approach is a hashmap for smaller sets, or a dense array for the full planet. (Old news to you, of course - just summarizing for other readers). Keeping the node coordinates in a dense array takes up close to 100 GB. The GOL Tool uses a different indexing approach that brings this down to about 25 GB, but that's still too heavy for most laptops or virtual servers (gol build won't run out of memory, but it will take several hours due to paging). So those users can now download a GOB (which contains the ways with their geometries fully resolved, and relations with optimized member references) and turn it into a GOL, with minimal RAM needed. Basically, gol load reads the required tiles from the GOB, unzips them, transforms them into a layout suitable for querying, indexes the features, and writes the tiles into the GOL. At this point, only the GOL Tool supports GOBs. In theory, the tiles within a GOB could be exported directly into other formats, but it's unlikely I'll implement that capability, since it's fast to turn a GOB back into a GOL (which already supports multiple export formats). If there's enough interest, I may refactor the GOB functionality from the GOL Tool codebase into a separate library (I'd prefer to keep it out of libgeodesk, which is meant to be lightweight). As a side note, at some point I'd like to propose GOL as an alternative input format for osm2pgsql. This should shave at least a third off the import time (and making it feasible to run on low-end hardware), while only adding 200 KB to the executable size. Since you're a key contributor, I'd love to know your thoughts. GeoDeskTeam (GeoDesk Team) October 24, 2025, 2:09pm 9 Thanks for posting the research paper, I'll have a more thorough read-through later. The SIMD and GPU-based accelerations sound fascinating. I haven't yet published a technical spec for the GOL/GOB file formats. I will do this eventually, since this will be essential to get more developers involved in the project. Here's a high-level overview: The basic file structure is broken into tiles (contiguous chunks of storage up to 1 GB in size, typically 500 KB to 4 MB). The tiling scheme is the same as commonly used by tile servers: a single square that covers the world in Mercator projection at zoom 0, recursively divided into quadrants. There's one important difference: whereas tile servers produce MVTs/PNGs for every tile at a given zoom level, the tiling in a GOL is sparse. In low-density areas such as oceans and deserts, tiling stops at level 4 or possibly 6, whereas in dense urban area, tile granularity can go up to level 12. That's why a typical planet-size GOL stores about 50K tiles instead of millions. A recurring theme is the design around locality of reference, cutting down not only IO, but also access to main memory, by defining structures that maximize use of CPU caches. Tiles themselves are divided into "hot" and "cold" zones, attempting to keep frequently accessed data together, and also ensuring that features that are spatially close and/or related thematically are placed in contiguous storage locations. This reduces the number of pages that need to be loaded to perform queries. On the other hand, an SQL-based database that treats individual features as rows essentially stores them wherever there is space. Even though SSDs don't have seek costs in the traditional sense, most still perform dramatically better on sequential reads, and a tighter layout means more data can be cached in memory. For low-level storage, both GOL and GOB use similar techniques as MVT, e.g. delta-encoded LEB128 for coordinates. GOL/GOB also extensively de-duplicate structures. For example, a tile of a city might contain thousands of palm trees, or buildings of the same type. These can share a common tag-table. The same goes for strings. (A traditional database typically stores a separate set of tags for each feature, leading to needless bloat, especially if stored columnar.) Here are some performance stats (10x2.3 Haswell Xeon, 32 GB, PCI3 NVMe): * Find all Italian restaurants (points and polygons) in the U.S. (based on a detailed admin-area polygon), using a planet-wide dataset (i.e. world("na[amenity=restaurant][cuisine=italian] ").within(usa)): 52 milliseconds * Measure the length of all canals in a bounding-box covering 500 square-km: 47 micro-seconds * Find all features in a bounding-box spanning multiple city blocks: 3 micro-seconds Results are fairly consistent across the 3 toolkits, with Java about 50% slower for polygon-restrained queries. Measurements are median timings based on a mixed-workload simulation (so large portions of at least the indexes will be in cache). Further optimizations could possibly gain another 20% speedup (e.g. SIMD instructions for bbox checks). As far as the data structures themselves, I typically don't design for particular CPU architectures, because that field is evolving so rapidly. The general trend in CPUs is heavily multi-core, and this benefits the GOL design. A regional query can easily span 100+ tiles, these can be processed by multiple cores in parallel. pnorman (Paul Norman) October 24, 2025, 4:01pm 10 # GeoDeskTeam: I'd like to propose GOL as an alternative input format for osm2pgsql. This should shave at least a third off the import time (and making it feasible to run on low-end hardware), while only adding 200 KB to the executable size. Since you're a key contributor, I'd love to know your thoughts. We support whatever libosmium supports and don't do any parsing of different file formats within OSM. I haven't done any recent measurements but I can't see any file format cutting a third of the import time. Far less than a third of the time is spent reading the file. Most of the time is spent in Postgres or building geometries. I have found osm2pgsql feasible to run on any hardware that can handle the resulting database. osm2pgsql is designed to create databases for certain uses, and those uses tend to require more RAM/ IOPS than osm2pgsql. 1 Like lonvia (Lonvia) October 24, 2025, 6:47pm 11 Do you have a technical specification of the file format? Note that I'm not interested in how the GOL tool works but how the data is practically stored in the file. Ideally you have a specification somewhere that is detailed enough that in theory one could write a parser for the format. Not that I want to write another parser. What you propose sounds really interesting and I just want to understand how it works. From your description I take it that the format rearranges the data (the tile-base processing) and it very much sounds that the conversion from pbf is lossy in more ways than just "dropping metadata". Both can be very reasonable design choices but might also limit the way in which the data can be processed. So it might be really useful for some use-cases and less ideal for others. But that is really hard to judge without having more technical details. 1 Like 0235 (0235) October 24, 2025, 6:47pm 12 Will you be releasing an official Torrent for this data? * Home * Categories * Guidelines * Terms of Service * Privacy Policy Powered by Discourse, best viewed with JavaScript enabled