[HN Gopher] Hexagons and Hilbert curves - The headaches of distr...
       ___________________________________________________________________
        
       Hexagons and Hilbert curves - The headaches of distributed spatial
       indices
        
       Author : max_sendfeld
       Score  : 57 points
       Date   : 2024-03-22 08:20 UTC (14 hours ago)
        
 (HTM) web link (hivekit.io)
 (TXT) w3m dump (hivekit.io)
        
       | otoburb wrote:
       | I knew about hexagons but not about R-trees and Hilbert curves,
       | and then combining the two. Even though that was an inbound
       | content article, it _was_ at least informative.
        
       | michelpp wrote:
       | I experimented with geospatial Hilbert Curves as a Postgres
       | extension [0] for PostGIS using the S2 [1] spherical geometry
       | library. S2 uses a scale free cell coverage pattern that is
       | numbered using six interlocking space filling Hilbert Curves [2].
       | This approach is similar to what is used in this article. S2
       | doesn't use hexagons but a different cell structure but the idea
       | is generally the same.
       | 
       | By having both high level (cell) and low level (cell id)
       | geometries it was a very powerful library which allowed
       | projection from the hilbert space into a Postgres spatial index
       | (spgist) including various trees, like noted in this article. It
       | appears to be still quite active in development.
       | 
       | [0] https://github.com/michelp/pgs2
       | 
       | [1] https://s2geometry.io/
       | 
       | [2] https://s2geometry.io/devguide/s2cell_hierarchy
        
         | max_sendfeld wrote:
         | Ah - I've used S2 in the past. Great work. It scales really
         | well for larger datasets.
         | 
         | Interesting that you mention the use of multiple hilbert curves
         | as well. We also experimented with two Hilbert Curves, rotated
         | by 90 degrees. This helps to get around what we've dubbed the
         | "Hilbert Equator" problem where two objects are quite far on
         | the curve because they are placed close to one of the major
         | fault lines in the fractal (for lack of a better word)
        
           | michelpp wrote:
           | Tackling these boundary problems are the literal "edge cases"
           | in geospatial indexing and they exist everywhere, so this i a
           | good reason for using an existing library as the authors have
           | already solved them.
           | 
           | Hexagons are cool, but they are not necessarily the bestagon
           | for a spherical geometry since you cannot break a hexagon
           | into smaller hexagons, whereas an S2 cell is a "cube" with
           | spherical sides or HEALPix uses a rhombic dodecahedron [0]
           | both of which can be split into smaller divisions of
           | themselves.
           | 
           | Not to discourage you from your experimentation, it's all
           | trade offs and you might find a good one. Good luck!
           | 
           | [0] https://en.wikipedia.org/wiki/HEALPix
        
           | plagiarist wrote:
           | Does that end up with specific points that have that same
           | difficulty at intersections of equators? I suppose even if
           | that is the case, a point is something that's easier to
           | position over a dead area than a line would be.
        
           | jandrewrogers wrote:
           | There are always caveats with these types of indexing
           | methods, especially if you require dynamic high-performance
           | indexing and support for rectangles/polygons. You can only
           | move the edge cases around, there is no way to eliminate
           | them. This allows you to tailor an indexing scheme for
           | specific workload assumptions but this obviously breaks down
           | if you need an algorithm that generalizes to many workloads
           | and data models. There isn't just one type of edge case with
           | this type of indexing, there are several which may or may not
           | be relevant depending on what you are trying to do.
           | 
           | Some research from the 1980s showed it is only possible to
           | mitigate bounded categories of edge case, thereby improving
           | generality, by indexing on complex higher-dimensionality
           | embeddings. Mitigating more categories requires more
           | dimensions and more complexity. However, no one could figure
           | out how to construct these embeddings for even basic cases or
           | deal with more practical curse of dimensionality issues, so
           | that is largely forgotten (the researchers themselves made
           | comments to the effect that they didn't think a tractable
           | solution was possible). I've never even been able to find
           | that literature in electronic form, unfortunately.
           | 
           | Like AI, it is an interesting open-ended problem space. You
           | can prove that an elegant optimal solution is not tractable
           | so it ends up being a search for asymptotically optimal
           | algorithms that become exponentially more complex the closer
           | you get to optimal.
        
         | durkie wrote:
         | any idea why postgis hasn't formally incorporated S2 cell
         | functions? It seems like it'd be right up their alley.
        
       | patches11 wrote:
       | It would be interesting to know what other solutions exist in
       | this space and why they weren't chosen.
       | 
       | I know of at least one, GeoMesa, which seems like it could at
       | least provide the building blocks to achieve what they are trying
       | to do.
        
         | joe_the_user wrote:
         | There's a long of study on the subject (that I've only
         | surveyed). See my comment
         | https://news.ycombinator.com/item?id=39792884
        
       | tromp wrote:
       | > a vehicle with a Hilbert Curve position of 0.34 is really close
       | to one with 0.35 and really far from one with 0.89
       | 
       | But points with a large difference in their single curve
       | coordinate can be either far apart or close together. E.g. on
       | this 16 point Hilbert curve                    __.  .__
       | __|  |__          |   __   |         |__|  |__|
       | 
       | the '.' marked points at 1/16th and 15/16th along the curve are
       | adjacent.
        
         | wolframhempel wrote:
         | It's an important point. We've ended up using two Hilbert
         | curves, rotated by 90 degrees to tackle this problem. If it is
         | close on either, it is close in 2D space
        
           | tromp wrote:
           | Interesting. So one might think that being close together in
           | 2D space corresponds to being close on
           | 
           | BOTH cartesian coordinates / EITHER Hilbert coordinate
           | 
           | And that being far apart in 2D space corresponds to being far
           | apart on
           | 
           | EITHER cartesian coordinate / BOTH Hilbert coordinates
           | 
           | But if we consider the two commas below which are close
           | together in 2D space, we see they are far apart in any
           | rotation of this Hilbert curve?!                       __ __
           | __ __         |__|   __|  |__   |__|          __   |__    __|
           | __         |  |__ __|  ;__ __|  |         |__    __,__ __
           | __|          __|  |__    __|  |__         |   __   |  |   __
           | |         |__|  |__|  |__|  |__|
        
           | sp332 wrote:
           | So you converted your 2D space into another 2D space?
        
           | dwallin wrote:
           | As mentioned in other replies, this doesn't fully solve a lot
           | of the cases. I think z order curves might actually be better
           | here, they are slightly less accurate individually, but in a
           | way that would seemingly minimize the worst case results when
           | taking the minimum of the two. But at that point you are
           | probably better off using a single z-order curve and using
           | the coordinates to check the manhattan or Chebyshev distance
           | using bitwise operations.
        
             | jandrewrogers wrote:
             | I would second the use of Z-order curves. Hilbert curves
             | were famously known as more optimal on spinning disk in a
             | specific context a few decades ago. Inertia has kept them
             | top-of-mind even though modern systems do not have the
             | problems Hilbert curves were better at solving and those
             | curves are more complex to use than e.g. Z-order curves.
        
       | zX41ZdbW wrote:
       | I've recently implemented a map of the Internet using a similar
       | technique: https://reversedns.space/
       | 
       | In ClickHouse, we have almost every mentioned technique: H3 and
       | S2, geohashes, and indexing by space-filling curves.
        
         | esafak wrote:
         | For reference:
         | 
         | https://clickhouse.com/docs/en/sql-reference/functions/geo/h...
         | 
         | https://clickhouse.com/docs/en/sql-reference/functions/geo/s...
        
       | joe_the_user wrote:
       | The problem of finding the shortest path on a map (a la google
       | maps etc) is often solved with approaches like "hub networks" and
       | "contraction hierarchies" rather than simple spatial networks.
       | Things like dijkstra's algorithm are only efficient if you're
       | input is the raw graph and point and you only want one path once.
       | 
       | See just for a start: https://arxiv.org/abs/1504.05140
        
       | spenczar5 wrote:
       | Oh boy, this gives me a chance to talk about one of the gems of
       | astronomy software which deserves to be better known: HEALPixel
       | tesselation!
       | 
       | HEALPixels stand for 'Hierarchical Equal-Area Iso-latitudinal
       | Pixels'. It is a scheme that was developed to analyze signals
       | that cover the entire sky, but with variable density.
       | 
       | Like HTM or Hilbert curves, this can be used to organize spatial
       | data.
       | 
       | The tesselation looks kind of funny but has many good features -
       | it doesn't have discontinuities at poles, and is always equal
       | area. And with the "nested" healpixel formulation, pixels are
       | identified by integers. Pixel IDs are hierarchical based on
       | leading bits - so, for example, pixel 106 (=0110 1010) contains
       | pixel 1709 (=0110 1010 1101). This lets you do some marvelous
       | optimizations in queries if you structure your data
       | appropriately. Nearest neighbor searches can be extremely quick
       | if things are HEALPix-indexed - and so can radius searches, and
       | arbitrary polygon searches.
       | 
       | HEALPixels are used today for more than just their original
       | intent. LSST will use them for storing all-sky data and point
       | source catalogs, for example.
       | 
       | More here:
       | 
       | - Original NASA/JPL site: https://healpix.jpl.nasa.gov/
       | 
       | - Popular Python implementation:
       | https://healpy.readthedocs.io/en/latest/
       | 
       | - Good PDF primer: https://healpix.jpl.nasa.gov/pdf/intro.pdf
       | 
       | And an experimental database being built on healpix for extremely
       | large data volumes (certainly many TB, maybe single-digit PB):
       | https://github.com/astronomy-commons/hipscat
        
         | aimor wrote:
         | Thanks for sharing, I've needed this for years and never
         | stumbled across it.
        
         | eh_why_not wrote:
         | Gem indeed! Here's its wikipedia page:
         | 
         | https://en.wikipedia.org/wiki/HEALPix
        
       | feverzsj wrote:
       | The purpose of using space-filling curves is to convert
       | coordinates into 1 dimensional data, thus they can be indexed by
       | ordinary database index, no matter distrusted or not. The problem
       | is it's much slower than a "real" spatial index like R-tree, and
       | the supported queries are rather limited.
       | 
       | On the other hand, spatial data can be easily partitioned,
       | rendering the aforementioned method less useful. Also, I've never
       | seen anyone needed more than a single instance of postgres to
       | index all the spatial data, unless you are dealing with spatial
       | temporal data, which can grow infinitely large. Then again,
       | spatial temporal data can be easily partitioned.
        
       | Lichtso wrote:
       | Here is one of the recent advancements in performing similarity
       | (including nearest neighbor) search on sparse high dimensional
       | data using learned indices: https://arxiv.org/abs/2204.10028
       | 
       | The paper basically deals with the most adverse case but the
       | approach should work in lower dimensions and in data with a non-
       | uniform density distribution as well.
        
       ___________________________________________________________________
       (page generated 2024-03-22 23:01 UTC)