Remix Hacker News Clone

How we made geo joins 400× faster with H3 indexes

We do something similar for some limited geospatial search using elastic search. We make a set of h3 indexes for each of the hundreds of millions of gps recordings on our service, and store them in elastic search. Geospatial queries become full text search queries, where a point is on the line if the set of h3 indexes contains the point. You can do queries on how many cells overlap, which lets you match geospatial tracks on the same paths, and with ES coverage queries, you can tune how much overlap you want.

Instead of using integers IDs for the hexes, we created an encoded version of the ID that has the property that removing a character gets you the containing parent of the cell. This means we can do basic containment queries by querying with a low resolution hex (short string) as a prefix query. If a gps track goes through this larger parent cell, the track will have hexes with the same prefix. You don’t get perfect control of distances because hexes have varying diameters (or rather the approximation, since they aren’t circles they are hexes), but in practice and at scale for a product that doesn’t require high precision, it’s very effective.

I think at the end of this year we’ll have about 6tb of these hex sets in a four node 8 process ES cluster. Performance is pretty good. Also acts as our full text search. Half the time we want a geo search we also want keyword / filtering / etc on the metadata of these trips.

Pretty fun system to build, and the concept works with a wide variety of data stores. Felt like a total hack job but it has stood the test of time.

Thanks uber, h3 is a great library!

by cullenking1770443492

There is a lot of literature on join operations using discrete global grid systems (DGGS). H3 is a widely used DGGS optimized for visualization.

If joins are a critical performance-sensitive operation, the most important property of a DGGS is congruency. H3 is not congruent it was optimized for visualization, where congruency doesn’t matter, rather than analytical computation. For example, the article talks about deduplication, which is not even necessary with a congruent DGGS. You can do joins with H3 but it is not recommended as a general rule unless the data is small such that you can afford to brute-force it to some extent.

H3 is great for doing point geometry aggregates. It shines at that. Not so much geospatial joins though. DGGS optimized for analytic computation (and joins by implication) exist, they just aren’t optimal for trivial visualization.

by jandrewrogers1770439502

  At 500 stations:
  - H3: 218µs, 4.7KB, 109 allocs
  - Fallback: 166µs, 1KB, 37 allocs
  - Fallback is 31% faster

  At 1000 stations:
  - H3: 352µs, 4.7KB, 109 allocs
  - Fallback: 312µs, 1KB, 37 allocs
  - Fallback is 13% faster

  At 2000 stations:
  - H3: 664µs, 4.7KB, 109 allocs
  - Fallback: 613µs, 1KB, 37 allocs
  - Fallback is 8% faster

  At 4500 stations (real-world scale):
  - H3: 1.40ms, 4.7KB, 109 allocs
  - Fallback: 1.34ms, 1KB, 37 allocs
  - Fallback is 4% faster

  Conclusion: The gap narrows as station count increases. At 4500 stations they're nearly equivalent. H3 has fixed overhead (~4.7KB/109 allocs for k=2 ring), while fallback scales linearly. The crossover point where H3 wins is likely
  around 10-20K entries.

by analytically1770454328

I don't like what scrolling this site does to my browser history.

by dgsan1770441330

Wouldn’t having a spatial index give you most of the performance gains talked about here without needing H3?

by febed1770443377

nice writeup, terrible website garbling the page history

I wonder how it compare with geohashing, I know it is not as efficient in term of partitioning and query end up weird since you need to manage decoding neighbor cells but finding all element of a cell is a "starts with" query which allow to put data effectively on most nosql databases with some sort of text sorting

by avereveard1770451053