The Problem with Traditional Indexes and Spatial Queries

The Problem with Traditional Indexes and Spatial Queries

传统索引与空间查询的问题

THE PROBLEM Let’s say you’re building an app called “Tomato” to help people find killer Indian Restaurants within 10km (or 6.2 miles) of them. Normally, your first instinct is to throw a standard B-Tree index on your latitude and longitude coordinates.

问题所在 假设你正在开发一款名为“Tomato”的应用程序,旨在帮助用户找到方圆 10 公里(约 6.2 英里)内最棒的印度餐厅。通常,你的第一直觉是在经纬度坐标上建立一个标准的 B-Tree 索引。

CREATE TABLE restaurants (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    latitude DECIMAL(10, 8),
    longitude DECIMAL(11, 8)
);

CREATE INDEX idx_lat ON restaurants(latitude);
CREATE INDEX idx_lng ON restaurants(longitude);

Looks totally fine, right? But here’s where it all falls apart. When you try to run a query to get restaurants within 10km of a specific lat and long, you’re going to hit a wall.

看起来完全没问题,对吧?但问题就在这里。当你尝试运行查询以获取特定经纬度 10 公里范围内的餐厅时,你就会碰壁。

-- 10 km radius. Conversions:
-- 1° latitude ≈ 111 km → 10 / 111 ≈ 0.0901° lat
-- 1° longitude ≈ 111 km × cos(37.77°) → 10 / (111 × 0.7906) ≈ 0.1140° lng

SELECT id, name, latitude, longitude 
FROM restaurants 
WHERE latitude BETWEEN 37.7749 - 0.0901 AND 37.7749 + 0.0901 -- 37.6848 .. 37.8650
  AND longitude BETWEEN -122.4194 - 0.1140 AND -122.4194 + 0.1140; -- -122.5334 .. -122.3054

You’ll probably attempt a range scan on one of the indexes. If it grabs the latitude first, the index does its thing and spits out restaurants between the two latitudes. But here’s the catch: that latitude query returns a massive, wrap-around-the-globe strip of results, bringing in restaurants from Mexico, Dubai, and India all at once. And as for longitude? It doesn’t even do a range query. It drops to a painfully slow per-candidate scan and filter. Even if you try to get clever and force an index intersection, the database still has to sweat through merging two giant sets of results. And the best part… Even if you pull off that intersection perfectly, you get a rectangle, not a true 10km radius circle. You still have to write extra math logic to shave off the corners and filter out the noise. The core issue is that slapping B-Tree indexes directly onto geo-spatial data just doesn’t work. They process the coordinates completely independently and have zero understanding of the actual 2D relationship between them.

你可能会尝试对其中一个索引进行范围扫描。如果它先处理纬度,索引会按预期工作,吐出两个纬度之间的所有餐厅。但问题在于:该纬度查询返回的是一条环绕全球的巨大结果带,将墨西哥、迪拜和印度的餐厅一并囊括。至于经度呢?它甚至无法进行范围查询,只能退化为极其缓慢的逐个候选扫描和过滤。即使你试图耍小聪明强制进行索引交集(Index Intersection),数据库仍然需要费力地合并两个巨大的结果集。最棒的部分是……即使你完美地完成了交集运算,你得到的也只是一个矩形,而不是真正的 10 公里半径圆。你仍然需要编写额外的数学逻辑来切掉边角并过滤掉噪音。核心问题在于,直接在地理空间数据上使用 B-Tree 索引根本行不通。它们完全独立地处理坐标,对坐标之间实际的二维关系毫无感知。

THE SOLUTION Every solid fix for this problem boils down to one simple idea: keep points that are physically close together close together in the index. Here are the three most popular ways to pull that off:

解决方案 解决这个问题的每一个可靠方案都归结为一个简单的理念:在索引中,将物理位置相近的点存储在一起。以下是实现这一目标的三种最流行的方法:

1. Geohash Think of a Geohash as a zip code for literally anywhere on Earth. It flattens a 2D coordinate into a short string of text. It’s designed so that places near each other start with the exact same letters. Since it’s just a regular string, any basic database can index and range-scan it—no fancy plugins required.

1. Geohash 可以将 Geohash 想象成地球上任何地点的邮政编码。它将二维坐标扁平化为一段简短的文本字符串。其设计初衷是让地理位置相近的地方以完全相同的前缀字母开头。由于它只是一个普通的字符串,任何基础数据库都可以对其进行索引和范围扫描——无需任何花哨的插件。

Example: A café and a bookshop on the same block in Bengaluru might both hash to tdr1vfp. Finding neighbors is as easy as asking the database, “give me everything starting with tdr1v.”

示例:班加罗尔同一街区的一家咖啡馆和一家书店的哈希值可能都是 tdr1vfp。寻找邻居就像问数据库“给我所有以 tdr1v 开头的数据”一样简单。

The Catch: Two spots could be a few meters apart but fall on opposite sides of a grid line. When that happens, they won’t share much of a prefix, so your code has to know to peek into neighboring grid cells just in case.

缺点:两个地点可能仅相隔几米,却恰好位于网格线的两侧。在这种情况下,它们不会共享太多的前缀,因此你的代码必须知道去查看相邻的网格单元以防万一。

2. Quadtree Imagine taking a map and cutting it into four equal squares. You only cut a square into four smaller squares if it gets too crowded with points. An empty desert stays as one giant square, while a packed city center gets chopped up into tiny ones. Detail scales with density. To find what’s nearby, you traverse down to the right square and check its immediate neighbors.

2. 四叉树 (Quadtree) 想象一下,将一张地图切成四个相等的正方形。只有当某个正方形内的点过于密集时,你才将其进一步切成四个更小的正方形。荒无人烟的沙漠保持为一个巨大的正方形,而拥挤的市中心则被切成微小的方块。细节程度随密度而变化。要查找附近的内容,你只需遍历到对应的方块并检查其直接邻居即可。

The Catch: This isn’t something you can easily hack onto a standard database index. It requires a purpose-built tree structure, so nowadays, you’ll see it more in tutorials and map-tiling systems than as a default production database index.

缺点:这并不是可以轻易通过标准数据库索引实现的。它需要专门构建的树结构,因此如今你更多是在教程和地图切片系统中看到它,而不是作为生产环境数据库的默认索引。

3. R-tree Instead of forcing a rigid grid over the world, R-trees draw loose bounding boxes snugly around natural clusters of things. Then, it groups those boxes into bigger boxes, like a set of nesting dolls. When you run a search, it only opens the boxes that actually overlap your search area and completely ignores the rest. Empty space literally costs nothing. This is the heavy hitter that PostGIS and MySQL use under the hood for spatial indexing.

3. R-tree R-tree 不会强行在世界上覆盖一个僵化的网格,而是围绕事物的自然聚类绘制松散的边界框。然后,它将这些框组合成更大的框,就像一套套娃。当你进行搜索时,它只会打开与你的搜索区域重叠的框,并完全忽略其余部分。空白空间几乎不产生任何成本。这是 PostGIS 和 MySQL 在底层用于空间索引的“重型武器”。

The Catch: Writes are heavy. When you insert new points, the database sometimes has to split and rebalance the rectangles. Also, because boxes can overlap, searches occasionally have to check more branches than you’d like.

缺点:写入开销较大。当你插入新点时,数据库有时必须拆分并重新平衡这些矩形。此外,由于框之间可能会重叠,搜索时偶尔需要检查比预期更多的分支。

TL;DR

  • The Problem: Standard database indexes (B-Trees) are terrible at handling maps. They process latitude and longitude separately, so asking for a 10km radius usually hands you a useless, globe-spanning strip of data instead of a neat local circle.
  • The Fix: You need a spatial index. They actually understand 2D space and keep physically close locations stored close together in the database.
  • The Big Three:
    • Geohash: Flattens coordinates into a short text string, acting like a zip code. If two spots share the same starting letters, they’re close. It’s incredibly easy to drop into any standard database.
    • Quadtree: Chops the map into four squares, and keeps subdividing squares only where things get crowded. It’s awesome for scaling detail based on density, but it requires a custom setup rather than a standard index.
    • R-tree: Draws snug, nesting boxes around natural clusters of points and completely ignores empty space. It’s the heavy-hitting industry standard (used in PostGIS and MySQL), though inserting new data can be a bit heavy on database writes.

总结

  • 问题: 标准数据库索引(B-Tree)在处理地图时表现极差。它们分别处理经度和纬度,因此查询 10 公里半径时,通常会返回一条无用的、环绕全球的数据带,而不是一个整洁的局部圆。
  • 修复: 你需要空间索引。它们真正理解二维空间,并将物理位置相近的地点存储在数据库中相邻的位置。
  • 三大主流方案:
    • Geohash: 将坐标扁平化为简短的文本字符串,充当邮政编码。如果两个地点共享相同的前缀字母,它们就是相近的。它非常容易集成到任何标准数据库中。
    • 四叉树 (Quadtree): 将地图切成四个正方形,仅在点密集的地方进行细分。它非常适合根据密度调整细节,但需要自定义设置,而非标准索引。
    • R-tree: 在自然聚类的点周围绘制紧密的嵌套框,并完全忽略空白空间。这是行业标准的“重型武器”(PostGIS 和 MySQL 均使用此技术),尽管插入新数据时数据库写入开销较大。