Friday, February 4, 2011

Spatial range queries using R-trees

Often, systems that holds geometric data require some sort of a spatial range query mechanism. That is, given a set of n object in k dimensions, we want to retrieve  a subset of n that overlaps a given range. When k=1, the range is represented as a line segment, when k=2 it would be a rectangle, and when k=3 a box. Of course this can be generalized for any k.

There are many known data structures for doing range queries, such as Quadtree, Octree and Kd-tree, just to name a few. A classic data structure, very popular in the world of GIS, is the R-tree. There are many variants of it (such as R+tree, R*tree), but in this post, I will describe an Haskell implementation of the basic, packed R-tree for 2D objects. The data structure is static, meaning that it is built once, and no shapes are added or removed later on (implementing a dynamic R-tree is a more complex task).

Here is a brief description on building a packed R-tree. Given a set of n shapes: points, polylines or polygons (Figure 1), we find the bounding box of each shape (Figure 2). Now, we need to set a parameter m which denotes the maximum children each node can have. Assume m=3. We sort all the shapes lexicographically according to the lower left corner point of their bounding box, and find the bounding box of each successive shapes triplet (Figure 3), these would be the leaf nodes. Then, we sort all the leaf nodes (again, according to the lower left corner of the bounding boxes), and group each successive triplet to an inner node. We repeat this process recursively until there is only one node, the root of the tree. Figure 4 illustrates the R-tree that would be created in our example. There is one inner node (the root) and three leaf nodes, each containing 3 shapes.

Figure 1 Figure 2

 So, to summarize the previous paragraph - we have constructed a tree such that each node covers the area of all its children. Now, to perform a query for a given range, at each inner node (starting from the root), we check which children nodes (if any) overlaps the search rectangle, and continue searching these nodes recursively, until we reach the leaf nodes, where we finally check which child shape overlaps.

The worst case run time complexity for a given search is O(n), although for most practical cases the running time is more closer to a logarithmic run time. The following link gives further details on the complexity analysis.

Figure 3 Figure 4

Now, we can present an Haskell implementation for the R-tree algorithm. We start by defining the shapes on which our R-tree will operate, namely - points, polylines and polygons. We also define the RNode data type which represents either an inner node, or a leaf:

For polylines and polygons we store the bounding box. For a point, the point itself is its bounding box. We define some helper functions for extracting the bounding box for each shape, and for performing two operation on bounding box - testBoundingBoxOverlap which returns true if two bounding boxes overlap, and totalBoundingBox which compute the minimal bounding box of two bounding boxes:

Now, we can write the functions needed for building the tree. buildPackedRTree is the main funciton. It accepts the number m and the list of shapes and returns the root of the tree. It sorts the shapes according to the lower left corner of the bounding box.It then uses an helper function, splitList, which splits the list of shapes to lists of size m. Then, a list of leaf nodes is passed to the buildInnerNodes helper function, which constructs the inner nodes recursively

Finally, we define the search function queryRTree, which takes the root node and a bounding box. It uses the searchNodes helper function, to filter out recursively all the relevant inner nodes, up to the leaf nodes.

In the above implementation, the sorting of the shapes is done simply by doing a lexicographical sort according to the lower left corner of the bounding box. There is an interesting R-tree variant, called Hilbert R-tree, which uses the Hilbert curve for sorting the bounding boxes. This approach increases the locality of the queried objects, thus increasing the the overall performance of the algorithm. The complete description of the Hilbert R-tree is given in this link.

No comments:

Post a Comment