1. Technical Field
The invention is related to systems and processes for identifying objects and/or points nearby a given object or point based on information accessed from a database of geometric data.
2. Background Art
There are in existence today large electronic databases containing information on objects associated with geometric systems. A very simple example of this is an electronic road map database. This type of database will typically contain information about the location of roads, towns and cities, and numerous other landmarks laid out in a planar geometry. In addition, the database will typically contain information about the landmarks found on the map. There are numerous software packages available, as well as sites on the Internet, where a user can access one of these electronic roadmap databases.
Another example of a large database containing geometric-type data in existence today is the Sloan Digital Sky Server (SDSS) database. This database which is accessible over the Internet using SkyServer database management applications contains astronomical information for astronomers and general science education purposes. The database will eventually contain the results of a 5-year survey of the Northern sky to about ½ arcsecond resolution using a modern ground-based telescope. It will characterize about 200M objects in 5 optical bands, and will measure the spectra of a million objects.
More particularly, raw astronomical data is gathered by the SDSS telescope at Apache Point, N. Mex., and fed through data analysis software pipelines. Imaging pipelines analyze data from the imaging camera to extract about 400 attributes for each celestial object along with a 5-color “cutout” image. The spectroscopic pipelines analyze data from the spectrographs, to extract calibrated spectra, redshifts, absorption and emission lines, and many other attributes. The result is a large high-quality catalog of the Northern sky (and of a small stripe of the Southern sky). When complete, the survey data will occupy about 40 terabytes (TB) of image data, and about 3 TB of processed data. After calibration, the pipeline output is available to the astronomers in the SDSS consortium. Then, after approximately a year, the SDSS publishes the data to the astronomy community and the public—so in 2007 all the SDSS data will be available to everyone everywhere. The first year's SDSS data is now public. It is about 80 GB containing about 14 million objects and 50 thousand spectra.
Still another example of a database containing a large amount of geometric-type data is Microsoft's TerraServer Web site which provides free public access to a vast data store of maps and aerial photographs of the United States. TerraServer is a valuable resource for researchers who want to study geography, environmental issues or archeological mysteries. TerraServer database management applications allow a user to easily navigate the enormous amount of information in the database by selecting a location on a map or entering a place name. TerraServer is operated by the Microsoft Corporation as a research project for developing advanced database technology, and was born at the Microsoft Bay Area Research Center. Maps and images for TerraServer are supplied by the U.S. Geological Survey.
One of the most common tasks required of database management programs used in large databases containing geometric data, such as the aforementioned SkyServer and TerraServer applications, is to find all objects nearby a given object or location. For example, in the context of the SkyServer database, astronomers are especially interested in galactic clustering and large-scale structure of the universe. As a result astronomers routinely ask for all objects in a certain area of the celestial sphere. In the context of an electronic road map or the TerraServer, a user often wants to know what points of interest can be found in the vicinity a particular location. It should be noted that while this problem of finding nearby objects or points of interest can be characterized in planar terms for maps (e.g., using a Cartesian scheme, or latitude and longitude, as a basis for measurement), in the case of the SkyServer search task, the search has to be performed in terms of spherical coordinates (e.g., using the equatorial coordinate system of right ascension and declination or (x,y,z) unit vector in J2000 coordinates). As the problem of finding nearby objects when spherical coordinates are involved is the more difficult task, the SkyServer example will be used in the following description of the issues involved in finding nearby object in a large database containing geometric information.
As indicated earlier, SkyServer applications and queries often need to find all objects nearby a given object in the celestial sphere. This is such a common operation that it is implemented as a series of table-valued functions that return all objects within a certain radius of a given equatorial coordinate point (right ascension and declination) or a given x,y,z unit vector in the J2000 coordinate system. In terms of the SQL relational database language used in the SkyServer application these functions are denoted as fGetNearbyObjEq and fGetNearbyObjXyz, respectively. The “Get Nearby Objects” functions first use the hierarchical triangular mesh (HTM) code to limit the scope of search, and then filter the objects identified in the search using an equation to compute the actual distance to ensure the object is within the specified distance from the object being considered.
In regard to the HTM search task, HTM processes inscribe the celestial sphere within an octahedron and project each celestial point onto the surface of this octahedron. The projection is approximately iso-area. HTM partitions the sphere into the 8 faces of an octahedron. It then hierarchically decomposes each face with a recursive sequence of triangles—so each level of the recursion divides each triangle into 4 sub-triangles as shown in FIG. 1. An HTM index number is then assigned to each point on the sphere. Most spatial queries use the HTM index to limit searches to a small set of triangles. An HTM index is built as an extension of SQL Server's B-trees. SkyServer uses a 20-deep HTM so that the individual triangles are less than 0.1 arcseconds on a side. There are basic routines to convert between the equatorial coordinates (i.e., right ascension (ra), declination (dec)) and HTM coordinates. Importantly, all the HTM IDs within a triangle, such as for example triangle 6,1,2,2, will have HTM IDs that are between that triangle and the next (e.g., between 6,1,2,2 and 6,1,2,3). So, when the HTM IDs are mapped into a B-tree index they provide a quick index for all the objects within a given triangle. For example, when it is desired to know what objects are nearby a certain object, or it is desired to know all the objects in a certain area, the fGetNearestObjEq(1,1,1) function returns the nearest object within one arcminute of equatorial coordinate (1°, 1°).
In regard to the aforementioned filtering task, this process eliminates any objects identified in the HTM search that are determined to be outside the prescribed distance from the object or location under consideration. The actual distance θ in degrees between the object or location under consideration (which is known to be at point x,y,z) and an object identified in the search (i.e., object o with celestial coordinates o.cx, o.cy, o.cz) are computed using the following equation:sin(θ/2)=|{right arrow over (o.xyz)}·{right arrow over (xyz)}|/2  (1)as shown by the geometric relations depicted in FIG. 2. Thus,θ=degrees(2×a sin(sqrt(o.cx−x)2+(o.cy−y)2+(o.cz−z)2))/2)).  (2)
Some queries want to compare each of several hundred million objects with all their neighbors. Searchers for gravitational lenses and for clusters are examples of such queries. To speed these queries the SkyServer application precomputes a Neighbors Table, which for each object lists all its neighbors within 30 arcseconds along with summary attributes. This table averages about 9 neighbors per object; but, some objects have hundreds of neighbors and some have none.
Computing the neighbors table using the fGetNearbyObjectsXyz( ) function can take a long time: on the fifteen million object SDSS early data release, the computation took 56 hours—or about 74 objects per second. Fortunately, the computation was done only a few times in the load process and then used many times in queries. But, it was obvious that some speedup will be needed as the SDSS database grows twenty-fold over the next 3 years. Indeed, with the SDSS Data Release 1 (DR1), the database is about to grow six-fold so the naive computation would take about 2 weeks on a dual 800 MHz Xeon processor with 2 GB ram and 12 disks with 150 MB/s bandwidth. Using that system, the full dataset would take about 2 months to compute. All subsequent measurements reported here are performed on an even slower computer: a 722 MHz Pentium III with 0.5 GB of ram and 1 disk with 10 MB/s of 10 bandwidth.
The basic problem is that SQL can evaluate equation (2) at the rate of about 170,000 records per second (5.6 μs per row), while the HTM functions run at about 170 records per second (5.9 ms per row to return the nearest object.) This is a 1,000:1 performance difference. The high costs of the HTM functions is a combination of the HTM procedures, the expensive linkage to SQL via external stored procedures (a string interface), and the use of table-valued functions. It appears, based on preliminary timing tests that the HTM code uses about 3 ms and that the other costs (linkage, string conversion, and table-valued function) are in the range of 2 ms.
It is noted, however, that the foregoing computation is parallel and inherently CPU-bound. In other words, each object's neighbors can be computed independently. Therefore, the computation could be accelerated by using multiple processors. For example, a 7-node processor farm could do the 2-week DR1 job in 2 days. While this solution is viable, it is not very efficient in that it requires the use of multiple processors and does nothing to reduce the overall processing costs. It makes more sense to solve the problem using a better process on a single processor. This is a goal of the present invention, as will be described in the sections to follow.
The issues involved with finding nearby objects when planar coordinates are involved are identical, although typically a regional quadtree approach is taken to limit the scope of search where the quadtree IDs are mapped into a B-tree to provide a quick index for all the objects within a given block. This is as opposed to the HTM approach used for geometric data in spherical coordinates. In addition, the distance computations are somewhat simpler in the filtering stage for the planar data case.