Often it is useful to identify the locations of “hot spots” (or clusters) and “cold spots” (e.g. zones of sparseness or exclusion) in a spatial point pattern in a statistically correct and computationally efficient way. One example is an analysis of the spatial distribution of exocytosis events in cells. However, other applications include monitoring patterns of human behavior, epidemiology of diseases distributions, assessing risk of an earthquake, military surveys, cell based screens, etc. The ability to identify unusual patterns in datasets representing these applications has a large commercial value.
Conventional spatial analysis techniques do not offer the flexibility, practicality, and rigor required for certain applications. Spatial statistics methods (such as Ripley's K function, FIG. 1) are often vulnerable to bias due to edge effects because areas near the edge are likely to have fewer points in their proximity than areas far from the edge. These deficiencies can be difficult to correct. Many methods also rely on specific assumptions that do not work well for all types of data and are quite complex to compute. FIG. 1 shows Ripley's K function used to analyze exocytosis events, for example, similar to those shown in the cell in FIG. 7. The K function counts the number of neighbors each data point has at a given spatial scale (the maximum distance between a point and a neighbor). Line 12 indicates the K function of the real data plotted against the spatial scale (also called the radius). When line 12 is within the envelope formed by lines 14 and 16, the distribution of the data is not significantly different from the distribution of homogeneous random points belonging to a spatial Poisson point process. When line 12 is above the envelope, the data are significantly clustered, and when line 12 is below, the data are significantly dispersed. Here, Ripley's K function is above the envelope at all scales, indicating that the points are significantly clustered at all scales, a conclusion which can be drawn by inspecting exocytosis events. Ripley's K function also cannot identify the presence of large areas void of exocytosis events, and Ripley's K function cannot identify the location of specific regions of interest (either hot spots or cold spots) within the data.
Simple heat map visualizations are commonly used to provide a more practical representation of spatial point data. Generally, areas of high and low density are displayed using different colors or intensities to provide a visually intuitive map of the data. However, a simple heat map method does not use a good statistical basis for identifying truly high- or low-density areas in the data, giving possibly misleading results. The highest density areas may not have significantly high density when compared to the expected density, but heat map displays would generally exaggerate the significance of such regions because they lack a solid statistical basis.
An exploratory method with better statistical grounding is the Geographical Analysis Machine (GAM), which is a grid-based method of identifying local clusters in spatial point data. The GAM method involves dividing the region into a regular grid, placing circles of various radii at the grid locations, counting the number of points within each of the circles, and identifying those circles with a significantly high number of points. The method is useful for identifying clusters, but it is a simple method suffering from several limitations. Although Monte Carlo simulations can be used to identify clusters, only simple comparison methods utilizing rankings (e.g. percentile rankings) have been suggested to test for statistical significance. Most applications and derivatives of the GAM have used analytical methods to identify significant circles using theoretical models of the Poisson process and other simple point processes; however, many useful applications require testing against more complicated models that are not analytically tractable and require more complex simulation techniques than have been applied to the GAM. The GAM also produces many false positives in some cases because of the difficulty of correcting for multiple testing. Secondly, the GAM does not deal well with edge effects, suffering from much the same problems as Ripley's K function because of the GAM's reliance on theoretical models. Because of this, important features of data near the edge are often lost using the GAM. GAM also, therefore, has difficulty with regions that may contain holes or other discontinuities. Thirdly, the method is designed only to identify clusters, not exclusion zones, holes, and other low-density areas in the data. Fourthly, the GAM is intended to be used with geographic information systems (GIS) and geographic-scale data, making it difficult, if not impossible, to apply to data at other scales, such as biological data from individual cells or data describing the location of astronomical objects, such as stars and galaxies. Finally, the visualization methods used to display the results of GAM are typically imprecise, as generally all significant circles at all radii are drawn, producing a crowded display, or all circles of the same arbitrarily chose size are drawn, again producing a slightly less crowded display.
An evaluation of spatial analysis methods in the scientific literature provides no existing method that offers the flexibility, practicality, and rigor for handling large point sets. Importantly, computation power is now so great that stochastic type studies are no longer necessarily limiting.