How Optimized Hot Spot Analysis works

AllSource 1.3    |

Optimized Hot Spot Analysis executes the Hot Spot Analysis (Getis-Ord Gi*) tool using parameters derived from characteristics of your input data. Similar to the way that the automatic setting on a digital camera will use lighting and subject versus ground readings to determine an appropriate aperture, shutter speed, and focus, the Optimized Hot Spot Analysis tool interrogates your data to obtain the settings that will yield optimal hot spot results. If, for example, the Input Features dataset contains incident point data, the tool will aggregate the incidents into weighted features. Using the distribution of the weighted features, the tool will identify an appropriate scale of analysis. The statistical significance reported in the Output Features will be automatically adjusted for multiple testing and spatial dependence using the False Discovery Rate (FDR) correction method.

Each of the decisions the tool makes to provide the best results possible are reported as messages during tool execution and an explanation for these calculations is documented below.

Just like your camera has a manual mode that allows you to override the automatic settings, the Hot Spot Analysis (Getis-Ord Gi*) tool allows you full control over all parameter options. Running the Optimized Hot Spot Analysis tool and noting the parameter settings it uses may help you refine the parameters you provide to the full control Hot Spot Analysis (Getis-Ord Gi*) tool.

The workflow for the Optimized Hot Spot Analysis tool includes the components described below. The calculations and algorithms used within each of these components are also provided.

Initial data assessment

In this component, the Input Features and the optional Analysis Field, Bounding Polygons Defining Where Incidents Are Possible, and Incident Data Aggregation Method are examined to ensure that there are sufficient features and adequate variation in the values to be analyzed. If the tool encounters records with corrupt or missing geometry, or if an Analysis Field is specified and null values are present, the associated records will be listed as bad records and excluded from analysis.

The Optimized Hot Spot Analysis tool uses the Getis-Ord Gi* (pronounced Gee Eye Star) statistic and, similar to many statistical methods, the results are not reliable when there are less than 30 features. If you provide polygon Input Features or point Input Features and an Analysis Field, you will need a minimum of 30 features to use this tool. The minimum number of Polygons For Aggregating Incidents Into Points is also 30. The feature layer representing Bounding Polygons Defining Where Incidents Are Possible may include one or more polygons.

The Gi* statistic also requires values to be associated with each feature it analyzes. When the Input Features you provide represent incident data (when you don't provide an Analysis Field), the tool will aggregate the incidents and the incident counts will serve as the values to be analyzed. After the aggregation process completes, there still must be a minimum of 30 features; so with incident data, you will want to start with more than 30 features. The table below documents the minimum number of features for each Incident Data Aggregation Method:

Minimum Number of IncidentsAggregation MethodMinimum Number of Features After Aggregation

60

30

30

30

30

30

60

30

The Gi* statistic was also designed for an Analysis Field with a variety of different values. For example, the statistic is not appropriate for binary data. The Optimized Hot Spot Analysis tool will check the Analysis Field to confirm that the values have at least some variation.

Locational outliers are features that are much farther away from neighboring features than the majority of features in the dataset. Think of an urban environment with large, densely populated cities in the center, and smaller, less densely populated cities at the periphery. If you computed the average nearest neighbor distance for these cities, you would find that the result would be smaller if you excluded the peripheral locational outliers and focused only on the cities near the urban center. This is an example of how locational outliers can have a strong impact on spatial statistics, such as Average Nearest Neighbor. Since the Optimized Hot Spot Analysis tool uses the average and the median nearest neighbor calculations for aggregation and also to identify an appropriate scale of analysis, the Initial Data Assessment component of the tool will also identify any locational outliers in the Input Features or Polygons For Aggregating Incidents Into Points and will report the number it encounters. To do this, the tool computes each feature's average nearest neighbor distance and evaluates the distribution of all of these distances. Features that are more than a three standard deviation distance away from their closest noncoincident neighbor are considered locational outliers.

Incident aggregation

For incident data, the next component in the workflow aggregates your data. There are three possible approaches based on the Incident Data Aggregation Method you select. The algorithms for each of these approaches are described below.

  • :
    1. Collapse coincident points yielding a single point at each unique location in the dataset, using the same method employed by the Collect Events tool.
    2. Compare the density of the N Input Features to the density of N random features based on the minimum bounding polygon of the Input Features (in geodesic meters). The average nearest neighbor distance for a random set of N points in the given minimum bounding polygon is computed. If twice this average nearest neighbor distance for the random feature distribution is less than the max extent of the study area divided by 100, the dataset is considered dense and the grid Cell Size used is max extent divided by 100.
    3. If the dataset is not considered dense using the method above, the Cell Size distance used is 2 times larger than either the average or the median nearest neighbor distance. The average nearest neighbor distance (ANN) for all of the unique location points, excluding locational outliers, is computed by summing the distance to each feature's nearest neighbor and dividing by the number of features (N). The median nearest neighbor distance (MNN) is computed by sorting the nearest neighbor distances smallest to largest and selecting the distance that falls in the middle of the sorted list (also excluding locational outliers). The larger distance (ANN or MNN) is multiplied by 2 and used as the grid Cell Size.
    4. Construct a fishnet or hexagon polygon grid using the optimized Cell Size and overlay the grid with the incident points.
    5. Count the incidents in each polygon cell.
    6. When you provide Bounding Polygons Defining Where Incidents Are Possible, all polygon cells within the bounding polygons are retained. When you do not provide Bounding Polygons Defining Where Incidents Are Possible, polygon cells with zero incidents are removed.
    7. If the aggregation process results in less than 30 polygon cells or if the counts in all the polygon cells are identical, a message appears indicating the Input Features you provided are not appropriate for the Incident Data Aggregation Method selected; otherwise, the aggregation component for this method completes successfully.
  • :
    1. For this Incident Data Aggregation Method, a Polygons For Aggregating Incidents Into Points feature layer is required. These aggregation polygons overlay the incident points.
    2. Count the incidents within each polygon.
    3. Ensure that there is sufficient variation in the incident counts for analysis. If the aggregation process results in all polygons having the same number of incidents, a message appears indicating the data is not appropriate for the Incident Data Aggregation Method you selected.
  • :
    1. Collapse coincident points yielding a single point at each unique location in the dataset, using the same method employed by the Collect Events tool. Count the number of unique location features (UL).
    2. Compute both the average and the median nearest neighbor distances on all of the unique location points, excluding locational outliers. The average nearest neighbor distance (ANN) is computed by summing the distance to each feature's nearest neighbor and dividing by the number of features (N). The median nearest neighbor distance (MNN) is computed by sorting the nearest neighbor distances smallest to largest and selecting the distance that falls in the middle of the sorted list.
    3. Set the initial snap distance (SD) to the smaller of either ANN or MNN.
    4. Adjust the snap distance to account for coincident points. Scalar = (UL/N), where N is the number of features in the Input Features layer. The adjusted snap distance becomes SD * Scalar.
    5. Integrate the incident points in three iterations, first using the adjusted snap distance times 0.10, then using the adjusted snap distance times 0.25, and finally integrating with a snap distance equal to the fully adjusted snap distance. Performing the integrate step in three passes minimizes distortion of the original point locations.
    6. Collapse the snapped points yielding a single point at each location with a weight to indicate the number of incidents that were snapped together. This part of the aggregation process uses the Collect Events method.
    7. If the aggregation process results in less than 30 weighted points or if the counts for all of the points are identical, you will get a message indicating the Input Features you provided are not appropriate for the Incident Data Aggregation Method selected; otherwise, the aggregation component for this method completes successfully.

Scale of analysis

This next component of the Optimized Hot Spot Analysis workflow is applied to weighted features, either because you provided Input Features with an Analysis Field or because the Incident Data Aggregation Method has created weights from incident counts. The next step is to identify an appropriate scale of analysis. The ideal scale of analysis is a distance that matches the scale of the question you are asking (if you are looking for hot spots of a disease outbreak and know that the mosquito vector has a range of 10 miles, for example, using a 10-mile distance would be most appropriate). When you can't justify any specific distance to use for your scale of analysis, there are strategies to help with this. The Optimized Hot Spot Analysis tool uses these strategies.

The first strategy tried is Incremental Spatial Autocorrelation. Whenever you see spatial clustering in the landscape, you are seeing evidence of underlying spatial processes at work. The Incremental Spatial Autocorrelation tool performs the Global Moran's I statistic method for a series of increasing distances, measuring the intensity of spatial clustering for each distance. Locational outliers are excluded from the calculations of the beginning and increment distances used in Incremental Spatial Autocorrelation. The intensity of clustering is determined by the z-score returned. Typically, as the distance increases, so does the z-score, indicating intensification of clustering. At some particular distance, however, the z-score generally peaks. Peaks reflect distances where the spatial processes promoting clustering are most pronounced. The Optimized Hot Spot Analysis tool identifies peak distances using Incremental Spatial Autocorrelation. If a peak distance is found, this distance becomes the scale of analysis. If multiple peak distances are found, the first peak distance is selected.

When no peak distance is found, Optimized Hot Spot Analysis examines the spatial distribution of the features and computes the average distance that would yield K neighbors for each feature. K is computed as 0.05 * N, where N is the number of features in the Input Features layer. K will be adjusted so it is never smaller than 3 or larger than 30. If the average distance that would yield K neighbors exceeds one standard distance, the scale of analysis will be set to one standard distance; otherwise, it will reflect the K neighbor average distance.

Incremental Spatial Autocorrelation can take a long time to finish for large, dense datasets. Consequently, when a feature with 500 or more neighbors is encountered, the incremental analysis is skipped, and the average distance that would yield 30 neighbors is computed and used for the scale of analysis.

Hot spot analysis

At this point in the Optimized Hot Spot Analysis workflow, all of the checks and parameter settings have been made. The next step is to run the Getis-Ord Gi* statistic. Details about the mathematics for this statistic are outlined in How Hot Spot Analysis (Getis-Ord Gi*) works. Results from the Gi* statistic will be automatically corrected for multiple testing and spatial dependence using the False Discovery Rate (FDR) correction method.

Output

The last component of the Optimized Hot Spot Analysis tool is to create the Output Features. If the Input Features represent incident data requiring aggregation, the Output Features will reflect the aggregated weighted features (fishnet or hexagon polygon cells, or the aggregation polygons you provided for the Polygons For Aggregating Incidents Into Points parameter, or weighted points). Each feature will have a z-score, p-value, Gi Bin result, and the number of neighbors each feature included in their calculations.

Additional resources

Getis, A. and J.K. Ord. 1992. "The Analysis of Spatial Association by Use of Distance Statistics" in Geographical Analysis 24(3).

Ord, J.K. and A. Getis. 1995. "Local Spatial Autocorrelation Statistics: Distributional Issues and an Application" in Geographical Analysis 27(4).

The spatial statistics resource page has short videos, tutorials, web seminars, articles, and a variety of other materials to help you get started with spatial statistics.