Density-based Clustering (Spatial Statistics)

Summary

Finds clusters of point features within surrounding noise based on their spatial distribution. Time can also be incorporated to find space-time clusters.

Learn more about how Density-based Clustering works

Illustration

Density-based Clustering tool illustration

Usage

  • This tool extracts clusters from the Input Point Features parameter value and identifies any surrounding noise.

  • There are three Clustering Method parameter options. The Defined distance (DBSCAN) option finds clusters of points that are in close proximity based on a specified search distance. The Self-adjusting (HDBSCAN) option finds clusters of points similar to DBSCAN but uses varying distances, allowing for clusters with varying densities based on cluster probability (or stability). The Multi-scale (OPTICS) option orders the input points based on the smallest distance to the next point. A reachability plot is then constructed, and clusters are obtained based on the fewest points to be considered a cluster, a search distance, and characteristics of the reachability plot (such as the slope and height of peaks).

  • This tool produces an output feature class with a new integer field, CLUSTER_ID, showing the cluster each point falls into. Default rendering is based on the COLOR_ID field. Multiple clusters will be assigned each color. Colors will be assigned and repeated so that each cluster is visually distinct from its neighboring clusters.

  • This tool also creates messages and charts that you can use to understand the characteristics of the identified clusters. To access the messages, hover over the progress bar, click the pop-out button, or expand the messages section in the Geoprocessing pane. You can also access the messages for a previous run of the Density-based Clustering tool in the geoprocessing history. You can access the charts in the Contents pane.

  • For more information about the output messages and charts and to learn more about the algorithms this tool uses, see How Density-based Clustering works.

  • If Self-adjusting (HDBSCAN) is chosen for the Clustering Method parameter, the output feature class will also contain the fields PROB, which is the probability the point belongs in its assigned group, OUTLIER, which designates that the point may be an outlier within its own cluster (a high value indicates the point is more likely to be an outlier), and EXEMPLAR, which denotes the points that are the most prototypical or most representative of each cluster.

  • If Multi-scale (OPTICS) is chosen for the Clustering Method parameter, the output feature class will also contain the fields REACHORDER, which is how the Input Point Features values were ordered for analysis, and REACHDIST, which is the distance between each point and its closest unvisited neighbor.

  • For the Clustering Method parameter's Defined distance (DBSCAN) and Multi-scale (OPTICS) options, the default Search Distance parameter value is the highest core distance found in the dataset, excluding those core distances in the top 1 percent (that is, excluding the most extreme core distances).

  • For the Clustering Method parameter's Defined distance (DBSCAN) and Multi-scale (OPTICS) options, the time of each point can be provided in the Time Field parameter. If provided, the tool will find clusters of points that are close to each other in space and time. The Search Time Interval parameter must be provided to determine whether a point is close enough in time to a cluster to be included in the cluster.

    • For the Defined distance (DBSCAN) option, when searching for cluster members, the Minimum Features per Cluster parameter value must be found within the Search Distance and Search Time Interval values to form a cluster.
    • For the Multi-scale (OPTICS) option, all points outside of the Search Time Interval value will be excluded when the point calculates its core distance, searches all neighbor distances within the specified Search Distance value, and calculates the reachability distance.

  • When the Time Field parameter value is provided, the output feature class will include a Time Span per Cluster chart displaying the time span of each space-time cluster. Four additional fields will also be included: Mean Time, Start Time, End Time, and Time Exaggeration. The output feature class is time-enabled, and it is recommended that you set the time to the Mean Time field, so the clusters can be visualized across time using the Time Slider. The spatiotemporal pattern can also be shown in a 3D scene by setting Time Exaggeration as the feature elevation.

  • The Search Time Interval parameter does not control the overall time span of the resulting space-time clusters. For example, using a search time interval of 3 days can result in a cluster with points spanning 10 days or more. This is because the search time interval is only used to determine whether a single point is included in a cluster. By forming clusters of multiple points, the overall time span of the cluster can be larger than the search time interval. This is analogous to how a spatial cluster can be larger than the Search Distance value, as long as each point has neighbors within the cluster that are closer than the search distance.

  • When the Input Features values are not projected (that is, when coordinates are in degrees, minutes, and seconds) or when the output coordinate system is set to a geographic coordinate system, distances are computed using chordal measurements. Chordal distance measurements are used because they can be computed quickly and provide good estimates of true geodesic distances, at least for points within about 30 degrees of each other. Chordal distances are based on an oblate spheroid. Given any two points on the earth's surface, the chordal distance between them is the length of a line, passing through the three-dimensional earth, to connect those two points. Chordal distances are reported in meters.

    Caution:

    It is a best practice to project the data, especially if the study area extends beyond 30 degrees. Chordal distances are not a good estimate of geodesic distances beyond 30 degrees.

  • This tool includes z-values in its calculations. If z-values are present, the result will be 3D.

  • This tool supports parallel processing and uses 50 percent of available processors by default. The number of processors can be increased or decreased using the Parallel Processing Factor environment.

Parameters

LabelExplanationData Type
Input Point Features

The point features for which density-based clustering will be performed.

Feature Layer
Output Features

The output feature class that will receive the cluster results.

Feature Class
Clustering Method

Specifies the method that will be used to define clusters.

  • Defined distance (DBSCAN) A specified distance will be used to separate dense clusters from sparser noise. DBSCAN is the fastest of the clustering methods but is only appropriate if there is a clear distance to use that works well to define all clusters that may be present. This results in clusters that have similar densities.
  • Self-adjusting (HDBSCAN) Varying distances will be used to separate clusters of varying densities from sparser noise. HDBSCAN is the most data-driven of the clustering methods and requires the least user input.
  • Multi-scale (OPTICS)The distance between neighbors and a reachability plot will be used to separate clusters of varying densities from noise. OPTICS offers the most flexibility in fine-tuning the clusters that are detected, though it is computationally intensive, particularly with a large search distance.
String
Minimum Features per Cluster

The minimum number of points that will be considered a cluster. Any cluster with fewer points than the number provided will be considered noise.

Long
Search Distance
(Optional)

The maximum distance that will be considered.

For the Clustering Method parameter's Defined distance (DBSCAN) option, the Minimum Features per Cluster parameter value must be found within this distance for cluster membership. Individual clusters will be separated by at least this distance. If a point is located farther than this distance from the next closest point in the cluster, it will not be included in the cluster.

For the Clustering Method parameter's Multi-scale (OPTICS) option, this parameter is optional and is used as the maximum search distance when creating the reachability plot. For OPTICS, the reachability plot, combined with the Cluster Sensitivity parameter value, determines cluster membership. If no distance is specified, the tool will search all distances, which will increase processing time.

If left blank, the default distance used will be the highest core distance found in the dataset, excluding those core distances in the top 1 percent (the most extreme core distances). If the Time Field parameter value is provided, a search distance must be provided and does not include a default value.

Linear Unit
Cluster Sensitivity

An integer between 0 and 100 that determines the compactness of clusters. A number close to 100 will result in a higher number of dense clusters. A number close to 0 will result in fewer, less compact clusters. If left blank, the tool will find a sensitivity value using the Kullback-Leibler divergence that finds the value in which adding more clusters does not add additional information.

Long
Time Field

The field containing the time stamp for each record in the dataset. This field must be of type Date. If provided, the tool will find clusters of points that are close to each other in space and time. The Search Time Interval parameter value must be provided to determine whether a point is close enough in time to a cluster to be included in the cluster.

Field
Search Time Interval

The time interval that will be used to determine whether points form a space-time cluster. The search time interval spans before and after the time of each point; for example, an interval of 3 days around a point will include all points starting 3 days before and ending 3 days after the time of the point.

  • For the Clustering Method parameter's Defined distance (DBSCAN) option, the Minimum Features per Cluster parameter value must be found within the search distance and the search time interval to be included in a cluster.
  • For the Clustering Method parameter's Multi-scale (OPTICS) option, all points outside of the search time interval will be excluded when calculating core distances, neighbor-distances, and reachability distances.

The search time interval does not control the overall time span of the resulting space-time clusters. The time span of points within a cluster can be larger than the search time interval as long as each point has neighbors within the cluster that are within the search time interval.

Time Unit

arcpy.stats.DensityBasedClustering(in_features, output_features, cluster_method, min_features_cluster, {search_distance}, cluster_sensitivity, time_field, search_time_interval)
NameExplanationData Type
in_features

The point features for which density-based clustering will be performed.

Feature Layer
output_features

The output feature class that will receive the cluster results.

Feature Class
cluster_method

Specifies the method that will be used to define clusters.

  • DBSCAN A specified distance will be used to separate dense clusters from sparser noise. DBSCAN is the fastest of the clustering methods but is only appropriate if there is a clear distance to use that works well to define all clusters that may be present. This results in clusters that have similar densities.
  • HDBSCAN Varying distances will be used to separate clusters of varying densities from sparser noise. HDBSCAN is the most data-driven of the clustering methods and requires the least user input.
  • OPTICSThe distance between neighbors and a reachability plot will be used to separate clusters of varying densities from noise. OPTICS offers the most flexibility in fine-tuning the clusters that are detected, though it is computationally intensive, particularly with a large search distance.
String
min_features_cluster

The minimum number of points that will be considered a cluster. Any cluster with fewer points than the number provided will be considered noise.

Long
search_distance
(Optional)

The maximum distance that will be considered.

For the cluster_method parameter's DBSCAN option, the min_features_cluster parameter value must be found within this distance for cluster membership. Individual clusters will be separated by at least this distance. If a point is located farther than this distance from the next closest point in the cluster, it will not be included in the cluster.

For the cluster_method parameter's OPTICS option, this parameter is optional and is used as the maximum search distance when creating the reachability plot. For OPTICS, the reachability plot, combined with the cluster_sensitivity parameter value, determines cluster membership. If no distance is specified, the tool will search all distances, which will increase processing time.

If left blank, the default distance used will be the highest core distance found in the dataset, excluding those core distances in the top 1 percent (he most extreme core distances). If the time_field parameter value is provided, a search distance must be provided and does not include a default value.

Linear Unit
cluster_sensitivity

An integer between 0 and 100 that determines the compactness of clusters. A number close to 100 will result in a higher number of dense clusters. A number close to 0 will result in fewer, less compact clusters. If left blank, the tool will find a sensitivity value using the Kullback-Leibler divergence that finds the value in which adding more clusters does not add additional information.

Long
time_field

The field containing the time stamp for each record in the dataset. This field must be of type Date. If provided, the tool will find clusters of points that are close to each other in space and time. The search_time_interval parameter value must be provided to determine whether a point is close enough in time to a cluster to be included in the cluster.

Field
search_time_interval

The time interval that will be used to determine whether points form a space-time cluster. The search time interval spans before and after the time of each point; for example, an interval of 3 days around a point will include all points starting 3 days before and ending 3 days after the time of the point.

  • For the cluster_method parameter's DBSCAN option, the min_features_cluster value specified must be found within the search distance and the search time interval to be included in a cluster.
  • For the cluster_method parameter's OPTICS option, all points outside of the search time interval will be excluded when calculating core distances, neighbor-distances, and reachability distances.

The search time interval does not control the overall time span of the resulting space-time clusters. The time span of points within a cluster can be larger than the search time interval as long as each point has neighbors within the cluster that are within the search time interval.

Time Unit

Code sample

DensityBasedClustering example 1 (Python window)

The following Python window script demonstrates how to use the DensityBasedClustering function.

import arcpy
arcpy.env.workspace = r"C:\Analysis"
arcpy.DensityBasedClustering_stats("Chicago_Arson", "Arson_HDB", "HDBSCAN", 15)
DensityBasedClustering example 2 (stand-alone script)

The following stand-alone Python script demonstrates how to use the DensityBasedClustering function.

# Clustering crime incidents in a downtown area using the DensityBasedClustering
# function

# Import system modules
import arcpy
import os

# Overwrite existing output, by default
arcpy.env.overwriteOutput = True

# Local variables...
workspace = r"E:\working\data.gdb"
arcpy.env.workspace = workspace

# Run Density-based Clustering with the HDBSCAN Cluster Method using a minimum 
# of 15 features per cluster
arcpy.stats.DensityBasedClustering("Chicago_Arson", "Arson_HDB", "HDBSCAN", 15)

# Run Density-based Clustering again using OPTICS with a Search Distance and 
# Cluster Sensitivity to create tighter clusters
arcpy.stats.DensityBasedClustering("Chicago_Arson", "Arson_Optics", "OPTICS", 
                                   15, "1200 Meters", 70)
DensityBasedClustering example 3 (stand-alone script)

The following stand-alone Python script demonstrates how to use the DensityBasedClustering function with time.

# The following stand-alone Python script demonstrates how to use 
# the DensityBasedClustering function with time to find space-time clusters.

# Time field and Search time interval only supported by DBSCAN and OPTICS methods

# Import system modules
import arcpy
import os

# Overwrite existing output, by default
arcpy.env.overwriteOutput = True

# Local variables...
workspace = r"E:\working\data.gdb"
arcpy.env.workspace = workspace

# Run Density-based Clustering with DBSCAN Cluster Method, and choose 50 as the minimum
# features per cluster, 200 meter search distance, and 10 minute search time interval
arcpy.stats.DensityBasedClustering("New_York_Taxi_PickingUp", "New_York_Taxi_DBSCAN_Time", 
                        "DBSCAN",  50, "200 Meters", None, "Pickup_Time", "10 Minutes")

# Run Density-based Clustering with OPTICS Method, and choose 50 as the minimum
# of features per cluster, 200 meter search distance, and 10 minute search time interval. 
# Using 15 as the cluster sensitivity to create a higher number of dense clusters
arcpy.stats.DensityBasedClustering("New_York_Taxi_PickingUp", "New_York_Taxi_OPTICS_Time", 
                        "OPTICS",  50, "200 Meters", 15, "Pickup_Time", "10 Minutes")

Related topics