Grouping Analysis (Spatial Statistics)—ArcGIS AllSource

Summary

Groups features based on feature attributes and optional spatial or temporal constraints.

Legacy:

This is a deprecated tool. The algorithm behind this tool has been enhanced and new functionality has been added to these methods. To simplify the features and methods, this tool has been replaced by two tools. Use the Spatially Constrained Multivariate Clustering tool if you would like to create spatially constrained groups. Use the Multivariate Clustering tool to create groups with no spatial constraints.

Illustration

Usage

Legacy:

The algorithm behind the Grouping Analysis tool has been enhanced and new functionality has been added to these methods at ArcGIS AllSource 2.1. To simplify the new features and methods, two new tools have been created to replace the Grouping Analysis tool. Use the Spatially Constrained Multivariate Clustering tool if you would like to create spatially contiguous groups. Use the Multivariate Clustering tool to create groups with no spatial constraints.

This tool produces an output feature class with the fields used in the analysis plus a new integer field named SS_GROUP. Default rendering is based on the SS_GROUP field and shows you which group each feature falls into. If you indicate that you want three groups, for example, each record will contain a 1, 2, or 3 for the SS_GROUP field. The SS_SEED field indicates which features were used as starting points to grow groups. The number of nonzero values in the SS_SEED field will match the value you entered for the Number of Groups parameter.

Note:

Creating the report file can add substantial processing time. Consequently, while Grouping Analysis will create the Output Feature Class showing group membership, the PDF report file will not be created if you specify more than 15 groups or more than 15 variables.

The Unique ID Field provides a way for you to link records in the Output Feature Class back to data in the original input feature class. Consequently, the Unique ID Field values must be unique for every feature and typically should be a permanent field that remains with the feature class. If you don't have a Unique ID Field in your dataset, you can easily create one by adding a new integer field to your feature class table and calculating the field values to be equal to the FID/OID field. You cannot use the FID/OID field directly for the Unique ID Field parameter.
The Analysis Fields should be numeric and should contain a variety of values. Fields with no variation (that is, the same value for every record) will be dropped from the analysis but will be included in the Output Feature Class. Categorical fields may be used with the Grouping Analysis tool if they are represented as dummy variables (a value of one for all features in a category and zeros for all other features).
Additional Spatial Constraints, such as fixed distance, may be imposed by using the Generate Spatial Weights Matrix tool to first create an SWM file and then providing the path to that file for the Spatial Weights Matrix File parameter.
Note:
Defining a spatial constraint ensures compact, contiguous, or proximal groups. Including spatial variables in your list of Analysis Fields can also encourage these group attributes. Examples of spatial variables would be distance to freeway on-ramps, accessibility to job openings, proximity to shopping opportunities, measures of connectivity, and even coordinates (X, Y). Including variables representing time, day of the week, or temporal distance can encourage temporal compactness among group members.
When there is a distinct spatial pattern to your features (an example would be three separate, spatially distinct clusters), it can complicate the spatially constrained grouping algorithm. Consequently, the grouping algorithm first determines if there are any disconnected groups. If the number of disconnected groups is larger than the Number of Groups specified, the tool cannot solve and will fail with an appropriate error message. If the number of disconnected groups is exactly the same as the Number of Groups specified, the spatial configuration of the features alone determines group results, as shown in (A) below. If the Number of Groups specified is larger than the number of disconnected groups, grouping begins with the disconnected groups already determined. For example, if there are three disconnected groups and the Number of Groups specified is 4, one of the three groups will be divided to create a fourth group, as shown in (B) below.
While there is a tendency to want to include as many Analysis Fields as possible, for this tool, it works best to start with a single variable and build. Results are much easier to interpret with fewer analysis fields. It is also easier to determine which variables are the best discriminators when there are fewer fields.

Note:

When using random seeds, you may wish to choose a seed to initiate the random number generator through the Random Number Generator Environment setting. However, the Random Number Generator used by this tool is always Mersenne Twister.

Any values of 1 in the Initialization Field will be interpreted as a seed. If there are more seed features than Number of Groups, the seed features will be randomly selected from those identified by the Initialization Field. If there are fewer seed features than specified by Number of Groups, the additional seed features will be selected so they are far away (in data space) from those identified by the Initialization Field.
Sometimes you know the Number of Groups most appropriate for your data. In the case that you don't, however, you may have to try different numbers of groups, noting which values provide the best group differentiation. When you check the Evaluate Optimal Number of Groups parameter, a pseudo F-Statistic will be computed for grouping solutions with 2 through 15 groups. If no other criteria guide your choice for Number of Groups, use a number associated with one of the largest pseudo F-Statistic values. The largest F-Statistic values indicate solutions that perform best at maximizing both within-group similarities and between-group differences. When you specify an optional Output Report File, that PDF report will include a graph showing the F-Statistic values for solutions with 2 through 15 groups.
Regardless of the Number of Groups you specify, the tool will stop if division into additional groups becomes arbitrary. Suppose, for example, that your data consists of three spatially clustered polygons and a single analysis field. If all the features in a cluster have the same analysis field value, it becomes arbitrary how any one of the individual clusters is divided after three groups have been created. If you specify more than three groups in this situation, the tool will still only create three groups. As long as at least one of the analysis fields in a group has some variation of values, division into additional groups can continue.
Groups will not be divided further if there is no variation in the analysis field values.
When you include a spatial or space-time constraint in your analysis, the pseudo F-Statistics are comparable (as long as the Input Features and Analysis Fields don't change). Consequently, you can use the F-Statistic values to determine not only optimal Number of Groups but also to help you make choices about the most effective Spatial Constraints option, Distance Method, and Number of Neighbors.
The Grouping Analysis tool returns three derived output values for potential use in custom models and scripts. These are the pseudo F-Statistic for the Number of Groups (Output_FStat), the largest pseudo F-Statistic for groups 2 through 15 (Max_FStat), and the number of groups associated with the largest pseudo F-Statistic value (Max_FStat_Group). When you do not elect to Evaluate Optimal Number of Groups, all of the derived output variables are set to None.
The group number assigned to a set of features may change from one run to the next. For example, suppose you partition features into two groups based on an income variable. The first time you run the analysis you might see the high income features labeled as group 2 and the low income features labeled as group 1; the second time you run the same analysis, the high income features might be labeled as group 1.
While you can select to create a very large number of different groups, in most scenarios you will likely be partitioning features into just a few groups. Because the graphs and maps become difficult to interpret with lots of groups, no report is created when you enter a value larger than 15 for the Number of Groups parameter or select more than 15 Analysis Fields. You can increase this limitation on the maximum number of groups, however.
Dive-in:
Because you have the Python source code for the Grouping Analysis tool, you may override the 15 variables or 15 groups report limitation, if desired. This upper limit is set by two variables in both the Partition.py script file and the tool's validation code inside the Spatial Statistics Toolbox:
```
maxNumGroups = 15
maxNumVars = 15
```
For more information about the Output Report File, see Learn more about how Grouping Analysis works.

Parameters

Label	Explanation	Data Type
Input Features	The feature class or feature layer for which you want to create groups.	Feature Layer
Unique ID Field	An integer field containing a different value for every feature in the input feature class. If you don't have a Unique ID field, you can create one by adding an integer field to your feature class table and calculating the field values to equal the FID or OBJECTID field.	Field
Output Feature Class	The new output feature class created containing all features, the analysis fields specified, and a field indicating to which group each feature belongs.	Feature Class
Number of Groups	The number of groups to create. The Output Report parameter will be disabled for more than 15 groups.	Long
Analysis Fields	A list of fields you want to use to distinguish one group from another. The Output Report parameter will be disabled for more than 15 fields.	Field
Spatial Constraints	Specifies if and how spatial relationships among features should constrain the groups created. CONTIGUITY_EDGES_ONLY—Groups contain contiguous polygon features. Only polygons that share an edge can be part of the same group. CONTIGUITY_EDGES_CORNERS—Groups contain contiguous polygon features. Only polygons that share an edge or a vertex can be part of the same group. DELAUNAY_TRIANGULATION—Features in the same group will have at least one natural neighbor in common with another feature in the group. Natural neighbor relationships are based on Delaunay Triangulation. Conceptually, Delaunay Triangulation creates a nonoverlapping mesh of triangles from feature centroids. Each feature is a triangle node and nodes that share edges are considered neighbors. K_NEAREST_NEIGHBORS—Features in the same group will be near each other; each feature will be a neighbor of at least one other feature in the group. Neighbor relationships are based on the nearest K features, where you specify an Integer value, K, for the Number of Neighbors parameter. GET_SPATIAL_WEIGHTS_FROM_FILE—Spatial, and optionally temporal, relationships are defined by a spatial weights file (.swm). Create the spatial weights matrix file using the Generate Spatial Weights Matrix tool or the Generate Network Spatial Weights tool. NO_SPATIAL_CONSTRAINT—Features will be grouped using data space proximity only. Features do not have to be near each other in space or time to be part of the same group.	String
Distance Method (Optional)	Specifies how distances are calculated from each feature to neighboring features. EUCLIDEAN—The straight-line distance between two points (as the crow flies) MANHATTAN—The distance between two points measured along axes at right angles (city block); calculated by summing the (absolute) difference between the x- and y-coordinates	String
Number of Neighbors (Optional)		Long
Weights Matrix File (Optional)	The path to a file containing spatial weights that define spatial relationships among features.	File
Initialization Method (Optional)		String
Initialization Field (Optional)	The numeric field identifying seed features. Features with a value of 1 for this field will be used to grow groups.	Field
Output Report File (Optional)	The full path for the PDF report file to be created summarizing group characteristics. This report provides a number of graphs to help you compare the characteristics of each group. Creating the report file can add substantial processing time.	File
Evaluate Optimal Number of Groups (Optional)	Specifies whether the tool will assess the optimal number of groups, 2 through 15. Checked—Groupings from 2 to 15 will be evaluated. Unchecked—No evaluation of the number of groups will be performed. This is the default.	Boolean

Derived Output

Label	Explanation	Data Type
F-Statistic	The output pseudo F-Statistic value.	Double
Maximum F-Statistic Group	The number of groups associated with the largest pseudo F-Statistic value.	Long
Maximum F-Statistic	The largest pseudo F-Statistic for groups 2 through 15.	Double

arcpy.stats.GroupingAnalysis(Input_Features, Unique_ID_Field, Output_Feature_Class, Number_of_Groups, Analysis_Fields, Spatial_Constraints, {Distance_Method}, {Number_of_Neighbors}, {Weights_Matrix_File}, {Initialization_Method}, {Initialization_Field}, {Output_Report_File}, {Evaluate_Optimal_Number_of_Groups})

Name	Explanation	Data Type
Input_Features	The feature class or feature layer for which you want to create groups.	Feature Layer
Unique_ID_Field	An integer field containing a different value for every feature in the input feature class. If you don't have a Unique ID field, you can create one by adding an integer field to your feature class table and calculating the field values to equal the FID or OBJECTID field.	Field
Output_Feature_Class	The new output feature class created containing all features, the analysis fields specified, and a field indicating to which group each feature belongs.	Feature Class
Number_of_Groups	The number of groups to create. The Output Report parameter will be disabled for more than 15 groups.	Long
Analysis_Fields [analysis_field,...]	A list of fields you want to use to distinguish one group from another. The Output Report parameter will be disabled for more than 15 fields.	Field
Spatial_Constraints	Specifies if and how spatial relationships among features should constrain the groups created. CONTIGUITY_EDGES_ONLY—Groups contain contiguous polygon features. Only polygons that share an edge can be part of the same group. CONTIGUITY_EDGES_CORNERS—Groups contain contiguous polygon features. Only polygons that share an edge or a vertex can be part of the same group. DELAUNAY_TRIANGULATION—Features in the same group will have at least one natural neighbor in common with another feature in the group. Natural neighbor relationships are based on Delaunay Triangulation. Conceptually, Delaunay Triangulation creates a nonoverlapping mesh of triangles from feature centroids. Each feature is a triangle node and nodes that share edges are considered neighbors. K_NEAREST_NEIGHBORS—Features in the same group will be near each other; each feature will be a neighbor of at least one other feature in the group. Neighbor relationships are based on the nearest K features where you specify an Integer value, K, for the Number_of_Neighbors parameter. GET_SPATIAL_WEIGHTS_FROM_FILE—Spatial, and optionally temporal, relationships are defined by a spatial weights file (.swm). Create the spatial weights matrix file using the Generate Spatial Weights Matrix tool or the Generate Network Spatial Weights tool. NO_SPATIAL_CONSTRAINT—Features will be grouped using data space proximity only. Features do not have to be near each other in space or time to be part of the same group.	String
Distance_Method (Optional)	Specifies how distances are calculated from each feature to neighboring features. EUCLIDEAN—The straight-line distance between two points (as the crow flies) MANHATTAN—The distance between two points measured along axes at right angles (city block); calculated by summing the (absolute) difference between the x- and y-coordinates	String
Number_of_Neighbors (Optional)	This parameter may be specified whenever the Spatial_Constraints parameter is K_NEAREST_NEIGHBORS or one of the contiguity methods (CONTIGUITY_EDGES_ONLY or CONTIGUITY_EDGES_CORNERS). The default number of neighbors is 8 and cannot be smaller than 2 for K_NEAREST_NEIGHBORS. This value reflects the exact number of nearest neighbor candidates to consider when building groups. A feature will not be included in a group unless one of the other features in that group is a K nearest neighbor. The default for CONTIGUITY_EDGES_ONLY and CONTIGUITY_EDGES_CORNERS is 0. For the contiguity methods, this value reflects the minimum number of neighbor candidates to consider. Additional nearby neighbors for features with less than the Number_of_Neighbors specified will be based on feature centroid proximity.	Long
Weights_Matrix_File (Optional)	The path to a file containing spatial weights that define spatial relationships among features.	File
Initialization_Method (Optional)	Specifies how initial seeds are obtained when the Spatial_Constraint parameter selected is NO_SPATIAL_CONSTRAINT. Seeds are used to grow groups. If you indicate you want three groups, for example, the analysis will begin with three seeds. FIND_SEED_LOCATIONS—Seed features will be selected to optimize performance. GET_SEEDS_FROM_FIELD—Nonzero entries in the Initialization Field will be used as starting points to grow groups. USE_RANDOM_SEEDS—Initial seed features will be randomly selected.	String
Initialization_Field (Optional)	The numeric field identifying seed features. Features with a value of 1 for this field will be used to grow groups.	Field
Output_Report_File (Optional)	The full path for the PDF report file to be created summarizing group characteristics. This report provides a number of graphs to help you compare the characteristics of each group. Creating the report file can add substantial processing time.	File
Evaluate_Optimal_Number_of_Groups (Optional)	EVALUATE—Groupings from 2 to 15 will be evaluated. DO_NOT_EVALUATE—No evaluation of the number of groups will be performed. This is the default.	Boolean

Derived Output

Name	Explanation	Data Type
Output_FStat	The output pseudo F-Statistic value.	Double
Max_FStat_Group	The number of groups associated with the largest pseudo F-Statistic value.	Long
Max_FStat	The largest pseudo F-Statistic for groups 2 through 15.	Double

Code sample

GroupingAnalysis example 1 (Python window)

The following Python window script demonstrates how to use the GroupingAnalysis function.

import arcpy
arcpy.env.workspace = r"C:\GA"
arcpy.stats.GroupingAnalysis("Dist_Vandalism.shp", "TARGET_FID", "outGSF.shp", 
                             "4", "Join_Count;TOTPOP_CY;VACANT_CY;UNEMP_CY",
                             "NO_SPATIAL_CONSRAINT", "EUCLIDEAN", "", "", 
                             "FIND_SEED_LOCATIONS", "", "outGSF.pdf", 
                             "DO_NOT_EVALUATE")

GroupingAnalysis example 2 (stand-alone script)

The following stand-alone Python script demonstrates how to use the GroupingAnalysis function.


# Grouping Analysis of Vandalism data in a metropolitan area
# using the Grouping Analysis Tool

# Import system modules
import arcpy
import os

# Set geoprocessor object property to overwrite existing output, by default
arcpy.env.overwriteOutput = True

try:
    # Set the current workspace (to avoid having to specify the full path to
    # the feature classes each time)
    arcpy.env.workspace = r"C:\GA"

    # Join the 911 Call Point feature class to the Block Group Polygon feature 
    # class
    # Process: Spatial Join
    fieldMappings = arcpy.FieldMappings()
    fieldMappings.addTable("ReportingDistricts.shp")
    fieldMappings.addTable("Vandalism2006.shp")

    sj = arcpy.SpatialJoin_analysis("ReportingDistricts.shp", 
                                    "Vandalism2006.shp", "Dist_Vand.shp", 
                                    "JOIN_ONE_TO_ONE", "KEEP_ALL", 
                                    fieldMappings, "COMPLETELY_CONTAINS")
    
    # Use Grouping Analysis tool to create groups based on different variables 
    # or analysis fields
    # Process: Group Similar Features  
    ga = arcpy.stats.GroupingAnalysis("Dist_Vand.shp", "TARGET_FID", 
                                      "outGSF.shp", "4", 
                                      "Join_Count;TOTPOP_CY;VACANT_CY;UNEMP_CY",
                                      "NO_SPATIAL_CONSRAINT", "EUCLIDEAN", "", 
                                      "", "FIND_SEED_LOCATIONS", "",
                                      "outGSF.pdf", "DO_NOT_EVALUATE")

    # Use Summary Statistic tool to get the Mean of variables used to group
    # Process: Summary Statistics
    SumStat = arcpy.Statistics_analysis("outGSF.shp", "outSS", 
                                        [["Join_Count", "MEAN"], 
                                         ["VACANT_CY", "MEAN"], 
                                         ["TOTPOP_CY", "MEAN"], 
                                         ["UNEMP_CY", "MEAN"]], 
                                        "GSF_GROUP")

except:
    # If an error occurred when running the tool, print out the error message.
    print(arcpy.GetMessages())

Environments

Output Coordinate System, Geographic Transformations, Current Workspace, Scratch Workspace, Qualified Field Names, Output has M values, M Resolution, M Tolerance, Output has Z values, Default Output Z Value, Z Resolution, Z Tolerance, XY Resolution, XY Tolerance, Random number generator

Special cases

Output Coordinate System: Feature geometry is projected to the Output Coordinate System prior to analysis. All mathematical computations are based on the Output Coordinate System spatial reference. When the Output Coordinate System is based on degrees, minutes, and seconds, geodesic distances are estimated using chordal distances.

Random number generator: The Random Generator Type used is always Mersenne Twister.

Feedback on this topic?

Summary

Legacy:

Illustration

Usage

Legacy:

Note:

Note:

Note:

Dive-in:

Parameters

Derived Output

Derived Output

Code sample

Environments

Special cases

In this topic