Train Random Trees Regression Model (Image Analyst)

Available with Image Analyst license.

Summary

Models the relationship between explanatory variables (independent variables) and a target dataset (dependent variable).

Usage

  • The tool can be used to train with a variety of data types. The input rasters (explanatory variables) can be one raster or a list of rasters, a single band or a multiband in which each band is an explanatory variable, a multidimensional raster in which the variables in the raster are the explanatory variables, or a combination of data types.

  • An input mosaic dataset will be treated as a raster dataset (not a collection of rasters). To use a collection of rasters as input, build multidimensional info for the mosaic dataset and use the result as input.

  • The input target can be a feature class or a raster. When the target is a feature, the Target Value Field value must be set to a numeric field.

  • If the input target feature has a date field or a field that defines dimension, specify a value for both the Target Value Field and Target Dimension Field parameter.

  • The input raster target can also be a multidimensional raster.

  • If the input target is multidimensional, the corresponding input explanatory variables must have at least one multidimensional raster. Those that intersect the target dimensions will be used in training; other dimensionless rasters in the list will be applied to all dimensions. If no explanatory variables intersect or they are all dimensionless, no training will occur.

  • If the input target is dimensionless and the explanatory variables have dimension, the first slice will be used.

  • If the output is a multidimensional raster, use CRF format. If the output is a dimensionless raster, it can be stored in any output raster format.

  • The cell sizes of the input explanatory variables will affect the training result and the processing time. By default, the tool uses the cell size of the first explanatory raster; you can change it using the Cell Size environment setting. In general, training with a cell size lower than that of your data is not recommended.

  • The Output Importance Table parameter value can be used to analyze the importance of each explanatory variable contributing to predicting target the variable.

  • Check the Percent of Samples for Testing parameter to compute three types of errors: errors on training points, errors on test points, and errors on test location points. For example, if percent value is set to 10, 10 percent of the training sample points will be used for reference based on location. These reference points will be used to measure the error for interpolation in space, called test location points. The remaining training sample points will be divided into two groups—one group containing 90 percent of the training sample points and the other group containing 10 percent of the training sample points. The group containing 90 percent of the points will be used to train the regression model, and the group containing 10 percent of the points will be used in testing to derive the accuracy.

  • Checking the Percent of Samples for Testing parameter will produce a scatter plot of the predicted versus reference training sample values. The coefficient of determination (R-squared) is also computed as an estimate of the goodness of fit.

  • To create a scatter plot of predicted values and training values, you can use the Sample tool to extract predicted values from predicted rasters. Then perform a table join using the LocationID field in the Sample tool output and the ObjectID field in the target field class. If the target input is a raster, you can generate random points and extract values from both the input target raster and the predict raster.

Parameters

LabelExplanationData Type
Input Rasters

The single-band, multidimensional, or multiband raster datasets, or mosaic datasets containing explanatory variables.

Mosaic Dataset; Mosaic Layer; Raster Dataset; Raster Layer; Image Service; String
Target Raster or Points

The raster or point feature class containing the target variable (dependant variable) data.

Feature Class; Feature Layer; Raster Dataset; Raster Layer; Mosaic Layer; Image Service
Output Regression Definition File

A JSON format file with an .ecd extension that contains attribute information, statistics, or other information for the classifier.

File
Target Value Field
(Optional)

The field name of the information to model in the target point feature class or raster dataset.

Field
Target Dimension Field
(Optional)

A date field or numeric field in the input point feature class that defines the dimension values.

Field
Raster Dimension
(Optional)

The dimension name of the input multidimensional raster (explanatory variables) that links to the dimension in the target data.

String
Output Importance Table
(Optional)

A table containing information describing the importance of each explanatory variable used in the model. A larger number indicates the corresponding variable is more correlated to the predicted variable and will contribute more in prediction. Values range between 0 and 1, and the sum of all the values equals 1.

Table
Max Number of Trees
(Optional)

The maximum number of trees in the forest. Increasing the number of trees will lead to higher accuracy rates, although this improvement will level off. The number of trees increases the processing time linearly. The default is 50.

Long
Max Tree Depth
(Optional)

The maximum depth of each tree in the forest. Depth determines the number of rules each tree can create, resulting in a decision. Trees will not grow any deeper than this setting. The default is 30.

Long
Max Number of Samples
(Optional)

The maximum number of samples that will be used for the regression analysis. A value that is less than or equal to 0 means that the system will use all the samples from the input target raster or point feature class to train the regression model. The default value is 10,000.

Long
Average Points Per Cell
(Optional)

Specifies whether the average will be calculated when multiple training points fall into one cell. This parameter is applicable only when the input target is a point feature class.

  • Unchecked—All points will be used when multiple training points fall into a single cell. This is the default.
  • Checked—The average value of the training points in a cell will be calculated.

  • Keep all pointsAll points will be used when multiple training points fall into a single cell. This is the default.
  • Average points per cellThe average value of the training points in a cell will be calculated.
Boolean
Percent of Samples for Testing
(Optional)

The percentage of test points that will be used for error checking. The tool checks for three types of errors: errors on training points, errors on test points, and errors on test location points. The default is 10.

Double
Output Scatter Plots (pdf or html)
(Optional)

The output scatter plots in PDF or HTML format. The output will include scatter plots of training data, test data, and location test data.

File
Output Sample Features
(Optional)

The output feature class that will contain target values and predicted values for training points, test points, and location test points.

Feature Class

TrainRandomTreesRegressionModel(in_rasters, in_target_data, out_regression_definition, {target_value_field}, {target_dimension_field}, {raster_dimension}, {out_importance_table}, {max_num_trees}, {max_tree_depth}, {max_samples}, {average_points_per_cell}, {percent_testing}, {out_scatterplots}, {out_sample_features})
NameExplanationData Type
in_rasters
[in_rasters,...]

The single-band, multidimensional, or multiband raster datasets, or mosaic datasets containing explanatory variables.

Mosaic Dataset; Mosaic Layer; Raster Dataset; Raster Layer; Image Service; String
in_target_data

The raster or point feature class containing the target variable (dependant variable) data.

Feature Class; Feature Layer; Raster Dataset; Raster Layer; Mosaic Layer; Image Service
out_regression_definition

A JSON format file with an .ecd extension that contains attribute information, statistics, or other information for the classifier.

File
target_value_field
(Optional)

The field name of the information to model in the target point feature class or raster dataset.

Field
target_dimension_field
(Optional)

A date field or numeric field in the input point feature class that defines the dimension values.

Field
raster_dimension
(Optional)

The dimension name of the input multidimensional raster (explanatory variables) that links to the dimension in the target data.

String
out_importance_table
(Optional)

A table containing information describing the importance of each explanatory variable used in the model. A larger number indicates the corresponding variable is more correlated to the predicted variable and will contribute more in prediction. Values range between 0 and 1, and the sum of all the values equals 1.

Table
max_num_trees
(Optional)

The maximum number of trees in the forest. Increasing the number of trees will lead to higher accuracy rates, although this improvement will level off. The number of trees increases the processing time linearly. The default is 50.

Long
max_tree_depth
(Optional)

The maximum depth of each tree in the forest. Depth determines the number of rules each tree can create, resulting in a decision. Trees will not grow any deeper than this setting. The default is 30.

Long
max_samples
(Optional)

The maximum number of samples that will be used for the regression analysis. A value that is less than or equal to 0 means that the system will use all the samples from the input target raster or point feature class to train the regression model. The default value is 10,000.

Long
average_points_per_cell
(Optional)

Specifies whether the average will be calculated when multiple training points fall into one cell. This parameter is applicable only when the input target is a point feature class.

  • Unchecked—All points will be used when multiple training points fall into a single cell. This is the default.
  • Checked—The average value of the training points in a cell will be calculated.

  • KEEP_ALL_POINTSAll points will be used when multiple training points fall into a single cell. This is the default.
  • AVERAGE_POINTS_PER_CELLThe average value of the training points in a cell will be calculated.
Boolean
percent_testing
(Optional)

The percentage of test points that will be used for error checking. The tool checks for three types of errors: errors on training points, errors on test points, and errors on test location points. The default is 10.

Double
out_scatterplots
(Optional)

The output scatter plots in PDF or HTML format. The output will include scatter plots of training data, test data, and location test data.

File
out_sample_features
(Optional)

The output feature class that will contain target values and predicted values for training points, test points, and location test points.

Feature Class

Code sample

TrainRandomTreesRegressionModel example 1 (Python window)

This Python window script models the relationship between explanatory variables and a target dataset.

# Import system modules 
import arcpy 
from arcpy.ia import * 

# Check out the ArcGIS Image Analyst extension license 
arcpy.CheckOutExtension("ImageAnalyst") 

# Execute  
arcpy.ia.TrainRandomTreesRegressionModel("weather_variables.crf";"dem.tif", "pm2.5.shp", r"c:\data\pm2.5_trained.ecd",  "mean_pm2.5", "date_collected", "StdTime”,  r"c:\data\pm2.5_importanc.csv", 50, 30, 10000)
TrainRandomTreesRegressionModel example 2 (stand-alone script)

This Python stand-alone script models the relationship between explanatory variables and a target dataset.

# Import system modules 

import arcpy 
from arcpy.ia import * 

# Check out the ArcGIS Image Analyst extension license 
arcpy.CheckOutExtension("ImageAnalyst") 

# Define input parameters 
in_weather_variables = "C:/Data/ClimateVariables.crf" 
in_dem_varaible = "C:/Data/dem.tif" 
in_target = "C:/Data/pm2.5_observations.shp" 
target_value_field = "mean_pm2.5" 
Target_date_field = "date_collected" 
Raster_dimension = “StdTime” 
out_model_definition = "C:/Data/pm2.5_trained_model.ecd" 
Out_importance_table = "C:/Data/pm2.5_importance_table.csv" 
max_num_trees = 50 
max_tree_depth = 30 
max_num_samples = 10000 

# Execute - train with random tree regression model 
arcpy.ia.TrainRandomTreesRegressionModel(in_weather_variables;in_dem_varaible, in_target, out_model_definition,  target_value_field, Target_date_field, Raster_dimension, max_num_trees, max_tree_depth, max_num_samples)

Related topics