Train Using AutoML (GeoAI)

Summary

Trains a deep learning model by building training pipelines and automating much of the training process. This includes exploratory data analysis, feature selection, feature engineering, model selection, hyperparameter tuning, and model training. Its outputs include performance metrics of the best model on the training data, as well as the trained deep learning model package (.dlpk) that can be used as input for the Predict Using AutoML tool to predict on a new dataset.

Learn more about how AutoML works

Usage

  • You must install the proper deep learning framework for Python in ArcGIS AllSource.

    Learn how to install deep learning frameworks for ArcGIS

  • The time it takes for the tool to produce the trained model depends on the following:

    • The amount of data provided during training
    • The AutoML Mode parameter value

    By default, the timer for all modes is set at 60 minutes. Regardless of the amount of data used in training, the Basic option of the AutoML Mode parameter will not take the entire 60 minutes to find the optimum model. The fit process will complete as soon as the optimum model is identified. The Advanced option will take more time due to the additional tasks of feature engineering, feature selection, and hyperparameter tuning. In addition to the new features obtained by combining multiple features from the input, the tool creates spatial features with names from zone3_id through zone7_id. These new features will be extracted from the location information in the input data and will be used to train better models. For more information about the new spatial features, see How AutoML Works. If the amount of data being trained is large, all combinations of the models may not be evaluated within 60 minutes. In such cases, the best performing model determined within 60 minutes will be considered the optimum model. You can then either use this model or rerun the tool with a higher Total Time Limit (Minutes) parameter value.

  • An ArcGIS Spatial Analyst extension license is required to use rasters as explanatory variables.

  • The Output Report parameter value is a file in HTML format that provides a way to review the information in the working directory.

    The first page in the output report includes links to each of the models evaluated and shows their performance on a validation dataset along with the time it took to train them. Based on the evaluation metric, the report shows the best performing model that was chosen.

    RMSE is the default evaluation metric for regression problems, while Logloss is the default metric for classification problems. The following metrics are available in the output report:

    • Classification—AUC, Logloss, F1, Accuracy, Average precision
    • Regression—MSE, RMSE, MAE, R2, MAPE, Spearman coefficient, Pearson coefficient

    When you click a model combination, details about the training for that model combination are displayed including the learning curves, variable importance curves, hyperparameters used, and so on.

  • Example use cases for the tool include training an annual solar energy generation model based on weather factors, training a crop prediction model using related variables, and training a house value prediction model.

  • For information about requirements for running this tool and issues you may encounter, see Deep Learning frequently asked questions.

  • To use the Add Image Attachments parameter, prepare the Input Training Features parameter value for image attachments by doing the following:

    • Ensure that the feature layer includes a field with image file paths for each record.
    • Enable attachments for the feature layer using the Enable Attachments tool .
    • Use the Add Attachments tool to specify the image path field and add it as an image attachment to the feature layer.

Parameters

LabelExplanationData Type
Input Training Features

The input feature class that will be used to train the model.

Feature Layer; Table View
Output Model

The output trained model that will be saved as a deep learning package (.dlpk file).

File
Variable to Predict

A field from the Input Training Features parameter that contains the values that will be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
Treat Variable as Categorical
(Optional)

Specifies whether the Variable to Predict parameter value will be treated as a categorical variable.

  • Checked—The Variable to Predict parameter value will be treated as a categorical variable and classification will be performed.
  • Unchecked—The Variable to Predict parameter value will be treated as continuous and regression will be performed. This is the default.

Boolean
Explanatory Training Variables
(Optional)

A list of fields representing the explanatory variables that will help predict the value or category of the Variable to Predict parameter value. Check the accompanying check box for any variables that represent classes or categories (such as land cover, presence, or absence).

Value Table
Explanatory Training Distance Features
(Optional)

The features whose distances from the input training features will be estimated automatically and added as more explanatory variables. Distances will be calculated from each of the input explanatory training distance features to the nearest input training features. Point and polygon features are supported, and if the input explanatory training distance features are polygons, the distance attributes will be calculated as the distance between the closest segments of the pair of features.

Feature Layer
Explanatory Training Rasters
(Optional)

The rasters whose values will be extracted from the raster and considered as explanatory variables for the model. Each layer forms one explanatory variable. For each feature in the input training features, the value of the raster cell will be extracted at that exact location. Bilinear raster resampling will be used when extracting the raster value for continuous rasters. Nearest neighbor assignment will be used when extracting a raster value from categorical rasters. If the Input Training Features parameter value has polygons, and you provided a value for this parameter, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters. Check the Categorical column check box for any raster that represents classes or categories such as land cover, presence, or absence.

Value Table
Total Time Limit (Minutes)
(Optional)

The total time limit in minutes it takes for AutoML model training. The default is 60 (1 hour).

Double
AutoML Mode
(Optional)

Specifies the goal of AutoML and how intensive the AutoML search will be.

  • BasicBasic is used to explain the significance of the different variables and the data. Feature engineering, feature selection, and hyperparameter tuning will not be performed. Full descriptions and explanations for model learning curves, feature importance plots generated for tree-based models, and SHAP plots for all other models will be included in reports. This mode takes the least amount of processing time. This is the default.
  • IntermediateIntermediate is used to train a model that will be used in real-life use cases. This mode uses 5-fold cross validation (CV) and produces output of learning curves and importance plots in the reports, but SHAP plots are not available.
  • Advanced Advanced is used for machine learning competitions (for maximum performance). This mode uses 10-fold cross validation (CV) and performs feature engineering, feature selection, and hyperparameter tuning. Input training features are assigned to multiple spatial grids of different sizes based on their location, and the corresponding grid IDs are passed as additional categorical explanatory variables to the model. The report only includes learning curves; model explainability is not available.
String
Algorithms
(Optional)

Specifies the algorithms that will be used during the training.

By default, all the algorithms will be used.

  • LinearThe Linear regression supervised algorithm will be used to train a regression machine learning model. If this is the only option specified, ensure that the total number of records is less than 10.000 and the number of columns is less than 1,000. Other models can accommodate larger datasets and it is recommended that you use this option with other algorithms and not as the sole algorithm.
  • Random TreesThe Random Trees decision tree-based supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • XGBoostThe XGBoost (extreme gradient boosting) supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • Light GBMThe Light GBM gradient boosting ensemble algorithm, which is based on decision trees, will be used. It can be used for both classification and regression. Light GBM is optimized for high performance with distributed systems.
  • Decision Tree The Decision Tree supervised machine learning algorithm, which classifies or regresses the data using true and false answers to certain questions, will be used. Decision trees are easily understood and are good for explainability.
  • Extra Tree The Extra Tree (extremely randomized trees) ensemble supervised machine learning algorithm, which uses decision trees, will be used. This algorithm is similar to Random Trees but can be faster.
  • CatBoostThe CatBoost algorithm will be used. It uses decision trees for classification and regression. This option can use a combination of categorical and noncategorical explanatory variables without preprocessing.
String
Validation Percentage
(Optional)

The percentage of input data that will be used for validation. The default value is 10.

Long
Output Report
(Optional)

The output report that will be generated as an .html file. If the path provided is not empty, the report will be created in a new folder under the provided path. The report will contain details of the various models as well as details of the hyperparameters that were used during the evaluation and the performance of each model. Hyperparameters are parameters that control the training process. They are not updated during training and include model architecture, learning rate, number of epochs, and so on.

File
Output Importance Table
(Optional)

An output table containing information about the importance of each explanatory variable (fields, distance features, and rasters) used in the model.

Table
Output Feature Class
(Optional)

The feature layer containing the predicted values by the best performing model on the training feature layer. It can be used to verify model performance by visually comparing the predicted values with the ground truth.

Feature Class
Add Image Attachments
(Optional)

Specifies whether images will be used as explanatory variables from the Input Training Features parameter value for training a multimodal or mixed data model. Training a multimodal or mixed data tabular model involves using machine and deep learning backbones in AutoML to learn from multiple types of data formats by a single model. The input data can consist of a combination of explanatory variables from a diverse set of data sources such as text descriptions, corresponding images, and any additional categorical and continuous variables.

  • Checked—The image attachments will be downloaded and treated as explanatory variables and multimodal data training will be performed.
  • Unchecked—The image attachments will not be used during the training. This is the default.

Boolean
Sensitive Feature Attributes
(Optional)

Assesses and improves the fairness of the trained models for tabular data for classification and regression models. Set the following two components for this parameter:

  • Sensitive Features—An attribute such as race, gender, socioeconomic status, or age that can introduce bias in machine learning or deep learning models. By selecting sensitive features such as race, gender, socioeconomic status, or age, biases associated with the specific sensitive features are mitigated for an unbiased model.
  • Underprivileged Groups—The discriminated group from the Sensitive Feature value provided.

Value Table
Fairness Metric
(Optional)

Specifies the fairness metrics that will be used for measuring fairness for classification and regression problems, which are used for grid searches for selecting the best fair model.

  • Demographic parity ratioThis metric is used in classification models. The ratio of selection rates between different groups of individuals will be measured. The selection rate is the proportion of individuals who are classified as positive by the model. The ideal value for this metric is 1, which indicates that the selection rates for different groups are equal. Fairness for this metric is between 0.8 to 1, meaning that the ratio of selection rates between groups should be no more than 20 percent.
  • Demographic parity differenceThis metric is used in classification models. It is similar to the demographic parity ratio metric, but the difference in selection rates between different groups of individuals will be measured, rather than the ratio. The selection rate is the proportion of individuals who are classified as positive by the model. The ideal value for this metic is 0, which indicates that there is no difference in selection rates between groups. Fairness for this metric is between 0 to 0.25, meaning that differences in selection rates between groups should be no more than 25 percent.
  • Equalized odds ratioThis metric is used in classification models. The ratio of error rates between groups of individuals, such as different racial ro gender groups, will be measured. The ideal value for this metric is 1, which indicates that the error rates for different groups are equal. Fairness for this metric is between 0.8 to 1, meaning that the ratio of error between groups should be no more than 20 percent.
  • Equalized odds differenceThis metric is used in classification models. It is similar to the equalized odds ratio metric, but the difference in error between different groups of individuals will be measured, rather than the ratio. The ideal value for this metric is 0, which indicates that there is no difference in error between groups. Fairness for this metric is between 0 to 0.25, meaning that difference in error between groups should be no more than 25 percent.
  • Group loss ratioThis metric is used in regression models. The ratio of the average loss or error for one subgroup compared to another subgroup will be measured. It provides a relative measure of the disparity in losses between groups. A value of 1 indicates no difference in losses between the groups, while values greater or smaller than 1 indicate relative disparities.
String

arcpy.geoai.TrainUsingAutoML(in_features, out_model, variable_predict, {treat_variable_as_categorical}, {explanatory_variables}, {distance_features}, {explanatory_rasters}, {total_time_limit}, {autoML_mode}, {algorithms}, {validation_percent}, {out_report}, {out_importance}, {out_features}, {add_image_attachments}, {sensitive_feature}, {fairness_metric})
NameExplanationData Type
in_features

The input feature class that will be used to train the model.

Feature Layer; Table View
out_model

The output trained model that will be saved as a deep learning package (.dlpk file).

File
variable_predict

A field from the in_features parameter that contains the values that will be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Field
treat_variable_as_categorical
(Optional)

Specifies whether the variable_predict parameter value will be treated as a categorical variable.

  • CATEGORICALThe variable_predict parameter value will be treated as a categorical variable and classification will be performed.
  • CONTINUOUSThe variable_predict parameter value will be treated as continuous and regression will be performed. This is the default.
Boolean
explanatory_variables
[explanatory_variables,...]
(Optional)

A list of fields representing the explanatory variables that will help predict the value or category of the variable_predict parameter value. Pass the true value ("<name_of_variable> true") for any variables that represent classes or categories (such as land cover, presence, or absence).

Value Table
distance_features
[distance_features,...]
(Optional)

The features whose distances from the input training features will be estimated automatically and added as more explanatory variables. Distances will be calculated from each of the input explanatory training distance features to the nearest input training features. Point and polygon features are supported, and if the input explanatory training distance features are polygons, the distance attributes will be calculated as the distance between the closest segments of the pair of features.

Feature Layer
explanatory_rasters
[explanatory_rasters,...]
(Optional)

The rasters whose values will be extracted from the raster and considered as explanatory variables for the model. Each layer forms one explanatory variable. For each feature in the input training features, the value of the raster cell will be extracted at that exact location. Bilinear raster resampling will be used when extracting the raster value for continuous rasters. Nearest neighbor assignment will be used when extracting a raster value from categorical rasters. If the in_features parameter value has polygons, and you provided a value for this parameter, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters. Pass the true value using "<name_of_raster> true" for any raster that represents classes or categories such as land cover, presence, or absence.

Value Table
total_time_limit
(Optional)

The total time limit in minutes it takes for AutoML model training. The default is 60 (1 hour).

Double
autoML_mode
(Optional)

Specifies the goal of AutoML and how intensive the AutoML search will be.

  • BASICBasic is used to explain the significance of the different variables and the data. Feature engineering, feature selection, and hyperparameter tuning will not be performed. Full descriptions and explanations for model learning curves, feature importance plots generated for tree-based models, and SHAP plots for all other models will be included in reports. This mode takes the least amount of processing time. This is the default.
  • INTERMEDIATEIntermediate is used to train a model that will be used in real-life use cases. This mode uses 5-fold cross validation (CV) and produces output of learning curves and importance plots in the reports, but SHAP plots are not available.
  • ADVANCED Advanced is used for machine learning competitions (for maximum performance). This mode uses 10-fold cross validation (CV) and performs feature engineering, feature selection, and hyperparameter tuning. Input training features are assigned to multiple spatial grids of different sizes based on their location, and the corresponding grid IDs are passed as additional categorical explanatory variables to the model. The report only includes learning curves; model explainability is not available.
String
algorithms
[algorithms,...]
(Optional)

Specifies the algorithms that will be used during the training.

  • LINEARThe Linear regression supervised algorithm will be used to train a regression machine learning model. If this is the only option specified, ensure that the total number of records is less than 10.000 and the number of columns is less than 1,000. Other models can accommodate larger datasets and it is recommended that you use this option with other algorithms and not as the sole algorithm.
  • RANDOM TREESThe Random Trees decision tree-based supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • XGBOOSTThe XGBoost (extreme gradient boosting) supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • LIGHT GBMThe Light GBM gradient boosting ensemble algorithm, which is based on decision trees, will be used. It can be used for both classification and regression. Light GBM is optimized for high performance with distributed systems.
  • DECISION TREE The Decision Tree supervised machine learning algorithm, which classifies or regresses the data using true and false answers to certain questions, will be used. Decision trees are easily understood and are good for explainability.
  • EXTRA TREE The Extra Tree (extremely randomized trees) ensemble supervised machine learning algorithm, which uses decision trees, will be used. This algorithm is similar to Random Trees but can be faster.
  • CATBOOSTThe CatBoost algorithm will be used. It uses decision trees for classification and regression. This option can use a combination of categorical and noncategorical explanatory variables without preprocessing.

By default, all the algorithms will be used.

String
validation_percent
(Optional)

The percentage of input data that will be used for validation. The default value is 10.

Long
out_report
(Optional)

The output report that will be generated as an .html file. If the path provided is not empty, the report will be created in a new folder under the provided path. The report will contain details of the various models as well as details of the hyperparameters that were used during the evaluation and the performance of each model. Hyperparameters are parameters that control the training process. They are not updated during training and include model architecture, learning rate, number of epochs, and so on.

File
out_importance
(Optional)

An output table containing information about the importance of each explanatory variable (fields, distance features, and rasters) used in the model.

Table
out_features
(Optional)

The feature layer containing the predicted values by the best performing model on the training feature layer. It can be used to verify model performance by visually comparing the predicted values with the ground truth.

Feature Class
add_image_attachments
(Optional)

Specifies whether images will be used as explanatory variables from the in_features parameter value for training a multimodal or mixed data model. Training a multimodal or mixed data tabular model involves using machine and deep learning backbones in AutoML to learn from multiple types of data formats by a single model. The input data can consist of a combination of explanatory variables from a diverse set of data sources such as text descriptions, corresponding images, and any additional categorical and continuous variables.

  • TRUEThe image attachments will be downloaded and treated as explanatory variables and multimodal data training will be performed.
  • FALSEThe image attachments will not be used during the training. This is the default.
Boolean
sensitive_feature
[sensitive_feature,...]
(Optional)

Assesses and improves the fairness of the trained models for tabular data for classification and regression models. Set the following two components for this parameter:

  • Sensitive Features—An attribute such as race, gender, socioeconomic status, or age that can introduce bias in machine learning or deep learning models. By selecting sensitive features such as race, gender, socioeconomic status, or age, biases associated with the specific sensitive features are mitigated for an unbiased model.
  • Underprivileged Groups—The discriminated group from the Sensitive Features value provided.

Value Table
fairness_metric
(Optional)

Specifies the fairness metrics that will be used for measuring fairness for classification and regression problems, which are used for grid searches for selecting the best fair model.

  • DEMOGRAPHIC_PARITY_RATIOThis metric is used in classification models. The ratio of selection rates between different groups of individuals will be measured. The selection rate is the proportion of individuals who are classified as positive by the model. The ideal value for this metric is 1, which indicates that the selection rates for different groups are equal. Fairness for this metric is between 0.8 to 1, meaning that the ratio of selection rates between groups should be no more than 20 percent.
  • DEMOGRAPHIC_PARITY_DIFFERENCEThis metric is used in classification models. It is similar to the demographic parity ratio metric, but the difference in selection rates between different groups of individuals will be measured, rather than the ratio. The selection rate is the proportion of individuals who are classified as positive by the model. The ideal value for this metic is 0, which indicates that there is no difference in selection rates between groups. Fairness for this metric is between 0 to 0.25, meaning that differences in selection rates between groups should be no more than 25 percent.
  • EQUALISED_ODDS_RATIOThis metric is used in classification models. The ratio of error rates between groups of individuals, such as different racial ro gender groups, will be measured. The ideal value for this metric is 1, which indicates that the error rates for different groups are equal. Fairness for this metric is between 0.8 to 1, meaning that the ratio of error between groups should be no more than 20 percent.
  • EQUALISED_ODDS_DIFFERENCEThis metric is used in classification models. It is similar to the equalized odds ratio metric, but the difference in error between different groups of individuals will be measured, rather than the ratio. The ideal value for this metric is 0, which indicates that there is no difference in error between groups. Fairness for this metric is between 0 to 0.25, meaning that difference in error between groups should be no more than 25 percent.
  • GROUP_LOSS_RATIOThis metric is used in regression models. The ratio of the average loss or error for one subgroup compared to another subgroup will be measured. It provides a relative measure of the disparity in losses between groups. A value of 1 indicates no difference in losses between the groups, while values greater or smaller than 1 indicate relative disparities.
String

Code sample

TrainUsingAutoML example (Python window)

This example shows how to use the TrainUsingAutoML function.

# Name: TrainUsingAutoML.py
# Description: Train a machine learning model on feature or tabular data with
# automatic hyperparameter selection.
  
# Import system modules
import arcpy
import os

# Set local variables

datapath  = "path_to_data" 
out_path = "path_to_trained_model"

in_feature = os.path.join(datapath, "train_data.gdb", "name_of_data")
out_model = os.path.join(out_path, "model.dlpk")

# Run Train Using AutoML Model
arcpy.geoai.TrainUsingAutoML(in_feature, out_model, "price", None, 
                             "bathrooms #;bedrooms #;square_fee #", None, None, 
                             60, "BASIC")