Train Using AutoML (GeoAI)

Trains a deep learning model by building training pipelines and automating much of the training process. This includes exploratory data analysis, feature selection, feature engineering, model selection, hyperparameter tuning, and model training. Its outputs include performance metrics of the best model on the training data, as well as the trained deep learning model package (.dlpk) that can be used as input for the Predict Using AutoML tool to predict on a new dataset.

  • You must install the proper deep learning framework for Python in AllSource.

  • The time it takes for the tool to produce the trained model depends on the following:

    • The amount of data provided during training
    • The AutoML Mode parameter value

    By default, the timer for all modes is set at 60 minutes. Regardless of the amount of data used in training, the Basic option will not take the entire 60 minutes to find the optimum model. The fit process will complete as soon as the optimum model is identified. The Advanced option will take more time due to the additional tasks of feature engineering, feature selection, and hyperparameter tuning. In addition to the new features obtained by combining multiple features from the input, the tool creates spatial features with names from zone3_id through zone7_id. These new features will be extracted from the location information in the input data and will be used to train better models. For more information about the new spatial features, see How AutoML Works. If the amount of data being trained is large, all combinations of the models may not be evaluated within 60 minutes. In such cases, the best performing model determined within 60 minutes will be considered the optimum model. You can then either use this model or rerun the tool with a higher Total Time Limit (Minutes) parameter value.

  • An ArcGIS Spatial Analyst extension license is required to use rasters as explanatory variables.

  • The Output Report parameter value is a file in HTML format that provides a way to review the information in the working directory.

    The first page in the output report includes links to each of the models evaluated and shows their performance on a validation dataset along with the time it took to train them. Based on the evaluation metric, the report shows the best performing model that was chosen.

    RMSE is the default evaluation metric for regression problems, while Logloss is the default metric for classification problems. The following metrics are available in the output report:

    • Classification—AUC, Logloss, F1, Accuracy, Average precision
    • Regression—MSE, RMSE, MAE, R2, MAPE, Spearman coefficient, Pearson coefficient

    When you click a model combination, details about the training for that model combination are displayed including the learning curves, variable importance curves, hyperparameters used, and so on.

  • Potential use cases for the tool include training an annual solar energy generation model based on weather factors, training a crop prediction model using related variables, and training a house value prediction model.

LabelExplanationData Type
Input Training Features

The input feature class that will be used to train the model.

Feature Layer; Table View
Output Model

The output trained model that will be saved as a deep learning package (.dlpk file).

Variable to Predict

A field from the Input Training Features parameter value that contains the values that will be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations.

Treat Variable as Categorical

Specifies whether the Variable to Predict parameter value will be treated as a categorical variable.

  • Checked—The Variable to Predict parameter value will be treated as a categorical variable and the tool will perform classification.
  • Unchecked—The Variable to Predict parameter value will be treated as continuous and the tool will perform regression. This is the default.

Explanatory Training Variables

A list of fields representing the explanatory variables that will help predict the value or category of the Variable to Predict parameter value. Check the accompanying check box for any variables that represent classes or categories (such as land cover, presence, or absence).

Value Table
Explanatory Training Distance Features

The features whose distances from the input training features will be estimated automatically and added as more explanatory variables. Distances will be calculated from each of the input explanatory training distance features to the nearest input training features. Point and polygon features are supported, and if the input explanatory training distance features are polygons, the distance attributes will be calculated as the distance between the closest segments of the pair of features.

Feature Layer
Explanatory Training Rasters

The rasters whose values will be extracted from the raster and considered as explanatory variables for the model. Each layer forms one explanatory variable. For each feature in the input training features, the value of the raster cell will be extracted at that exact location. Bilinear raster resampling will be used when extracting the raster value for continuous rasters. Nearest neighbor assignment will be used when extracting a raster value from categorical rasters. If the Input Training Features parameter value has polygons, and you have specified this parameter, one raster value for each polygon will be used in the model. Each polygon is assigned the average value for continuous rasters and the majority for categorical rasters. Check the Categorical column check box for any raster that represents classes or categories such as land cover, presence, or absence.

Value Table
Total Time Limit (Minutes)

The total time limit in minutes it takes for AutoML model training. The default is 60 (1 hour).

AutoML Mode

Specifies the goal of AutoML and how intensive the AutoML search will be.

  • BasicBasic is used to explain the significance of the different variables and the data. Feature engineering, feature selection, and hyperparameter tuning will not be performed. Full descriptions and explanations for model learning curves, feature importance plots generated for tree-based models, and SHAP plots for all other models will be included in reports. This mode takes the least amount of processing time. This is the default.
  • IntermediateIntermediate is used to train a model that will be used in real-life use cases. This mode uses 5-fold cross validation (CV) and produces output of learning curves and importance plots in the reports, but SHAP plots are not available.
  • Advanced Advanced is used for machine learning competitions (for maximum performance). This mode uses 10-fold cross validation (CV) and performs feature engineering, feature selection, and hyperparameter tuning. Input training features are assigned to multiple spatial grids of different sizes based on their location, and the corresponding grid IDs are passed as additional categorical explanatory variables to the model. The report only includes learning curves; model explainability is not available.

Specifies the algorithms that will be used during the training.

By default, all the algorithms will be used.

  • LinearThe Linear regression supervised algorithm will be used to train a regression machine learning model. If Linear is the only algorithm specified, ensure that the total number of records is less than 10.000 and the number of columns is less than 1,000. Other models can accommodate larger datasets and it is recommended that you use Linear with other algorithms and not as the sole algorithm.
  • Random TreesThe Random Trees decision tree-based supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • XGBoostThe XGBoost (extreme gradient boosting) supervised machine learning algorithm will be used. It can be used for both classification and regression.
  • Light GBMThe Light GBM gradient boosting ensemble algorithm, which is based on decision trees, will be used. It can be used for both classification and regression. Light GBM is optimized for high performance with distributed systems.
  • Decision Tree The Decision Tree supervised machine learning algorithm, which classifies or regresses the data using true and false answers to certain questions, will be used. Decision trees are easily understood and are good for explainability.
  • Extra Tree The Extra Tree (extremely randomized trees) ensemble supervised machine learning algorithm, which uses decision trees, will be used. This algorithm is similar to Random Trees but can be faster.
  • CaTBoostThe CatBoost algorithm will be used. It uses decision trees for classification and regression. This option can use a combination of categorical and noncategorical explanatory variables without preprocessing.
Validation Percentage

The percentage of input data that will be used for validation. The default value is 10.

Output Report

The output report that will be generated as an .html file. If the path provided is not empty, the report will be created in a new folder under the provided path. The report will contain details of the various models as well as details of the hyperparameters that were used during the evaluation and the performance of each model. Hyperparameters are parameters that control the training process. They are not updated during training and include model architecture, learning rate, number of epochs, and so on.

Output Importance Table

An output table containing information about the importance of each explanatory variable (fields, distance features, and rasters) used in the model.

Output Feature Class

The feature layer containing the predicted values by the best performing model on the training feature layer. It can be used to verify model performance by visually comparing the predicted values with the ground truth.

Feature Class