Deep learning arguments

AllSource 1.3    |

Available with Image Analyst license.

Arguments are one of the many ways to control how deep learning models are trained and used. In this topic, the first table lists the supported model arguments for training deep learning models. The second table lists the arguments to control how deep learning models are used for inferencing.

Training arguments

The Train Deep Learning Model tool includes arguments for training deep learning models. These arguments vary, depending on the model architecture. You can change the values of these arguments to train a model. The arguments are as follows:

  • attention_type—Specifies the module type. The default is PAM.
  • attn_res—The number of attentions in residual blocks. This is an optional integer value; the default is 16. This argument is only supported when the Backbone Model parameter value is SR3.
  • backend—Controls the backend framework to be used for this model. To use Tensorflow, switch to the processor type to CPU. The default is pytorch.
  • bias—Bias for Single Shot Detector (SSD) head. The default is -0.4.
  • box_batch_size_per_image—The number of proposals that are sampled during training of the classification. The default is 512.
  • box_bg_iou_thresh—The maximum intersection of union (IoU) between the proposals and the (ground truth) GT box, so that they can be considered as negative during training of the classification head. The default is 0.5.
  • box_detections_per_img—The maximum number of detections per image, for all classes. The default is 100.
  • box_fg_iou_thresh—The minimum IoU between the proposals and the GT box, so that they can be considered as positive during training of the classification head. The default is 0.5.
  • box_nms_thresh—The non maximum suppression (NMS) threshold for the prediction head; used during inferencing. The default is 0.5.
  • box_positive_fraction—The proportion of positive proposals in a mini-batch during training of the classification head. The default is 0.25.
  • box_score_thresh—The classification score threshold that must be met in order to return proposals during inferencing. The default is 0.05.
  • channel_mults—Optional depth multipliers for subsequent resolutions in U-Net. The default is 1, 2, 4, 4, 8, 8. This argument is only supported when the Backbone Model parameter value is SR3.
  • channels_of_interest—A list of spectral bands (channels) of interest. This will filter out bands from rasters of multi-temporal time series based on this list. For instance if there are bands 0-4 in our dataset, but the training is only going to use bands 0,1, and 2, the list will be [0,1,2].
  • chip_size—The size of the image that will be used to train the model. Images will be cropped to the specified chip size.
  • class_balancing—Specifies whether the cross-entropy loss inverse will be balanced to the frequency of pixels per class. The default is False.
  • d_k—The dimension of the key and query vectors. The default is 32.
  • decode_params—A dictionary that controls how the Image captioner will run. It is composed of the following parameters: embed_size, hidden_size, attention_size, teacher_forcing, dropout, and pretrained_emb. The teacher_forcing is the probability of teacher forcing. Teacher forcing is a strategy for training recurrent neural networks. It uses model output from a prior time step as an input, instead of the previous output, during back propagation. The pretrained_emb specifies whether pretrained text embedding will be used. If True, it will use fast text embedding. If False, it will not use the pretrained text embedding.
  • depth—The depth of model. The default is 17.
  • dice_loss_average—Choose whether to use micro or macro averaging. A macro-average will compute the metric independently for each class and then take the average, thereby treating all classes equally). A micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be a class imbalance where there are many more samples of one class than of other classes. The default is micro.
  • dice_loss_fraction—Used to adjust the weight of default loss (or focal loss) compared to dice loss, in the total loss to guide training. The default is 0. If focal_loss is set to true, the focal loss is used in place of default loss. If dice_loss_fraction is set to 0, the training will use either default loss (or focal loss) as the total loss to guide training. If dice_loss fraction is greater than 0, the training will use a formula to use as the total loss to guide training:
    =(1 – dice_loss_fraction)*default_loss + dice_loss_fraction*dice_loss
  • downsample_factor—The factor to downsample the images. The default is 4.
  • drop—The dropout probability. To reduce overfitting, increase the value. The default is 0.3.
  • dropout—The dropout probability. To reduce overfitting, increase the value. This argument is only supported when the Backbone Model parameter value is SR3.
  • embed_dim—The dimension of embeddings. The default is 768.
  • feat_loss—Specifies whether to use disciminator feature matching loss. The default is True.
  • focal_loss—Specifies whether focal loss will be used. The default is False.
  • gaussian_thresh—The Gaussian threshold, which sets the required road width. The valid range is 0.0 to 1.0. The default is 0.76.
  • gen_blocks—The number of ResNet blocks to use in generator. The default is 9.
  • gen_network—Select the model to use for the generator. Use global if the machine's GPU memory is low. The default is local.
  • grids—The number of grids the image will be divided into for processing. For example, setting this argument to 4 means the image will be divided into 4 x 4 or 16 grid cells. If no value is specified, the optimal grid value will be calculated based on the input imagery.
  • ignore_classes—The list of class values on which the model will not incur loss.
  • inner_channel—The dimension of the first U-net layer. This is an optional integer value. The default is 64. This argument is only supported when the Backbone Model parameter value is SR3.
  • keep_dilation—Specify whether to use keep_dilation. When set to True and the pointrend architecture is used, it can potentially improve te accuracy at the expense of memory consumption. The default is False.
  • lambda_feat—The weight for feature matching loss. The default is 10.
  • lambda_l1—The weight for feature matching loss. The default is 100. This is not supported for 3 band imagery.
  • linear_end—An optional integer to schedule the end. The default is 1e-06. This argument is only supported when the Backbone Model parameter value is SR3.
  • linear_start—An optional integer to schedule the start. The default is 1e-02. This argument is only supported when the Backbone Model parameter value is SR3.
  • lsgan—Specifies whether to use mean squared error in the training. If False, it will use binary cross entropy instead. The default is True.
  • location_loss_factor—Sets the weight of the bounding box loss. This factor adjusts the focus of model on the location of bounding box. When this is set to None, it gives equal weight to both location and classification loss.
  • min_points—The number of pixels to sample from the each masked region of training; this value must be a multiple of 64.
  • mixup—Choose whether to use mixup. When set to True, it creates new training images by randomly mixing training set images. The default is False.
  • mlp_ratio—The ratio of multilayer perceptron (MLP). The default is 4.
  • mlp1—The dimensions of the successive feature spaces of MLP1. The default is 32,64.
  • mlp2—The dimensions of the successive feature spaces of MLP2. The default is 128,128.
  • mlp4—The dimensions of decoder MLP. The default is 64,32.
  • model—The backbone model used to train the model. The available backbones depend on the specified Model Type parameter value. This argument is only supported for the MMDetection and MMSegmentation model types. The default for MMDetection is cascade_rcnn. The default for MMSegmentation is mask2former.
  • model_weight—Specifies whether pretrained model weights will be used. The default is False. The value can also be a path to a configuration file containing the weights of a model from the MMDetection repository or the MMSegmentation repository.
  • monitor—Specifies the metric to monitor while checkpointing and early stopping. The available metrics depend on the Model Type parameter value. The default is valid_loss.
  • mtl_model—Specifies the architecture type that will be used to create the model. The options are linknet or hourglass for linknet-based or hourglass-based, respectively, neural architectures. The default is hourglass.
  • n_blocks_global—The number of residual blocks in the global generator network. The default is 9.
  • n_blocks_local—The number of residual blocks in the local enhancer network. The default is 3.
  • n_downsample_global—The number of downsampling layers in global generator network.
  • n_dscr—The number of discriminators to use. The default is 2.
  • n_dscr_filters—The number of discriminator filters in first convolution layer. The default is 64.
  • n_gen_filters—The number of gen filters in first convolution layer. The default is 64.
  • n_head—The number of attention heads. The default is 4.
  • n_layers_dscr—The number of layers for the Discriminator Network used in Pix2PixHD. The default is 3.
  • n_local_enhancers—The number of local enhancers to use. The default is 1.
  • n_masks—Represents the maximum number of class labels and instances any image can contain. The default is 30.
  • n_timestep—An optional value for the number of diffusion time steps. The default is 1000. This argument is only supported when the Backbone Model parameter value is SR3.
  • norm—Specifies whether to use instance normalization or batch normalization. The default is instance.
  • norm_groups—The number of groups for group normalization. This is an optional integer value. The default is 32. This argument is only supported when the Backbone Model parameter value is SR3.
  • num_heads—The number of attention heads. The default is 12.
  • orient_bin_size—The bin size for orientation angles. The default is 20.
  • orient_theta—The width of orientation mask. The default is 8.
  • oversample—Specifies whether to use over sampling. If set to True, it oversamples unbalanced classes of the dataset during training. This is not supported with MultiLabel datasets. The default is False.
  • patch_size—The patch size for generating patch embeddings The default is 16.
  • perceptual_loss—Specifies whether to use perceptual loss in the training. The default is False.
  • pointrend—Specifies whether to use the PointRend architecture on top of the segmentation head. For more information about the PointRend architecture, see the PointRend PDF. The default is False.
  • pooling—The pixel-embedding pooling strategy to use. The default is mean
  • pyramid_sizes—The number and size of convolution layers to be applied to the different subregions. The default is [1,2,3,6]. This argument is specific to the Pyramid Scene Parsing Network model.
  • qkv_bias—Specifies whether to use QK Vector bias in the training. The default is False.
  • ratios—The list of aspect ratios to use for the anchor boxes. In object detection, an anchor box represents the ideal location, shape, and size of the object being predicted. For example, setting this argument to [1.0,1.0], [1.0, 0.5] means the anchor box is a square (1:1) or a rectangle in which the horizontal side is half the size of the vertical side (1:0.5). The default for RetinaNet is [0.5,1,2]. The default for Single Shot Detector is [1.0, 1.0].
  • res_blocks—The number of residual blocks. This is an optional integer value. The default is 3. This argument is only supported when the Backbone Model parameter value is SR3.
  • rpn_batch_size_per_image—The number of anchors that are sampled during training of the RPN for computing the loss. The default is 256.
  • rpn_bg_iou_thresh—The maximum IoU between the anchor and the GT box so that they can be considered as negative during training of the RPN. The default is 0.3.
  • rpn_fg_iou_thresh—The minimum IoU between the anchor and the GT box so that they can be considered as positive during training of the RPN. The default is 0.7.
  • rpn_nms_thresh—The NMS threshold used for postprocessing the RPN proposals. The default is 0.7.
  • rpn_positive_fraction—The proportion of positive anchors in a mini-batch during training of the RPN. The default is 0.5.
  • rpn_post_nms_top_n_test—The number of proposals to keep after applying NMS during testing. The default is 1000.
  • rpn_post_nms_top_n_train—The number of proposals to keep after applying NMS during training. The default is 2000.
  • rpn_pre_nms_top_n_test—The number of proposals to keep before applying NMS during testing. The default is 1000.
  • rpn_pre_nms_top_n_train—The number of proposals to keep before applying NMS during training. The default is 2000.
  • scales—The number of scale levels each cell will be scaled up or down. The default is [1, 0.8, 0.63].
  • schedule—Optional argument to set the type of schedule to use. The options are linear, warmup10, warmup50, const, jsd, and cosine. The default is linear. This argument is only supported when the Backbone Model parameter value is SR3.
  • T—The period to use for the positional encoding. The default is 1000.
  • timesteps_of_interest—The list of time steps of interest; this will filter multi-temporal time series based on the list of time step specified. For example, if the dataset has tie steps 0, 1, 2, and 3, but only time steps 0, 1, and 2 are used in the training, this parameter would be set to [0,1,2]; the rest of the time-steps will be filtered out.
  • use_net—Specifies whether the U-Net decoder will be used to recover data once the pyramid pooling is complete. The default is True. This argument is specific to the Pyramid Scene Parsing Network model.
  • vgg_loss—Specify whether to use VGG feature matching loss. This is only supported for 3 band imagery. The default is True.
  • zooms—The number of zoom levels each grid cell will be scaled up or down. Setting this argument to 1 means all the grid cells will remain at the same size or zoom level. A zoom level of 2 means all the grid cells will become twice as large (zoomed in 100 percent). Providing a list of zoom levels means all the grid cells will be scaled using all the numbers in the list. The default is 1.

Model typeArgumentValid values

Change detector

(pixel classification)


PAM (Pyramid Attention Module) or BAM (Basic Attention Module). The default is PAM.


Integers between 0 and image size.


valid_loss, precision, recall, and f1.


(pixel classification)


Integers between 0 and image size.


0.0 to 1.0. The default is 0.76.


valid_loss, accuracy, miou, and dice.


linknet or hourglass.


A positive number. The default is 20.


A positive number. The default is 8.


(image translation)


A positive integer. The default is 9.


true or false. The default is true.


(pixel classification)


Integers between 0 and image size.


true or false.


micro or macro. The default is micro.


Floating point value between 0 to 1. The default is 0.


true or false.


Valid class values.


true or false. The default is false.


true or false.


valid_loss and accuracy.


true or false. The default is false.


(Object detection)


Positive integers. The default is 512.


Floating point value between 0 to 1. The default is 0.5.


Positive integers. The default is 100.


Floating point value between 0 to 1. The default is 0.5.


Floating point value between 0 to 1. The default is 0.5.


Floating point value between 0 to 1. The default is 0.25.


Floating point value between 0 to 1. The default is 0.05.


Positive integers. The default is 256.


Floating point value between 0 to 1. The default is 0.3.


Floating point value between 0 to 1. The default is 0.7.


Floating point value between 0 to 1. The default is 0.7.


Floating point value between 0 to 1. The default is 0.5.


Positive integers. The default is 1000.


Positive integers. The default is 2000.


Positive integers. The default is 1000.


Positive integers. The default is 2000.

Feature Classifier

(Object classification)


pytorch or tensorflow. The default is pytorch.


true or false. The default is false.


true or false. The default is false.

Image captioner

(image translation)


Integers between 0 and image size.

The decode_params argument is composed of the following parameters:

  • embed_size
  • hidden_size
  • attention_size
  • teacher_forcing
  • dropout
  • pretrained_emb

The default is {'embed_size':100, 'hidden_size':100, 'attention_size':100, 'teacher_forcing':1, 'dropout':0.1, 'pretrained_emb':False}.


valid_loss, accuracy, corpus_bleu, and multi_label_fbeta.


(Object detection)


Positive integers. The default is 512.


Floating point value between 0 to 1. The default is 0.5.


Positive integers. The default is 100.


Floating point value between 0 to 1. The default is 0.5.


Floating point value between 0 to 1. The default is 0.5.


Floating point value between 0 to 1. The default is 0.25.


Floating point value between 0 to 1. The default is 0.05.


Positive integers. The default is 256.


Floating point value between 0 to 1. The default is 0.3.


Floating point value between 0 to 1. The default is 0.7.


Floating point value between 0 to 1. The default is 0.7.


Floating point value between 0 to 1. The default is 0.5.


Positive integers. The default is 1000.


Positive integers. The default is 2000.


Positive integers. The default is 1000.


Positive integers. The default is 2000.


(panoptic segmentation)


Positive integers. The default is 30.


(object detection)


Integers between 0 and image size.


atss, carafe, cascade_rcnn, cascade_rpn, dcn, deeplabv3, detectors, dino, double_heads, dynamic_rcnn, empirical_attention, fcos, foveabox, fsaf, ghm, hrnet, libra_rcnn, nas_fcos, pafpn, pisa, regnet, reppoints, res2net, sabl, and vfnet.

The default is deeplabv3.


true or false.


(pixel classification)


Integers between 0 and image size.


ann, apcnet, ccnet, cgnet, deeplabv3, deeplabv3plus, dmnet , dnlnet, emanet, fastscnn, fcn, gcnet, hrnet, mask2former, mobilenet_v2, nonlocal_net, ocrnet, prithvi100m, psanet, pspnet, resnest, sem_fpn, unet, and upernet.

The default is mask2former.


true or false.

Multi Task Road Extractor

(pixel classification)


Integers between 0 and image size.


0.0 to 1.0. The default is 0.76.


valid_loss, accuracy, miou, and dice.


linknet or hourglass.


A positive number. The default is 20.


A positive number. The default is 8.


(image translation)


true or false. The default is false.


(image translation)


local or global. The default is local.


true or false. The default is true.


Positive integer values. The default is 10.


Positive integer values. The default is 100.


true or false. The default is true.


Positive integer values. The default is 9.


Positive integer values. The default is 3.


Positive integer values. The default is 4.


Positive integer values. The default is 2.


Positive integer values. The default is 64.


Positive integer values. The default is 64.


Positive integer values. The default is 3.


Positive integer values. The default is 1.


instance or batch. The default is instance.


true or false. The default is true.


(pixel classification)


List of band numbers (positive integers).


Positive integer values. The default is 32.


Floating point value between 0 to 1. The default is 0.2.


Integer multiples of 64.


List of positive integers. The default is 32, 64.


List of positive integers. The default is 128, 128.


List of positive integers. The default is 64, 32.


Positive integer values. The default is 4.


mean, std, max, or min.


Positive integer values. The default is 1000.


List of positive integers.

Pyramid Scene Parsing Network

(pixel classification)


Integers between 0 and image size.


true or false.


micro or macro. The default is micro.


Floating point value between 0 to 1. The default is 0.


true or false.


Valid class values.


true or false. The default is false.


valid_loss or accuracy.


true or false.


true or false. The default is false.


[convolution layer 1, convolution layer 2, ... , convolution layer n]


true or false.


(object detection)


Integers between 0 and image size.


valid_loss or average_precision.


Ratio value 1, ratio vale 2, ratio value 3.

The default is 0.5,1,2.


[scale value 1, scale value 2, scale value 3]

The default is [1, 0.8, 0.63].


(pixel classification)


true or false.


Valid class values.

Single Shot Detector

(object detection)


pytorch or tensorflow. The default is pytorch.


Floating point value. The default is -0.4.


Integers between 0 and image size. The default is 0.3.


Floating point value between 0 to 1.


true or false. The default is false.


Integer values greater than 0.


Floating point value between 0 to 1.


valid_loss or average_precision.


[horizontal value, vertical value]


The zoom value in which 1.0 is normal zoom.

Super Resolution with SR3 backbone

(image translation)


Integers greater than 0. The default is 16.


Integer multiplier sets The default is [1, 2, 4, 4, 8, 8].


Positive integer value. The default is 4.


Floating point value. The default is 0.


Integer value greater than 0. The default is 64.


Time integer. The default is 1e-02.


Time integer. The default is 1e-06.


Integer value greater than 0. The default is 1000.


Integer value greater than 0. The default is 32.


Integer value greater than 0. The default is 3.


linear, warmup10, warmup50, const, jsd, or cosine.

The default is linear.

Super Resolution with SR3_UViT backbone

(image translation)


Positive integer point value. The default is 17.


Positive integer point value. The default is 768.


Positive floating point value. The default is 4.0.


Positive integer point value. The default is 12.


Positive integer point value. The default is 16.


true or false. The default is false.


(pixel classification)


Integers between 0 and image size.


true or false.


micro or macro. The default is micro.


Floating point value between 0 to 1. The default is 0.


true or false.


Valid class values.


valid_loss or accuracy.


true or false.

Inferencing arguments

The following arguments are available to control how deep learning models are trained for inferencing. The information from the Model Definition parameter will be used to populate the Arguments parameter in the inferencing tools. These arguments vary, depending on the model architecture. ArcGIS pretrained models and custom deep learning models may have additional arguments that the tool supports.

ArgumentInference typeValid values


The number of image tiles processed in each step of the model inference. This depends on the memory of your graphics card. The argument is available for all model architectures.

Classify Objects

Classify Pixels

Detect Change

Detect Objects

Integer values greater than 0; it's usually an integer that is a power of 2n.


The image is translated from one domain to another. For more information about this argument, see How CycleGAN works.

The argument is only available for the CycleGAN architecture.

Classify Pixels

Available options are AtoB and BtoA.


\If true, potentially truncated detections near the edges that are in the padded region of image chips will be filtered.

The argument is available for SSD, RetinaNet, YOLOv3, DETReg, MMDetection, and Faster RCNN only.

Detect Objects

true or false.


The policy for merging augmented predictions. This is only applicable when test time augmentation is used.

For the Classify Pixels Using Deep Learning tool, the argument is available for the MultiTaskRoadExtractor and ConnectNet architectures. If IsEdgeDetection is present in the model's .emd file, BDCNEdgeDetector, HEDEdgeDetector, and MMSegmentation are also available architectures.

For the Detect Objects Using Deep Learning tool, the argument is only available for MaskRCNN.

Classify Pixels

Detect Objects

Available options are mean, max, and min.


The maximum overlap ratio for two overlapping features, which is defined as the ratio of intersection area over union area. The argument is available for all model architectures.

Detect Objects

A floating point value of 0.0 to 1.0 . The default is 0.1.


The path to the output raster. The argument is only available for MaXDeepLab.

Detect Objects

The file path and name for the output classified raster.


The number of pixels at the border of image tiles from which predictions are blended for adjacent tiles. To smooth the output while reducing artifacts, increase the value. The maximum value of the padding can be half the tile size value. The argument is available for all model architectures.

Classify Pixels

Detect Change

Detect Objects

Integer values greater than 0 and less than half the tile size value.


If true, the background class is also classified. The argument is available for UNET, PSPNET, DeepLab, and MMSegmentation.

Classify Pixels

true or false.


If true, it will output a probability raster. A probability raster is a raster whose pixels specify the probability that the variable of interest is above or below a specified threshold value.

If ArcGISLearnVersion is 1.8.4 or later in the model's .emd file, the MultiTaskRoadExtractor and ConnectNet architectures are available. If ArcGISLearnVersion is 1.8.4 or later and IsEdgeDetection is present in the model's .emd file, the BDCNEdgeDetector, HEDEdgeDetector, and MMSegmentation architectures are also available.

Classify Pixels

true or false.


The predictions above this confidence score are included in the result. The argument is available for all model architectures.

Classify Objects

0 to 1.0


Performs test time augmentation while predicting. If true, predictions of flipped and rotated variants of the input image will be merged into the final output. The argument is available for most model architectures.

Classify Objects

Classify Pixels

true or false.


The predictions that have a confidence score higher than this threshold are included in the result.

For the Classify Pixels Using Deep Learning tool, if ArcGISLearnVersion is 1.8.4 or later in the model's .emd file, the MultiTaskRoadExtractor and ConnectNet architectures are available. If ArcGISLearnVersion is 1.8.4 or later and IsEdgeDetection is present in the model's .emd file, the BDCNEdgeDetector, HEDEdgeDetector, and MMSegmentation architectures are also available.

For the Detect Objects Using Deep Learning tool, the argument is available for all model architectures.

Classify Pixels

Detect Objects

0 to 1.0.


Thins or skeletonizes the predicted edges.

If IsEdgeDetection is present in the model's .emd file, BDCNEdgeDetector, HEDEdgeDetector, and MMSegmentation are available architectures.

Classify Pixels

true or false.


tile_size—The width and height of image tiles into which the imagery is split for prediction.

For the Classify Pixels Using Deep Learning tool, the argument is only available for the CycleGAN architecture.

For the Detect Objects Using Deep Learning tool, the argument is only available for MaskRCNN.

Classify Pixels

Detect Objects

Integer values greater than 0 and less than the size of the image.

Related topics