Skip To Content

Introduction to the model

Banner image for the model showing data with missing values in red

This document explains how to use the DistilGPT2 Based Imputation pretrained model available on ArcGIS Living Atlas of the World. The model is used to fill in missing numeric values in feature and tabular datasets.

Missing values in datasets and incomplete feature information can compromise accuracy, producing misleading patterns and weakening spatial decision-making. Effective data imputation techniques are therefore essential to have data continuity and improve analytical reliability. The DistilGPT2 Based Imputation model is built upon the DistilGPT2 transformer architecture, a highly efficient, compact generative language model optimized through knowledge distillation. Unlike conventional methods that treat missing data column by column, this model learns patterns across entire records at once. By adapting Natural Language Processing (NLP) to tabular data, it allows you to impute missing numeric attributes in complex, highly dimensional, multivariate datasets.

The model applies a deep-learning, generative, row-wise imputation strategy. Each row is serialized into a sentence-like sequence of feature-value pairs, and the model is fine-tuned directly on the input dataset using autoregressive next-token prediction. During inference, missing values are masked, allowing the model to generate the missing data conditioned on all other observed fields within that same row. Once generated, the outputs are converted back into their native numeric formats to update the dataset seamlessly.

Model details

This model has the following characteristics:

  • Input—Use a feature class or stand-alone table containing missing values in one or more fields.
  • Output—The model returns a feature class or stand-alone table with missing values filled in the selected fields using model predictions.
  • Compute—This workflow is compute-intensive, and a GPU with minimum CUDA compute capability of 6.0 is recommended. This model requires a GPU with at least 8 GB of GPU memory.
  • Applicable geographies—The model is expected to work globally.
  • Architecture— DistilGPT2 is a compact, decoder-only Transformer architecture derived from GPT-2 through knowledge distillation.
  • Limitations—The workflow may encounter out-of-memory issues when the dataset contains more than 30-40 columns or when columns include long text values. Additionally, datasets containing embedding vectors are currently not supported. For datasets with large volumes of data or with embeddings, using XGBoost Based Imputation is recommended.

Access and download the model

Download the DistilGPT2 Based Imputation pretrained model from ArcGIS Living Atlas of the World. Alternatively, access the model directly from ArcGIS Pro in the Predict Missing Values Using AI Model tool.

  1. Browse to ArcGIS Living Atlas of the World.
  2. Sign in with your ArcGIS Online credentials.
  3. Search for DistilGPT2 Based Imputation and open the item page from the search results.
  4. Click the Download button to download the model.

    You can use the downloaded .dlpk file directly in ArcGIS Pro.

Release notes

The following are the release notes:

DateDescription

May 2026

First release of the DistilGPT2 Based Imputation model