You can use the Vision Language Context-Based Classification pretrained model in the Classify Objects Using Deep Learning tool available in the Image Analyst toolbox in ArcGIS Pro.
Classify imagery
Complete the following steps to use Vision Language Classification from the imagery:
- Download the Vision Language Context-Based Classification model.
Note: This model requires an internet connection to work. The data used for classification, including the imagery and possible class labels, will be shared with OpenAI.
- Click Add
data to add an image to the Contents pane.
You'll run the prediction on this image.
- Click the Analysis tab and browse to Tools.
- In the Geoprocessing pane, click Toolboxes and expand Image Analyst Tools. Select the Classify Objects Using Deep Learning tool under Deep Learning.
- On the Parameters tab, set the variables as follows:
- Input Raster—Choose an input image from the drop-down menu or from a folder location.
- Input Features (optional)—Select the feature layer if you want to limit the processing to specific regions in your raster identified by your feature class.
- Output Classified Objects Feature Class—Set the output feature layer that will contain the classification labels.
- Model Definition—Select the pretrained model .dlpk file.
- Arguments (optional)—Change the values of the arguments if
required.
- batch_size—The size of the batch processed during model inferencing. This depends on the memory of your graphics card.
- classes—Provide the classes to which the images are to be classified. Multiple classes can be provided here by using a ‘,’ as a class separator. For example, if you are trying to classify properties damaged during a hurricane, your input will be as follows: Minor Damage, Major Damage, No Damage.
- additional_context—Briefly describe the image to provide a context to the vision language model and help it classify your images better. The input raster is clipped into multiple smaller images by the tool. If you provide input features to the tool, the raster is clipped around your input features. If input features are not provided, the raster is clipped into smaller chips of dimension 1024x1024. While describing the image, describe the individual image and not the raster itself. An example is as follows: If you want to classify damages incurred by all buildings in a neighborhood, you might want to pass the input raster and the bounding boxes of building footprints in the neighborhood in the input features parameter. In such a scenario, your additional_context parameter might look something like this: ‘This is an aerial view of a house from a hurricane impacted area.’ Notice how we describe individual image and not the entire raster.
- strict_classification—Large language models including vision language models are prone to hallucinations. It is possible the model comes up with a hallucinated class that is not among the classes provided in the tool. If this parameter is set to True, all hallucinated classes are labelled as ‘Unknown’. However, there are times when you might want to keep these hallucinated classes in the output feature class. This parameter should remain as False in such cases.
- ai_connection_file—This is the path to the .json file which holds the connection details of the OpenAI instance to which the connection should be made. Currently, the OpenAI vision language models deployed on OpenAI or Azure are supported. The path of the file should be from the same machine where the geoprocessing tool is being run. Here is an example of how the connection file json would look for an OpenAI Azure deployment instance:
{
" service_provider":"AzureOpenAI",
"api_key": "YOUR_API_KEY",
"azure_endpoint": "YOUR_AZURE_ENDPOINT",
"api_version": "YOUR_API_VERSION",
"deployment_name": "YOUR_DEPLOYMENT_NAME"
}
If you want to connect to OpenAI directly, change the service provider to OpenAI and provide your OpenAI key. The last three parameters can be left blank if connecting to OpenAI directly and are needed only if you are connecting to OpenAI hosted on Azure.
- On the Environments tab, set the variables as follows:
- Processing Extent—Select the default extent or any other option from the drop-down menu.
- Cell Size—Set the value appropriately. Select the cell size in meters in such a way that maximizes the visibility of the objects of interest throughout the chosen extent. Consider a larger cell size for detecting larger objects and a smaller cell size for detecting smaller objects. For example, set the cell size for cloud detection to 10 meters, while for car detection, set it to 0.30 meters (30 centimeters). For further information regarding cell size, refer to the provided resource.
- Processor Type—Select CPU or GPU as needed.
It is recommended that you select GPU, if available, and set GPU ID to the GPU to be used.
- Click Run. As soon as processing finishes, the output layer is added to the map. The predicted classes are added in the attribute table of the output layer.