You can use the Vision Language Context-Based Classification pretrained model in the Classify Objects Using Deep Learning tool from the Image Analyst toolbox in ArcGIS Pro.
Classify imagery
To use Vision Language Context-Based Classification from the imagery, complete the following steps:
- Download the Vision Language Context-Based Classification model.
Note:
This model requires an internet connection if you're using OpenAI's vision language models. The data used for classification, including the imagery and possible class labels, will be shared with OpenAI. However, if you are using the Llama Vision model, it operates locally and does not require an internet connection, ensuring that your data remains on your machine without being shared externally. - Click Add
data to add an image to the Contents pane.
You'll run the prediction on this image.
- Click the Analysis tab and click Tools.
- In the Geoprocessing pane, click Toolboxes, expand Image Analyst
Tools, and select the Classify Objects Using Deep Learning tool under Deep Learning.
- On the Parameters tab, set the parameters as follows:
- Input Raster—Choose an input image from the drop-down menu or from a folder location.
- Input Features (optional)—Select the feature layer if you want to limit the processing to specific regions in the raster identified by the feature class.
- Output Classified Objects Feature Class—Set the output feature layer that will contain the classification labels.
- Model Definition—Select the pretrained model .dlpk file.
- Arguments (optional)—Change the values of the arguments if
required.
- classes—Provide the classes to which the images are to be classified. Multiple classes can be provided here using a comma (,) as the class separator. For example, if you are classifying properties damaged during a hurricane, the input will be Minor Damage, Major Damage, No Damage.
- additional_context—Briefly describe the image to provide context to the vision language model and help it classify the images better. The input raster is clipped into multiple smaller images by the tool. If you provide input features, the raster is clipped around those input features. If no input features are provided, the raster is clipped into smaller chips of dimension 1024x1024. Describe the individual image, not the raster. For example, to classify damages incurred by all buildings in a neighborhood, you can pass the input raster and the bounding boxes of building footprints in the neighborhood in the input features parameter. In this scenario, the additional context might be "This is an aerial view of a house from a hurricane impacted area." Notice that the individual image is described, not the entire raster.
- strict_classification—Large language models, including vision language models, are prone to hallucinations. It is possible that the model uses a hallucinated class that is not among the classes provided in the tool. If this parameter is set to True, all hallucinated classes are labelled as Unknown. However, you may want to keep these hallucinated classes in the output feature class. In that case, this parameter should remain False.
- ai_connection_file—This is the path to the .json file that includes the connection details of the model that will be used. Currently, the OpenAI vision language models deployed on OpenAI or Azure, and the Llama Vision models installed locally on the machine are supported. This provides flexibility to choose between cloud-based solutions with OpenAI or locally deployed models with Llama Vision. The path of the file should be from the same machine where the geoprocessing tool is being run. The following is an example of how the connection file JSON would look for an OpenAI Azure deployment instance:
OpenAI Azure connection JSON file
{ "service_provider" : "AzureOpenAI", "api_key" : "YOUR_API_KEY", "azure_endpoint" : "YOUR_AZURE_ENDPOINT", "api_version" : "YOUR_API_VERSION", "deployment_name" : "YOUR_DEPLOYMENT_NAME" }
To connect to OpenAI directly, change the service provider to OpenAI and provide your OpenAI key. The last three parameters can be left blank if you're connecting to OpenAI directly; they are only needed if you are connecting to OpenAI hosted on Azure.
To use Llama Vision without cloud-based services, download the Llama Vision model weights using the following command in the python command prompt:
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.2-11B-Vision-Instruct
Once the weights are downloaded, you can modify the connection JSON to use Llama Vision as follows:
Llama Vision connection JSON file
{ "service_provider" : "local-llama", }
- On the Environments tab, set the environments as follows:
- Processing Extent—Select the default extent or any other option from the drop-down menu.
- Cell Size—Set the value appropriately. Select the cell size in meters in such a way that maximizes the visibility of the objects of interest throughout the chosen extent. Consider a larger cell size for detecting larger objects and a smaller cell size for detecting smaller objects. For example, set the cell size for cloud detection to 10 meters, and for car detection, set it to 0.30 meters (30 centimeters). For more information about cell size, see Cell size of raster data .
- Processor Type—Select CPU or GPU as needed.
It is recommended that you select GPU, if available, and set GPU ID to the GPU that will be used.
- Click Run.
As soon as processing finishes, the output layer is added to the map, and the predicted classes are added to the attribute table of the output layer.