Imagery storage in the cloud—Imagery Workflows

Cloud storage options

There are two broad categories of virtual storage—file storage and object storage.

Object storage, the most common type of cloud storage, uses metadata to organize and access pieces of data in a storage pool.

File storage—such as what you use on your local machine—stores data hierarchically and allows machines on the same network to share files using Server Message Block (SMB) network protocol (for example, C:/ ). It’s possible to use file storage on the disk attached to the virtual machine (the disk can be shared), but this solution doesn’t scale well.

Object storage is distinct from file storage in a number of ways:

It’s relatively inexpensive.
Many machines can simultaneously and efficiently access the same storage.
It’s REST based (allowing HTTP requests).
It’s nearly unlimited in size (object storage can be many petabytes).
It has high latency (an object storage request might take 40 milliseconds to return a single section of a file, which doesn’t scale in applications that make thousands of small requests).

Suitable cloud optimized formats and caching are used to mitigate the high latency of object storage. Ideally, you’ll use cloud storage to store massive volumes of data, and then cache data as it’s accessed on the ephemeral disks. Since people tend to revisit the same data, this will improve performance.

Access to imagery stored in the cloud

There are four ways to access imagery stored in the cloud when using cloud storage:

VSI file handlers (//vsicurl, //vsis3, //vsiaz)—You can access imagery using native access within ArcGIS using VSI file handlers to directly access the data. Imagery accessed this way will have limited caching with limited security options. Only a single cloud storage access policy can be defined so all data needs to be from a single cloud account.
Cloud storage connection (ACS) files—Just like you can create a connection to a database, you can create a connection to cloud storage using cloud storage connection files (ACS files) in ArcGIS Pro. When you create the connection, you enter security credentials to be stored in the ACS file. You can then access that cloud storage in a way that looks very similar to a local file system, browsing and selecting files to add to ArcGIS Pro. You can use multiple cloud security profiles to access imagery from one machine. With ACS files, data can be cached in the server’s temp directory for a short period of time, reducing repeated data requests for frequently accessed imagery.
Note:
The temp directory on the server, called localTempFolder, is located in the server admin system properties. For example, for https://zero:6443/arcgis/admin/system/properties, it would be defined as {"localTempFolder":"E:/Temp/data"}. For desktop, the temp location is defined as an environmental variable called TempFolder.
Raster proxies—These are small XML files that embed information about the raster file in cloud storage; ArcGIS treats raster proxies as rasters, then accesses the actual data from the cloud only as needed. They can reference most GDAL-readable formats, can have any file extension, and can be referenced or embedded in a mosaic dataset or used directly in ArcGIS.
- When accessed, the pixels that have been read and the index to the tiles are cached locally so subsequent requests don’t have to go back to the cloud. You’ll need to consider the cache location and manage the cache periodically.
Cloud raster store—This is created with ArcGIS Server Manager and defines a cloud location to store rasters. These are typically used with ArcGIS Image Server for the output of raster analytics or for image hosting. If you use the ArcGIS Image Server hosting capabilities, the rasters will be stored in the cloud raster store in CRF format.

References to cloud-based imagery in a mosaic dataset

When you create a mosaic dataset, it typically references a file on disk, which isn't suitable in the cloud. The following are options for adding rasters to a mosaic dataset that don’t reference files on disk:

A cloud storage connection (ACS) file—Create the ACS file, then use it to add data to a mosaic dataset. You can create a connection to the cloud store, or add all the files accessed by the ACS file directly to the mosaic dataset. On the Add Rasters dialog box, under input data, choose File, change File List (*.csv) to All Files (*.*), and browse to the ACS file.
Note:
Publishing a mosaic dataset that accesses data in the cloud via an ACS file requires the mosaic dataset be published by reference. The first step is to register the cloud store with the server using the ACS credentials. Once it is registered, the cloud store connection can publish by reference. The published layer will access the data using the cloud store.
Embed raster proxies in a mosaic dataset. The raster proxy text string is embedded into the mosaic dataset—no external file is referenced. There are two ways to do this:
- Create the mosaic dataset using raster proxy files, and use MDTools (part of MDCS) to embed them in the mosaic dataset.
- Use OptimizeRasters to create the raster proxies as a table, and use that table to create your mosaic dataset.
File share to raster proxies—Raster proxies are small XML files that include a reference to a cloud storage location. Since the files are so small, they can be placed in the same file structure on the authoring machine and the server.
VSI file handlers (//vsicurl, //vsis3, //vsiaz)—Using the Table raster type, you can use these paths to reference rasters directly in the mosaic dataset.
A file share—In this case, the same path for authoring the raster must be available to the server, which is not suitable for cloud storage.

Cloud security for imagery

There are multiple ways of handling access and security.

Public buckets

No restrictions—You can have a public bucket that anyone can read.
Public, no-list permissions with obfuscated files—You can have a public bucket without a no-list option. Users with the URL to the file can access it, but if you go to the bucket and query what’s there, it won’t return anything. If the path of the file is also obfuscated, the URL is impossible to guess. However, if someone gives the URL to someone else, that person can also access it. In that case, it’s not possible to restrict access without removing the file (the level of security is analogous to allowing someone to download a copy of the file through a secured connection; once downloaded, they could then possibly share it with someone else).

Note:

Public buckets often make use of Requestor Pays, where the user pays for egress, which requires an account with the cloud storage provider.

Secure buckets

Access control list (ACL)—File-level permissions for specific users or system processes.
Role-based access control (RBAC)—Can set permissions by user; can use presigned URLs (token-based access); access control lists to define permissions at the file level; can use bucket policies that provide fine-grained control that can use canonical ID (canonical ID is given, and all data is shared with whatever system has that ID).
Token-based control—This includes presigned URLs, AWS’s query string request authentication, and Azure’s Shared Access Signature (SAS).
Bucket policies—Nuanced access control. You can set this according to canonical IDs, IP addresses, and so on.

Note that cross-origin resource sharing can be an issue. If you have data in the cloud that you want web apps to be able to access directly, you need to be thoughtful about these settings, so you don’t prevent access.

Which security option to use is dependent on many factors, but each level of security can affect performance. Typically, the public and obfuscated public options provide faster performance, since they do not require additional security checks.

Performance optimization with cloud storage

Performance in the cloud can be affected by the volume of data read, how efficient the process is, latency, bandwidth, and data structure. ArcGIS does a large amount of back-end optimization to improve performance, including minimizing the number of requests to cloud storage, caching when required, and so on.

Implementing general image management best practices will also improve performance.

File format

It is important to make sure you have data structured correctly to minimize requests that will slow down processing. Different file types—simple TIFF files, netCDF, GriB, different varieties of geoTIFF, COG, MRF, and CRF—all have advantages and disadvantages:

Tiling enables partial access to the file, reducing data transfer for large datasets.
Compression reduces storage and transfer but has additional compute requirements.
Some raster data structures are more complex, decreasing performance by requiring multiple requests to access.
Pyramids provide faster access at smaller scales.

Below are summaries of common raster formats.

TIFF or GeoTIFF (Untiled)

Diagram showing how untiled TIFFs are structured

Popular format for imagery and rasters.
Supports different bit depths and numbers of bands.
Includes additional metadata in tags internal to the file.
Can include georeferencing information embedded as tags (sometimes called GeoTIFF).
Often doesn’t include pyramids and doesn’t use compression.
TIFF files from data providers are often in the simplest form and inefficient to access, both generally and in the cloud.

Tiled TIFF

Diagram showing how tiled TIFF files are structured

Type of TIFF or GeoTIFF.
Pixels are structured into tiles to optimize access, especially for large files. This minimizes the number of disk access requests to get a subset of pixels.
Tiling is done by including an index to the tiles, which is stored as part of the tags.
Optional JPEG, LZW/Deflate, or Limited Error Rate Compression (LERC) can reduce file sizes.
Optional pyramids (sometimes referred to as reduced-resolution datasets or overviews) increase access efficiency at smaller scales. These pyramids increase the file size by 30 to 50 percent depending on the compression and type of data.

COG

Diagram showing how Cloud Optimized GeoTIFFs are structured

Cloud Optimized GeoTIFF (COG) is a type of tiled TIFF where pyramids are required and the index and pyramids are moved to the beginning of the file.
This file restructuring can provide a slight performance improvement in applications that only view the image at small scales or need to crawl for the metadata.
Creating COG files can take longer than tiled TIFF to write directly to cloud storage. This is because the pyramids and tags are moved to the start of the files.

MRF

Diagram showing how untiled MRFs are structured

MRF is a tile-based format developed by NASA for storing and accessing rasters more efficiently and improving performance, especially in cloud storage.
The data is tiled and has pyramids (such as tiled TIFF or COG).
The pyramids, index, and metadata can be stored as separate files, which can improve access speed when the small index and metadata files are stored separately in fast-access ephemeral storage.
MRF supports LERC and QB3 in addition to JPEG or LZW/Deflate. LERC and QB3 provide better and faster compression and decompression than LZW/Deflate. LERC also supports controlled lossy compression (important for large-bit-depth rasters such as elevation data or digital camera imagery). Both LERC and QB3 save additional storage space while speeding up data access.
The way NoData is managed helps remove artifacts at the edges of some images (a result of LERC compression and the way the JPEG tiles are stored).
When using caching for MRF, the caching performance can be faster because the cache can be stored without requiring it to be decoded first.

netCDF, HDF, or GriB

Diagram showing how multidimensional file formats are structured

These file types are used to store multidimensional data.
Metadata and data are spread among multiple files, so you need to access many files to read a given subset.
Accessing these file types from the cloud results in poor performance.

CRF

This format is optimized for storing large rasters in cloud storage.
The raster is split into bundles, each of which has its own index and a set number of tiles.
This structure enables separate processes to write to different bundles in parallel.
The tile structure is built into the directory structure, but can result in a large number of files.
The format is best suited for large rasters, since the file is divided into multiple directories and files.
When accessed in ArcGIS Pro, each required bundle is read and cached locally or by tile, depending on a configuration setting.
CRF is not accessed through GDAL but uses an ArcGIS optimized pipeline.

Transposed CRF

This is a type of CRF option that is optimized for multidimensional data.
Conceptually, it creates a copy of the data cube turned on its side to optimize time series queries.
GDAL has no suitable API for additional dimensions.

JP2

This format is optimized for high compression.
There are many types of JP2, some more optimized than others for storage size or access.
It uses wavelet compression, which typically provides higher compression than JPEG and supports additional bit depths but is relatively slow to access, especially from cloud storage. It is, therefore, not recommended.

Compression

Choosing a compression format means balancing the following factors:

The reduction in required storage
Loss of data (lossy, controlled lossy, or lossless compression)
Time required to write the compressed file
Time required to read the compressed file

Different types of compression balance these factors differently. For lossless or controlled lossy compression, LERC generally provides the best compression and is fast to both compress and decompress. LERC has the most value when used with floating-point data, such as elevation. Deflate often provides good compression. QB3 is a newer lossless compression that has good compression and is extremely fast to read and write. For lossy compression, JPEG provides fast compression and decompression while maintaining most of the image information.

Compression performance is dependent on the data source, but the following table shows typical results for a variety of common compression types:

Table showing the time to write, time to read, and resulting size of various raster compression formats

File conversion

Converting imagery to an optimized format is best done before or during the upload process.

There are various ways you can convert imagery into an optimized format and upload simultaneously, such as:

The Export Raster pane in ArcGIS Pro;
The Copy Raster geoprocessing tool in ArcGIS Pro;
GDAL;
or OptimizeRasters, which is an open-source tool—available from Esri in GitHub—that uses GDAL behind the scenes. It performs cloud optimization of files while maintaining the data structure and metadata. It can also be used to copy data to and from the cloud and create raster proxy files and table containing raster proxies that provide higher efficiently when working with large numbers of images.

File transfer

The following are additional upload options that will transfer your files without optimizing the format:

The Transfer Files tool in ArcGIS Pro is recommended because it connects via ASC to most cloud stores and provides efficient download and upload.
ArcGIS Enterprise portal (goes to raster store).
OptimizeRasters can be used to upload data to the cloud with or without file conversion.
Third-party tools such as Cloudberry and Amazon CLI.
White glove services, where the cloud-storage company ships you a disk for you to copy data. With this option, you send the disk back, and the company uploads it to the cloud for you. It’s a quick way to move large amounts of data into the cloud.

Feedback on this topic?

Cloud storage options

Access to imagery stored in the cloud

Note:

References to cloud-based imagery in a mosaic dataset

Note:

Cloud security for imagery

Public buckets

Note:

Secure buckets

Performance optimization with cloud storage

File format

TIFF or GeoTIFF (Untiled)

Tiled TIFF

COG

MRF

netCDF, HDF, or GriB

CRF

Transposed CRF

JP2

Compression

File conversion

File transfer

In this topic