Data structure—Imagery Workflows

In most cases, the structure of how the data is stored on disk in terms of directories is not important, and it is often best to leave existing data in its original structure. Otherwise, existing applications may have difficulty accessing the data. If new data is being acquired, recommendations may be given on how to structure the files to simplify data management. As a general rule, it is better to structure imagery based on a directory naming convention and hierarchy agreed on by the organization.

The number of files in a directory can affect the access performance. There are many cases in which it is necessary to create a listing of a directory. If a directory has a very large number of files, this can take a long time. Even if as a user you do not need to create a file listing, in many cases the operating system needs to perform this, for example, to ensure a file is not case sensitive. It is therefore recommended that you restrict the number of files in a directory to about 1,000. If there are large collections of files, it is better to create subdirectories to reduce the number of files in a single directory.

The following are recommendations for data structure.

Directory structure

Define a hierarchy that makes sense for the data, and plan ahead to provide sufficient granularity later. A typical hierarchy could be as follows:

DataType\Source\Type\Geography\Date

For example: Satellite\GeoEye\GeoEye1\Europe\2001

Even if you initially only have imagery of one specific subcategory, planning in advance with sufficient granularity makes it easier to extend later.

File naming

The names of the files are generally defined by the data provider. It is recommended that you do not rename any files. If creating new files or directories, it is best to include key descriptors in the name, such as date and some geospatial indicator. Unique names can help later in linking metadata attributes. Do not be too verbose with the names of directories or files, as some disk systems impose a 256-character limit on the path and name of a file.

Files per directory

Try to keep the number of files per directory under 1,000. There is no specific maximum, but as the number of files in a directory increases, the time to access the file also usually increases. This affects not only the time taken listing files in a directory, but accessing the individual files. This is more noticeable with Windows-based operating systems, but also affects ArcGIS on Linux-based operating systems.

Case sensitivity

Although file names are not case sensitive in Windows, in many storage systems, using a mixture of extension cases can cause issues. It is recommended that you use lowercase for file extensions.

Setting files as read-only

It is recommended that you set data files such as TIF to be read-only, but do not make the directory read-only.

Images generally do not change. Files such as pyramids (.ovr) and metadata (.aux.xml) files may be added, but mostly it is possible and recommended that you set the main imagery files (.tif and so on) as read-only. Some processes (such as setting the spatial reference) could change the file, but could also modify the associated .aux.xml file. By having image files set as read-only, you can help ensure that the original files are not modified. This often ensures authenticity of files and ensures that they are not unnecessarily backed up multiple times.

It is recommended that you not set the directory in which the files are stored as read-only, since many of the workflows result in additional pyramid, statistics, or metadata files being written along with the source files. If the directories are read-only during the authoring processes, these files will be stored in separate locations disconnected from the originals.

Drive performance

As defined in the section above, due to the large size of imagery, it is not possible for the system to load all imagery into memory, and ArcGIS needs to read the required pixels off the disk system as needed. Therefore, the performance of the disk system is an important component of optimization. It is important that the server have fast access to the imagery. If the imagery is highly compressed, this is less of a concern. It is recommended that you check the performance of the disk subsystem by using a disk speed testing utility.

UNC vs. drive letters

On some file systems, it may be better to reference files by drive names versus UNC paths. Whether there is a difference in performance, and which is faster, is dependent on many factors, and it is best to determine this by testing.

Metadata

Metadata about imagery is important to maintain. Unfortunately, there are few standards that are adhered to. Typically, imagery that comes from a major data provider comes with associated metadata files for each scene. When a large number of files are associated with a single project, such as an aerial survey mission or a large collection of tiled images that make up a dataset, metadata may be available in the form of a table, or metadata is only associated using a document stored with the files.

In recent years, STAC (Spatio Temporal Asset Catalogs) have become a de facto standard for maintaining metadata about rasters. An increasing number of data providers include STAC-structured metadata. The STAC specification recommends that STAC JSON files are stored next to all rasters. The STAC standard along with various extensions define the how-to structure and name the nodes inside the JSON as well as provide references to individual rasters/items that make up the complete raster dataset. References to child and parent items are also defined. Such STAC items can be stored next to the data, but easily crawled and so ingested into catalogs of fast search.

It is recommended that any metadata that comes with the imagery be stored alongside the imagery. If a mosaic dataset is created from the imagery, the appropriate metadata should be either directly ingested as part of the raster type used to ingest the data or added as additional fields to the mosaic dataset. The schema for additional fields in a mosaic dataset is not enforced; therefore, fields with any valid name or type can be added. Attribute fields are copied from a source mosaic dataset to derived mosaic datasets, so standardizing the field naming of the source mosaic datasets will reduce the presence of empty attributes in the derived mosaic datasets. The following is a list of some proposed field names for commonly used metadata fields:

AcquisitonDate [Date]
CloudCover [Long]-(0 - 100%)
NadirAngle [Long]-(0 - 90 deg) 0 is vertical
LE90 [Float]-Horizontal Accuracy
CE90 [Float]-Vertical Accuracy

Workflows for specific data types may have additional recommendations.

Feedback on this topic?