Skip To Content

Data structure

In most cases, the structure of how the data is stored on disk in terms of directories is not important, and it is often best to leave existing data in its original structure. Otherwise, existing applications may have difficulty accessing the data. If new data is being acquired, recommendations may be given on how to structure the files to simplify data management. As a general rule, it is better to structure imagery based on a directory naming convention and hierarchy agreed on by the organization.

The number of files in a directory can have effect on the access performance. There are many cases where it is necessary to create a listing of a directory. If a directory has a very large number of files this can take a long time. Even if as a user you do not need to create a file listing, in many cases the operating system needs to perform this for example to ensure case insensitivity. It is therefore recommended to restrict the number of files in a directory to about 1000. If there are large collections of files it is better to create subdirectories to reduce the number of files in a single directory.

The following are recommendations for data structure.

Directory structure

Define a hierarchy that makes sense for the data, and plan ahead to provide sufficient granularity later. A typical hierarchy could be as follows:

DataType\Source\Type\Geography\Date

For example: Satellite\GeoEye\GeoEye1\Europe\2001

Even if you initially only have imagery of one specific subcategory, planning in advance with sufficient granularity makes it easier to extend later.

File naming

The names of the files are generally defined by the data provider. It is recommended that you do not rename any files. If creating new files, it is best to include key descriptors in the name, such as date and some geospatial indicator. Unique names can help later in linking metadata attributes.

Files per directory

Try to keep the number of files per directory under 1,000. There is no specific maximum, but as the number of files in a directory increases, the time to access the file also usually increases. This affects not only the time taken listing files in a directory, but accessing the individual files. This is more noticeable with Windows-based operating systems, but also affects ArcGIS on Linux-based operating systems.

Case sensitivity

Although file names are case insensitive in Windows, in many storage systems using a mixture of extension cases can cause issues. It is recommended to use lower case for file extensions.

Setting files as read-only

It is recommended to set data files such as TIF to be read only, but do not make the directory read only.

Images generally do not change. Files such as pyramids (.ovr) and metadata (.aux.xml) files may be added, but mostly it is possible and recommended to set the main imagery files (.tif and so on) as read-only. Some processes (such as setting the spatial reference) could change the file, but could also modify the associated .aux.xml file. By having image files set as read-only you can help ensure that the original files are not modified. This often ensure authenticity of a file and ensure that they are not unnecessarily backed up multiple times.

It is recommended to not set the directory in which the files are stored as read-only, since many of the workflows result in additional pyramid, statistics, or metadata files being written along with the source files. If the directories are read-only during the authoring processes, these files will be stored in separate locations disconnected from the originals.

Drive performance

As defined in the section above, due to the large size of imagery, it is not possible for the system to load all imagery into memory, and ArcGIS needs to read the required pixels off the disk system as needed. Therefore, the performance of the disk system is an important component of optimization. It is important that the server have fast access to the imagery. If the imagery is highly compressed this is less of a concern. It is recommended that you check the performance of the disk subsystem by using a disk speed testing utility.

UNC versus drive letters

On some file systems it may be better to reference files by drive names versus UNC paths. Whether there is a difference in performance, and which is faster, is dependent on many different factors, and it is best to determine this by testing.

Metadata

Metadata about imagery is important to maintain. Unfortunately, there are few standards that are adhered to. Typically, imagery that comes from a major data provider comes with associated metadata files for each scene. When a large number of files are associated with a single project, such as an aerial survey mission or a large collection of tiled images that make up a dataset, metadata may be available in the form of a table, or metadata is only associated using a document stored with the files.

It is recommended that any metadata that comes with the imagery be stored alongside the imagery. If a mosaic dataset is created from the imagery, then the appropriate metadata should be either directly ingested as part of the raster type used to ingest the data or added as additional fields to the mosaic dataset. The schema for additional fields in a mosaic dataset is not enforced; therefore, fields with any valid name or type can be added. Attribute fields are copied from a source mosaic dataset to derived mosaic datasets, so standardizing the field naming of the source mosaic datasets will reduce the presence of empty attributes in the derived mosaic datasets. The following is a list of some proposed field names for commonly used metadata fields:

  • AcquisitonDate [Date]

  • CloudCover [Long]-(0 - 100%)

  • NadirAngle [Long]-(0 - 90 deg) 0 is vertical

  • LE90 [Float]-Horizontal Accuracy

  • CE90 [Float]-Vertical Accuracy

Workflows for specific data types may have additional recommendations.