Data Pipelines performs batch processing on stored vector and tabular data such as data in a feature layer, or a cloud or object store such as Amazon S3 and Google BigQuery. Data Pipelines provides data preparation and engineering capabilities so you can blend and build your data and integrate it into ArcGIS. The processing that can be performed uses tools grouped into categories as toolsets:
- Clean—Clean the data. For example, you can remove unnecessary fields. You can also modify the fields or fill missing values.
- Construct—Create fields that are derived from existing fields or properties of the layer. For example, you can add and calculate a new field; standardize, transform, or reclassify an existing field; and add a field based on the input layer's geometry.
- Format—Change the format of the fields or reorganize the fields in the table or feature class. For example, you can convert time fields, encode categorical fields, or reduce the dimensions of existing fields.
- Integrate—Integrate or add data from another data source to the input table or feature class. For example, you can join fields or add fields by enriching the data.
- Output datasets—Choose the output type to write and store the result.
The following are example scenarios in which Data Pipelines can be used:
- As a data scientist, you can combine disparate datasets and calculate variables as fields using Arcade functions.
- As a GIS analyst, you can build and share reproducible data preparation workflows.
- As an environmental scientist, you can combine and standardize field information that is stored as a collection of .csv files.
The following tables describe the tools in the various categories in the Data Pipelines editor.
The following tools are in the Clean category:
The Filter by attribute tool returns a subset of a dataset based on a query. The output is a new dataset containing only the records that meet the condition specified in the query.
The Filter by extent tool returns a subset of a dataset based on a specified spatial extent. The output is a new dataset containing only the records that are geographically within the specified extent.
The Removes duplicate tool removes duplicate records based on one or more key fields. The output is a new dataset with no duplicate records.
The Select fields tool maintains one or more specified fields in the output dataset. The output is a new dataset containing only the specified fields.
The Simplify geometry tool simplifies the complexity of polylines or polygons by removing unnecessary vertices and maintaining only the most critical vertices.
The following tools are in the Construct category:
The Calculate field tool calculates field values for a new or existing field. You can use Arcade functions to define the calculation expression.
The Create date time tool creates a date time field using existing field values.
The Create geometry tool creates a geometry field using one or more fields.
The following tools are in the Format category:
The following tools are in the Integrate category:
The Join tool joins datasets based on the specified relationships. Datasets can be joined using matching attributes, spatial relationships, temporal relationships, or any combination of the three.
The Merge tool combines one or more datasets into a single, new dataset. You can combine point, line, polygon, or tabular datasets.
The following output dataset is supported:
The Feature layer output tool writes data pipeline datasets to a hosted feature layer or hosted table. You can create a new feature layer or table, replace the data in an existing feature layer or table, or add and update records in an existing feature layer or table.