The data pipelines you create in the ArcGIS Data Pipelines app are stored as items in your content. You'll use the Data Pipelines editor to create and edit data pipelines. The following sections outline the data pipeline elements and explain how to create and run a data pipeline in the editor.
Data pipeline elements
The following are the three elements of a data pipeline:
- An input is used to load data into the data pipeline for downstream processing. There are many input source types available. For more information about sources and source types, see Dataset configuration.
- There can be multiple data sources in a single data pipeline. At least one is required in a data pipeline workflow.
- Tools process data that has been loaded from input datasets.
- There can be multiple tools in a single data pipeline.
- Tools can be connected to each other when the output of one tool represents the input of the next tool.
- To learn more about the available tools and how to use them, see Data processing.
- An output defines what will be done with the results of the data pipeline.
- You can output data pipeline results to a new feature layer, replace the data in an existing feature layer, or add to and update the existing data in a feature layer.
- There can be multiple outputs in a single data pipeline.
- You can configure multiple outputs for a single tool result or input dataset. At least one is required to run a data pipeline.
- To learn more about writing results, see Feature layer.
Data pipeline workflow
The data pipeline workflow is composed of the three elements outlined above: connect to existing data, perform data engineering, and write out the newly prepared data. When a data pipeline is run, it generates one or more outputs. All output results are available in your portal content.
Connect to the data
The first step in creating a data pipeline is to connect to the data. On the editor toolbar, under Inputs, choose the source type to connect to. For example, choose Feature layer and browse to the layer, or choose Amazon S3 and browse to the data store item representing the bucket and folder containing the dataset. To learn more about connecting to data and how to optimize read performance, see Dataset configuration.
Perform data processing
The second step is to process the input data. On the editor toolbar, under Tools, choose the process to complete on the dataset. For example, to calculate locations for CSV data and filter the locations for a specific area of interest, you can use the Create geometry and Filter by extent tools.
To specify the dataset to use as input to a tool, do one of the following:
- Draw a line by dragging the pointer from the connector of one element to the other.
- Use the input dataset parameter to identify the input dataset.
Processing the data is optional. After connecting to the dataset, you can write it out as a feature layer with no processing.
To improve the performance of the data pipeline processing, you can limit the amount of data you are working with using one or a combination of the following tools:
- Select fields—Maintain only the fields of interest. For example, you have a census dataset with fields for the years 2000 and 2010 but you are only interested in the year 2010. Select only the fields that represent 2010 values.
- Filter by attribute—Maintain a subset of records that contain certain attribute values. For example, filter an earthquakes dataset for earthquakes with a magnitude greater than 5.5.
- Filter by extent—Maintain a subset of records within a certain spatial extent. For example, filter a dataset of United States flood hazard areas to the extent of another dataset that represents a state boundary.
Preview data pipeline elements
Use preview to investigate your data at any step of the workflow. Preview offers the following methods to inspect your data:
- Table preview—Display a tabular representation of the data.
- Map preview—Display the locations of the dataset on a map. In map preview, you can pan, zoom, and inspect attributes.
- Schema—View the schema of the dataset.
- Messages—Review messages returned from the preview action.
Previews show up to 8000 data records.
When previewing date time fields, the values will be shown in the time zone of your browser. When writing the values to a feature layer, they will be stored in UTC.
Previewing datasets with complex geometries can consume a large amount of the available memory. If memory thresholds are exceeded, map previews may not render, or the status may change to reconnecting while it recovers. To improve preview performance, consider the following:
- For any geometry type, consider adding a filter to the dataset using the Filter by attribute tool or the Filter by extent tool.
- For polygon geometries, consider generalizing the geometries using the Simplify geometry tool.
To write the full dataset to a feature layer, make sure you remove the filtering or simplification tool before running the data pipeline.
Run a data pipeline
Use the Run button in the data pipeline diagram to run the configured processes. To run a data pipeline, at least one output Feature layer element must be configured. Job results and messages can be accessed from the latest run details console. You can click a result to open the item details page.
To run a data pipeline on an automated schedule, you can create a task. To learn more about creating scheduled tasks for data pipelines, see Schedule a data pipeline task.