Note:
This feature is currently in beta. Share your experience and seek support through the Beta Features Feedback forum in the Data Pipelines Community.Use records from a Databricks (Beta) table as input to ArcGIS Data Pipelines.
Usage notes
Keep the following in mind when working with Databricks (Beta):
- To use a dataset from Databricks (Beta), you must first create a data store item. Data store items securely store credentials and connection information so the data can be read by Data Pipelines. To create a data store, follow the steps in the Connect to Databricks (Beta) section below.
- To change the data store item you configured, use the Data store item parameter to remove the currently selected item, and choose one of the following options:
- Add data store—Create a new data store item.
- Select item—Browse your content to select an existing data store item.
- Use the Schema parameter to specify the schema that contains the dataset you want to use.
- Use the Table parameter to specify the dataset you want to use.
- To improve the performance of reading input datasets, consider the following options:
- Use the Use caching parameter to store a copy of the dataset. The cached copy is only maintained while at least one browser tab open to the editor is connected. This may make it faster to access the data during processing. If the source data has been updated since it was cached, uncheck this parameter and preview or run the tool again.
- After configuring an input dataset, configure any of the following tools that limit the amount of data being processed:
- Filter by attribute—Maintain a subset of records that contain certain attribute values.
- Filter by extent—Maintain a subset of records within a certain spatial extent.
- Select fields—Maintain only the fields of interest.
- Clip—Maintain a subset of records that intersect with specific geometries.
Connect to Databricks (Beta)
To use data stored in Databricks (Beta), complete the following steps to create a data store item in the Data Pipelines editor:
- On the Data Pipelines editor toolbar, click Inputs, and choose Databricks (Beta).
The Select a data store connection dialog box appears.
- Choose Add a new data store, and click Next.
The Add a connection to a data store dialog box appears.
- Provide the server URL to the Databricks account. The following is an example: my_account.azuredatabricks.net.
Validation may fail if you specify https:// in the server URL.
- Choose one of the following authentication types:
- OAuth machine-to-machine—Provide the client ID and client secret for your Databricks account.
- Personal access token—Provide the token for your Databricks account.
- In the HTTP path parameter, provide the HTTP path for the Databricks compute resource that will be used.
It is recommended to use a serverless warehouse. Serverless warehouses may be quicker to connect than classic warehouses.
- In the Catalog (optional) parameter, provide the name of the catalog containing the datasets to use.If you do not specify a catalog, the data store item connects to the default catalog for your Databricks account.
- Click Next.
The item details pane appears.
- Provide a title for the new data store item.
This title will appear in your content. You can also store the item in a specific folder, and provide item tags or a summary.
- Click Create connection to create the data store item.
The Select datasets dialog box appears.
- In the Schema parameter, provide the name of the schema that contains the table to load records from.
- In the Table parameter, provide the name of the table that contains the records to use as input to the data pipeline.
- Click Add.
A Databricks (Beta) element is added to the canvas.
Limitations
The following are known limitations:
- If your organization has blocked beta apps and capabilities, you cannot access the Databricks (Beta) input option.
- To work with data from a Databricks table, the Databricks warehouse must first be started. Data Pipelines will start the warehouse when the data is requested. It may take a few minutes to load records or fields from a Databricks table depending on how long it takes for the warehouse to start. For improved performance, it is recommended that you use a serverless warehouse instead of a classic warehouse.
- Field types that cannot be queried using Databricks SQL cannot be used in Data Pipelines.
- To use a data store item to connect to external data sources, you must be the owner of the data store item. Data store items are private and cannot be shared.
Licensing requirements
The following licensing and configurations are required:
- Creator or Professional user type
- Publisher, Facilitator, or Administrator role, or an equivalent custom role
To learn more about Data Pipelines requirements, see Requirements.