This lesson will demonstrate how to construct a big data analytic using ArcGIS Velocity. You will assume the role of a transportation planner looking to better understand motor vehicle accidents involving bicyclists over a multi-year period. Your findings will be used to help identify where the construction of new bicycle friendly infrastructure such as bicycle lanes or lane barriers would generate the largest impact for bicyclist safety.
The data used in this lesson can be downloaded from the New York City (NYC) OpenData site. The complete dataset of over 1.5 million records was downloaded from this site in CSV format. For this lesson, the CSV file has been conveniently hosted on a public Amazon S3 bucket, with connection information in the steps below.
As you work through the steps in this lesson, you will learn skills including creating a new big data analytic, creating a data source, configuring a variety of tools, and generating an output feature layer containing analytic results that can be viewed in a web map.
This lesson is designed for beginners. You must have an ArcGIS Online account with access to ArcGIS Velocity. Estimated time: 30 minutes.
Create a new big data analytic
To begin, you will create a new big data analytic using the ArcGIS Velocity application.
- In a web browser, navigate to https://velocity.arcgis.com and sign in with your ArcGIS Online credentials.
For the best experience, use Google Chrome or Mozilla Firefox.
If you encounter issues signing in, contact your ArcGIS Online administrator. You may need to be assigned to an ArcGIS Online role with privileges to use ArcGIS Velocity.
- In the main menu, click Big Data to access the Big Data Analytics page.
From here, you can create new and view existing big data analytics as well as start and stop, check validity and running status, edit existing, and clone and delete your big data analytics.
- Click Create big data analytic to open the options for connecting to a data source.
Configure a data source
When configuring a big data analytic, you must first configure a data source from which to load data that will be analyzed by the big data analytic.
- In the Select a type of data source page, click See More under the Cloud category.
All big data analytics must have at least one data source as an input.
- Under Cloud, click See all to view the cloud data source options.
- From the Cloud options, choose Amazon S3 from the list.
- In the Configure Amazon S3 page, for the Configure Amazon S3 Bucket step, set the parameters as follows:
- For Access mode, choose Public.
- For Bucket name, enter the following text:
- For Region, choose US West (Oregon).
- For Folder path (optional), enter the following text:
- For Dataset, enter the following text:
- Click Next to apply the Amazon S3 bucket parameters.
With the Amazon S3 bucket parameters set, you will now confirm the schema of the dataset.
Confirm the data schema
When configuring a data source, it is important to define the schema of the data you are receiving. Velocity makes a best attempt to define the schema when it samples the data including estimating the data format, field delimiter, field type, and the field name properties.
- In the Confirm Schema step, explore the schema of the dataset returned.
Velocity tested the connection to the data source, sampled the first few data records, and interpreted the schema of the data based off the sampled records. At this point, you can optionally change field types, field names, and data formats to ensure a valid schema. For this lesson, you will accept the default schema properties.
- Without making any changes, click Next to confirm the schema as sampled.
Identify the key fields
Next, you will configure some important fields so Velocity can properly construct geometry, date information, and a unique identifier for the data.
- In the Identify Key Fields step, configure the Location parameters as follows:
- For Location type, choose X/Y fields.
- For X (longitude), choose LONGITUDE.
- For Y (latitude), choose LATITUDE.
- For Z (altitude), choose None.
- For Spatial reference, choose GCS WGS 1984.
- For Does your data have date fields?, choose No.
This parameter can be used to choose a start and end date or date/time field in the data source. If the incoming data has date information in a string format, then a date format is required. For details, see Define date and time properties. For the purpose of the analysis in this lesson, you will not specify any date and time information.
- For Track ID, choose Data does not have a Track ID.
This property can be used to designate a Track ID field in the data source. For details, see Track ID. For the purpose of the big data analysis in this lesson, you will not define a Track ID.
- Click Complete to finish configuring the data source.
Create the big data analytic
With the Amazon S3 data source now configured, the analytic editor will open. In the analytic editor, you can add tools, additional data sources, and outputs which can be used to define the flow and analysis you wish to perform on the data. You will now create a new analytic.
- In the New Big Data Analytic page, click Create analytic.
- In the Create Analytic window, for Title, enter the following text:
NYC Cyclist Accidents
- For Summary, enter the following text:
Process motor vehicle accidents to identify and analyze those involving cyclists
- Click Create Analytic to create the new big data analytic.
Once the analytic is saved, the toolbar at the top of the analytic editor will display additional options and controls for saving, starting, scheduling, as well as run settings for the analytic.
Add and configure tools in the analytic
With the new analytic created, you will now add tools to the analytic that will perform the big data analysis on the NYC cyclist accident data. With Velocity, you configure an analysis pipeline in which the output of one step is the input to the next. You will configure sequential tools to better understand motor vehicle accidents involving injuries to cyclists.
You will first create a new field called TotalCyclistCasualties which sums the values in the NUMBER OF CYCLIST INJURED and NUMBER OF CYCLIST KILLED fields for each individual record from the data source.
- From the Manage Data folder, choose the Calculate Field tool.
- Configure the Calculate Field tool as follows:
- Ensure New field is chosen.
- For Field, enter the following text:
- For Type, choose Int32 which specifies this will be a 32-bit integer field.
- Click the pencil icon to open the Configure an Arcade expression window.
- In the Expression pane, enter the following text:
$feature["NUMBER OF CYCLIST INJURED"] + $feature["NUMBER OF CYCLIST KILLED"]
- Click OK to save the expression.
- In the Add field calculation column, click Add to add the new field.
- Click Apply to save the Calculate Field tool.
The Calculate Field tool will be added to the analytic after the Amazon S3 data source you configured above.
With the Calculate Field tool created, you will now filter the NYC motor vehicle accident data to identify the accidents that resulted in a cyclist injury or death, and additionally, only the accidents with valid location coordinates.
- From the Manage Data folder, choose the Filter By Expression tool and configure as follows:
- Click the pencil icon to open the Configure an Arcade expression window.
- In the Expression pane, enter the following expression:
$feature.TotalCyclistCasualties > 0 & $feature.LATITUDE > 0
In this dataset, there are records with invalid coordinates. These records can be ignored by filtering out records where the latitude value is less than or equal to 0.
- Click OK to return to the Filter by Expression tool configuration.
- Click Apply to apply the expression.
The Filter by Expression tool will be added to the analytic editor after the Calculate Field tool you created earlier.
With the filter added, you will now add another tool that will aggregate points spatially in order to represent the number of accidents involving cyclist injury or death as regular hexagonal bins.
- From the Summarize Data folder, choose the Aggregate Points tool and configure as follows:
- For Aggregate points into, choose Bins.
- For Bin type, choose Hexagon.
- For Bin size, enter: 250
- Leave the unit of measure set to Meters.
- Click Advanced options.
- In the Summary fields section, for Attribute, choose the TotalCyclistCasualties field.
- For Statistic, choose Sum.
- For Output field name leave the default TotalCyclistCasualties_Sum.
- Click the Add to add this summary field.
- Click Apply to apply the tool parameters.
The Aggregate Points tool will be added to the analytic editor after the Filter by Expression tool you configured in the previous step.
With data source and a pipeline of analysis tools configured, you will now add an output that will allow you to visualize the results of the big data analysis in a web map. In this lesson, you will be writing the output to a new feature layer you will create using the steps below.
- In the analytic editor, click Add output to open the output options.
- Click See all under the ArcGIS category.
- Choose Feature Layer and choose Feature Layer (new) from the list.
- In the Configure Feature Layer (new) window, on the Configure Feature Layer step, set the following parameters:
- For Data storage method, choose Add new features.
Choosing the Keep Latest Feature storage method would be used if you were working with a data source that had a Track ID defined. With this storage method, each time a new feature is received for a certain Track ID, the stored feature associated with that Track ID would be replaced by the new feature.
- For Each time the analytic runs, choose Replace existing features and schema.
When Replace existing features and schema is chosen, each time the big data analytic is run, the features and schema in the output feature layer will be overwritten. This is can be useful when you are developing a big data analytic and adding, removing, or changing tools in-between running the analytic. Alternatively, the Keep existing features and schema option can be useful if you wanted to append additional records each time the big data analytic is run.
- For Data storage method, choose Add new features.
- Click Next to proceed to the next step.
- For Feature layer name, enter the following text:
- Click Complete to save the new output.
The new Feature Layer (new) output will be added after the Aggregate Points tool you added previously.
- On the top of the Velocity application, click Save to save the NYC Cyclist Accidents big data analytic.
Start the analytic
At this point, you have successfully configured a big data analytic. The analytic will load millions of records from a delimited text file using a defined schema, process the event records through a variety of tools, and write the analysis output to a new feature layer. Next, you will start the NYC Cyclist Accidents big data analytic.
- On the top of the Velocity application, click Start to start the NYC Cyclist Accidents analytic.
Notice the Start button transitions to Stop Initialization and then to Stop, indicating the analytic has started and is running.
Velocity feeds and real-time analytics remain running once they are started. Big data analytics run until the analysis is completed and then stop. Big data analytics can be configured to run in a recurring manner using the options available from the Schedule dropdown. This means that big data analytics can be run every few minutes or hours, on certain days of the week, or at certain times of the day. For more information on how to schedule a big data analytic, see Schedule recurring big data analysis.
- Monitor the analytic until the Stop button changes to Start once again.
The Stop button changing to Start indicates the analytic has run and is now complete and no longer running. Additionally, you can monitor the status of your big data analytics from the Big Data Analytics page in the Velocity application.
Explore the analytic results in a web map
When you started the big data analytic in the previous section, an output feature layer was created. You will now open that output feature layer in a web map and view the results of the big data analysis on the NYC cyclist accident data.
- In the main menu, click Layers under OUTPUT to open the Layers page.
- Locate the NYC_Cyclist_Accident_Aggregation feature layer in the list and click Open in Map Viewer to view the layer in a web map.
Output layers created by real-time or big data analytics do not appear on the layers page until the analytic has successfully run and generated output.
- Zoom in to the extent of the data in the New York City region.
- Change the basemap to Dark Gray Canvas.
- On the layer, click the Change Style button and for the Choose an attribute to show step, choose COUNT from the dropdown.
- For the Select a drawing style step, choose Counts and Amounts (Color) and click OPTIONS.
- Click Symbols and change the color ramp to a Red / Orange / White color ramp and click OK.
- Check the Classify Data checkbox.
- For the Using dropdown, choose Standard Deviation and set the class size to 1 standard deviation.
- Accept the other default properties and click OK and then Done.
- Pan and zoom around the web map to explore the results of the big data analysis. Identify areas that had more cyclist related injuries and deaths to those areas with less.
Congratulations! In this lesson, you created and ran a big data analytic that analyzed millions of cyclist accidents to identify areas in NYC with the highest numbers of accidents. With these results, you are now able to make more informed decisions on where new bicycle infrastructure could have the greatest impact.