This lesson demonstrates how to create a big data analytic using ArcGIS Velocity. You will assume the role of a transportation planner looking to better understand motor vehicle accidents involving bicyclists over a multiyear period. Your findings will be used to help identify where the construction of new bicycle friendly infrastructure such as bicycle lanes or lane barriers will generate the greatest impact for bicyclist safety.
The data used in this lesson can be downloaded from the New York City (NYC) OpenData site. The complete dataset of more than 1.5 million records was downloaded from this site in CSV format. For this lesson, the CSV file is hosted on a public Amazon S3 bucket, with connection information in the steps below.
As you work through the steps, you will create a big data analytic and a data source, configure a variety of tools, and generate an output feature layer containing analytic results that can be viewed in a web map.
This lesson is designed for beginners. You must have an ArcGIS Online account with access to ArcGIS Velocity. The estimated time to complete this lesson is 30 minutes.
Create a big data analytic
To begin, you will create a big data analytic in ArcGIS Velocity.
- In a web browser, open the ArcGIS Velocity app and sign in with your ArcGIS Online credentials.
For the best experience, use Google Chrome or Mozilla Firefox.
Note:
If you encounter issues signing in, contact your ArcGIS Online administrator. You may need to be assigned an ArcGIS Online role with privileges to use ArcGIS Velocity.
- From the main menu, click Big Data under ANALYTICS to access the Big Data Analytics page.
You can view existing big data analytics and create big data analytics as well as start and stop, check validity and running status, edit existing, and clone and delete big data analytics.
- Click Create big data analytic to select a type of data source.
Configure the data source
When configuring a big data analytic, you must first configure the data source that you'll use to load data that will be analyzed by the big data analytic.
- On the Select a type of data source window, click See all in the Cloud category.
Note:
All big data analytics must have at least one data source as an input.
- Under Cloud options, choose Amazon S3.
For details about the cloud providers, see the providers websites at Azure Blob Store, Azure Cosmos DB, or Amazon S3.
- In the Configure Amazon S3 window, for the Configure Amazon S3 Bucket step, set the parameters as follows:
- For Access mode, choose Public.
- For Bucket name, type arcgis-velocity-public.
- For Region, chooseUS West (Oregon).
- For Folder path (optional), type /nyc-motor-vehicle-collisions.
- For Dataset, type NYPD_Motor_Vehicle_Collisions.csv.
- Click Next to apply the Amazon S3 bucket parameters.
The data source validates.
Confirm the data schema
With the Amazon S3 bucket parameters set, you will now confirm the data schema. When configuring a data source, it is important to define the schema of the data being received. Velocity defines the schema when it samples the data including estimating the data format, field delimiter, field types, and field names.
- In the Confirm Schema step, review and confirm the schema of the data.
Velocity tested the connection to the data source, sampled the first few data records, and interpreted the schema of the data based on the sampled records. At this point, you can change data formats, field delimiter, field types, and field names to ensure a valid schema. In this lesson, you will accept the default schema properties.
- Click Next to confirm the schema as sampled.
Identify the key fields
Next, you will configure key fields so Velocity can properly construct geometry, date information, and a unique identifier for the data.
- In the Identify Key Fields step, configure the Location parameters as follows:
- For Location type, select X/Y fields.
- For X (longitude), select LONGITUDE.
- For Y (latitude), select LATITUDE.
- For Z (altitude), select None.
- For Spatial reference, choose GCS WGS 1984.
- For Does your data have date fields, choose No.
This parameter can be used to set a start and end date or date/time field in the data source. If the incoming data includes date information in a string format, a date format is required. For details, see Define date and time properties. For this lesson, you will not specify date or time information.
- For Track ID, choose Data does not have a Track ID.
This parameter can be used to designate a track ID field in the data source. For details, see Track ID. For this lesson, you will not define a Track ID.
- Click Complete to create the data source.
Create the big data analytic
With the Amazon S3 data source configured, the analytic editor opens. In the analytic editor, you can add tools, data sources, and outputs that can be used to define the flow and analyses to perform on the data. You will now create the big data analytic.
- On the New Big Data Analytic page, click Create analytic.
- In the Create Analytic window, for Title, type NYC Cyclist Accidents.
- For Summary, type Process motor vehicle accidents to identify and analyze those involving cyclists.
- Click Create analytic to create the analytic.
Once the analytic is created, the toolbar at the top of the analytic editor will display additional options and controls for saving, starting, scheduling, as well as run settings for the analytic.
Add and configure tools in the analytic
With the new analytic created, you will now add tools to the analytic that will perform the big data analysis on the NYC cyclist accident data. With Velocity, you configure an analysis pipeline in which the output of one step is the input to the next. You will configure sequential tools to better understand motor vehicle accidents involving injuries to cyclists.
First, you will add a field called TotalCyclistCasualties that sums the values in the NUMBER OF CYCLIST INJURED and NUMBER OF CYCLIST KILLED fields for each individual record from the data source.
- In the Manage Data folder, select the Calculate Field tool.
- Configure the Calculate Field tool as follows:
- Choose New field.
- For Field, type:
TotalCyclistCasualties
- For Type, choose Int32.
This specifies it will be a 32-bit integer field.
- Click the Configure Arcade expression button to open the Configure an Arcade expression window .
- In the Expression pane, type:
$feature["NUMBER OF CYCLIST INJURED"] + $feature["NUMBER OF CYCLIST KILLED"]
The result should look similar to the illustration below.
- Click OK to save the expression.
- In the Add field calculation column, click Add to add the new field.
- Click Apply to save the Calculate Field tool.
The Calculate Field tool will be added to the analytic after the Amazon S3 data source you configured above.
With the Calculate Field tool added, you will now filter the NYC motor vehicle accident data to identify the accidents with valid location coordinates that resulted in a cyclist injury or death.
- In the Manage Data folder, select the Filter By Expression tool and configure it as follows:
- Click the Configure an Arcade expression button to open the Configure an Arcade expression window.
- In the Expression pane, type:
$feature.TotalCyclistCasualties > 0 & $feature.LATITUDE > 0
Records with invalid coordinates exist in this dataset. These records can be ignored by filtering out records in which the latitude value is less than or equal to 0.
- Click OK to return to the Filter by Expression tool configuration wizard.
- Click Apply to apply the expression.
The Filter by Expression tool will be added to the analytic editor after the Calculate Field tool you created earlier.
With the filter added, you will now add another tool that will aggregate points spatially to represent the number of accidents involving cyclist injury or death as regular hexagonal bins.
- In the Summarize Data folder, choose the Aggregate Points tool and configure it as follows:
- For Aggregate points into, select Bins.
- For Bin type, select Hexagon.
- For Bin size, type 250. Leave the unit of measure set to Meters.
- Click Advanced options.
- In the Summary fields section, for Attribute, choose TotalCyclistCasualties.
- For Statistic, choose Sum.
- For Output field name, leave the default TotalCyclistCasualties_Sum.
- Click Add to add the summary field.
- Click Apply to apply the tool parameters.
The Aggregate Points tool will be added to the analytic editor after the Filter by Expression tool you configured in the previous step.
Configure an output
With data source and a pipeline of analysis tools configured, you will now add an output that will allow you to visualize the results of the big data analysis in a web map. You will write the output to a new feature layer that you will create using the steps below.
- In the analytic editor, click Add output to choose an output.
- Click See all in the ArcGIS category.
- Choose Feature Layer and choose Feature Layer (new).
- In the Configure Feature Layer (new) window, for the Configure Feature Layer step, set the following parameters:
- For Data storage method, select Add new features.
If you were working with a data source that had a track ID defined, you'd use the Keep latest feature method. With this storage method, each time a new feature is received for a certain track ID, the stored feature associated with that track ID is replaced by the new feature.
- For Each time the analytic runs, select Replace existing features and schema.
When Replace existing features and schema is used, each time the big data analytic is run, the features and schema in the output feature layer are overwritten. This can be useful when you are developing a big data analytic and adding, removing, or changing tools between running the analytic. Alternatively, the Keep existing features and schema option can be useful when you want to append records each time the big data analytic is run.
- For Data storage method, select Add new features.
- Click Next.
- In the Save step, for Feature layer name, type NYC_Cyclist_Accident_Aggregation.
- Click Complete to save the new output.
The new Feature Layer (new) output will be added after the Aggregate Points tool you added previously.
- At the top of the Velocity app, click Save to save the NYC Cyclist Accidents big data analytic.
Start the big data analytic
You successfully configured a big data analytic. The analytic will load millions of records from a delimited text file using a defined schema, process the event records through a variety of tools, and write the analysis output to a new feature layer. Next, you will start the NYC Cyclist Accidents big data analytic.
- At the top of the Velocity app, click Start to start the NYC Cyclist Accidents analytic.
The Start button text transitions to Stop Initialization and to Stop, indicating that the analytic has started and is running.
Note:
Velocity feeds and real-time analytics remain running once they are started. Big data analytics run until the analysis is completed when they stop automatically. Big data analytics can be configured to run in a recurring manner using the options available from the Schedule drop-down menu. This means that big data analytics can be run every few minutes or hours, on certain days of the week, or at certain times of the day. For details on scheduling a big data analytic, see Schedule recurring big data analysis.
- Monitor the analytic until the Stop button text changes to Start.
The Stop button changing to Start indicates that the analytic has run, is now complete, and is no longer running. Additionally, you can monitor the status of big data analytics from the Big Data Analytics page in the Velocity app.
Explore the analytic results in a web map
When you started the big data analytic in the previous section, an output feature layer was created. You will now open that output feature layer in a web map and view the results of the big data analysis on the NYC cyclist accident data.
- From the main menu, click Layers under OUTPUT to open the Layers page.
- Find the NYC_Cyclist_Accident_Aggregation feature layer in the list and click Open in map viewer to view the layer in a web map.
Note:
Output layers created by real-time or big data analytics do not appear on the Layers page until the analytic has successfully run and generated output.
- Zoom in to the extent of the data in the New York City area.
- Change the basemap to Dark Gray Canvas.
- On the layer, click the Change Style button and for the Choose an attribute to show step, choose Count from the drop-down menu.
- For the Select a drawing style step, choose Counts and Amounts (Color) and click Options.
- Click Symbols, change the color ramp to a Red/Orange/White color ramp, and click OK.
- Check the Classify Data check box.
- From the Using drop-down menu, choose Standard Deviation and set the class size to 1 standard deviation.
- Accept the other default properties, click OK, and click Done.
- Pan and zoom around the web map to explore the results of the big data analysis. Identify areas that had more cyclist-related injuries and deaths to those areas with less.
Next steps
In this lesson, you created and ran a big data analytic that analyzed millions of cyclist accidents to identify areas in New York City with the highest numbers of accidents. With these results, you can make informed decisions on where new bicycle infrastructure will have the greatest impact.
Review the following resources as you continue to work with Velocity: Essential ArcGIS Velocity vocabulary, Perform big data analysis, and Use Arcade expressions.