Design a big data analytic

In this lesson you will learn how to create a big data analytic using ArcGIS Velocity. You will assume the role of a transportation planner looking to better understand motor vehicle accidents involving bicyclists over a multiyear period. Your findings will be used to help identify where the construction of new bicycle friendly infrastructure such as bicycle lanes or lane barriers will generate the greatest impact for bicyclist safety.

The data used in this lesson can be downloaded from the New York City (NYC) OpenData site. The complete dataset of more than 1.5 million records was downloaded from this site in CSV format. For this lesson, the CSV file is hosted on a public Amazon S3 bucket, with connection information in the steps below.

As you work through the steps, you will create a big data analytic and a data source, configure a variety of tools, and generate an output feature layer containing analytic results that can be viewed in a web map.

This lesson is designed for beginners. You must have an ArcGIS Online account with access to ArcGIS Velocity. The estimated time to complete this lesson is 30 minutes.

Create a big data analytic

To begin, you will create a big data analytic in ArcGIS Velocity.

  1. In a web browser, open ArcGIS Velocity and sign in with your ArcGIS Online credentials.

    For the best experience, use Google Chrome or Mozilla Firefox.

    Note:

    If you encounter issues signing in, contact your ArcGIS Online administrator. You may need to be assigned an ArcGIS Online role with privileges to use ArcGIS Velocity. For more information on creating roles and assigning users, see the Create roles and assign users topic.

  2. From the main menu, click Big Data under ANALYTICS to access the Big Data Analytics page.

    On the Big Data Analytics page, you can view existing and create new big data analytics as well as start and stop, check status, edit existing, and clone and delete big data analytics.

  3. Click Create big data analytic to open the configuration wizard for creating new big data analytics.

Configure the data source

When configuring a big data analytic, you must first select the type of data source.

  1. In the Select a type of data source window, click See all in the Cloud category.

    Select a type of data source window

    Note:

    All big data analytics must have at least one data source as an input.

  2. Under Cloud options, choose Amazon S3.

    Cloud data source options

    For details about the cloud providers, see the providers websites at Azure Blob Store, Azure Cosmos DB, or Amazon S3.

  3. In the Configure Amazon S3 window, for the Configure Amazon S3 Bucket step, set the parameters as follows:
    1. For Access mode, choose Public.
    2. For Bucket name, type arcgis-velocity-public.
    3. For Region, chooseUS West (Oregon).
    4. For Folder path (optional), type /nyc-motor-vehicle-collisions.
    5. For Dataset, type NYPD_Motor_Vehicle_Collisions.csv.

    Amazon S3 data source configuration wizard

  4. Click Next to apply the Amazon S3 bucket parameters.

    The data source validates and returns sampled event data you will review and confirm in the next section.

Confirm the data schema

With the Amazon S3 bucket parameters set, you will now confirm the data schema. When configuring a data source, it is important to define the schema of the data being loaded. Velocity defines the schema when it samples the source data including the data format, field delimiter, field types, and field names.

  1. In the Confirm Schema step, review and confirm the schema of the data is similar to the below illustration.

    Confirm the schema of the data source

    Velocity tested the connection to the data source, sampled the first few data records, and interpreted the schema of the data based on the sampled records. At this point, you could change the data format, field delimiter, field types, and field names if necessary to ensure a valid schema. However, in this lesson, you will accept the default schema parameters.

  2. Click Next to confirm the schema as sampled.

Identify the key fields

Next, you will configure the key fields so Velocity can properly construct geometry, date information, and a unique identifier for the data.

  1. In the Identify Key Fields step, configure the Location, Date and Time, and Tracking parameters as follows:
    1. For Location type, choose X/Y fields.
    2. For X (longitude), choose LONGITUDE.
    3. For Y (latitude), choose LATITUDE.
    4. For Z (altitude), choose None.
    5. For Spatial reference, accept the default GCS WGS 1984.
    6. For Does your data have date fields?, choose No.

      This parameter can be used to set a start and end date or date/time field in the data source. If the incoming data includes date information in a string format, a date format is required. For more information, see Date and time parameters. For this lesson, you will not specify date or time information.

    7. For Track ID, choose Data does not have a Track ID.

      This parameter can be used to designate a Track ID field in the data source. For more information on Track ID's, see Track ID. For this lesson, you will not define a Track ID.

      Identify key fields in the data source

  2. Click Complete to create the new data source.

    The new Amazon S3 data source is added to the analytic editor.

Create the big data analytic

With the data source now added to the analytic editor, you will now create the big data analytic.

  1. On the New Big Data Analytic page, click Create analytic.
  2. In the Create Analytic window, for Title, type NYC Cyclist Accidents.
  3. For Summary, type Process motor vehicle accidents to identify and analyze those involving cyclists.
  4. For Folder, choose the folder to create the big data analytic in.

    Create Analytic window

  5. Click Create analytic to create the analytic.

    Once the analytic is created, the toolbar at the top of the analytic editor provides additional options and controls for saving, starting, scheduling, as well as run settings for the analytic.

Add tools to the analytic

With the new analytic created, you will now add tools to the analytic that will perform the big data analysis on the NYC cyclist accident data. With Velocity, you can configure an analytic pipeline in which the output of one step is the input to the next. You will now configure sequential tools to better understand motor vehicle accidents involving injuries to cyclists.

First, you will add the Calculate Field tool and then you will add a field called TotalCyclistCasualties that sums the values in the NUMBER_CYCLIST_INJURED and NUMBER_CYCLIST_KILLED fields for each individual record from the data source.

  1. From the Add Node menu on the left, click the Manage Data folder and choose the Calculate Field tool.

    Calculate Field tool in the Manage Data folder

    The Calculate Field tool is added to the analytic editor.

  2. Connect the Amazon S3 data source to the Calculate Field tool.

    Amazon S3 data source connected to the Calculate Field tool

    You must connect the two nodes so the Calculate Field tool knows the source data it will be working with in the next step.

  3. Double-click the Calculate Field tool to access the properties.
  4. Configure the Calculate Field tool as follows:
    1. Ensure the New field option is chosen.
    2. In the Field column, type TotalCyclistCasualties.
    3. In the Type column, click the drop-down and choose Int64.

      This specifies the field type will be a 64-bit integer field.

    4. Click the Configure an Arcade expression button to open the Configure an Arcade expression window.
    5. In the left pane, use the Arcade expression builder or type $feature.NUMBER_CYCLIST_INJURED+$feature.NUMBER_CYCLIST_KILLED.
    6. Click Run to run the Arcade expression.

      The result of the run should look similar to the illustration below.

      Configure an Arcade expression window

    7. Click OK to save the expression.
    8. In the Add field calculation column, click Add to add the new field.

      Configured Calculate Field tool

    9. Click Apply to apply the properties.

      With the Calculate Field tool now configured and connected to the Amazon S3 data source, you will now filter the NYC motor vehicle accident data to identify the accidents with valid location coordinates that resulted in a cyclist injury or death.

  5. In the analytic editor, click Save to save the current big data analytic configuration.
  6. From the Add Node menu, click the Manage Data folder and choose the Filter By Expression tool.

    A new Filter by Expression tool is added to the analytic editor.

  7. Drag and drop the Filter By Expression tool to the right of the Calculate Field tool and connect the two nodes.

    Filter by Expression tool added to the model

  8. Double-click the Filter By Expression tool to open the properties and configure as follows:
    1. Click the Configure an Arcade expression button to open the Configure an Arcade expression window.
    2. In the left pane, use the Arcade expression builder or type $feature.TotalCyclistCasualties>0&$feature.LATITUDE>0.

      Records with invalid coordinates exist in this dataset. These records can be ignored by filtering out records in which the latitude value is less than or equal to 0.

    3. Click Run to run the Arcade expression.

      The result of the run should look similar to the illustration below.

      Configure Arcade expression window

    4. Click OK to return to the Filter by Expression tool properties.
    5. Click Apply to apply the expression.

    With the filter added, you will add another tool, Aggregate Points, that will aggregate points spatially to represent the number of accidents involving cyclist injury or death as regular hexagonal bins.

  9. From the Add Node menu, click the Summarize Data folder and choose the Aggregate Points tool.

    The Aggregate Points tool is added to the analytic editor.

  10. In the analytic editor, click Save to save the updated big data analytic configuration.
  11. Drag and drop the Aggregate Points tool to the right of the Filter By Expression tool and connect the two nodes.

    Filter by Expression tool added to the model

  12. Double-click the Aggregate Points tool to open the properties and configure as follows:
    1. For Aggregate points into, choose Bins.
    2. For Bin type, choose Hexagon.
    3. For Bin size, type 250 and keep the unit of measure set to Meters.
    4. For Summary fields, for Attribute, choose TotalCyclistCasualties.
    5. For Statistic, choose Sum.
    6. For Output field name, leave the default TotalCyclistCasualties_Sum.
    7. Click Add to add the summary field.

      Aggregate Points tool properties

    8. Click Apply to apply the properties.

      You have successfully added three analytic tools that will process the accident data. Next you will add an output.

Add an output to the analytic

With the data source and analytic tools created, the last step is to add an output that will send the processed event data to a feature layer, which can be visualized in a web map.

  1. From the Add Node menu, click the Outputs folder and choose the Feature Layer (new) output.

    The Configure Feature Layer (new) window opens.

  2. On the Configure Feature Layer step, configure the properties as follows:
    1. For Store data in spatiotemporal feature layer option, turn the toggle on.
    2. For Data storage method, choose Add all new features.

      If you were working with a data source that had a Track ID defined, you'd use the Keep only latest feature for each Track ID value method. With this storage method, each time a new feature is received for a certain Track ID, the stored feature associated with that Track ID is replaced by the new feature.

    3. For Each time the analytic runs, choose Replace existing features and schema.

      Configure a new feature layer output

      When Replace existing features and schema is chosen, each time the big data analytic runs, the features and schema in the output feature layer are overwritten. This is useful when you are creating a big data analytic and adding, removing, or changing tools between analytic runs. Alternatively, the Keep existing features and schema option is useful when you want to append records each time the big data analytic runs.

  3. Click Next.
  4. In the Save step, for Feature layer name, type Cyclist_Accident_Aggregation.
  5. For Feature layer summary, type NYC cyclist accident aggregated feature layer.
  6. For Folder, choose the folder to save the feature layer.

    Save the new feature layer output

  7. Click Complete to save the new output.

    The new Cyclist_Accident_Aggregation output is added to the analytic editor.

  8. Drag and drop the Cyclist_Accident_Aggregation output to the right of the Aggregate Points tool and connect the two nodes.

    You can move the nodes to make the model more visually appealing.

    Final configured big data analytic
  9. Click Save to save the NYC Cyclist Accidents big data analytic.

Start the big data analytic

Now that you have successfully configured a big data analytic with all the necessary nodes, you will start the analytic and run it once. The analytic will load over 1.5 million records from the CSV file using a defined schema, process the event data through a variety of tools, and write the analysis output to a new feature layer.

  1. In the analytic editor, click Start to start the NYC Cyclist Accidents big data analytic.

    Start the big data analytic

    The Start button text transitions to Initializing and then to Stop, indicating that the analytic has started and is running.

    Note:

    Feeds and real-time analytics in Velocity remain running once they are started. Big data analytics, on the other hand, run until the analysis is complete and stop automatically. Big data analytics can be configured to run in a recurring manner using the options available from the Schedule drop-down menu in the analytic editor. Options include the ability to run the analytic once, periodically, or at a recurring time. For more information on scheduling big data analytics, see Schedule recurring big data analysis.

  2. Monitor the analytic until the Stop button changes to Start.

    The Stop button changing to Start indicates that the analytic ran once, is now finished, and is no longer running. Additionally, you can monitor the status of big data analytics from the Big Data Analytics page.

Explore the analytic results in a web map

When you started the big data analytic in the previous section, an output feature layer was created. You will now open that output feature layer in a web map and view the results of the big data analysis on the NYC cyclist accident data.

  1. From the main menu, click Layers under OUTPUT to open the Layers page.
  2. Find the Cyclist_Accident_Aggregation feature layer in the list and click Open in map viewer to view the feature layer in a web map.

    Open feature layer in a map viewer

    Note:

    Output layers created by real-time and big data analytics do not appear on the Layers page until the analytic runs and generates output.

  3. Zoom in to the extent of the data in the New York City, USA.
  4. Change the basemap to Dark Gray Canvas.
  5. On the Cyclist Accident Aggregation layer, click Change Style and for Choose an attribute to show, choose COUNT from the drop-down menu.
  6. For Select a drawing style, choose Counts and Amounts (Color) and click OPTIONS.
  7. Click Symbols, change the color ramp to a Red/Orange/White color ramp and click OK.
  8. Check the Classify Data checkbox.
  9. From the Using drop-down menu, choose Standard Deviation and set the class size to 1 standard deviation.
  10. Accept the other default properties, click OK and then click DONE.

    Feature layer added and symbolized in a web map

  11. Pan and zoom around the web map to explore the results of the big data analysis. Identify areas that had more cyclist-related injuries and deaths to those areas with less.

Next steps

In this lesson, you created and ran a big data analytic that analyzed roughly 1.5 million cyclist accidents to identify areas in New York City with the highest number of accidents. With these results, you can make better informed decisions on where new bicycle infrastructure would have the greatest impact.

Additional resources are available as you continue to work with ArcGIS Velocity including What's new in the latest release, Essential ArcGIS Velocity vocabulary, Big data analysis, Real-time analysis, and Use Arcade expressions.