Design a big data analytic

This lesson will demonstrate how to construct a big data analytic using ArcGIS Analytics for IoT. You will assume the role of a transportation planner looking to better understand motor vehicle accidents involving bicyclists over a multi-year period. Your findings will be used to help identify where the construction of new bicycle friendly infrastructure such as bicycle lanes or lane barriers would generate the largest impact for bicyclist safety.

The data used in this lesson can be downloaded from the New York City (NYC) OpenData site. The complete dataset of over 1.5 million records was downloaded from this site in CSV format. For this lesson, the CSV file has been conveniently hosted on a public Amazon S3 bucket, with connection information in the steps below.

As you work through the steps in this lesson, you will learn skills including creating a new big data analytic, creating a data source, configuring a variety of tools, and generating an output feature layer containing analytic results that can be viewed in a web map.

This lesson is designed for beginners. You must have an ArcGIS account with access to ArcGIS Analytics for IoT. Estimated time: 30 minutes.

Create a new big data analytic

To begin, you will create a new big data analytic using the ArcGIS Analytics for IoT application.

  1. Go to https://iot.arcgis.com and sign in with your ArcGIS Online credentials.

    For the best experience, use Google Chrome or Mozilla Firefox.

    Note:

    If you encounter issues signing in, contact your ArcGIS Online administrator. You may need to be assigned to an ArcGIS Online role with privileges to use ArcGIS Analytics for IoT.

  2. In the main menu on the left, click Big Data to access the Big Data Analytics page.

    From here, you can create new and view existing big data analytics as well as start and stop, check validity and running status, edit existing, and clone and delete your big data analytics.

  3. Click Create Analytic to launch the analytic configuration wizard.

    Big Data Analytics page

Configure a data source

When configuring a big data analytic, you must first configure a data source from which to load data that will be analyzed by the big data analytic.

  1. In the Select a type of data source page, click See More under the Cloud category.

    Select a type of data source page

    Note:

    All big data analytics must have at least one data source as an input.

  2. Click Amazon S3 to launch the data source configuration wizard.

    Cloud data source options

    For information about the cloud providers, see the Amazon S3 or Azure Blob Store sites.

  3. In the Configure Amazon S3 page, configure the Amazon S3 Bucket Properties as follows:
    1. For Access Mode, choose Public.
    2. For S3 Bucket Name, enter the following text:

      a4iot-public

    3. For Region, choose US West (Oregon).
    4. For Folder Path, enter the following text:

      /nyc-motor-vehicle-collisions

    5. For Dataset, enter the following text:

      NYPD_Motor_Vehicle_Collisions.csv

    6. Click Next to apply the Amazon S3 bucket properties.

    Amazon S3 data source configuration wizard

    With the Amazon S3 bucket properties set, you will now confirm the schema of the dataset.

Confirm the data schema

When configuring a data source, it is important to define the schema of the data you are receiving. Analytics for IoT makes a best attempt to define the schema when it samples the data including estimating the data format, field delimiter, field type, and the field name properties.

  1. In the Confirm Schema step, explore the schema of the dataset returned.

    Confirm the schema of the data source

    Analytics for IoT tested the connection to the data source, sampled the first few data records, and interpreted the schema of the data based off the sampled records. At this point, you can optionally change field types, field names, and data formats to ensure a valid schema. For this lesson, you will accept the default schema properties.

  2. Without making any changes, click Next to confirm the schema as sampled.

Identify the key fields

Next, you will configure some important fields so Analytics for IoT can properly construct geometry, date information, and a unique identifier for the data.

  1. In the Identify Key Fields step, configure the key fields as follows:
    1. For Location Type, choose X/Y fields.
    2. For X (Longitude), choose LONGITUDE.
    3. For Y (Latitude), choose LATITUDE.
    4. For Spatial reference (WKID), enter: 4326.
    5. Choose the GCS WGS 1984 spatial reference from the search results.
    6. For Does your data have date fields?, choose No.

      This property can be used to choose a start and end date or date/time field in the data source. If the incoming data has date information in a string format, then a date format is required. For details, see Define date and time properties. For the purpose of the analysis in this lesson, you will not specify any date and time information.

    7. For Track ID, choose My data does not have a Track ID.

      This property can be used to designate a Track ID field in the data source. For details, see Track ID. For the purpose of the analysis in this lesson, you will not define a Track ID.

      Identify the key fields in the data source

  2. Click Complete to finish configuring the data source.

Create the analytic

With the Amazon S3 data source now configured, the analytic editor will open. In the analytic editor, you can add tools, additional data sources, and outputs which can be used to define the flow and analysis you wish to perform on the data. You will now create a new analytic.

  1. In the New Big Data Analytic page, click Create Analytic in the upper right.

    Create analytic button

  2. In the Create Analytic window, for Title, enter the following text:

    NYC Cyclist Accidents

  3. For Description, enter the following text:

    Process motor vehicle accidents to identify and analyze those involving cyclists

    Create new analytic window

  4. Click Create Analytic to create the new big data analytic.

    Once the analytic is saved, the toolbar at the top of the analytic editor will display additional options and controls for saving, starting, scheduling, as well as run settings for the analytic.

Add and configure tools in the analytic

With the new analytic created, you will now add tools to the analytic that will perform the big data analysis on the NYC cyclist accident data. With Analytics for IoT, you configure an analysis pipeline in which the output of one step is the input to the next. You will configure sequential tools to better understand motor vehicle accidents involving injuries to cyclists.

You will first create a new field called TotalCyclistCasualties, which sums the values in the NUMBER OF CYCLIST INJURED and NUMBER OF CYCLIST KILLED fields for each individual record from the data source.

  1. Click the Manage Data folder and choose the Calculate Field tool and configure as follows:

    Add Calculate Fields tool

    1. Ensure New field is chosen.
    2. For Field, enter the following text:

      TotalCyclistCasualties

    3. For Type, choose Int32 which specifies this will be a 32-bit integer field.
    4. Click the pencil icon to open the Configure an Arcade expression window.
    5. In the Expression pane, enter the following text:

      $feature["NUMBER OF CYCLIST INJURED"] + $feature["NUMBER OF CYCLIST KILLED"]

      Configure an Arcade expression window

    6. Click OK to save the expression.
    7. In the Add field calculation column, click + to add the new field.

      Configured Calculate Fields tool

    8. Click Apply to save the Calculate Field analytic.

      The Calculate Field tool will be added to the analytic after the Amazon S3 data source you configured above.

    With the Calculate Field analytic created, you will now filter the NYC motor vehicle accident data to identify the accidents that resulted in a cyclist injury or death, and additionally, only the accidents with valid location coordinates.

  2. Click the Manage Data folder and choose the Filter By Expression tool and configure as follows:
    1. Click the pencil icon to open the Configure an Arcade expression window.
    2. In the Expression pane, enter the following expression:

      $feature.TotalCyclistCasualties > 0 && $feature.LATITUDE > 0

      In this dataset, there are records with invalid coordinates. These records can be ignored by filtering out records where the latitude value is less than or equal to 0.

    3. Click OK to return to the Filter by Expression tool configuration.
    4. Click Apply to apply the expression.

      Configured Filter by Expression tool

      The Filter tool will be added to the analytic under the Calculate Fields tool you just created.

    With the filter added, you will now add another tool that will aggregate points spatially in order to represent the number of accidents involving cyclist injury or death as regular hexagonal bins.

  3. Click the Summarize Data folder, then click the Aggregate Points tool and configure as follows:
    1. For Bin type, choose Hexagon.
    2. For Bin size, enter: 250
    3. Leave the unit of measure set to Meters.
    4. Click Show advanced options.
    5. In the Summary Fields section, for Attribute, choose the TotalCyclistCasualties field.
    6. Click the + icon to add this summary field.
    7. For Statistic, choose Sum.
    8. For Output field name leave the default TotalCyclistCasualties_Sum.
    9. Click Apply to apply the tool settings.

      Configured Aggregate Points tool

      The Aggregate Points tool will be added to the analytic after the TotalCyclistCasualties filter you configured in the previous step.

Configure output

With data source and a pipeline of analysis tools configured, you will now add an output that will allow you to visualize the results of the big data analysis in a web map. In this lesson, you will be writing the output to a new feature layer which you will create using the steps below.

  1. Click the Add output button and in the ArcGIS category click See More.
  2. Under ArcGIS options, expand Feature Layer and choose Feature Layer (new) to open the Configure Feature Layer (new) configuration wizard.
  3. In the Feature Layer Options step, for Data Storage Method, choose Add New Features.

    Choosing the Keep Latest Feature storage method would be used if you were working with a data source that had a Track ID defined. With this storage method, each time a new feature is received for a certain Track ID, the stored feature associated with that Track ID would be replaced by the new feature.

  4. For Each time the analytic is run, choose Replace existing features and schema.

    Configure new feature layer output

    When Replace existing features and schema is chosen, each time the big data analytic is run, the features and schema in the output feature layer will be overwritten. This is can be useful when you are developing a big data analytic and adding, removing, or changing tools in-between running the analytic. Alternatively, the Keep existing features and schema option can be useful if you wanted to append additional records each time the big data analytic is run.

  5. Click Next to proceed to the next step.
  6. For Feature Layer Name, enter the following text:

    NYC_Cyclist_Accident_Aggregation

  7. Click Complete to save the new output.

    Output feature layer name and summary

    The new Feature layer (new) output will be added after the Aggregate Points tool you added previously.

  8. On the top of the Analytics for IoT application, click Save to save the NYC Cyclist Accidents analytic.

Start the analytic

At this point, you have successfully configured a big data analytic. The analytic will load millions of records from a delimited text file using a defined schema, process the event records through a variety of tools, and write the analysis output to a new feature layer. Next, you will start the NYC Cyclist Accidents big data analytic.

  1. On the top of the Analytics for IoT application, click Start to start the NYC Cyclist Accidents analytic.

    Start the big data analytic

    Notice the Start button transitions to Initializing and then to Stop, indicating the analytic has started and is running.

    Note:

    Analytics for IoT feeds and real-time analytics remain running once they are started. Big data analytics run until the analysis is completed and then stop. Big data analytics can be configured to run in a recurring manner using the Schedule button. This means that big data analytics can be run every few minutes or hours, on certain days of the week, or at certain times of the day. For more information on how to schedule a big data analytic, see Schedule recurring big data analysis.

  2. Monitor the analytic until the Stop button changes to Start once again.

    The Stop button changing to Start indicates the analytic has run and is now complete and no longer running. Additionally, you can monitor the status of your real-time or big data analytics from the Big Data Analytics page in the application.

Explore the analytic results in a web map

When you started the big data analytic in the previous section, an output feature layer was created. You will now open that output layer in a web map and view the results of the big data analysis on the NYC cyclist accident data.

  1. In the main menu on the left, click Layers under OUTPUT to open the Layers page.
  2. Locate the NYC_Cyclist_Accident_Aggregation feature layer in the list and click the Open in map viewer icon to view the layer in a web map.

    Open output layer in map viewer

    Note:

    Output layers created by real-time or big data analytics do not appear on the layers page until the analytic has successfully run and generated output.

  3. Zoom in to the extent of the data in the New York City region.
  4. Change the basemap to Dark Gray Canvas.
  5. On the layer, click the Change Style button and for the Choose an attribute to show step, choose COUNT from the drop-down.
  6. For the Select a drawing style step, choose Counts and Amounts (Color) and click OPTIONS.
  7. Click Symbols and change the color ramp to a Red / Orange / White color ramp and click OK.
  8. Check the Classify Data checkbox.
  9. For the Using drop-down, choose Standard Deviation and set the class size to 1 standard deviation.
  10. Accept the other default properties and click OK and then Done.

    Results of the big data analysis in a web map

  11. Pan and zoom around the web map to explore the results of the big data analysis. Identify areas that had more cyclist related injuries and deaths to those areas with less.

Next steps

Congratulations! In this lesson, you created and ran a big data analytic that analyzed millions of cyclist accidents to identify areas in NYC with the highest numbers of accidents. With these results, you are now able to make more informed decisions on where new bicycle infrastructure could have the greatest impact.

What's next? Take a look at the following additional resources as you continue to work with Analytics for IoT: Essential ArcGIS Analytics for IoT vocabulary, Perform big data analysis, and Use Arcade expressions.