Causal Inference Analysis (Spatial Statistics)

Summary

Estimates the causal effect of a continuous exposure variable on a continuous outcome variable by approximating a randomized experiment and controlling for confounding variables.

In statistical experiments, the cause-and-effect relationship between an exposure variable (such as the dose of a drug) and an outcome variable (such as a health outcome) is determined by randomly assigning each participant a particular exposure level so that any differences in the outcomes must be due only to the differences in the exposures and not any other attributes of the participants, such as age, preexisting conditions, and healthcare access. However, it is frequently impossible or unethical to perform controlled experiments, so relationships are often established through observational studies. For example, to study the effect of pollution on depression rates, you cannot intentionally expose individuals to high pollution to see the effect on depression. Instead, you can only observe the exposure to pollution and the depression rates of the individuals in your sample. However, because there are many variables (called confounding variables) that impact both pollution and depression, the causal effect cannot be directly estimated without controlling for these variables.

To emulate the process of a randomized, controlled experiment, the tool calculates propensity scores for each observation, and the propensity scores are used to weight the observations in such a way that the causal relationship between the exposure and outcome variables is maintained, but correlations between the confounding variables and the exposure variable are removed. This weighted dataset is often called a pseudopopulation, and it has analogous properties to a controlled experiment in which each participant is randomly assigned an exposure. Using the weighted observations, the tool creates an exposure-response function (ERF) that estimates what the average outcome would be if all members of the population received a given exposure value but did not change their confounding variables.

Learn more about how Causal Inference Analysis works

Illustration

Causal Inference Analysis tool illustration
The causal effect between an exposure and outcome is estimated by balancing confounding variables.

Usage

  • In causal inference analysis, it is assumed that all important confounding variables are included in the model. This means that if any variables that impact the exposure and outcome variables are not included as confounding variables, the estimate of the causal effect will be biased. The tool cannot determine whether all important confounding variables have been included, so it is critical that you consider which variables might be related to your exposure and outcome variables and include them in the model. If there are important confounding variables that are not available for inclusion, you should interpret the results of the tool with extreme caution and skepticism or consider not using the tool until you can acquire data for all confounding variables.

  • The exposure variable must be continuous (not binary or categorical), but the confounding variables can be continuous, categorical, or binary. It is recommended that the outcome variable be continuous, but binary outcome variables are allowed and can often be interpreted as probabilities or proportions.

  • The tool accepts both spatial and nonspatial input datasets. You can use tables, points, polygons, and polylines as input, and the output will be the same type as the input.

  • The primary output of the tool is an ERF, which is returned as a scatter plot chart on the output features and as an image in the geoprocessing messages. You can also create a table containing various exposure-response values using the Output Exposure-Response Function Table parameter.

    Learn more about the ERF

  • The Propensity Score Calculation Method parameter allows you to specify how propensity scores will be estimated. Propensity scores are the likelihoods (or probabilities) of receiving a particular exposure value, given a set of confounding variables. Propensity scores are estimated by creating a model that predicts the exposure variable from the confounding variables. The following propensity score calculation methods are available:

  • The Balancing Method parameter allows you to specify how the propensity scores will be used to balance the confounding variables. Two balancing methods are available:

    • Propensity score matching—Each observation is matched with various other observations that have similar propensity scores but different exposure values. By comparing the outcome value of the observation to the outcome values of the matches, you can see what the observation's outcome value might have been if it had a different exposure. After matching all observations to various other observations, each observation is assigned a balancing weight equal to the number of times the observation was matched to any other observation. The reasoning behind this weighting scheme is that observations with high match counts have confounding variables that were common across many values of the exposure variable, so they are most representative of the causal effect.
    • Inverse propensity score weighting—Balancing weights are assigned to each observation by inverting the propensity score and multiplying by the overall probability of having the given exposure. This provides higher balancing weights to observations with low propensity scores and lower balancing weights to observations with high propensity scores. The reasoning behind this weighting scheme is that the propensity scores are a measure of how common or uncommon the exposure value is for the particular set of confounding variables. By increasing the influence (increasing the balancing weight) of uncommon observations (observations with low propensity scores) and decreasing the influence of common observations, the overall distributions of confounding variables are kept in proportion across all values of the exposure variable.

    Learn more about propensity scores, propensity score matching, and inverse propensity score weighting

  • By default, the tool trims (removes from analysis) observations that have the top and bottom 1 percent of exposure values. Extreme values or outliers in the exposure variable can introduce bias to causal inference analyses. By trimming these extreme values, you can reduce the impact of influential observations that may distort the estimation of causal effect. You can change the amount of exposure trimming using the Lower Exposure Quantile and Upper Exposure Quantile parameters. You can also trim observations based on their propensity scores using the Lower Propensity Score Quantile and Upper Propensity Score Quantile parameters, but no propensity score trimming is performed by default. When using inverse propensity score weighting, it is often necessary to trim some of the lowest propensity scores because propensity scores close to zero can produce large and unstable balancing weights.

  • The output features or table will contain fields of the propensity scores, balancing weights, and a field indicating whether the feature was trimmed (0 means that the feature was trimmed, and 1 means that the feature was included in the analysis). Copies of the exposure, outcome, and confounding variables are also included.

  • Achieving balance between the confounding variables and the exposure variable is key to deriving the causal relationship between the exposure and outcome variables. To determine whether the balancing weights effectively balance the confounding variables, the tool calculates weighted correlations between each confounding variable and the exposure variable (weighted by the balancing weights). The weighted correlations are then aggregated and compared to a threshold value. If the aggregated correlation is less than the threshold, the confounding variables will be determined to be balanced. You can specify the aggregation type (mean, median, or maximum absolute correlation) using the Balance Type parameter and provide the threshold value in the Balance Threshold parameter. By default, the tool will use absolute mean correlation and a threshold value of 0.1. Using 0.1 as a threshold is a common convention, but the threshold value should be tailored to align with domain expertise, research objectives, and the intrinsic characteristics of the population being studied. A lower threshold value indicates less tolerance for bias in the estimation of the causal effect; however, it is more difficult to achieve balance with lower thresholds.

  • If the balancing weights do not sufficiently balance the confounding variables, the tool will return an error and not produce an ERF; however, various messages are displayed with information about how effectively the confounding variables were balanced. It is recommended that you first attempt to resolve the error through the selection of confounding variables and different options for the Propensity Score Calculation Method and Balancing Method parameters. If the error is still not resolved, using a different option for the Balance Type parameter or increasing the value of the Balance Threshold parameter can be used to produce an ERF, but this may introduce bias into the estimation of the causal effect.

    Learn more about achieving balanced confounding variables

  • The confounding variables should contain a variety of values across the entire range of the exposure variable. For categorical confounding variables, there should be a wide range of exposure values within each level of the category, and there can be no more than 60 categories in each categorical variable. For propensity score matching, if there is not enough variation of the exposure variable across all values of each confounding variable, it will be difficult to achieve balance.

  • The Target Outcome Values for Calculating New Exposures parameter can be used to explore what-if scenarios (sometimes called counterfactual scenarios) for each observation. Using a local ERF for each observation, the tool calculates the necessary exposure level for each observation to achieve the desired outcome. For example, each county can estimate the pollution level that would be needed to produce an asthma hospitalization rate under a given target. If target outcome values are provided, the output features or table will include two additional fields for each target outcome: one for the new exposure value, and the other for the difference between the new and current exposure value. If there are multiple exposure values that would produce the target outcome, the tool will use the one that is closest to the current exposure value of the observation. Similarly, you can also provide target exposure values in the Target Exposure Values for Calculating New Outcomes parameter to investigate how the outcome variable might change locally for various target exposures.

    If an output ERF table is created, any target outcome or target exposure values will be appended to the end of the table. If there are multiple solutions for a targe outcome, all solutions will be included in the table.

  • If the Enable Exposure-Response Function Pop-ups parameter is checked, local exposure-response functions will be created for each observation. The local ERFs display as charts in the pop-ups of the output features or table. Creating local ERFs requires the additional assumption of a fixed treatment effect, which is often violated for variables such as race, income, and gender.

    Learn more about local ERF estimation and assumptions

    Caution:

    If there are many observations, creating pop-ups can be memory and computationally intensive. It is recommended that you run the tool without enabling pop-ups in the exploratory stages of modeling and only create them once all other tool parameters have been determined.

  • A common misconception is that the causal effect can be estimated solely by including the confounding variables as explanatory variables in a predictive model such as the Generalized Linear Regression or Forest-based and Boosted Classification and Regression tools. However, this is only true when all explanatory variables are independent of the exposure variable and all relevant variables are included in the model. Because most datasets have variables that are all mutually related to each other, the causal effect cannot be directly estimated.

  • The general methodology of the tool is based on the following references:

    • Khoshnevis, Naeem, Xiao Wu, and Danielle Braun. 2023. "CausalGPS: Matching on Generalized Propensity Scores with Continuous Exposures." R package version 0.4.0. https://CRAN.R-project.org/package=CausalGPS.
    • Wu, Xiao, Fabrizia Mealli, Marianthi-Anna Kioumourtzoglou, Francesca Dominici, and Danielle Braun. 2022. "Matching on Generalized Propensity Scores with Continuous Exposures." Journal of the American Statistical Association. https://doi.org/10.1080/01621459.2022.2144737.

Parameters

LabelExplanationData Type
Input Features or Table

The input features or table containing fields of the exposure, outcome, and confounding variables.

Feature Layer; Table View
Outcome Field

The numeric field of the outcome variable. This is the variable that responds to changes in the exposure variable. The outcome variable must be continuous or binary (not categorical).

Field
Exposure Field

The numeric field of the exposure variable (sometimes called the treatment variable). This is the variable that causes changes in the outcome variable. The exposure variable must be continuous (not binary or categorical).

Field
Confounding Variables

The fields of the confounding variables. These are the variables that are related to both the exposure and outcome variables, and they must be balanced in order to estimate the causal effect between the exposure and outcome variables. The confounding variables can be continuous, categorical, or binary. Text fields must be categorical, integer fields can be either categorical or continuous, and other numeric fields must be continuous.

For the exposure-response function to be unbiased, all variables that are related to the exposure and outcome variables must be included as confounding variables.

Value Table
Output Features or Table

The output features or table containing the propensity scores, balancing weights, and a field indicating whether the feature was trimmed (excluded from the analysis). The exposure, outcome, and confounding variables are also included.

Feature Class; Table
Propensity Score Calculation Method
(Optional)

Specifies the method that will be used for calculating the propensity scores of each observation.

The propensity score of an observation is the likelihood (or probability) of receiving the observed exposure value, given the values of the confounding variables. Large propensity scores mean that the exposure is common for individuals with the associated confounding variables, and low propensity scores mean that the exposure value is uncommon for individuals with the confounding variables. For example, if an individual has high blood pressure (exposure) but has no risk factors for high blood pressure (confounders), this individual would have a low propensity score because it is uncommon to have high blood pressure without any risk factors. Conversely, high blood pressure for an individual with many risk factors would have a larger propensity score because it is more common.

Propensity scores are estimated by a statistical model that predicts the exposure variable using the confounding variables as explanatory variables. You can use an OLS regression model or a machine learning model that uses gradient boosted regression trees. It is recommended that you first use regression and only use gradient boosting if regression fails to balance the confounding variables.

  • RegressionOLS regression will be used to estimate the propensity scores. This is the default.
  • Gradient boostingGradient boosted regression trees will be used to estimate the propensity scores.
String
Balancing Method
(Optional)

Specifies the method that will be used for balancing the confounding variables.

Each method estimates a set of balancing weights that removes the correlation between the confounding variables and the exposure variable. It is recommended that you first use matching and only use inverse propensity score weighting if matching fails to balance the confounding variables. Inverse propensity score weighting will calculate faster than propensity score matching, so it also recommended when the calculation time of matching is not feasible for the data.

  • Propensity score matchingPropensity score matching will be used to balance the confounding variables. This is the default.
  • Inverse propensity score weightingInverse propensity score weighting will be used to balance the confounding variables.
String
Enable Exposure-Response Function Pop-ups
(Optional)

Specifies whether pop-up charts that display the local ERF for the observation will be created for each observation.

  • Checked—Local ERF pop-up charts will be created on the output features or table.
  • Unchecked—Local ERF pop-up charts will not be created on the output features or table. This is the default.
Boolean
Output Exposure-Response Function Table
(Optional)

A table containing values of the exposure-response function. The table will contain 200 evenly spaced exposure values between the minimum and maximum exposure (after trimming) along with the estimated response from the exposure-response function. The response field represents the average value of the outcome variable if all members of the population received the associated exposure value. If bootstrapped confidence intervals are created, additional fields will be created containing the upper and lower bounds of the confidence interval for the exposure value, as well as the standard deviation and number of samples used to construct the confidence interval. If any target outcome or exposure values are provided, they will be appended to the end of the table.

Table
Target Outcome Values for Calculating New Exposures
(Optional)

A list of target outcome values from which required changes in exposure to achieve the outcomes will be calculated for each observation. For example, if the exposure variable is an air quality index and the outcome variable is the annual asthma hospitalization rate of counties, you can determine how much the air quality index needs to decrease to achieve asthma hospitalization rates below 0.01, 0.005, and 0.001. For each provided target outcome value, two new fields will be created on the output. The first field contains the exposure value that would result in the target outcome, and the second field contains the required change in the exposure variable to produce the target outcome (positive values indicate that the exposure needs to increase, and negative values indicate that the exposure needs to decrease). In some cases, there will be no solution for some observations, so you should only provide target outcomes that are feasible to achieve by changing the exposure variable. For example, there is no PM2.5 level that can result in an asthma hospitalization rate of zero, so using a target outcome equal to zero will result in no solutions. If there are multiple exposure values that would result in the target outcome, the one that requires the smallest change in exposure will be used.

If an output exposure-response function table is created, it will include any target outcome values and the associated exposure values appended to the end of the table. If there are multiple solutions, multiple records will be appended to the table with repeated outcome values.

If local ERF pop-up charts are created, the target outcomes and associated exposure values will be shown in the pop-ups of each observation.

Double
Target Exposure Values for Calculating New Outcomes
(Optional)

A list of target exposure values that will be used to calculate new outcomes for each observation. For each target exposure value, the tool estimates the new outcome value that the observation would receive if its exposure variable was changed to the target exposure. For example, if the exposure variable is an air quality index and the outcome variable is the annual asthma hospitalization rate of counties, you can estimate how the hospitalization rate for each observation would change for different levels of air quality. For each provided target exposure value, two new fields will be created on the output. The first field contains the estimated outcome value if the observation received the target exposure, and the second field contains the estimated change in the outcome variable (positive values indicate that the outcome variable will increase, and negative values indicate that the outcome variable will decrease). The target exposures must be within the range of the exposure variable after trimming.

If an output exposure-response function table is created, it will include any target exposure values and the associated response values appended to the end of the table.

If local ERF pop-up charts are created, the target exposure values and associated outcomes will be shown in the pop-ups of each feature.

Double
Lower Exposure Quantile
(Optional)

The lower quantile that will be used to trim the exposure variable. Any observations with exposure values below this quantile will be excluded from the analysis before estimating propensity scores. The value must be between 0 and 1. The default is 0.01, meaning that the bottom 1 percent of exposure values will be trimmed. It is recommended that you trim some of the lowest exposure values to improve the estimation of propensity scores.

Double
Upper Exposure Quantile
(Optional)

The upper quantile that will be used to trim the exposure variable. Any observations with exposure values above this quantile will be excluded from the analysis before estimating propensity scores. The value must be between 0 and 1. The default is 0.99, meaning that the top 1 percent of exposure values will be trimmed. It is recommended that you trim some of the highest exposure values to improve the estimation of propensity scores.

Double
Lower Propensity Score Quantile
(Optional)

The lower quantile that will be used to trim the propensity scores. Any observations with propensity scores below this quantile will be excluded from the analysis before performing propensity score matching or inverse propensity score weighting. The value must be between 0 and 1. The default is 0, meaning that no trimming will be performed.

Lower propensity score trimming is often needed when using inverse propensity score weighting. Propensity scores near zero can produce large and unstable balancing weights.

Double
Upper Propensity Score Quantile
(Optional)

The upper quantile that will be used to trim the propensity scores. Any observations with propensity scores above this quantile will be excluded from the analysis before performing propensity score matching or inverse propensity score weighting. The value must be between 0 and 1. The default is 1, meaning that no trimming will be performed.

Double
Number of Exposure Bins
(Optional)

The number of exposure bins that will be used for propensity score matching. In matching, the exposure variable is divided into evenly spaced bins (equal intervals), and matching is performed within each bin. At least two exposure bins are required, and it is recommended that at least five exposure values be included within each bin. If no value is provided, the value will be estimated while the tool runs and displayed in the messages.

Long
Relative Weight of Propensity to Exposure
(Optional)

The relative weight (sometimes called the scale) of the propensity score to the exposure variable that will be used when performing propensity score matching. Within each exposure bin, matches are determined using the differences in the propensity scores and in the values of the exposure variable. This parameter specifies how to prioritize each criteria. For example, a value equal to 0.5 means that the propensity score and exposure variables are given equal weight when finding matching observations.

If no value is provided, the value will be estimated while the tool runs and displayed in the messages. The value that will provide the best balance is difficult to predict, so it is recommended that you allow the tool to estimate the value. Providing a manual value can be used to reduce the calculation time or to reproduce prior results. If the resulting exposure-response function shows vertical bands of observations with large weights, increasing the relative weight may provide a more realistic and accurate exposure-response function.

Double
Balance Type
(Optional)

Specifies the method that will be used to determine whether the confounding variables are balanced. After estimating weights with propensity score matching or inverse propensity score weighting, weighted correlations are calculated for each confounding variable. If the mean, median, or maximum absolute correlation is less than the balance threshold, the confounding variables are considered balanced, meaning they are sufficiently uncorrelated with the exposure variable.

  • MeanConfounding variables will be considered balanced if the mean absolute correlation is less than the balance threshold. This is the default.
  • MedianConfounding variables will be considered balanced if the median absolute correlation is less than the balance threshold.
  • MaximumConfounding variables will be considered balanced if the maximum absolute correlation is less than the balance threshold.
String
Balance Threshold
(Optional)

The threshold value that will be compared to the weighted correlations of the confounding variables to determine if they are balanced. The value must be between 0 and 1. A larger balance threshold indicates a larger tolerance for imbalance in the confounding variables and bias in the exposure-response function. The default is 0.1.

Double
Bandwidth Estimation Method
(Optional)

Specifies the method that will be used to estimate the bandwidth of the exposure-response function.

  • Plug-inA plug-in method will be used to estimate the bandwidth. This is the default.
  • Cross validationThe bandwidth that minimizes the mean square cross validation error will be used.
  • ManualA custom bandwidth will be used.
String
Bandwidth
(Optional)

The bandwidth value of the exposure-response function when using a manual bandwidth.

Double
Create Bootstrapped Confidence Intervals
(Optional)

Specifies whether 95 percent confidence intervals for the exposure-response function will be created using M-out-of-N bootstrapping. The confidence intervals will appear in the output graphics layer as dashed lines above and below the exposure-response function.

  • Checked—Confidence intervals for the exposure-response function will be created.
  • Unchecked—Confidence intervals for the exposure-response function will not be created. This is the default.
Boolean

arcpy.stats.CausalInferenceAnalysis(in_features, outcome_field, exposure_field, confounding_variables, out_features, {ps_method}, {balancing_method}, {enable_erf_popups}, {out_erf_table}, {target_outcomes}, {target_exposures}, {lower_exp_trim}, {upper_exp_trim}, {lower_ps_trim}, {upper_ps_trim}, {num_bins}, {scale}, {balance_type}, {balance_threshold}, {bw_method}, {bandwidth}, {create_bootstrap_ci})
NameExplanationData Type
in_features

The input features or table containing fields of the exposure, outcome, and confounding variables.

Feature Layer; Table View
outcome_field

The numeric field of the outcome variable. This is the variable that responds to changes in the exposure variable. The outcome variable must be continuous or binary (not categorical).

Field
exposure_field

The numeric field of the exposure variable (sometimes called the treatment variable). This is the variable that causes changes in the outcome variable. The exposure variable must be continuous (not binary or categorical).

Field
confounding_variables
[[var1, cat1], [var2, cat2],...]

The fields of the confounding variables. These are the variables that are related to both the exposure and outcome variables, and they must be balanced in order to estimate the causal effect between the exposure and outcome variables. The confounding variables can be continuous, categorical, or binary. Text fields must be categorical, integer fields can be either categorical or continuous, and other numeric fields must be continuous.

For the exposure-response function to be unbiased, all variables that are related to the exposure and outcome variables must be included as confounding variables.

Value Table
out_features

The output features or table containing the propensity scores, balancing weights, and a field indicating whether the feature was trimmed (excluded from the analysis). The exposure, outcome, and confounding variables are also included.

Feature Class; Table
ps_method
(Optional)

Specifies the method that will be used for calculating the propensity scores of each observation.

The propensity score of an observation is the likelihood (or probability) of receiving the observed exposure value, given the values of the confounding variables. Large propensity scores mean that the exposure is common for individuals with the associated confounding variables, and low propensity scores mean that the exposure value is uncommon for individuals with the confounding variables. For example, if an individual has high blood pressure (exposure) but has no risk factors for high blood pressure (confounders), this individual would have a low propensity score because it is uncommon to have high blood pressure without any risk factors. Conversely, high blood pressure for an individual with many risk factors would have a larger propensity score because it is more common.

Propensity scores are estimated by a statistical model that predicts the exposure variable using the confounding variables as explanatory variables. You can use an OLS regression model or a machine learning model that uses gradient boosted regression trees. It is recommended that you first use regression and only use gradient boosting if regression fails to balance the confounding variables.

  • REGRESSIONOLS regression will be used to estimate the propensity scores. This is the default.
  • GRADIENT_BOOSTINGGradient boosted regression trees will be used to estimate the propensity scores.
String
balancing_method
(Optional)

Specifies the method that will be used for balancing the confounding variables.

Each method estimates a set of balancing weights that removes the correlation between the confounding variables and the exposure variable. It is recommended that you first use matching and only use inverse propensity score weighting if matching fails to balance the confounding variables. Inverse propensity score weighting will calculate faster than propensity score matching, so it also recommended when the calculation time of matching is not feasible for the data.

  • MATCHINGPropensity score matching will be used to balance the confounding variables. This is the default.
  • WEIGHTINGInverse propensity score weighting will be used to balance the confounding variables.
String
enable_erf_popups
(Optional)

Specifies whether pop-up charts that display the local ERF for the observation will be created for each observation.

  • CREATE_POPUPLocal ERF pop-up charts will be created on the output features or table.
  • NO_POPUPLocal ERF pop-up charts will not be created on the output features or table. This is the default.
Boolean
out_erf_table
(Optional)

A table containing values of the exposure-response function. The table will contain 200 evenly spaced exposure values between the minimum and maximum exposure (after trimming) along with the estimated response from the exposure-response function. The response field represents the average value of the outcome variable if all members of the population received the associated exposure value. If bootstrapped confidence intervals are created, additional fields will be created containing the upper and lower bounds of the confidence interval for the exposure value, as well as the standard deviation and number of samples used to construct the confidence interval. If any target outcome or exposure values are provided, they will be appended to the end of the table.

Table
target_outcomes
[target_outcomes,...]
(Optional)

A list of target outcome values from which required changes in exposure to achieve the outcomes will be calculated for each observation. For example, if the exposure variable is an air quality index and the outcome variable is the annual asthma hospitalization rate of counties, you can determine how much the air quality index needs to decrease to achieve asthma hospitalization rates below 0.01, 0.005, and 0.001. For each provided target outcome value, two new fields will be created on the output. The first field contains the exposure value that would result in the target outcome, and the second field contains the required change in the exposure variable to produce the target outcome (positive values indicate that the exposure needs to increase, and negative values indicate that the exposure needs to decrease). In some cases, there will be no solution for some observations, so you should only provide target outcomes that are feasible to achieve by changing the exposure variable. For example, there is no PM2.5 level that can result in an asthma hospitalization rate of zero, so using a target outcome equal to zero will result in no solutions. If there are multiple exposure values that would result in the target outcome, the one that requires the smallest change in exposure will be used.

If an output exposure-response function table is created, it will include any target outcome values and the associated exposure values appended to the end of the table. If there are multiple solutions, multiple records will be appended to the table with repeated outcome values.

If local ERF pop-up charts are created, the target outcomes and associated exposure values will be shown in the pop-ups of each observation.

Double
target_exposures
[target_exposures,...]
(Optional)

A list of target exposure values that will be used to calculate new outcomes for each observation. For each target exposure value, the tool estimates the new outcome value that the observation would receive if its exposure variable was changed to the target exposure. For example, if the exposure variable is an air quality index and the outcome variable is the annual asthma hospitalization rate of counties, you can estimate how the hospitalization rate for each observation would change for different levels of air quality. For each provided target exposure value, two new fields will be created on the output. The first field contains the estimated outcome value if the observation received the target exposure, and the second field contains the estimated change in the outcome variable (positive values indicate that the outcome variable will increase, and negative values indicate that the outcome variable will decrease). The target exposures must be within the range of the exposure variable after trimming.

If an output exposure-response function table is created, it will include any target exposure values and the associated response values appended to the end of the table.

If local ERF pop-up charts are created, the target exposure values and associated outcomes will be shown in the pop-ups of each feature.

Double
lower_exp_trim
(Optional)

The lower quantile that will be used to trim the exposure variable. Any observations with exposure values below this quantile will be excluded from the analysis before estimating propensity scores. The value must be between 0 and 1. The default is 0.01, meaning that the bottom 1 percent of exposure values will be trimmed. It is recommended that you trim some of the lowest exposure values to improve the estimation of propensity scores.

Double
upper_exp_trim
(Optional)

The upper quantile that will be used to trim the exposure variable. Any observations with exposure values above this quantile will be excluded from the analysis before estimating propensity scores. The value must be between 0 and 1. The default is 0.99, meaning that the top 1 percent of exposure values will be trimmed. It is recommended that you trim some of the highest exposure values to improve the estimation of propensity scores.

Double
lower_ps_trim
(Optional)

The lower quantile that will be used to trim the propensity scores. Any observations with propensity scores below this quantile will be excluded from the analysis before performing propensity score matching or inverse propensity score weighting. The value must be between 0 and 1. The default is 0, meaning that no trimming will be performed.

Lower propensity score trimming is often needed when using inverse propensity score weighting. Propensity scores near zero can produce large and unstable balancing weights.

Double
upper_ps_trim
(Optional)

The upper quantile that will be used to trim the propensity scores. Any observations with propensity scores above this quantile will be excluded from the analysis before performing propensity score matching or inverse propensity score weighting. The value must be between 0 and 1. The default is 1, meaning that no trimming will be performed.

Double
num_bins
(Optional)

The number of exposure bins that will be used for propensity score matching. In matching, the exposure variable is divided into evenly spaced bins (equal intervals), and matching is performed within each bin. At least two exposure bins are required, and it is recommended that at least five exposure values be included within each bin. If no value is provided, the value will be estimated while the tool runs and displayed in the messages.

Long
scale
(Optional)

The relative weight (sometimes called the scale) of the propensity score to the exposure variable that will be used when performing propensity score matching. Within each exposure bin, matches are determined using the differences in the propensity scores and in the values of the exposure variable. This parameter specifies how to prioritize each criteria. For example, a value equal to 0.5 means that the propensity score and exposure variables are given equal weight when finding matching observations.

If no value is provided, the value will be estimated while the tool runs and displayed in the messages. The value that will provide the best balance is difficult to predict, so it is recommended that you allow the tool to estimate the value. Providing a manual value can be used to reduce the calculation time or to reproduce prior results. If the resulting exposure-response function shows vertical bands of observations with large weights, increasing the relative weight may provide a more realistic and accurate exposure-response function.

Double
balance_type
(Optional)

Specifies the method that will be used to determine whether the confounding variables are balanced. After estimating weights with propensity score matching or inverse propensity score weighting, weighted correlations are calculated for each confounding variable. If the mean, median, or maximum absolute correlation is less than the balance threshold, the confounding variables are considered balanced, meaning they are sufficiently uncorrelated with the exposure variable.

  • MEANConfounding variables will be considered balanced if the mean absolute correlation is less than the balance threshold. This is the default.
  • MEDIANConfounding variables will be considered balanced if the median absolute correlation is less than the balance threshold.
  • MAXIMUMConfounding variables will be considered balanced if the maximum absolute correlation is less than the balance threshold.
String
balance_threshold
(Optional)

The threshold value that will be compared to the weighted correlations of the confounding variables to determine if they are balanced. The value must be between 0 and 1. A larger balance threshold indicates a larger tolerance for imbalance in the confounding variables and bias in the exposure-response function. The default is 0.1.

Double
bw_method
(Optional)

Specifies the method that will be used to estimate the bandwidth of the exposure-response function.

  • PLUG_INA plug-in method will be used to estimate the bandwidth. This is the default.
  • CVThe bandwidth that minimizes the mean square cross validation error will be used.
  • MANUALA custom bandwidth will be used.
String
bandwidth
(Optional)

The bandwidth value of the exposure-response function when using a manual bandwidth.

Double
create_bootstrap_ci
(Optional)

Specifies whether 95 percent confidence intervals for the exposure-response function will be created using M-out-of-N bootstrapping.

  • CREATE_CIConfidence intervals for the exposure-response function will be created.
  • NO_CIConfidence intervals for the exposure-response function will not be created. This is the default.
Boolean

Code sample

CausalInferenceAnalysis example 1 (Python window)

The following Python script demonstrates how to use the CausalInferenceAnalysis function.

import arcpy
arcpy.stats.CausalInferenceAnalysis(
    in_features="crop_locations",
    outcome_field="corn_yield",
    exposure_field="fertilizer",
    confounding_variables="soil_type true;temperature false",
    out_features=r"CausalInference_corn_yield",
    ps_method="REGRESSION",
    balancing_method="MATCHING",
    enable_erf_popups="CREATE_POPUP",
    out_erf_table=r"erftable",
    target_outcomes=[],
    target_exposures=[],
    lower_exp_trim=0.01,
    upper_exp_trim=0.99,
    lower_ps_trim=0,
    upper_ps_trim=1,
    num_bins=None,
    scale=None,
    balance_type="MEAN",
    balance_threshold=0.1,
    bw_method="PLUG_IN",
    create_bootstrap_ci="CREATE_CI"
)
CausalInferenceAnalysis example 2 (stand-alone script)

The following Python script demonstrates how to use the CausalInferenceAnalysis function.

# Estimate the causal effect between fertilizer amount 
# and corn yield using soil type and temperature as
# confounding variables.

# Import required modules.
import arcpy

# Set the workspace.
arcpy.env.workspace = "c:/data/crops.gdb"

# Run Causal Inference Analysis tool with gradient boosting
# and inverse propensity score weighting.
try:
    arcpy.stats.CausalInferenceAnalysis(
        in_features="crop_locations",
        outcome_field="corn_yield",
        exposure_field="fertilizer",
        confounding_variables="soil_type true;temperature false",
        out_features=r"CausalInference_corn_yield",
        ps_method="GRADIENT_BOOSTING",
        balancing_method="WEIGHTING",
        enable_erf_popups="CREATE_POPUP",
        out_erf_table=r"erftable",
        target_outcomes=[],
        target_exposures=[],
        lower_exp_trim=0.01,
        upper_exp_trim=0.99,
        lower_ps_trim=0,
        upper_ps_trim=1,
        num_bins=None,
        scale=None,
        balance_type="MEAN",
        balance_threshold=0.1,
        bw_method="PLUG_IN",
        create_bootstrap_ci="CREATE_CI"
    )

except arcpy.ExecuteError:
    # If an error occurred when running the tool, print the error message.
    print(arcpy.GetMessages())

Related topics