Link analysis is a technique that focuses on relationships and connections in a dataset. Link analysis allows you to calculate centrality measures—namely degree, betweenness, closeness, and eigenvector—and visualize the connections on a link chart or link map.
Link analysis uses a network of interconnected links and nodes to identify and analyze relationships that are not easily discerned in raw data. Common types of networks include the following:
- Social networks that show who talks to whom
- Semantic networks that illustrate topics that are related to each other
- Conflict networks indicating alliances of connections between players
- Airline networks indicating which airports have connecting flights
Examples
The following are example scenarios for using link analysis:
- A crime analyst is investigating a criminal network. Data from cell phone records can be used to determine the relationship and hierarchy between members of the network.
- A credit card company is developing a new system to detect credit card theft. The system uses the known patterns of transactions for each client—such as the city, stores, and types of transactions—to identify anomalies and alert the client of a potential theft.
- A public health analyst is researching the opioid crisis in North America. The analyst uses prescription and demographic data to identify new patterns that are emerging as the crisis spreads.
How link analysis works
The following table provides an overview of the terminology in link analysis:
Term | Description | Examples |
---|---|---|
Network | A set of interconnected nodes and links. | An online social network that uses a network of profiles and relationships to connect users. Airline networks that use a network of airports and flights to transport travelers from their origin to their destination. |
Node | A point or vertex that represents an object, such as a person, place, crime type, or tweet. The node can also include associated properties. | The profiles in a social network. Associated properties may include the user's name, home town, or employer. The airports in an airline network. Associated properties may include the airport name. |
Link | The relationships or connections between nodes. The link can also include associated properties. | The relationship between profiles in the network, such as friend, follower, or connection. Associated properties may include the length of the relationship. The flights between airports in an airline network. Associated properties may include the number of flights between airports. |
Centrality
Centrality is a measure of importance for nodes in a network.
Overall centrality is used for the following purposes:
- Evaluate the influence of a node over other nodes in the network. For example, which user will reach the most users when sharing a piece of news or a job opportunity?
- Identify the nodes that are most influenced by other nodes. For example, which airport will be most affected by canceled flights from a storm in a different region?
- Observe the flow or spread of something throughout the network, including information, objects, or phenomena. For example, how does a package move from the warehouse to the delivery address?
- Understand which nodes spread phenomena through the network most efficiently. For example, which newspaper or channel should be contacted so the story reaches the most people?
- Locate nodes that can block or prevent the spread of phenomena. For example, where should vaccination clinics be located to stop the spread of a virus?
There are four ways to measure centrality in Insights: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality.
Calculations for betweenness, closeness, and eigenvector centralities can be weighted or unweighted.
Degree centrality
Degree centrality is based on the number of direct connections a node has. Use degree centrality when you want to determine which nodes have the most direct influence. For example, in a social network, the users with the most connections have a higher degree centrality.
Degree centrality of node x is calculated using the following equation:
degCentrality(x)=deg(x)/(NodesTotal-1)
where:
- NodesTotal = The number of nodes in the network
- deg(x) = The number of nodes connected to node x
If the links are directed, meaning that information flows between nodes in one direction only, the degree centrality can be measured either as indegree or outdegree. In the case of a social network, the indegree would be based on the number of profiles the user is following, and the outdegree would be based on the number of followers the user has.
Indegree centrality is calculated using the following equation:
indegCentrality(x)=indeg(x)/(NodesTotal-1)
where:
- NodesTotal = The number of nodes in the network
- indeg(x) = The number of nodes connected to node x with flow directed toward node x
Outdegree centrality is calculated using the following equation:
outdegCentrality(x)=outdeg(x)/(NodesTotal-1)
where:
- NodesTotal = The number of nodes in the network
- outdeg(x) = The number of nodes connected to node x with flow directed away from node x
For directed graphs, Insights sizes nodes by outdegree centrality by default.
Betweenness centrality
Betweenness centrality is based on the extent a node is part of the shortest path between other nodes. Use betweenness centrality when you want to determine which nodes are used to connect other nodes to each other. For example, a user in a social network with connections to multiple groups of friends will have a higher betweenness centrality than users with connections in only one group.
Betweenness centrality of node x is calculated using the following equation:
btwCentrality(x)=Σa,bϵNodes(pathsa,b(x)/pathsa,b)
where:
- Nodes = All the nodes in the network
- pathsa,b = The number of shortest paths between all nodes a and b
- pathsa,b(x) = The number of shortest paths between nodes a and b that connect through node x
The betweenness centrality equation above does not account for the size of the network, so large networks will tend to have greater betweenness centrality values than small networks. To allow comparisons between networks of different sizes, the betweenness centrality equation must be normalized by dividing by the number of node pairs in the chart.
The following equation is used to normalize an undirected chart:
1/2(NodesTotal-1)(NodesTotal-2)
where:
- NodesTotal = The number of nodes in the network
The following equation is used to normalize a directed chart:
(NodesTotal-1)(NodesTotal-2)
where:
- NodesTotal = The number of nodes in the network
Closeness centrality
Closeness centrality is based on the average of the shortest network path distance between nodes. Use closeness centrality when you want to determine which nodes are most closely associated to the other nodes in the network. For example, a user with more connections in the social network will have a higher closeness centrality than a user who is connected through other people (in other words, a friend of a friend).
Note:
The distance between nodes refers to the number of links separating them, not the geographical distance.
Closeness centrality of node x is calculated using the following equation:
closeCentrality(x)=(nodes(x,y)/(NodesTotal-1))*(nodes(x,y)/dist(x,y)Total)
where:
- NodesTotal = The number of nodes in the network
- nodes(x,y) = The number of nodes that are connected to node x
- dist(x,y)Total = The sum of the shortest path distances from node x to other nodes
Eigenvector centrality
Eigenvector centrality is based on important nodes being connected to other important nodes. Use eigenvector centrality when you want to determine which nodes are part of a cluster of influence. For example, a user in a social network with many connections to other users with many connections will have a higher eigenvector centrality than a user with few connections, or who is connected to other users with few connections.
Eigenvector centrality of node x is calculated using power iteration to find the largest eigenvector using the following equation:
Ax=λx
where:
- λ = The eigenvalue
- x = The eigenvector
- A = The matrix describing the linear transformation
Edge weight
Calculations for closeness, betweenness, and eigenvector centralities can be weighted or unweighted. An unweighted centrality calculation sets edges to a uniform weight with a value of 1, and a weighted calculation uses field values to assign a value to each edge.
Note:
Undefined weights are assigned a value of 1. It is best practice to assign a field without null or missing values for the edge weight.
For eigenvector centrality, weights are used to determine the strength of the connection between nodes. Since eigenvector centrality measures the importance of nodes within the network, higher weight values correspond to higher values for their connecting nodes.
For closeness and betweenness centralities, weight values signify the distance between nodes. Higher edge weights mean a larger distance between nodes and reduce the likelihood of the edge being used in the shortest path. If a higher number in the desired weight field indicates increased importance (for example, the number of messages sent between members in a social network indicates how connected members are), a new field must be calculated with inverse values. Use the following equation to calculate a field of inverse values:
weight=ABS(field-MAX(field))+IF(MIN(field)<0, ABS(MIN(field)), MIN(field))
For an unweighted closeness or betweenness calculation, the shortest path is the path that uses the least number of links. The example below shows a network with four nodes (A, B, C, and D) and uniform weights. There are two paths that join node A to node D: A-B-D or A-B-C-D. Since A-B-D has fewer links, it is the shortest path.
A weighted calculation applies weights to each edge based on field values. Weighted closeness and betweenness centralities use the Bellman-Ford algorithm to find the shortest paths between nodes.
The example below shows a network with four nodes and weighted edges. Path A-B-D has a value of 15 and path A-B-C-D has a value of 9. Since A-B-C-D has the lowest edge value, it is the shortest path.
Weighted closeness and betweenness centrality calculations do not support negative weight cycles. If a negative weight cycle is detected, all centrality values are set to 0. A negative weight cycle can occur in the following circumstances:
- The graph contains a negative cycle.
- The graph contains a negative self-loop.
- The graph is undirected and contains a negative edge.
Resources
Use the following resources to learn more about link analysis: