Link Scoring & Prediction Algorithms

Link scoring is the task of measuring the strength of existing links. Link prediction is the task of estimating the the likelihood of existence of a link between a pair of disconnected nodes. The techniques applied in both problems are similar and they are based on the idea that the likelihood of two nodes interacting can be evaluated based on their neighborhood within the network.

NOESIS provides implementations for multiple link scoring and prediction algorithms. Roughly speaking, these algorithms have been divided into two categories according to the amount of information in the network that they consider: local and global. Examples of how to use these algorithms are provided below.

Scoring links

In NOESIS, links can be scored using the LinkScorer class. The constructor of this class takes the network for which the links will be scored and the algorithm that will be used to compute the scores. The supported link scoring algorithms are described below. For example, to use the resource allocation algorithm, first we need to instantiate it as follows:

NodePairMeasureTask measure = new KatzScore(network);

Then, this node-pair metric is used to create a link scorer as follows:

LinkScorer scorer = new LinkScorer(network, measure);

Once the scorer has been instantiated, links can be scored using the compute method. The following example shows how to score the link between a pair of nodes with indices 15 and 28:

double score = scorer.compute(15, 28);

Predicting links

To perform link prediction using NOESIS, we have to instantiate the link predictor that we want to use. For example, if we want to compute the likelihood of link between a pair of nodes in a given network using random walks, we would instantiate RandomWalkScore as follows:

NodePairMeasureTask predictor = new RandomWalkScore(network);

The instantiated predictor provides the call method, which computes and returns a matrix. This matrix contains the computed scores, so that the (i,j)-element is the score associated to the pair of nodes with indices i and j. An example of how to use this method is provided in the following code snippet:

Matrix matrix = predictor.call();

Creating a custom link scorer and predictor

New link-related algorithms can be created by extending the NodePairMeasureTask class. Extending this class requires implementing the compute method, which must return the matrix of computed scores. An example is shown below:

public class MyLinkMeasure extends NodePairMeasureTask 
{
  public MyLinkMeasure(Network network) 
  {
    super(network);
  }

  @Override
  public void compute() 
  {
    Network network = getNetwork();
    int numNodes = network.nodes();
    Matrix matrix = new DenseMatrix(numNodes, numNodes);  // Warning: Quadratic space requirements!!!
    
    ... // Implement your link scoring & prediction algorithm here 
    
    setResult(matrix);
  }
}

The implemented class can be used as link scorer and as link predictor, as shown in the previous sections. On the one hand, to compute a score proportional to the likelihood of link existence between a pair of nodes with indices 12 and 24, we would use the call method as follows:

MyLinkMeasure measure = new MyLinkMeasure(network);
Matrix scores = measure.call();
double score = scores.get(12, 14);

On the other hand, this algorithm can be used for link scoring by creating and using a link scorer as follows:

MyLinkScorer measure = new MyLinkScorer(network);
LinkScorer scorer = new LinkScorer(network, measure);
double score = scorer.compute(15, 28);

The score variable contains the score computed for the link between the pair of nodes with indices 15 and 28.


Local link scoring & prediction

Local algorithms use only neighborhood information in the network. As a result of this, these algorithms are efficient and scalable.


Common neighbors score

The common neighbors score defines the similarity between pairs of nodes as the number of shared neighbors between both nodes.

The following example shows how to use the CommonNeighborsScore class, which implements this metric.

Network network = ...
CommonNeighborsScore commonNeighborsScorer = new CommonNeighborsScore(network);
Matrix scores = commonNeighborsScorer.call();

References:

  • Liben-Nowell, D. & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031. DOI: 10.1145/956863.956972.
  • Newman, M. E. (2001). The structure of scientific collaboration networks. Proceedings of the national academy of sciences, 98(2), 404-409. DOI: 10.1073/pnas.98.2.404.


Adamic-Adar score

This metric defines the similarity between a pair of nodes according to the number of shared neighbors. However, in contrast to the common neighbors score, the Adamic-Adar score weighs the contribution of each shared neighbor inversely proportional to the logarithm of its degree.

An example of how to use this scorer is shown below:

Network network = ...
AdamicAdarScore adamicAdarScorer = new AdamicAdarScore(network);
Matrix scores = adamicAdarScorer.call();

References:

  • Adamic, L. A. & Adar, E. (2003). Friends and neighbors on the web. Social networks, 25(3), 211-230. DOI: 10.1016/S0378-8733(03)00009-1.


Resource allocation score

This score, which is motivated by the resource allocation process that takes place in complex networks, models the transmission of units of resources between pairs of unconnected nodes through their neighbors. This metric is similar to the Adamic-Adar score, but the weighting is computed directly according to the degree.

The following example shows how to use the resource allocation score:

Network network = ...
ResourceAllocationScore ra = new ResourceAllocationScore(network);
Matrix scores = ra.call();

References:

  • Zhou, T., Lü, L. & Zhang, Y. C. (2009). Predicting missing links via local information. The European Physical Journal B, 71(4), 623-630. DOI: 10.1140/epjb/e2009-00335-8.


Jaccard score

The Jaccard score is computed as the size of the intersection between the set of neighbors of both nodes, divided by the size of their union.

An example of how to use the JaccardScore class to compute this metric is shown below:

Network network = ...
JaccardScore jaccard = new JaccardScore(network);
Matrix scores = jaccard.call();


Preferential attachment score

The node degrees follow a power law distribution in many real world networks, also known as scale-free networks. The preferential attachment score is derived from results in the study of scale-free networks, and measures the interaction between a pair of nodes as the product of their degrees. Therefore, this metric prioritizes links between pairs of high-degree nodes.

The following example shows how to use the preferential attachment score:

Network network = ...
PreferentialAttachmentScore pa = new PreferentialAttachmentScore(network);
Matrix scores = pa.call();

References:

  • Barabási, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509-512. DOI: 10.1126/science.286.5439.509.
  • Mitzenmacher, M. (2004). A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1(2), 226-251. DOI: 10.1080/15427951.2004.10129088.


Salton score

This score, also known as the cosine similarity, is closely related to the Jaccard index. It is defined as the size of the intersection between the sets of neighbors of both nodes, divided by the root of the product of their size.

The Salton score can be computed using the SaltonScore class, as shown below:

Network network = ...
SaltonScore salton = new SaltonScore(network);
Matrix scores = salton.call();

References:

  • Gerard Salton, Michael McGill: Introduction to Modern Information Retrieval. McGraw-Hill Book Company 1984, ISBN 0-07-054484-0.


Sørensen score

This score, which was initially proposed to compare the similarity between different ecological community data samples, is closely related to the Jaccard score. This metric is proportional to the number of shared neighbors, divided by the sum of the degree of both nodes. Some authors suggest that the Sørensen score is less sensitive to outliers than the Jaccard score.

The following code snippet shows how to compute the Sørensen score:

Network network = ...
SorensenScore sorensen = new SorensenScore(network);
Matrix scores = sorensen.call();

References:

  • Sørensen, T. (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr., 5, 1-34.


Hub-promoted score

This score, which was proposed as a result of studying modularity in metabolic networks, models behaviors that lead to hierarchically structured networks. This similarity score avoids link formation between hub nodes and promotes link formation between low-degree nodes and hubs. This leads to small highly internally connected groups of nodes that are also highly isolated from other groups.

This local link scoring & prediction method can be used as follows:

Network network = ...
HubPromotedScore hp = new HubPromotedScore(network);
Matrix scores = hp.call();

References:

  • Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., & Barabási, A. L. (2002). Hierarchical organization of modularity in metabolic networks. science, 297(5586), 1551-1555. DOI: 10.1126/science.1073374.


Hub-depressed score

This metric has the opposite goal of the hub promoted score. It promotes link formation between hubs and between low-degree nodes, but not between hubs and low-degree nodes.

This score, which is implemented by the HubDepressedScore class, can be used as follows:

Network network = ...
HubDepressedScore hd = new HubDepressedScore(network);
Matrix scores = hd.call();

References:

  • Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., & Barabási, A. L. (2002). Hierarchical organization of modularity in metabolic networks. science, 297(5586), 1551-1555. DOI: 10.1126/science.1073374.


Local Leicht-Holme-Newman score

This score is defined as the ratio of length two paths between a pair of nodes and a value proportional to the expected number of length two paths between them.

This score can be computed as shown below:

Network network = ...
LocalLeichtHolmeNewmanScore llhn = new LocalLeichtHolmeNewmanScore(network);
Matrix scores = llhn.call();

References:

  • Leicht, E. A., Holme, P., & Newman, M. E. (2006). Vertex similarity in networks. Physical Review E, 73(2), 026120. DOI: 10.1103/PhysRevE.73.026120.


Global link scoring & prediction

In contrast to local methods, global methods take into account the complete network topology. This allows the design of more sophisticated approaches, which are also more inefficient and less scalable.


Katz score

This score is computed as the sum of the influence of all possible paths between pairs of nodes. Path influence is incrementally penalized proportionally to path length according to a damping factor. This damping factor, with a value between 0 and 1, controls path length penalization. A larger damping factor increases the influence of longer paths.

The following example shows how to compute the Katz score for a given graph using a damping factor of 0.2:

Network network = ...
KatzScore katz = new KatzScore(network, 0.2);
Matrix scores = katz.call();

References:

  • Katz, L. (1953). A new status index derived from sociometric analysis. Psychometrika, 18(1), 39-43. DOI: 10.1007/BF02289026.


Global Leicht-Holme-Newman score

Similar to the Katz score, the global Leicht-Holme-Newman score measures node similarity proportional to the number of paths between nodes. This technique also introduces a damping factor, which controls the penalization of paths according to their length. The main different with the Katz score is that the global Leicht-Holme-Newman score approximates the number of paths by computing the largest eigenvalue of the network adjacency matrix.

The code listed below shows how to compute the global Leicht-Holme-Newman score using a damping factor of 0.15:

Network network = ...
GlobalLeichtHolmeNewmanScore glhn = new GlobalLeichtHolmeNewmanScore(network, 0.15);
Matrix scores = glhn.call();

References:

  • Leicht, E. A., Holme, P., & Newman, M. E. (2006). Vertex similarity in networks. Physical Review E, 73(2), 026120. DOI: 10.1103/PhysRevE.73.026120.


Random walk score

A random walk on the graph is defined as Markov chain of randomly selected nodes. Starting from a given node, a theoretical walker randomly jumps from one node to another through links. Random walks have been widely used to study lots of stochastic processes in many fields of study. Random walks can be used to compute the probability of reaching a node from another. This probability is used by the random walk score to estimate the similarity between pairs of nodes.

An example of how to use the RandomWalkScore class, which implements this metric, is shown below:

Network network = ...
RandomWalkScore rw = new RandomWalkScore(network);
Matrix scores = rw.call();

References:

  • Liu, W., & Lü, L. (2010). Link prediction based on local random walk. Europhysics Letters, 89(5), 58007. DOI: 10.1209/0295-5075/89/58007.


Random walk with restart score

Random walk with restart is an extension of the original random walk algorithm. In this new model, the random walker moves to adjacent nodes with a probability given by an alpha parameter. However, with complementary probability 1-alpha, the random walker returns to the starting node. This metric is closely related to the PageRank algorithm, described in the Structural Network Properties section.

An example of how to compute this metric with an alpha value of 0.8, which implies a restart probability equal to 0.2, is shown in the following fragment of code:

Network network = ...
RandomWalkWithRestartScore rwr = new RandomWalkWithRestartScore(network, 0.8);
Matrix scores = rwr.call();

References:

  • Tong, H., Faloutsos, C., & Pan, J. Y. (2006). Fast random walk with restart and its applications. ICDM, 613-622. DOI: 10.1109/ICDM.2006.70.


Flow propagation score

Random walk-based metrics require the normalization of the network adjacency matrix. Previously described approaches required row normalization, so the sum of the elements in each row is equal to 1. The flow propagation score is a variant of Random walk with restart score that replaces the normalized adjacency matrix with the normalized Laplacian matrix.

The following example shows how to compute flow propagation in a given network using an alpha value of 0.85:

Network network = ...
FlowPropagationScore fp = new FlowPropagationScore(network, 0.85);
Matrix scores = fp.call();

References:

  • Vanunu, O., & Sharan, R. (2008). A propagation-based algorithm for inferring gene-disease associations. In German Conference on Bioinformatics, 54-63.


Pseudoinverse Laplacian score

The Laplacian matrix is an alternative graph representation widely used in spectral graph theory. The Moore-Penrose pseudoinverse of the Laplacian matrix can be computed using singular value decomposition (SVD). This matrix is a kernel, which means that it can be used as a node similarity matrix. The pseudoinverse Laplacian score is computed as the node similarity given by the pseudoinverse Laplacian matrix, divided by the root of the product of the self-similarity of both nodes.

This metric can be computed using the PseudoinverseLaplacianScore class as shown below:

Network network = ...
PseudoinverseLaplacianScore pl = new PseudoinverseLaplacianScore(network);
Matrix scores = pl.call();

References:

  • Fouss, F., Pirotte, A., Renders, J. M., & Saerens, M. (2007). Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on knowledge and data engineering, 19(3), 355-369. DOI: 10.1109/TKDE.2007.46.


Average commute time score

The average commute time score is equal to the average number of steps that a random walker, starting from a given node, takes to reach other given node for the first time and go back to the starting node. This number is computed using the pseudoinverse of the Laplacian matrix.

The average commute time score for a given network can be computed as follows:

Network network = ...
AverageCommuteTimeScore act = new AverageCommuteTimeScore(network);
Matrix scores = act.call();

References:

  • Fouss, F., Pirotte, A., Renders, J. M., & Saerens, M. (2007). Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on knowledge and data engineering, 19(3), 355-369. DOI: 10.1109/TKDE.2007.46.


Random forest kernel score

The random forest kernel measures the accessibility between pairs of nodes. This score is based on the matrix-tree theorem, which states that the number of spanning trees in a graph can be estimated using the Laplacian matrix. This number can be used to estimate the likelihood of link existence between every pair of nodes in a given network.

The following example shows how to compute this score using the RandomForestKernelScore class:

Network network = ...
RandomForestKernelScore rfk = new RandomForestKernelScore(network);
Matrix scores = rfk.call();

References:

  • Chebotarev, P., & Shamis, E. (2002). The Forest Metrics for Graph Vertices. Electronic Notes in Discrete Mathematics, 11, 98-107. DOI: 10.1016/S1571-0653(04)00058-7.