How Google works: Markov chains and eigenvalues

Originating author is Christiane Rousseau.
From its very beginning, Google became “the” search engine. This comes from the supremacy of its ranking algorithm: the PageRank algorithm. Indeed, with the enormous quantity of pages on the World-Wide-Web, many searches end up with thousands or millions of results. If these are not properly ordered, then the search may not be of any help, since no one can explore millions of entries.


How does the PageRank algorithm work?
We will explain this. But before, let us make a search on Google. On June 4 2010, you get 16,300,000 results for Klein project, even though the project is just beginning. On that precise date, the first entry is


http://www.mathunion.org/icmi/other-activities/klein-project/introduction/

rather than


http://www.kleinproject.org/
The first url is the url of a page, which is located on the website of the International Mathematical Union: http://www.mathunion.org. Because the International Mathematical Union is an important body, its official website comes first when you make the search ” International Mathematical Union”. Moreover, it communicates some of its importance to all of its pages, one of which is

http://www.mathunion.org/icmi/other-activities/klein-project/introduction/
In a few months or years from now, we could expect that the page

http://www.kleinproject.org/

will appear first in a search for Klein project.

To explain the algorithm, we model the web as an oriented graph. The vertices are the pages, and the oriented edges are the links between pages. As we just explained, each page corresponds to a different url. Hence, a website may contain many pages. The model makes no difference between the individual pages of a website and its front page. But, most likely, the algorithm will rank better the front page of an important website.

A simple example

Let us look at the simple web on the left with five pages named A, B, C, D, and E. This web has few links. If we are on page A, then there is only one link to page B, while, if we are on page C, we find three links, and we can choose to move to either page A, or B, or E. Notice that there is at least one link from each page.

We play a game, which is simply a random walk on the oriented graph. Starting from a page, at each step we choose at random a link from the page where we are, and we follow it. For instance, in our example, if we start on page B, then we can go to A or to C with probability 1/2 for each case while, if we start on D, then we necessarily go to A with probability 1. We iterate the game.

Where will we be after n steps?
To automatize the process, we summarize the web in the following matrix P, where each column represents the departing page, and each row, the page where we arrive.

    \[P=\begin{matrix} \begin{matrix} A & B & C & D & E \end{matrix} & \\ \left(\ \ \begin{matrix} 0 & \frac12 & \frac13 & 1 & 0 \\ 1 & 0 & \frac13 & 0 & \frac13 \\ 0 & \frac12 & 0 & 0 & \frac13 \\ 0 & 0 & 0 & 0 & \frac13 \\ 0 & 0 & \frac13 & 0 & 0 \end{matrix} \ \ \right) & \begin{matrix} A \\ B\\ C\\ D\\ E\end{matrix} \end{matrix}\]

Let us remark that the sum of entries of any column of P is equal to 1 and that all entries are greater or equal to zero. Matrices having these two properties are very special: each such matrix is the matrix of a Markov chain process, also called Markov transition matrix. It always has 1 as an eigenvalue and there exists an eigenvector with eigenvalue 1, all the components of which are less than or equal to 1 and greater than or equal to 0, with sum equal to 1. But, before recalling the definitions of eigenvalue and eigenvector, let us explore the advantage of the matrix representation of the web graph.

We consider a random variable X_n with values in the set of pages \{A,B,C,D, E\}, which contains N pages (here N=5). X_n represents the page where we are after n steps of the random walk. Hence, if we call p_{ij} the entry of the matrix P located on the i-th row and j-th column, then p_{ij} is the conditional probability that we are on the i-th page at step n+1, given that we were on the j-th page at step n:

    \[p_{ij}=\text{Prob}(X_{n+1}= i\mid X_n=j).\]

Note that this probability is independent of n! We say that a Markov chain process has no memory of the past. It is not difficult to figure that the probabilities after two steps can be summarized in the matrix P^2.

Let us prove it (you can skip the proof if you prefer). The theorem of total probabilities yields

    \[\text{Prob}(X_{n+2}=i \mid X_n=j)=\sum_{k=1}^N\text{Prob}(X_{n+2}= i \;\text{and}\; X_{n+1}=k \mid X_n=j).\]

By definition of conditional probability

    \[\text{Prob}(X_{n+2}=i \mid X_n=j)=\sum_{k=1}^N\frac{\text{Prob}(X_{n+2}= i \;\text{and}\; X_{n+1}=k \;\text{and}\; X_n=j)} {\text{Prob}(X_n=j)}.\]

We use a familiar trick: multiply and divide by the same quantity:

    \begin{multline*} \text{Prob}(X_{n+2}=i \mid X_n=j)=\\ \sum_{k=1}^N\frac{\text{Prob}(X_{n+2}= i \;\text{and}\; X_{n+1}=k \;\text{and}\; X_n=j)} {\text{Prob}(X_{n+1}=k \;\text{and}\; X_n=j)}\frac{\text{Prob}(X_{n+1}=k \;\text{and} \; X_n=j)}{\text{Prob}(X_n=j)}. \end{multline*}

The first quotient is equal to

    \[\text{Prob}(X_{n+2}= i | X_{n+1}=k \;\text{and}\; X_n=j)= \text{Prob}(X_{n+2}= i | X_{n+1}=k ),\]

since a Markov chain process has no memory past the previous
step.
Hence,

    \begin{align*}\text{Prob}(X_{n+2}=i \mid X_n=j)&=\sum_{k=1}^N\text{Prob}(X_{n+2}= i \mid X_{n+1}=k) \text{Prob}(X_{n+1}=k \mid X_n=j) \\ &=\sum_{k=1}^N p_{ik}p_{kj} \\ &=(P^2)_{ij}. \end{align*}

In our example

    \[P^2=\begin{matrix} \begin{matrix} A & B & C & D & E \end{matrix} & \\ \left(\ \ \begin{matrix}\frac12 & \frac16 & \frac16 & 0 & \frac{11}{18} \\ 0 & \frac23 & \frac49 & 1 & \frac19 \\ \frac12 & 0 & \frac5{18} & 0 & \frac16 \\ 0 & 0 & \frac19 & 0 & 0 \\ 0 & \frac16 & 0 & 0 & \frac19 \end{matrix} \ \ \right) & \begin{matrix} A \\ B\\ C\\ D\\ E\end{matrix} \end{matrix}\]

Iterating this idea, it is clear that the entry (P^m)_{ij}, of the matrix P^m describes the probability \text{Prob}(X_{n+m}=i\mid X_n=j). For instance,

    \[P^{32} =\begin{matrix} \begin{matrix} \:\:A \quad& \:\:B\quad & \:\:C \quad& \:\:D\quad &\:\: E \quad\end{matrix} & \\ \left(\ \ \begin{matrix} 0.293 & 0.293 & 0.293 & 0.293 & 0.293 \\ 0.390 & 0.390 & 0.390 & 0.390 & 0.390 \\ 0.220 & 0.220 & 0.220 & 0.220 & 0.220 \\ 0.024 & 0.024 & 0.024 & 0.024 & 0.024 \\ 0.073 & 0.073 & 0.073 & 0.073 & 0.073 \end{matrix} \ \ \right) & \begin{matrix} A \\ B\\ C\\ D\\ E\end{matrix}\end{matrix}\]

All columns of P^{32} are identical if we choose precision to 3 decimals, and the same as the columns of P^n when n>32. If we choose a higher precision, we also find stabilization, but for n larger than 32. Hence, after n steps, where n is sufficiently large, the probability of being on a page is independent from where we started!

Moreover, let us consider the vector

    \[\pi^t=(0.293,0.390,0.220,0.024,0.073)\]

(\pi is a vertical vector, and its transpose \pi^t is a horizontal vector). It is easily checked that P\pi= \pi. If we consider the i-th coordinate of the vector \pi^t as the probability of being on page i at a given time n, and hence \pi^t as the probability distribution of pages at time n, then it is also the probability distribution at time n+1. For this reason, the vector \pi is called the stationary distribution. This stationary distribution allows to order the pages. In our example, we order the pages as B, A, C, E, D, and we declare B the most important page.

The general case

The general case can be treated exactly as our example. We represent the web as an oriented graph in which the N vertices represent the N pages of the web, and the oriented edges represent the links between pages. We summarize the graph in an N\times N matrix, P, with the j-column representing the j-th departing page and the i-row, the ith arrival page. In our example we found a vector \pi satisfying P\pi=\pi. This vector is an eigenvector of the eigenvalue 1. Let us recall the definition of eigenvalue and eigenvector.

Definition. Let P be an N\times N matrix. A number \lambda\in \mathbb C is an eigenvalue of P if there exists a nonzero vector X\in \mathbb C^N such that PX=\lambda X. Any such vector X is called an eigenvector of P.

We also recall the method to find the eigenvalues and eigenvectors.

Proposition Let P be an N\times N matrix. The eigenvalues of P are the roots of the characteristic polynomial \det(\lambda I- P)=0, where I is the N\times N identity matrix The eigenvectors of an eigenvalue \lambda are the nonzero solutions of the linear homogeneous system (\lambda I - P)X=0.

The following deep theorem of Frobenius guarantees that for the matrix associated to a web graph, we will always find a stationary solution.

Theorem (Frobenius theorem) We consider an N\times N Markov transition matrix P=(p_{ij}) (i.e. p_{ij}\in [0,1] for all i,j, and the sum of entries on each column is equal to 1, namely \sum_{i=1}^Np_{ij}=1). Then

  1. \lambda=1 is one eigenvalue of P.
  2. Any eigenvalue \lambda of P satisfies |\lambda|\leq 1.
  3. There exists an eigenvector \pi of the eigenvalue 1, all the coordinates of which are greater
    or equal than zero. Without loss of generality we can suppose that
    the sum of its coordinates be 1.

It is now time to admire the power of this theorem. For this purpose, we will make the simplifying hypothesis that the matrix P has a basis of eigenvectors \mathcal{B}=\{v_1, \dots , v_n\} and we suppose that v_1 is the vector \pi of Frobenius theorem. For each v_i, there exists \lambda_i such that Pv_i=\lambda_iv_i. Let us take any nonzero vector X with X^t= (x_1, \dots, x_N), where x_i\in[0,1] and \sum_{i=1}^N x_i=1. We decompose X in the basis \mathcal{B}:

    \[X=\sum_{i=1}^N a_iv_i.\]

A technical proof that we skip allows to prove that a_1=1. Now, let us calculate PX:

    \[PX= \sum_{i=1}^Na_i Pv_i= \sum_{i=1}^n a_i\lambda_iv_i,\]

since v_i is an eigenvector of the eigenvalue \lambda_i. And if we iterate we get

    \[P^nX= \sum_{i=1}^n a_i\lambda_i^nv_i.\]

If all \lambda_i for i>1 satisfy |\lambda|<1, then

    \[\lim_{n\to \infty} P^nX = a_1v_1=\pi.\]

This is exactly what happens in our example!

Notice however that the theorem does not guarantee that any matrix P satisfying the hypotheses of the theorem will have this property. Let us describe the possible pathologies and the remedy.

Possible pathologies.

  • The eigenvalue 1 may be a multiple root of the characteristic polynomial \det(\lambda I -A)=0.
  • The matrix P may have other eigenvalues \lambda than 1, with modulus equal to 1.

What do we do in this case?

We call a Markov transition matrix with no pathology a regular Markov transition matrix, namely

Definition. A Markov transition matrix is regular if

  • The eigenvalue 1 is a simple root of the characteristic polynomial \det(\lambda I -A)=0.
  • All other eigenvalues \lambda of P than 1 have modulus less than 1.Remark that most matrices P are regular. So, if the Markov transtion matrix of a web is not regular, then the strategy is to deform it slightly to a regular one.

    Remedy. We consider the N\times N matrix Q= (q_{ij}), with q_{ij} = \frac1N for all i,j. We replace the matrix P of the web by the matrix

    (1)   \begin{equation*}P_\beta= (1-\beta) P+ \beta Q, \end{equation*}

    for some small \beta\in [0,1]. (The value \beta= 0.15 has been used by Google.) Notice that the matrix P_\beta still has nonnegative entries and that the sum of the entries of each column is still equal to 1, so it is still a Markov transition matrix. The following theorem guarantees that there exists some small \beta for which we have fixed the pathologies.

    Theorem Given any Markov transition matrix P, there always exists a positive \beta, as small as we wish, such that the matrix P_\beta is regular. Let \pi be the eigenvector of 1 for the matrix P_\beta, normalized so that the sum of its coordinates be 1. For the matrix P_\beta, given any nonzero vector X, where X^t= (p_1, \dots, p_N) with p_i\in[0,1] and \sum_{i=1}^N p_i=1, then

    (2)   \begin{equation*}\lim_{n\to \infty} (P_\beta)^nX= \pi.\end{equation*}

    Link with the Banach fixed point theorem

    Another (coming) vignette treats the Banach fixed point theorem. The theorem above can be seen as a particular application of it. You can skip this section if you have not read the other vignette. Indeed,
    Theorem Let P be a regular Markov transition matrix. We consider \mathcal{S}=\{X\mid X^t=(p_1, \dots p_N), p_i\in [0,1],\sum_{i=1}^N p_i=1\}, with an adequate distance d(X,Y) between points (this distance depends on P). On \mathcal{S}, we consider the linear operator L: \mathcal{S} \rightarrow \mathcal{S} defined by L(X)=PX. The operator L is a contraction on \mathcal{S}, namely there exists c\in [0,1[ such that for all X,Y\in \mathcal{S},

        \[d(L(X),L(Y)) \leq c\,d(X,Y).\]

    Then, there exists a unique vector \pi\in \mathcal{S} such that L(\pi)=\pi.

    Moreover, given any X_0\in \mathcal{S}, we can define the sequence \{X_n\} by induction, where X_{n+1} = L(X_n). Then, \lim_{n\to \infty} X_n=\pi.

    Definition of the distance d. The definition of the distance d is a bit subtle and can be skipped. We include it for the sake of completeness for the reader who wants to push far the details. We limit ourselves to the case where the matrix P is diagonalizable. Let \mathcal{B}=\{v_1=\pi, v_2, \dots, v_N\} be a basis of eigenvectors. Vectors X,Y\in\mathcal{S} can be written in the basis \mathcal{B}:

        \[X=\sum_{i=1}^N a_iv_i, \quad Y=\sum_{i=1}^N b_iv_i,\]

    where a_i,b_i\in \mathbb C. Then we define

        \[d(X,Y)=\sum_{i=1}^n |a_i-b_i|.\]

    With this distance, \mathcal{S} is a complete metric space, i.e. all Cauchy sequences converge.

    This theorem not only guarantees the existence of \pi, but gives a method to construct it, as the limit of the sequence \{X_n\}. We have seen an illustration of this convergence in our example. Indeed, the j-th column of P^n is the vector P^n e_j, where e_j^t= (0,\dots, 0, 1, 0, \dots,0) is the j-th vector of the canonical basis. Of course, in our example, we could also have found directly the vector \pi by solving the system (I-P)X=0 with matrix

        \[I-P= \left(\begin{matrix} 1 &- \frac12 & -\frac13 & -1 & 0 \\ -1 & 1 & -\frac13 & 0 & -\frac13 \\ 0 & -\frac12 & 1 & 0 & -\frac13 \\ 0 & 0 & 0 & 1 & -\frac13 \\ 0 & 0 & -\frac13 & 0 & 1 \end{matrix}\right).\]

    We would have found that all solutions are of the form (4s, \frac{16}3s,3s,\frac13 s,s)^t for s\in\mathbb R. The solution whose sum of coordinates is 1 is hence \pi, where

        \[\pi^t=\left(\frac{12}{41}, \frac{16}{41}, \frac{9}{41}, \frac1{41}, \frac{3}{41}\right).\]

    Practical calculation of the stationary distribution

    We have identified the simple idea underlying the algorithm. However finding the stationary distribution \pi, i.e. an eigenvector of the eigenvalue 1 for the matrix P_\beta in (1), is not at all a trivial task when the matrix has billions of lines and columns: both the time of the computation and the memory space to achieve it represent real challenges. The usual method of Gauss elimination is useless in this case, both because of the size of the computation, and because it requires to divide by small coefficients. A more efficient algorithm makes use of the property (2) (see [LM]). This makes the link with the coming vignette on the Banach fixed point theorem which will enlighten that the proof of Banach fixed point theorem provides an algorithm to construct the fixed point.

    Indeed, we start with X_0 such that

        \[X_0^t= (\frac1{N},\frac1{N}, \dots, \frac1{N}),\]

    and we need to calculate \lim_{n\to \infty} (P_\beta)^nX_0. Usually, P_\beta^nX_0 for some n between 50 and 100 gives a pretty good approximation of \pi. By induction, we calculate X_{n+1} =P_\beta X_n. Even such calculations are quite long. Indeed, due to its construction, the matrix P_\beta in (1) has no zero entries. On the other hand, most of the entries of the matrix P are zero. So, we must decompose the computation to take advantage of this fact, namely

        \[X_{n+1} = (1-\beta)PX_n +\beta Q X_n.\]

    Because of the special form of Q, it is easy to verify that if X is a vector the sum of entries of which is 1, then QX= X_0. Hence, it suffices to calculate the sequence

        \[X_{n+1} = (1-\beta)PX_n +\beta X_0.\]

    Conclusion

    We have presented the public part of Google’s PageRank algorithm. Already you can experiment with simple webs and find the tricks to improve the ranking of your personal page by adding internal and external links in an optimal way. Some private, more sophisticated parts continue to be developed. Some of them consist in replacing the ” neutral” matrix Q of (1) by matrices reflecting the taste of the surfer on the web. Others ensure that the ranking is not too sensitive to the manipulations of those who try to improve the ranking of their pages.

    As general conclusion, what have we observed? A simple, clever idea has led to a immense breakthrough in the efficiency of search engines, and to the birth of a commercial empire. Even if the implementation is in itself a computational exploit, the initial idea required ” elementary” mathematics, namely linear algebra and probability theory. These relatively standard mathematical tools, in particular the diagonalization of matrices, have given their full power when they have been used outside of their “normal” context. Also, we have highlighted the unifying ideas inside science, with Banach fixed point theorem having applications so remote from its origin.

    Bibliography

    [E] M. Eisermann, Comment Google classe les pages webb, http://images.math.cnrs.fr/Comment-Google-classe-les-pages.html, 2009.

    [LM] A. N. Langville and C. D. Meyer, A Survey of Eigenvector Methods
    for Web Information Retrieval, SIAM Review, Volume 47, Issue 1,
    (2005), pp. 135–161.

    [RS] C. Rousseau and Y. Saint-Aubin, Mathematics and
    technology
    , SUMAT Series, Springer-Verlag, 2008 (A French version of the book exists, published in the same series.)

This post is also available in: French, Italian, Spanish, Arabic

Create PDF    Send article as PDF   
This entry was posted in Mathematics Within the Last 100 Years. Bookmark the permalink.

  1. yov says:

    Thanks for this piece, but I am still lacking the connection between google’s algorithm and the analysis above. Does the stationary distribution define the web-page’s ranking in a search?

  2. Nitsa Movshovitz-Hadar says:

    The next challenge is to expose high school students to this and other articles on Klein project blog website. A study dealing with such a challenge is on going in Israel in the past 2 years based upon a preliminary presentation at ESU5 (Movshovitz-Hadar 2007), and an action research presented at ESU6 (Amit and Movshovitz-Hadar 2011). We develop short, single-topic PowerPoint presentations named Math-News-Snapshots to be interwoven in the curriculum once a fortnight so that progress in the mandatory curriculum is not harmed yet students get the sense of what’s happening in contemporary mathematics. We will be glad to share more information about the study. Please send personal e-mail to: nitsa at technion dot ac dot il

Leave a Reply

Your email address will not be published. Required fields are marked *