MAT103
Franklin College
Erich Prisner

8,000,000,000 variables and 8,000,000,000 equations---maybe the worlds largest system of equations

Google's PageRank

Description

"PageRank" is Google's method to measure "importance" of web pages. The position where web pages are displayed in a Google search depends on the occurence of the search phrase in the document, but also on the PageRank, therefore achieving a high rank for their pages is of immense importance for web masters.

"PageRank" was introduced by Sergey Brin and Lawrence Page. It is a concept patented by Stanford University (since both were students at Stanford when deriving it) but leased to Google, the company both founded right after graduation. The term "Page" refers to Lawrence Page [].

So what makes a web page "important"? It is clear, that many other web pages would link towards such a page, therefore an obvious but too simple approach would be to just count the number of all incoming links and take this as the measure. Since people mostly only link to pages they consider to be important, by linking to it they perform a kind of vote [3].

This first approach is too simplistic for two reasons: A page with hundreds of outgoing links would have hundreds of votes, much more voting power than a page with only few outgoing links. When somebody assures you "I love you, but also I also love Alice and Alma and Beth and ...", this may carry less weight than if somebody loves only you. So it seems fair to relate these links to the total number of outgoing links. A link from a page with 100 outgoing links would count as 1/100, whereas a link from a page with only 10 outgoing links would have the higher count of 1/10.

This first modification made sure that the total number of all votes from any web page would add up to 1. But this democracy is maybe not what we want in this kind of voting. The second modification gives more "important" pages more voting power. It is like a shareholder meeting, where everybody has as many votes as shares. Note that our process now gets recursive---the importance of a page depends on the importance of the pages that link to that page, which themselves depend on the importance of other pages, ..., and so on without an end, since there may be link cycles.

So we arrive at the following formula. Assume the web pages B1, B2, ... , Bn link towards the web pages A. Let the variables A, B1, B2, ... , Bn also denote the PageRank of these pages. Assume the total number of outbound links of pages B1, B2, ... , Bn are d1, d2, ... , dn respectively. Then

A = B1/d1 + B2/d2 + ... + Bn/dn.

The formula needs a last, technical modification (as otherwise the system of equations may not have a solution). Let m be a constant between 0 and 1, (usually m=0.85 is chosen [1,2,3]). Then the formula reads

A = m*B1/d1 + m*B2/d2 + ... + m*Bn/dn + (1-m).

Only m of the voting power of a web page is distributed.

An example

Let us treat a simple example of three pages, A, B, and C. A is linking to both B and C, and both are linking back to A. The equations are

A = m*B/1 + m*C/1 + (1-m).
B = m*A/2 + (1-m)
C = m*A/2 + (1-m)

We solve this system (note that m is a constant) and get, by substituting first B and then C,

A = m(mA/2 + (1-m)) + m(mA/2 + (1-m)) + (1-m) =
= m2A/2 + m(1-m) + m2A/2 + m(1-m) + (1-m) =
= m2A + 1 + m - 2m2
and
(1-m2)A = 1 + m - 2m2
A = (1 + m - 2m2)/(1-m2) = (1-m)(2m+1)/((1+m)(1-m))
A = (2m+1)/(m+1) =1.459...
B = m*(2m+1)/(2m+2) + 1-m = (m*(2m+1)+(1-m)(2m+2))/(2m+2) =
= (2m2+m+2m+2-2m2-2m)/(2m+2)
B =(m+2)/(2m+2) = 0.77...
C = (m+2)/(2m+2) = 0.77...

Note also that, for every possible m, A is larger than B and C,

Another example

Details

Note that there are much more details. It may be that multiple links from one page to another are neglected. So-called "dangling pages", pages without a single link outside, are reduced to minimum PageRank [4]. Actually the exact formula (equations) used by Google nowadays is a well-kept secret. 

Google also doesn't solve these 8 Billions equations using the substitution method, but rather uses a fast method approximating the values closer and closer [1,2]. 

Moreover, the numbers (usually 1 to 10) displayed in the Google toolbar are probably the logarithms of that numbers we created above (to an unknown base) [2,3].

References

[1] Chris Ridings, Mike Shishigin, PageRank Uncovered, 2002,

[2] Ian Rodgers, "Page Rank Explained---The Google Pagerank Algorithm and how it works", PageRank Explained correctly with examples, September 20, 2005 <http://www.iprcom.com/papers/pagerank/>

[3] Phil Craven, Google's PageRank explained and how to make the most of it, September 20, 2005,  <http://www.webworkshop.net/pagerank.html>

[4] Sergey Brin, Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, The Anatomy of a Search Engine, September 19, 2005 <http://www-db.stanford.edu/~backrub/google.html>


Erich Prisner, September 2005