Google's success in the past years is strongly connected with its web search delivering precise results to queries. Over 200 factors are taken into account before the resulting pages are listed according to an overall relevancy score. One of the factors is the PageRank, which is a numerical value used to establish an importance ranking between all the web pages.
This paper tries to give a brief overview of the PageRank algorithm and its related topics. In section 2 the basic concepts and a first definition of PageRank are introduced. Due to problems arising with the simple model, the random surfer model leading to the Google matrix is depicted in section 3. Section 4 deals with convergence and sensitivity issues of the PageRank vector. The computation of the PageRank vector using different methods is shown in section 5.
An Introduction to the PageRank Algorithm
Haoyue Hu
1 Introduction
Google’s success in the past years is strongly connected with its web search delivering precise results to queries. Over 200 factors are taken into account before the resulting pages are listed according to an overall relevancy score. One of the factors is the PageRank, which is a numerical value used to establish an importance ranking between all the web pages.
The following quote from Google’s web page emphasizes the significance of PageRank:
“The heart of our software is PageRankTM, a system for ranking web pages developed by our founders Larry Page and Sergey Brin at Stanford University. And while we have dozens of engineers working to improve every aspect of Google on a daily basis, PageRank continues to provide the basis for all of our web search tools.”
This paper tries to give a brief overview of the PageRank algorithm and its related topics. In section 2 the basic concepts and a first definition of PageRank are introduced. Due to problems arising with the simple model, the random surfer model leading to the Google matrix is depicted in section 3. Section 4 deals with convergence and sensitivity issues of the PageRank vector. The computation of the PageRank vector using different methods is shown in section 5.
2 Basics of PageRank
2.1 The World Wide Web as a Directed Graph
A directed graph always consists of nodes and directed edges. In the context of the World Wide Web, each node represents a web page and each directed edge represents a hyperlink. We distinguish between inlinks pointing into nodes and outlinks pointing out from nodes. If a node has no outlinks, it is called a dangling node. An example of a tiny web graph consisting of 5 pages is shown in figure 1. In this example node 2 is a dangling node.
illustration not visible in this excerpt
Figure 1: An example web graph
2.2 The Basic Idea Behind PageRank
A link to a page can be seen as a recommendation to that page. A page is considered to be important if it is recommended by a lot of other important pages or in other words, if it is pointed to by other important pages. As a consequence, the more inlinks a page has, the more important is the page. Not only the amount of inlinks plays a role, but also the quality of the pages linking to a specific page. A weighting factor can be introduced to evaluate the importance of the inlinking pages: Inlinks from important pages carry more weight than inlinks from unimportant or even spam pages. If a page points to several pages (i.e. has several outlinks), its weight is evenly distributed among all the outlinking pages.
2.3 The Original Summation Formula for PageRank
With regard to the basic concepts given above, the PageRank is defined as follows:
The PageRank of a page i is the sum of the PageRanks of all pages pointing into page i:
illustration not visible in this excerpt
where
- r(Pj) is the PageRank of page j
- BPi is the set of pages pointing into page i
- |Pj | is the number of outlinks from page j
To calculate the PageRank of a page, the other PageRanks have to be known. This definition implies that the calculation is based on a recursive procedure. The (k + 1)-th iteration step is[Abbildung in dieser Leseprobe nicht enthalten]
In equation (2) the initial value r0(Pi) can be chosen arbitrarily. Usually an equal ranking to all the pages in the web graph is assumed at the beginning, so the initial value is set to[Abbildung in dieser Leseprobe nicht enthalten]
where n is the total number of web pages in the web graph.
To simplify the computation, equation (2) can be reformulated into a vector-matrix multiplication [Abbildung in dieser Leseprobe nicht enthalten]
with p being the PageRank vector of size (1 × n) which holds the PageRanks for all n pages and H being the hyperlink matrix of size (n × n). H represents the link structure and is defined as:[Abbildung in dieser Leseprobe nicht enthalten]
Each iteration involves one vector-matrix multiplication which has the computational effort of O(n2 ). It is estimated that a web page has an average of 10 outlinks[4]. Therefore H is a very sparse matrix and only minimal storage schemes are needed. The effort is reduced to O(nnz(H)), where nnz(H) ≈ 10n is the number of nonzeros in H, so the algorithm has O(n). Another observation concerns dangling nodes: Their respective rows in H only have zero entries, whereas the rows of nondangling nodes are stochastic (sum of all entries in the row is equal to 1).
Using this PageRank definition, two problems might occur: The first problem is the appearance of so called rank sinks which are pages that cannot share their PageRanks with other pages due to the lack of outlinks. In figure 2, the dangling node 2 is a rank sink.
illustration not visible in this excerpt
Figure 2: Graph with a dangling node acting as a rank sink
The second problem are cycles; the simplest case of a cycle is shown in figure [3]. Because page 1 only points to page 2 and vice versa, thus creating a cycle or in other words an infinite loop, no convergence can be achieved in the PageRank iterations. For example, consider [Abbildung in dieser Leseprobe nicht enthalten] as the starting vector. Then the PageRank vector jumps between [Abbildung in dieser Leseprobe nicht enthalten] in all further iteration steps.
illustration not visible in this excerpt
Figure 3: Graph with a cycle
Therefore, we need some modifications to the PageRank algorithm to overcome these problems.
3 Adjustments to the Basic Model - Random Surfer Model
Imagine a web surfer who successively clicks on links randomly. At dangling nodes (e.g. pdf files), he jumps to any page. When he gets bored by clicking on arbitrary links, he simply enters a new page in the browser‘s URL line. The proportion of time he spends on a page corresponds to its relative importance.
To model the random surfer in a mathematical context, the zero rows in H corresponding to dangling nodes are replaced by[Abbildung in dieser Leseprobe nicht enthalten].Weobtainthe[
illustration not visible in this excerpt
[...]
- Quote paper
- Haoyue Hu (Author), 2012, An Introduction to the PageRank Algorithm, Munich, GRIN Verlag, https://www.grin.com/document/209302
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.