Jeremy Bradley: MSc Projects

Reference: Matthew Cook. Calculation of PageRank Over a Peer-To-Peer Network. MSc Thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September, 2004.

Abstract: Modern computer networks contain enormous quantities of unstructured data, and providing efficient methods of identifying interesting documents continues to be one of the largest challenges such networks present. Internet search company Google have had the most prominent success in this arena, and their achievements are in part due to their development and use of the PageRank algorithm, which is a technique employed to improve the relevance of search results by analysing the structure of hyperlinks in the underlying web graph. Currently PageRank is calculated centrally on a monthly batch cycle. This report suggests that calculating document relevance ratings in this way has drawbacks in terms of scalability, overreliance on centralised coordination and latency of resource discovery and inclusion, and presents an alternative method that may overcome these problems with the use of a modified algorithm calculated over a next-generation peer-to-peer protocol such as Chord.

Paper: 2004-cook-0.pdf 992768 bytes

Reference: Douglas de Jager. PageRank: Three Distributed Algorithms. MSc Thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September, 2004.

Abstract: This paper shows for the first time that there are multiple PageRank definitions, and that more is required to justify PageRank than is offered in the literature ([1],[2] and [5]). Adopting a formal approach, this paper provides a justification of PageRank. It also shows that the irreducibility restriction on the PageRank transition matrix [5] is unnecessary. This is important because it gives the personalisation vector more intuitive force. Noting the difficulties in calculating PageRank centrally, the paper then shows, via the Chazan- Miranker theorem [10], that distributed calculation algorithms are possible; and, it presents three such novel algorithms. The first of these takes documents/pages as being of principal interest; the second and third, in an attempt to reduce communication overheads,focus attention on nodes/ webservers. Empirical test results are shown. These confirm reduced message numbers for algorithms 2 and 3. They also show that more investigation is required into suitable epsilon-thresholds within asynchronous environs.

Paper: 2004-dejager-0.pdf 4309987 bytes

Reference: Saar Miron. Distributed Asynchronous Link-Based and Content-Sensitive Pagerank Algorithms. MSc Thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September, 2004.

Abstract: This paper presents a fully distributed implementation of a non-iterative algorithm for computing Pagerank in large scale hypertext environments. The implementation introduced here does not imitate the centralized Pageank algorithm implemented effectively in Google [Goo], or convert it into a distributed system. Instead, it offers a different kind of implementation, one that is suitable for both the scale and the dynamic behavior of the Web. In view of this, it is possible that the distributed ranking implementation presented here may be a candidate for the replacement of the traditional centralized Pagerank computation. This paper criticizes the use of pages` link-structures as the sole criterion for determining the "importance" of pages in a ranking algorithm. Instead, it argues that the weight of links in a ranking equation should be partly determined by the content relevance of the connected pages. Content analysis can be effectively incorporated into a distributed Pagerank algorithm using the fact that nodes (Web-servers) have immediate access to their pages` contents. This paper proposes incorporating content analysis into a Pagerank algorithm in order to improve both rank quality and convergence performance.

Paper: 2004-miron-0.pdf 8765349 bytes

Reference: Lida Tsimonou. Distributed Pagerank: Comparisons between a Simulation and Peer-to-Peer Implementation of the Algorithm. MSc Thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September, 2004.

Abstract: The Pagerank algorithm was first introduced by Sergey Brin and Lawrence Page. Pagerank can be described as a measure of popularity assigned to each web page. The idea behind the algorithm is to infer the popularity of a web page based on information provided by the link structure of the web graph. Except from web page ranking, the algorithm has also been applied to ranking documents in peer- to-peer file-sharing systems as a means to reduce network traffic. The algorithm has traditionally been implemented centrally, on a set of pages obtained by a centralized web crawler. This approach suffers from performance and scalability issues and as a result distributed Pagerank algorithms have been proposed in literature. The present project presents an implementation of a peer-to-peer, distributed Pagerank algorithm, first introduced by Sankaralingam et al. in 2003, which is based on chaotic relaxation. In order to be able to evaluate the accuracy of the Pagerank values generated by this system, an additional application has been developed which runs as a stand-alone system, simulating the multiple-peer environment, but avoiding possible peer communication issues. The results obtained suggest that the peer-to-peer system developed can converge with considerable accuracy when compared to the simulated version. In addition, the system can deal effectively with dynamic changes like node arrivals and page insertions. Certain issues related to dealing with increased message traffic during Pagerank calculations, as well as a trend of the distributed system to underestimate Pagerank values have been identified. Possible strategies for improvement, as well as suggestions for future work are discussed.

Paper: 2004-tsiminou-0.pdf 636363 bytes

Jeremy Bradley: Supervised MSc Projects and Abstracts