Estimating PageRank Values of Wikipedia Articles using MapReduce

Similar documents
Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Part 1: Link Analysis & Page Rank

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Scaling Up HBase, Hive, Pegasus

Big Data Management and Analytics

Graph Processing and Social Networks

The PageRank Citation Ranking: Bring Order to the Web

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Graph Algorithms and Graph Databases. Dr. Daisy Zhe Wang CISE Department University of Florida August 27th 2014

Hadoop Parallel Data Processing

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

CS Data Science and Visualization Spring 2016

Advanced Big Data Analytics with R and Hadoop

Big Data and Analytics (Fall 2015)

Apache Hama Design Document v0.6

Integrating Big Data into the Computing Curricula

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

A Brief Introduction to Apache Tez

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

CS 5890: Introduction to Data Science Syllabus, Utah State University, Fall

B490 Mining the Big Data. 0 Introduction

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Big Data Systems CS 5965/6965 FALL 2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Log Mining Based on Hadoop s Map and Reduce Technique

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

A. Aiken & K. Olukotun PA3

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Link Analysis. Chapter PageRank

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Big Data and Scripting Systems build on top of Hadoop

BIG DATA What it is and how to use?

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

SoSe 2014: M-TANI: Big Data Analytics

Hadoop Tutorial. General Instructions

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data Analytics. Lucas Rego Drumond

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Fast Analytics on Big Data with H20

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Similarity Search in a Very Large Scale Using Hadoop and HBase

HiBench Installation. Sunil Raiyani, Jayam Modi

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Fast Data Hadoop acceleration with Flash. June 2013

Tutorial for Assignment 2.0

Big Data on Microsoft Platform

Big Data and Apache Hadoop s MapReduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Extreme Computing Second Assignment

Hadoop Scheduler w i t h Deadline Constraint

Distributed R for Big Data

Introduction to DISC and Hadoop

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop and Map-Reduce. Swati Gore

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop Ecosystem B Y R A H I M A.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Incremental PageRank for Twitter Data Using Hadoop

MapReduce: Algorithm Design Patterns

Image Search by MapReduce

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Bringing Big Data Modelling into the Hands of Domain Experts

Map-Reduce for Machine Learning on Multicore

Tutorial for Assignment 2.0

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

DATA ANALYSIS II. Matrix Algorithms

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Jeffrey D. Ullman slides. MapReduce for data intensive computing

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

Big Data Analytics. Lucas Rego Drumond

A Performance Analysis of Distributed Indexing using Terrier


How To Create A Graphlab Framework For Parallel Machine Learning

Big Data Analytics Hadoop and Spark

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Mining Social Network Graphs

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data


NoSQL and Hadoop Technologies On Oracle Cloud

Infrastructures for big data

How Companies are! Using Spark

Distributed Computing and Big Data: Hadoop and MapReduce

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Data Mining with Hadoop at TACC

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Transcription:

Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html Objectives The goal of this programming assignment is to enable you to gain experience in: Implementing iterative algorithms to estimate PageRank values of Wikipedia articles using Wikipedia dump data Designing and implementing batch layer computation using Hadoop MepReduce and HDFS (1) Overview In this assignment you will design and implement a system that calculates PageRank 1 values of currently available Wikipedia articles. The PageRank algorithm is used by Google Search to rank web pages in their search engine query results. PageRank measures the importance of web pages. The underlying assumption is that more important web pages are likely to receive more links from other web pages. In this assignment, you will implement algorithms to estimate PageRank values over the Wikipedia dataset. These results can be used to rank search results within Wikipedia. Most of the Wikipedia articles contain links to other articles or external web contents. We will consider only internal Wikipedia articles to calculate the PageRank. Currently Wikipedia contains more than 4.9 million articles 2. Wikipedia publishes data dumps in various formats periodically 3. In this assignment, you will perform: Estimation of PageRank values under ideal conditions Estimation of the PageRank values while considering dead-end articles Analysis of the above results 1 Larry Page, PageRank: Bringing Order to the Web Stanford Digital Library Project, 1997, 2 https://en.wikipedia.org/wiki/wikipedia:statistics 3 https://meta.wikimedia.org/wiki/data_dumps 1/7

To perform the PageRank algorithm over the Wikipedia dump, you are required to use MapReduce in an iterative fashion. (2) Requirements of Programing Assignment 1 Your implementation of this assignment should provide: A. A sorted list of Wikipedia pages based on their ideal PageRank value in descending order. Each row should contain the title of the article and its PageRank value. This computation should be performed in your own Hadoop cluster with at least 10 machines. You should present results from at least 25 iterations. B. A sorted list (in descending order) of Wikipedia pages based on their PageRank value with taxation. Each row should contain the title of the article and its PageRank value. This computation should be performed on your own Hadoop cluster with at least 10 machines. You should present results from at least 25 iterations. C. Average difference between the ideal PageRank (from A) and PageRank with taxation (from B) for each Wikipedia article. There are many algorithms to estimate PageRank. You should use the algorithms included in this description. Also, you are not allowed to use existing PageRank implementations. Demonstration of your software should be on the machines in CSB-120. Demonstration will include an interview discussing implementation and design details. Your submission should include your source codes and outputs. Your submission should be via Canvas. (3) PageRank The term PageRank comes from Larry Page, a founder of Google. PageRank is a function that assigns a real number to each page in the Web (in our case Wikipedia articles). A page with a higher PageRank value will be considered as a more important page. There are many algorithms to assign PageRank. In this assignment, we will consider two algorithms 4 : (1) idealized PageRank and (2) taxation based PageRank. 4 Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Chapter 5: Link Analysis, Mining of Massive Datasets, Cambridge University Press, http://infolab.stanford.edu/~ullman/mmds/ch5.pdf 2/7

A. Idealized PageRank Suppose that our Web is a directed graph where pages are the nodes and there is (are) edge(s) from page p 1 to page p 2. The edges between nodes represent links between the web pages. Figure 1 depicts the links between 4 pages (A, B, C and D) as a graph. Page A has links to each of the other three pages B, C and D. Page B has links to A and D. Page C has a link only to A. Finally, page D has links to B and C. A A B C D Figure 1. Web graph of example Suppose a random surfer starts at page A (in Figure 1). The probability that the random surfer will visit each of the pages B, C, and D in his next stop is 1/3. Note that if a page contains a link to itself, it should be represented in the graph as a loop. The random surfer that arrived at page B will likely to visit A with probability of ½. This graph can be defined as the transition matrix of the Web (here, Wikipedia), and the transition matrix for the previous example depicted in the Figure 1 is:! # M = # # # " 0 1/ 2 1 0 1/ 3 0 0 1/ 2 1/ 3 0 0 1/ 2 1/ 3 1/ 2 0 0 In the above matrix, the order of the pages is the natural one, A, B, C, and D. The first column shows that a random surfer at A has probability of 1/3 of visiting each of the other pages. $ % An idealized PageRank supposes that a random surfer may start from any of the n pages with equal probability. The initial vector v 0 for our previous example is (¼, ¼, ¼, ¼). After one step, the distribution of the surfer will be Mv o, after two steps it will be M((Mv o ) = M 2 v 0, and so on. Multiplying the initial vector v 0 by M a total of i times will provide us the distribution of the surfer after i steps. After enough number of iterations, the vector will show little change at each round. For this assignment your iteration count is as specified earlier. 3/7

B. Taxation Wikipedia contains dead-end articles 5. A dead-end article is a page that contains no links to other Wikipedia articles. Since there is no link to go out, once a random surfer visits the dead-end article, this random surfer never leaves the article. This often results in 0 probabilities in many of the pages as the number of iteration increases. To deal with dead-end articles, the PageRank algorithm modifies the idealized process by which random surfers are assumed to move without following the links. This method is referred to as taxation. Taxation allows each random surfer to have a small probability of teleporting to a random page, rather than following an out-link from their current page. With taxation, the new vector for estimating the PageRank will be, v' = βmv + (1 β)e / n where β is a chosen constant, usually in the range of 0.8 to 0.9, e is a vector of all 1 s with the appropriate number of components (i.e. Wikipedia pages), and n is the number of nodes in the Web graph. In this assignment, you should use β as 0.85. (4) Iterative PageRank Algorithm using MapReduce A. Generating transition matrix To calculate the PageRank of Wikipedia articles, you should generate your initial vector and transition matrix from the Wikipedia dump. For n number of Wikipedia articles, your initial matrix should be (1/n, 1/n,.. 1/n) with n dimensions. The transition matrix should contain n columns and n rows. To fill out the transition matrix, your software should record all of the linked articles from each article. For the article with m links (do not count links to web pages external to Wikipedia), each of the corresponding articles will have a new value of 1/m. The output of this step is the transition matrix of Wikipedia articles. You should perform this processing using MapReduce. Please ignore all of the possible errors due to the data segmentation by HDFS. B. Performing iteration To perform a single iteration, you should design and implement your MapReduce job to calculate Mv 0 for the idealized PageRank, βmv 0 + (1 β)e / n for taxation. Here, M is 5 https://en.wikipedia.org/wiki/wikipedia:dead-end_pages 4/7

the output of the previous MapReduce job (from (4)-A). You should perform this process using MapReduce. C. Further iteration You are required to perform the prescribed number of iterations for each case. You can reuse the MapReduce code in (4)-B. You should deal with your intermediate files in you own directory (e.g. HDFS). You should perform this step for each of the methods: idealized and taxation. (5) Programming Environment A. Requirements All of the software demonstration of this assignment will be done in CSB120. You should make sure your software works on the machines in CSB120. You are required to use Java (Version 7 in CS120). B. Setup your Hadoop cluster This assignment includes a requirement of setting up your own Hadoop cluster with HDFS on every node. We will be staging Wikipedia dump datasets on a read-only cluster set up in CSB-120. For writing your intermediate and output data, you should use your own cluster setup. (6) Dataset You should use Wikipedia dump 6 of Aug. 05, 2015. We will be staging this Wikipedia dump datasets on a read-only HDFS cluster in CSB 120. You do not need to download dataset (12.1GB as bzip) and deflate it in your own directory. However, you should manage your output dataset from each iterations and final computations in your own directory. If you have any problem with storage capacity, you should contact the Prof. Pallickara immediately. 6 enwiki dump progress on 20150805, https://dumps.wikimedia.org/enwiki/20150805/ 5/7

1: <page> 2: <title>abraham Lincoln</title> 3: <ns>0</ns> 4: <id>307</id> 5: <revision> 6: <id>672009362</id> 7: <parentid>672009304</parentid> 8: <timestamp>2015-07-18t16:12:04z</timestamp> 9: <contributor> 10: <username>sunshineisles2</username> 11: <id>10907185</id> 12: </contributor> 13: <minor /> 14: <model>wikitext</model> 15: <format>text/x-wiki</format> 16: <text xml:space="preserve">{{about the American president}} 17:{{pp-protected small=yes}} 18:{{pp-move-indef}} 19: 20:{{Use mdy dates date=january 2015}} 21:'''Abraham Lincoln''' {{IPAc-en audio=lincoln.ogg ˈ eɪ b r ə h æ m _ ˈ l I ŋ k ən}} 22:(February 12, 1809 Apri 23:l 15, 1865) was the [[List of Presidents of the United States 16th President of the 24:United States]], serving 25:from March 1861 until [[Assassination of Abraham Lincoln his assassination in April 26:1865]]. Lincoln led the U 27:nited States through its [[American Civil War Civil War]] its bloodiest war and its 28:greatest moral, constitut 29:ional, and political crisis.lt;refgt;{{cite book author=william A. 30:Pencak title=encyclopedia of the Veteran 31: in 32:America url=https://books.google.com/?id=yyvmcmsnnb4camp;pg=pa222 year=2009 publisher 33:=ABC-CLIO page=222 34:isbn=978-0-313-08759-2}}lt;/refgt;lt;refgt;{{cite book author1=paul 35:Finkelman author2=stephen E. Gottlieb 36: title=toward a Usable Past: Liberty Under State 37:Constitutions url=https://books.google.com/books?id=xjuxt1sv 38:hFcCamp;pg=PA388 year=2009 publisher=u of Georgia Press page=388 isbn=978-0-8203-39:3496-7}}lt;/refgt; In doi 40:ng so, he preserved the [[Union (American Civil War) Union]], abolished slavery, 41:strengthened the federal gov 42:ernment, and modernized the economy. 43: The dataset follows the XML format. XML format is included in the beginning of the dump file. It is also available in the Dump format page 7. The size of the data dump is 12.1GB as a bz2 file, and about 50GB when it is decompressed. You should use values in the XML tag <title> as the title of this article. In the above example, the title of this article is Abraham Lincoln as it is specified in the tag <title> in line 2. Do not use the complete URL of the article. Links are specified in [[ ]]. Only consider terms that have been as titles. Also, only consider targeted title of article. For example, in the above text, the link specified as [[Union (American Civil War) Union]] in line 40 should be translated as Union (American Civil War). Although there exist a page with a name Union, it is not the actual link from the Abraham Lincoln page. Union was used as the word that contains the link Union (American Civil War). Do not use links to external links as valid links. 7 https://meta.wikimedia.org/wiki/data_dumps/dump_format 6/7

Pages that contain redirection (i.e. <redirect title="abc" /> tag) should be ignored as well. For the link to the subsection of an article, the main title should represent the link. (7) Grading Your submissions will be graded based on the demonstration to instructor in CSB120. All of the items described in (6) will be evaluated. The grading will be done on a 200- point scale. This assignment will account for 20% of your final course grade. 80 points: Implementation of the Idealized Page Rank algorithm 80 points: Implementation of the Taxation-based Page Rank algorithm 20 points: Set up the cluster 20 points: Analysis (8) Late Policy Please check the late policy posted on the course web page. 7/7