Hadoop SNS. renren.com. Saturday, December 3, 11

Similar documents
Similarity Search in a Very Large Scale Using Hadoop and HBase

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Big Data and Scripting map/reduce in Hadoop

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Machine Learning using MapReduce

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE-E5430 Scalable Cloud Computing Lecture 2

Apache Hama Design Document v0.6

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Hadoop Parallel Data Processing

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Distributed Computing and Big Data: Hadoop and MapReduce

Cloudera Certified Developer for Apache Hadoop

Map-Reduce for Machine Learning on Multicore

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

MapReduce Job Processing

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Analysis of MapReduce Algorithms

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Hadoop Design and k-means Clustering

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Fast Data in the Era of Big Data: Twitter s Real-

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Open source large scale distributed data management with Google s MapReduce and Bigtable

HiBench Introduction. Carson Wang Software & Services Group

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

DATA CLUSTERING USING MAPREDUCE

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

Case Study : 3 different hadoop cluster deployments

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Advanced Big Data Analytics with R and Hadoop


Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Big Data Processing with Google s MapReduce. Alexandru Costan

Jeffrey D. Ullman slides. MapReduce for data intensive computing

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

Hadoop: Embracing future hardware

Map Reduce & Hadoop Recommended Text:

Information Retrieval and Web Search Engines

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Constructing a Data Lake: Hadoop and Oracle Database United!

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop Usage At Yahoo! Milind Bhandarkar

I/O Considerations in Big Data Analytics

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Advances in Natural and Applied Sciences

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Hadoop and Map-Reduce. Swati Gore

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA


Hadoop Ecosystem B Y R A H I M A.

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Integrating Big Data into the Computing Curricula

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

How Companies are! Using Spark

Challenges for Data Driven Systems

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Big Data With Hadoop

Clustering UE 141 Spring 2013

A very short Intro to Hadoop

A comparison of various clustering methods and algorithms in data mining

THE HADOOP DISTRIBUTED FILE SYSTEM

Duke University

ITG Software Engineering

Big Systems, Big Data

Part 2: Community Detection

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Workshop on Hadoop with Big Data

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Fault Tolerance in Hadoop for Work Migration

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Large-Scale Test Mining

ITG Software Engineering

Extending Hadoop beyond MapReduce

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Transcription:

Hadoop SNS renren.com Saturday, December 3, 11

2.2 190 40 Saturday, December 3, 11

Saturday, December 3, 11

Saturday, December 3, 11

Saturday, December 3, 11

Saturday, December 3, 11

Saturday, December 3, 11 7

200 Hadoop 0.21.0 4k+ / 700TB Used/1.2PB Total Hive/HBase/Streaming 30 Hadoop 0.20.3 HBase only Saturday, December 3, 11

XceLog M/R HDFS Web Hive HBase ActiveTrack Streaming Saturday, December 3, 11

BI ios Android WAP Hive M/R Saturday, December 3, 11

Social computing at Renren Data driven applications

Distribute TB of daily log data to be analyzed Millions of blogs, videos to be recommended Hundreds of millions of friends to be recommended The most Computational- Intensive applications with highly structured big data

MapReduce Data Points Adjacent list data structure A sparse representation facilitate to pass graph structures from iteration to the next iteration Parallel Breadth-first search To find shortest path in the graph Google page rank Impact index passing through graph links Distributed k-means clustering Clustering large data into pre-defined number of groups

Case I: Friend recommendation by agglomerative hierarchical clustering Primary problem of friend recommendation User familiarity Common friends User profile User access User interest

People you may know Friends friends similarity( user1, user2) friendset1 friendset2

Hierarchy Clustering to find communities in social network All in one community share some properties. These overlapping communities reveal some social relationship of different levels. They help to building new friendships in the social network.

Clustering: unsupervised learning flat clustering hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical clustering

Hierarchical agglomerative clustering Monotonic 0 0.2 0.4 0.6 0.8 Method: Merge the nearest clusters until a single cluster is left Procedure HAC (N points, stop criterion) { (1) Initialize n points as n cluster centers; (2) Iterate over centers until stop criterion is satisfied: a. Compute pair-wise similarity between any two centers sim( ci, c j) b. Find the nearest pair of centers c. Merge the two centers i, j argmax sim( c, c ) i, j (3) Output the hierarchical clusters. } i j

Pair-wise distances 0 j B1 B2 B3 Bk i Symmetric similarity matrix

Iterative map/reduce Data Points Assignment table Iterative map/reduce Mapper: assign Block IDs to each data point. Reducer: each reducer is responsible for find the local nearest pair of centers. Find the global nearest centers and merge them

Assignment table 0 j B1 k k B2 B3 Bk i

Iterative map/reduce Data Points Data Points Iterative map/reduce Mapper: preload centers in memory and compute distances from each center to all the centers coming into mappers. Reducer: each reducer is responsible for find the local nearest pair of centers. Find the global nearest centers and merge them

Block map/reduce Data Points Data Points Data Points Data splits Data splits Data splits New centers Find the global nearest pair of centers and merge them

Partition 100,000,000 users have to be partitioned to blocks before clustering. User profile helps to partition users into overlapping blocks. There are millions of blocks and each block contains several thousands of users on average.

Speed-up For small blocks, iterative map/reduce is not efficient for the overhead of start and end of a job. Only few of elements of the similarity matrix need to be updated.

One-off map/reduce Data Points One-off map/reduce Mapper: Passing friend list to reducers. Reducer: Agglomeration until clustering stops and output clustering results.

Scalability Only suitable for m*n matrix where m,n < 4 10

Distance caching Avoid re-calculating pair-wise distance between centers not for agglomeration from iteration to the next. j i

Compressed storage for lower triangle Choose a row compressed storage mode to keep pair-wise distance between centers in memory. 0 i 0 j n a ij n K=(i+1)*i/2+j a a 00 10 a11 aij a n 1 n 1 0 1 2 k n(n+1)/2-1

Performance 180 Iterative map/reduce 160 140 Time(sec) 120 100 80 60 40 20 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of data points

Performance 400 Overall efficiency Time(m) 350 300 250 200 150 100 Iterative one-off Cached distances 50 0 10 20 30 40 Num of data points(million)

Case II: Topic detection by Mean-shift clustering Clustering to find topics in news feed High density indicates hot news. Without knowing the number of topics.

Mean shift

Mean shift Region of interest

Mean shift Region of interest Center of mass

Mean shift Region of interest Center of mass Mean Shift vector

Mean shift Region of interest Center of mass Mean Shift vector

Mean shift clustering Compute center of mass give a random point iteration Generate a new center Possible merge between close centers

Iterative map/reduce Data Points Assign label to points within bandwidth Iterative map/reduce Mapper: select a point as the center and compute distances from the center to all the data points, and assign a label to the points within the bandwidth. Reducer: collect data points of the same label and compute the center of mass of them. Update centers

Performance 180 Iterative map/reduce Time(sec) 160 140 120 100 80 60 40 20 HAC Mean shift 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Num of data points

Performance 140 Overall efficiency 120 Time(m) 100 80 60 40 HAC Mean shift 20 0 10 20 30 40 Num of points(million)

bochun.bai@renren-inc.com http://renren.com/bbc yeyin.zhang@renren-inc.com http://renren.com/hartebeest Friday, December 2, 11