Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here



Similar documents
ITG Software Engineering

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data Course Highlights

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Implement Hadoop jobs to extract business value from large and varied data sets

Cloudera Certified Developer for Apache Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

COURSE CONTENT Big Data and Hadoop Training

Hadoop.TS: Large-Scale Time-Series Processing

Peers Techno log ies Pv t. L td. HADOOP

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Workshop on Hadoop with Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

I/O Considerations in Big Data Analytics

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Hadoop Ecosystem B Y R A H I M A.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

How To Scale Out Of A Nosql Database

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Job Oriented Training Agenda

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Internals of Hadoop Application Framework and Distributed File System

Data Analyst Program- 0 to 100

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

BIG DATA HADOOP TRAINING

Hadoop implementation of MapReduce computational model. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop and Map-Reduce. Swati Gore

Hadoop: The Definitive Guide

Hadoop & Spark Using Amazon EMR

Apache Hadoop Ecosystem

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

ITG Software Engineering

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Google Bing Daytona Microsoft Research

Advanced Big Data Analytics with R and Hadoop

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Building Recommenda/on Pla1orm with Hadoop. Jayant Shekhar

TRAINING PROGRAM ON BIGDATA/HADOOP

BIG DATA - HADOOP PROFESSIONAL amron

Analysis of Web Archives. Vinay Goel Senior Data Engineer

HADOOP. Revised 10/19/2015

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Dell In-Memory Appliance for Cloudera Enterprise

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

A Brief Outline on Bigdata Hadoop

Apache Flink. Fast and Reliable Large-Scale Data Processing

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Flink Next-gen data analysis. Kostas

The Internet of Things and Big Data: Intro

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

How Companies are! Using Spark

Big Data and Apache Hadoop s MapReduce

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Complete Java Classes Hadoop Syllabus Contact No:

A very short Intro to Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Play with Big Data on the Shoulders of Open Source

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop IST 734 SS CHUNG

CS 378 Big Data Programming

Unified Big Data Processing with Apache Spark. Matei

Communicating with the Elephant in the Data Center

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Apache Hama Design Document v0.6

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Hadoop SNS. renren.com. Saturday, December 3, 11

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

HiBench Introduction. Carson Wang Software & Services Group

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

HDP Hadoop From concept to deployment.

Big Data and Data Science: Behind the Buzz Words

Transcription:

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here Speaker Name or Subhead Goes Here GridKa School 2013, Karlsruhe 2013-08-27 Mirko Kämpf mirko@cloudera.com

Physicist, TU Chemnitz, 2009 Java Trainer, since 2003 Java Developer, since 1996 Committer, PPMC @ ASF HadoopTrainer, Cloudera, Inc. Mirko Kämpf Research Project: SOCIONICAL, Martin-Luther Universität Halle-Wittenberg Open Source Activity: Hadoop Development Tools (Apache HDT) Hadoop.TS (on GITHUB)

WHATS COMMING? 1) Complex Systems, from Time Series to Networks... 2) Data, data, and even more data... but how to handle it? 3) Some results of our project... 4) Lessons learned, some recommendations...

images from Google image search... Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. urlurl Bulk Processing Datatypes Pig I/O-Formats SequenceFile OpenTSDB Hive HBase Distributed Storage Random-Access Sampling Rate VectorWriteable Partitions

Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. (A) The Hadoop Ecosystem, offers a new technology to store and process large data sets, which are in the focus of interdisciplinary research. (B) Our data sets are created or generated by highly dynamic and urlurl flexible Social Media Applications. (C) This requires new scientific approaches from complex systems research and also new technology.... and the loop is cosed.

images from Google image search... Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. Social networks consist of nodes, which are the real world objects and edges, which are e.g. urlurl relations, interactions or dependencies between nodes.

Complex Networks Definitions: images from Google image search... "A system comprised of a (usually large) number of (usually strongly) interacting entities, processes, or agents,... the understanding of which requires the development, or the use of, new scientific tools, nonlinear models, out-of equilibrium descriptions and computer simulations." [Advances in Complex Systems Journal] "A system that can be analyzed into many components having relatively many relations among them, so that the behavior of each component depends on the behavior of others." [Herbert Simon] "A system that involves numerous interacting agents whose aggregate behaviors are to be understood. Such aggregate activity is nonlinear, hence it cannot simply be derived from summation of individual components behavior." [Jerome Singer] Nonlinear models Social Networks are urlurl Complex Networks. out-of equilibrium Aggregation Dynamics of Components Dynamics of Subsystems Interaction Hierarchical Systems Superposition not possible Dependency cycles Based on Rocha, Luis M. [1999]. BITS: Computer and Communications News. Computing, Information, and Communications Division. Los Alamos National Laboratory. Nov. 1999.

INTRODUCTION OF OUR PROJECT Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 8

INTRODUCTION OF OUR PROJECT Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different. We quantify how the interrelations and inter-dependencies between the three networks change in time and affect each other. 8

INTRODUCTION OF OUR PROJECT Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different. We quantify how the interrelations and inter-dependencies between the three networks change in time and affect each other. Example: Wikipedia reconstruct co-evolving networks 1. Cross-link network between articles (pages, nodes) 2. Access behavior network (characterizing similarity in article reading behavior) 3. Edit activity network (characterizing similarities in edit activity for each article) 8

CHALLENGES... Complex System Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 9

CHALLENGES... Complex System Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 9

CHALLENGES... Complex System Element properties Measured data is disconnected Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 10

CHALLENGES... Complex System Element properties Measured data is disconnected Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 10

CHALLENGES... Complex System Element properties Measured data is disconnected System properties Derived from relations between elements and structure of the network Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 11

CHALLENGES... Complex System Element properties Measured data is disconnected System properties Derived from relations between elements and structure of the network Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 11

TYPICAL TASK : Time Series Analysis if our data set is well prepared and we have records with well defined properties (as in RDBMS), than Hive and Pig work well. How to organize the loose data in records? How to deal with sliding windows? How to handle intermediate data?

TIME SERIES: WIKIPEDIA USER ACTIVITY Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in 300 days) 2. Edit events (time stamps for all changes in the wikipedia pages) Examples of Wikipedia access time series for three articles with (a,b) stationary access rates ('Illuminati (book)'), (c,d) an endogenous burst of activity ('Heidelberg'), and (e,f) an exogenous burst of activity ('Amoklauf Erfurt'). The left parts show the complete hourly access rate time series (from January 1, 2009, till October 21, 2009; i.e. for 42 weeks = 294 days = 7056 hours). The right parts show edit-event data for the three representative articles. 13

TIME SERIES: WIKIPEDIA USER ACTIVITY Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in 300 days) 2. Edit events (time stamps for all changes in the wikipedia pages) filtering, resampling preprocessing, calculation on single records feature extraction (peak detection) creation of (non)-overlapping episodes or (sliding) windows creation of time series pairs for cross-correlation or event synchronisation 14 Map-Reduce / UDF

Graph Analysis If network data is prepared as an adjacency list or an adjacency matrix, tools like Giraph or Mahout work well. But: only if the appropriate data strcutures and Input-Format Readers exist.

TYPES OF NETWORKS a) unipartite network, one type of nodes and links b) bipartite network, one type of connections c) hypergraph, one link relates more than two nodes links of just one single type 16

TYPES OF NETWORKS MULTIPLEX NETWORKS links of multiple different types HIERARCHICAL NETWORKS 17

creation of adjacency matrix is not trivial adjacency matrix is an ineffiecient format files stored in HDFS are read only and can not be changed store dynamic edge / node properties in HBase aggregate relevant data to network snapshots store intermediate results back to HBase and preprocess this data in a following utility-step Graph Analysis Large scale raw data sets have to be stored and processed in a scalable distributed system. How to organize node/edge properties? How to deal with time dependent properties? How to calculate link properties on the fly?

RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System.

RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs

RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs HDFS files consist of one or more blocks (distributed chunks of data).

RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs HDFS files consist of one or more blocks (distributed chunks of data). Chunks are distributed transparently (in background) and processed in parallel.

RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs HDFS files consist of one or more blocks (distributed chunks of data). Chunks are distributed transparently (in background) and processed in parallel. Using data locality when possible by assigning the map task to a node that contains the chunk locally.

MAP REDUCE: TYPICAL APPLICATIONS Filter, group, and join operations on large data sets... the data set (or a part of it)* is streamed and processed in parallel, but usually not in real time * if partitioning is used Algorithms like k-means Clustering (Apache Mahout) or Map-Reduce based implementations of SSSP work in multiple iterations data is loaded from disk to CPU in each iteration! * heavy I/O workload

HADOOP: Platform for large scale data integration Algorithm & Method Development Scientific Applications

OVERVIEW - DATA FLOW 26

OVERVIEW - DATA FLOW Sqoop Flume HBase (cache) HDFS (static group) HBase (cache), & Tools Data Collection and storage 27

OVERVIEW - DATA FLOW Hive, Pig, Impala MR, UDF Hive, Pig, Impala MR, & Tools Giraph, Mahout Data (pre)processing, 28

OVERVIEW - DATA FLOW Hive, Pig, Impala MR, UDF Hive, Pig, Impala MR, & Tools Giraph, Mahout Network Clustering!= k-means Clustering 29

OVERVIEW - DATA FLOW Sqoop Hive, Pig, Impala MR, UDF Hive, Pig, Impala Flume MR HBase (cache) HDFS (static group) HBase (cache), & Tools Giraph, Mahout The Hadoop ecosystem in action... 30

RESULTS: CROSS-CORRELATION measured data surrogat data Distribution of cross-correlation coefficients for pairs of access-rate time series of Wikipedia pages (top) compared to surrogat data (bottom) - 100 shuffled configurations are considered 31

EVOLUTION OF CORRELATION NETWORKS time time 32

EVOLUTION OF CORRELATION NETWORKS time time 32

EVOLUTION OF CORRELATION NETWORKS time time Obviously, the distribution of link strength values changes in time, but only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. 33

EVOLUTION OF CORRELATION NETWORKS time time Obviously, the distribution of link strength values changes in time, but only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. 33

EXAMPLES OF RECONSTRUCTED NETWORKS Spatially embedded correlation networks, for wikipedia pages of all German cities. Wikipedia Access Network Wikipedia Edit Network 34

RECOMMENDATIONS (1.) Create algorithms based on reusable components! Use or create stable and standardized I/O-Formats! Do preprocessing, e.g. a re-organization of unstructured data, if you have to process the data many times. Collect event data in HBase and create Time-Series Buckets for advanced procedures, maybe on a subset of the data. Store intermediate data (e.g. time dependent properties) in HBase, close to the raw data, and allow random access.

RECOMMENDATIONS (2.) Consider Design Patterns Partitioning vs. Binning Map-Side vs. Reduce-Side Joins Use Bulk Synchronuos Processing for graph processing instead of Map-Reduce, or even a combination of both. In classical programming: (and also in Hadoop!!!) find good data representation to find good algorithms. Think about access patterns: streaming vs. random access

REFERENCES 37

MANY THANKS!!! Wikimedia Foundation 38

MANY THANKS!!! to the audience, here in Karlsruhe! to my supervisor and collaborators at MLU: PD Dr. Jan W. Kantelhardt, Berit Schreck, Arne Böcker to my colleagues at Cloudera, Inc. Kai Voigt, Glynn Durham, and Tom Wheeler 39

40