Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here
|
|
|
- Baldric Edwards
- 9 years ago
- Views:
Transcription
1 Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here Speaker Name or Subhead Goes Here GridKa School 2013, Karlsruhe Mirko Kämpf [email protected]
2 Physicist, TU Chemnitz, 2009 Java Trainer, since 2003 Java Developer, since 1996 Committer, ASF HadoopTrainer, Cloudera, Inc. Mirko Kämpf Research Project: SOCIONICAL, Martin-Luther Universität Halle-Wittenberg Open Source Activity: Hadoop Development Tools (Apache HDT) Hadoop.TS (on GITHUB)
3 WHATS COMMING? 1) Complex Systems, from Time Series to Networks... 2) Data, data, and even more data... but how to handle it? 3) Some results of our project... 4) Lessons learned, some recommendations...
4 images from Google image search... Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. urlurl Bulk Processing Datatypes Pig I/O-Formats SequenceFile OpenTSDB Hive HBase Distributed Storage Random-Access Sampling Rate VectorWriteable Partitions
5 Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. (A) The Hadoop Ecosystem, offers a new technology to store and process large data sets, which are in the focus of interdisciplinary research. (B) Our data sets are created or generated by highly dynamic and urlurl flexible Social Media Applications. (C) This requires new scientific approaches from complex systems research and also new technology.... and the loop is cosed.
6 images from Google image search... Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. Social networks consist of nodes, which are the real world objects and edges, which are e.g. urlurl relations, interactions or dependencies between nodes.
7 Complex Networks Definitions: images from Google image search... "A system comprised of a (usually large) number of (usually strongly) interacting entities, processes, or agents,... the understanding of which requires the development, or the use of, new scientific tools, nonlinear models, out-of equilibrium descriptions and computer simulations." [Advances in Complex Systems Journal] "A system that can be analyzed into many components having relatively many relations among them, so that the behavior of each component depends on the behavior of others." [Herbert Simon] "A system that involves numerous interacting agents whose aggregate behaviors are to be understood. Such aggregate activity is nonlinear, hence it cannot simply be derived from summation of individual components behavior." [Jerome Singer] Nonlinear models Social Networks are urlurl Complex Networks. out-of equilibrium Aggregation Dynamics of Components Dynamics of Subsystems Interaction Hierarchical Systems Superposition not possible Dependency cycles Based on Rocha, Luis M. [1999]. BITS: Computer and Communications News. Computing, Information, and Communications Division. Los Alamos National Laboratory. Nov
8 INTRODUCTION OF OUR PROJECT Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 8
9 INTRODUCTION OF OUR PROJECT Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different. We quantify how the interrelations and inter-dependencies between the three networks change in time and affect each other. 8
10 INTRODUCTION OF OUR PROJECT Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different. We quantify how the interrelations and inter-dependencies between the three networks change in time and affect each other. Example: Wikipedia reconstruct co-evolving networks 1. Cross-link network between articles (pages, nodes) 2. Access behavior network (characterizing similarity in article reading behavior) 3. Edit activity network (characterizing similarities in edit activity for each article) 8
11 CHALLENGES... Complex System Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 9
12 CHALLENGES... Complex System Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 9
13 CHALLENGES... Complex System Element properties Measured data is disconnected Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 10
14 CHALLENGES... Complex System Element properties Measured data is disconnected Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 10
15 CHALLENGES... Complex System Element properties Measured data is disconnected System properties Derived from relations between elements and structure of the network Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 11
16 CHALLENGES... Complex System Element properties Measured data is disconnected System properties Derived from relations between elements and structure of the network Time evolution??? Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? What meaning do different correlations have and how can we eliminate artifacts of the calculation method? 11
17 TYPICAL TASK : Time Series Analysis if our data set is well prepared and we have records with well defined properties (as in RDBMS), than Hive and Pig work well. How to organize the loose data in records? How to deal with sliding windows? How to handle intermediate data?
18 TIME SERIES: WIKIPEDIA USER ACTIVITY Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in 300 days) 2. Edit events (time stamps for all changes in the wikipedia pages) Examples of Wikipedia access time series for three articles with (a,b) stationary access rates ('Illuminati (book)'), (c,d) an endogenous burst of activity ('Heidelberg'), and (e,f) an exogenous burst of activity ('Amoklauf Erfurt'). The left parts show the complete hourly access rate time series (from January 1, 2009, till October 21, 2009; i.e. for 42 weeks = 294 days = 7056 hours). The right parts show edit-event data for the three representative articles. 13
19 TIME SERIES: WIKIPEDIA USER ACTIVITY Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in 300 days) 2. Edit events (time stamps for all changes in the wikipedia pages) filtering, resampling preprocessing, calculation on single records feature extraction (peak detection) creation of (non)-overlapping episodes or (sliding) windows creation of time series pairs for cross-correlation or event synchronisation 14 Map-Reduce / UDF
20 Graph Analysis If network data is prepared as an adjacency list or an adjacency matrix, tools like Giraph or Mahout work well. But: only if the appropriate data strcutures and Input-Format Readers exist.
21 TYPES OF NETWORKS a) unipartite network, one type of nodes and links b) bipartite network, one type of connections c) hypergraph, one link relates more than two nodes links of just one single type 16
22 TYPES OF NETWORKS MULTIPLEX NETWORKS links of multiple different types HIERARCHICAL NETWORKS 17
23 creation of adjacency matrix is not trivial adjacency matrix is an ineffiecient format files stored in HDFS are read only and can not be changed store dynamic edge / node properties in HBase aggregate relevant data to network snapshots store intermediate results back to HBase and preprocess this data in a following utility-step Graph Analysis Large scale raw data sets have to be stored and processed in a scalable distributed system. How to organize node/edge properties? How to deal with time dependent properties? How to calculate link properties on the fly?
24 RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System.
25 RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs
26 RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs HDFS files consist of one or more blocks (distributed chunks of data).
27 RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs HDFS files consist of one or more blocks (distributed chunks of data). Chunks are distributed transparently (in background) and processed in parallel.
28 RECAP: WHAT IS HADOOP? Distributed platform to store and process massive amounts of data in parallel Implements Map-Reduce paradigm on top of Hadoop Distributed File System. Map-Reduce : JAVA API to implement a map and a reduce phase Map phase uses key/value pairs, Reduce phase uses key/value-list pairs HDFS files consist of one or more blocks (distributed chunks of data). Chunks are distributed transparently (in background) and processed in parallel. Using data locality when possible by assigning the map task to a node that contains the chunk locally.
29 MAP REDUCE: TYPICAL APPLICATIONS Filter, group, and join operations on large data sets... the data set (or a part of it)* is streamed and processed in parallel, but usually not in real time * if partitioning is used Algorithms like k-means Clustering (Apache Mahout) or Map-Reduce based implementations of SSSP work in multiple iterations data is loaded from disk to CPU in each iteration! * heavy I/O workload
30 HADOOP: Platform for large scale data integration Algorithm & Method Development Scientific Applications
31 OVERVIEW - DATA FLOW 26
32 OVERVIEW - DATA FLOW Sqoop Flume HBase (cache) HDFS (static group) HBase (cache), & Tools Data Collection and storage 27
33 OVERVIEW - DATA FLOW Hive, Pig, Impala MR, UDF Hive, Pig, Impala MR, & Tools Giraph, Mahout Data (pre)processing, 28
34 OVERVIEW - DATA FLOW Hive, Pig, Impala MR, UDF Hive, Pig, Impala MR, & Tools Giraph, Mahout Network Clustering!= k-means Clustering 29
35 OVERVIEW - DATA FLOW Sqoop Hive, Pig, Impala MR, UDF Hive, Pig, Impala Flume MR HBase (cache) HDFS (static group) HBase (cache), & Tools Giraph, Mahout The Hadoop ecosystem in action... 30
36 RESULTS: CROSS-CORRELATION measured data surrogat data Distribution of cross-correlation coefficients for pairs of access-rate time series of Wikipedia pages (top) compared to surrogat data (bottom) shuffled configurations are considered 31
37 EVOLUTION OF CORRELATION NETWORKS time time 32
38 EVOLUTION OF CORRELATION NETWORKS time time 32
39 EVOLUTION OF CORRELATION NETWORKS time time Obviously, the distribution of link strength values changes in time, but only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. 33
40 EVOLUTION OF CORRELATION NETWORKS time time Obviously, the distribution of link strength values changes in time, but only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. 33
41 EXAMPLES OF RECONSTRUCTED NETWORKS Spatially embedded correlation networks, for wikipedia pages of all German cities. Wikipedia Access Network Wikipedia Edit Network 34
42 RECOMMENDATIONS (1.) Create algorithms based on reusable components! Use or create stable and standardized I/O-Formats! Do preprocessing, e.g. a re-organization of unstructured data, if you have to process the data many times. Collect event data in HBase and create Time-Series Buckets for advanced procedures, maybe on a subset of the data. Store intermediate data (e.g. time dependent properties) in HBase, close to the raw data, and allow random access.
43 RECOMMENDATIONS (2.) Consider Design Patterns Partitioning vs. Binning Map-Side vs. Reduce-Side Joins Use Bulk Synchronuos Processing for graph processing instead of Map-Reduce, or even a combination of both. In classical programming: (and also in Hadoop!!!) find good data representation to find good algorithms. Think about access patterns: streaming vs. random access
44 REFERENCES 37
45 MANY THANKS!!! Wikimedia Foundation 38
46 MANY THANKS!!! to the audience, here in Karlsruhe! to my supervisor and collaborators at MLU: PD Dr. Jan W. Kantelhardt, Berit Schreck, Arne Böcker to my colleagues at Cloudera, Inc. Kai Voigt, Glynn Durham, and Tom Wheeler 39
47 40
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
I/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Data Analyst Program- 0 to 100
Development Data Analyst Program- 0 to 100 Master the Data Analysis tools like Pig and hive Data Science Build a recommendation engine 1 Data Analyst Program- 0 to 100 HADOOP SCHOOL OF TRAINING Basics
A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Apache Hadoop Ecosystem
Apache Hadoop Ecosystem Rim Moussa ZENITH Team Inria Sophia Antipolis DataScale project [email protected] Context *large scale systems Response time (RIUD ops: one hit, OLTP) Time Processing (analytics:
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Big Data : Experiments with Apache Hadoop and JBoss Community projects
Big Data : Experiments with Apache Hadoop and JBoss Community projects About the speaker Anil Saldhana is Lead Security Architect at JBoss. Founder of PicketBox and PicketLink. Interested in using Big
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Building Recommenda/on Pla1orm with Hadoop. Jayant Shekhar
Building Recommenda/on Pla1orm with Hadoop Jayant Shekhar 1 Agenda Why Big Data Recommenda/on Pla1orm? Architecture Design & Implementa/on Common Recommenda/on PaEerns 2 Recommendations is one of the commonly
TRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
BIG DATA - HADOOP PROFESSIONAL amron
0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies
Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies 1 Copyright 2011, Oracle and/or its affiliates. All rights Big Data, Advanced Analytics:
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases
INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at [email protected].
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce
Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop Learn about Big Data and the shortcomings of the prevailing solutions for Big Data issues. You will also get to know, how Hadoop eradicates
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
