Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015



Similar documents
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA What it is and how to use?

Hadoop Ecosystem B Y R A H I M A.

Big Data With Hadoop

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop in the Enterprise

How Companies are! Using Spark

Spark and the Big Data Library

Hadoop and Map-Reduce. Swati Gore

Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

Big Data Analytics. Lucas Rego Drumond

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

A very short Intro to Hadoop

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Apache Hama Design Document v0.6

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Unified Big Data Analytics Pipeline. 连 城

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Chapter 7. Using Hadoop Cluster and MapReduce

YARN Apache Hadoop Next Generation Compute Platform

Big Data and Data Science: Behind the Buzz Words

Machine- Learning Summer School

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Map Reduce & Hadoop Recommended Text:

Scaling Out With Apache Spark. DTL Meeting Slides based on

Apache Flink Next-gen data analysis. Kostas

Chase Wu New Jersey Ins0tute of Technology

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop. Sunday, November 25, 12

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Big Data Analytics - Accelerated. stream-horizon.com

Distributed Computing and Big Data: Hadoop and MapReduce

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Manifest for Big Data Pig, Hive & Jaql

Hadoop Job Oriented Training Agenda

Big Data Analytics Hadoop and Spark

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

MapReduce with Apache Hadoop Analysing Big Data

Bayesian networks - Time-series models - Apache Spark & Scala

How To Scale Out Of A Nosql Database

Implement Hadoop jobs to extract business value from large and varied data sets

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Challenges for Data Driven Systems

Log Mining Based on Hadoop s Map and Reduce Technique

Constructing a Data Lake: Hadoop and Oracle Database United!

Internals of Hadoop Application Framework and Distributed File System

How To Handle Big Data With A Data Scientist

Unified Big Data Processing with Apache Spark. Matei

All You Wanted to Know About Big Data Projects Chida Jan 2014

Big Data Introduction

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Intro to Map/Reduce a.k.a. Hadoop

I/O Considerations in Big Data Analytics

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Data-Intensive Computing with Map-Reduce and Hadoop

The 4 Pillars of Technosoft s Big Data Practice

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

A Brief Introduction to Apache Tez

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Introduction to Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Dell In-Memory Appliance for Cloudera Enterprise

Open source Google-style large scale data analysis with Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Managing large clusters resources

The basic data mining algorithms introduced may be enhanced in a number of ways.

Sujee Maniyam, ElephantScale

Scalable Developments for Big Data Analytics in Remote Sensing

A Brief Outline on Bigdata Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Moving From Hadoop to Spark

Transcription:

Processing of Big Data Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Acknowledgement Some slides in this set of slides were provided by EMC Corporation and Sandra Avila, University of Campinas

Map Reduce

What is MapReduce? A parallel programming model suitable for big data processing Split data into distributable chunks ( shards ) Define the steps to process those chunks Run that process in parallel on the chunks Scalable by adding more machines to process chunks Leverage commodity hardware to tackle big jobs The foundation for Hadoop MapReduce is a parallel programming model Hadoop is a concrete platform that implements MapReduce Copyright EMC Corporation 4

The Map part of MapReduce Transform (Map) input values to output values: <k1,v1> <k2,v2> Input Key/Value Pairs For instance, Key = line number, Value = text string Map Function Steps to transform input pairs to output pairs For example, count the different words in the input Output Key/Value Pairs For example, Key = <word>, Value = <count> Map output is the input to Reduce Copyright EMC Corporation

The Reduce Part of MapReduce Merge (Reduce) Values from the Map phase Reduce is optional. Sometimes all the work is done in the Mapper Input Values for a given Key from all the Mappers Reduce Function Steps to combine (Sum?, Count?, Print?, ) the values Output Print values?, load into a DB? send to the next MapReduce job? 6

Coordinator (Default MapReduce uses Filesystem) Data partitions by key Map computation partitions Reduce computation partitions Redistribution by output s key ("shuffle")

Word Count (Beach, 1) Map This is the Hello World of MapReduce Distribute the text of millions of documents over hundreds of machines. (Beach, 1) (Beach, 1) MAPPERS can be word-specific. They run through the stacks and shout One! every time they see the word beach (Beach, 1) (Beach, 1) REDUCERS listen to all the Mappers and total the counts for each word. (Beach, 2) (Beach, 1) (Beach, 2) transition Reduce (Beach, 5) finalize Copyright EMC Corporation 8

Word Counting

Example: UFOs Attack July 15 th, 2010. Raytown, Missouri When I fist noticed it, I wanted to freak out. There it was an object floating in on a direct path, It didn&apos;t move side to side or volley up and down. It moved as if though it had a mission or purpose. I was nervous, and scared, So afraid in fact that I could feel my knees buckling. I guess because I didn&apos;t know what to expect and I wanted to act non aggressive. I though that I was either going to be taken, blasted into nothing, or Q: What is the witness describing? A: An encounter with a UFO. Q: What is the emotional state of the witness? A: Frightened, ready to flee. Source: http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada Copyright EMC Corporation

Example: UFOs Attack If we really are on the cusp of a major alien invasion, eyewitness testimony is the key to our survival as a species. Strangely, the computer finds this account unreliable! Machine error When I fist noticed it, I wanted to freak out. There it was an object floating in on a direct path, It didn&apos;t move side to side or volley up and down. It moved as if though it had a mission or purpose. I was nervous, and scared, So Typo afraid in fact that I could feel my knees buckling. I guess because I didn&apos;t know what to expect and I wanted to act non aggressive. Turn of I though phrasethat I was either going to Ambiguous be taken, blasted meaning into nothing, or Source: http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada UFO keyword missing 11

Example: UFOs Attack Investigators need to Search Manage Understand for keywords and phrases, but your topic may be very complicated or keywords may be misspelled within the document document meta-data like time, location and author. Later retrieval may be key to identifying this metadata early, and the document may be amenable to structure. content via sentiment analysis, custom dictionaries, natural language processing, clustering, classification and good ol domain expertise. with computer-aided text mining 12

Map Reduce

Map-Reduce Facebook Trace analysis: 30% to 50% of running time took up by communication phase

Big Data Processing Environment

What is..? People use Hadoop to mean one of four things: > MapReduce paradigm. > Java Classes for HDFS types and MapReduce job management. > Massive unstructured data storage on commodity hardware. > HDFS: The Hadoop distributed file system. (ideas) (actual Hadoop) With Hadoop, you can do MapReduce jobs quickly and efficiently. https://hadoop.apache.org/ Module 5: Advanced Analytics - Technology and Tools 17

What do we Mean by Hadoop A framework for performing big data analytics An implementation of the MapReduce paradigm Hadoop glues the storage and analytics together and provides reliability, scalability, and management Two Main Components Storage (Big Data) HDFS Hadoop Distributed File System Reliable, redundant, distributed file system optimized for large files MapReduce (Analytics) Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Module 5: Advanced Analytics - Technology and Tools 18

Putting it all Together: MapReduce and HDFS Client/Dev 2 Job Tracker Map Job Map Job Reduce Job Reduce Job Task Tracker 3 Task Tracker Task Tracker Map Job 4 Map Job Map Job 1 Reduce Job Reduce Job Reduce Job Large Data Set (Log files, Sensor Data) Hadoop Distributed File System (HDFS) Module 5: Advanced Analytics - Technology and Tools 19

Hadoop and HDFS Module 5: Advanced Analytics - Technology and Tools 20

Which Interface Should You Choose? Pig Replacement for MapReduce Java coding When need exists to customize part of the processing phases (UDF) Hive Use when SQL skills are available Customize part of the processing via UDFs HBase Use when random queries and partial processing is required, or when specific file layouts are needed Module 5: Advanced Analytics - Technology and Tools 21

Hadoop v2 Main changes: MapReduce NextGen aka YARN aka MRv2 divides the two major functions of the JobTracker: resource management and job lifecycle management into separate components. Now, MapReduce is just one of the applications running on top of YARN. YARN permits alternative programming models. HDFS Federation Scale the name service horizontally.

Hadoop Hadoop 1.0 Hadoop 2.0

Hadoop MRv1 versus MRv2 Hadoop v2 divides the two major functions of the JobTracker: resource management and data processing into separate components.

YARN Multitenancy: multiple access engine (batch, interactive and real-time) Cluster utilization: dynamic allocation of cluster resource Resource management: scale scheduling as number of nodes grows Compatibility: retrocompatible applications BATCH (MapReduce) Applications Run Natively in Hadoop INTERACTIVE (Tez) ONLINE (HBase) STREAMING (Storm, S4, ) GRAPH (Giraph) IN-MEMORY (Spark) YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) HPC MPI (OpenMPI) OTHER (Search) (Weave )

Apache Spark In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms. https://spark.apache.org/

Spark Architecture Spark Core and Resilient Distributed Datasets (RDDs): provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark SQ - provides support for structured and semi-structured data. Spark Streaming - leverages Spark Core's fast scheduling capability to perform streaming analytics. MLlib Machine Learning Library - distributed machine learning framework on top of Spark, GraphX - a distributed graph processing framework on top of Spark

Spark Users

Machine Learning

What is Machine Learning (ML)? A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Tom Mitchell (1997), Carnegie Mellon University 32

Machine Learning Algorithms Supervised Learning induces a prediction function using a set of examples, called a training set. data is labelled to predict the correct label associated with any new observation Unsupervised Learning consider unlabeled training examples and try to uncover regularities in the data 33

Continuous Categorical Machine Learning Algorithms Supervised (Classification) k Nearest Neighbors Support Vector Machine Neural Networks Unsupervised Hidden Markov Model Decision Trees Random Forests Regression (Clustering and Dimensionality Reduction) Principal Component Analysis Singular Value Decomposition Gaussian Mixture Model 34

Continuous Categorical Machine Learning Algorithms Supervised (Classification) k Nearest Neighbors Support Vector Machine Neural Networks Unsupervised Hidden Markov Model Decision Trees Random Forests Regression (Clustering and Dimensionality Reduction) Principal Component Analysis Singular Value Decomposition Gaussian Mixture Model 4

How do we classify (images)? 5

k Nearest Neighbors (knn) [ One of the simplest of all machine learning classifiers. No training phase. The examples are classified based on the class of their k nearest neighbors in the descriptor space. [Duda et al., 2001] Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Classification. John Wiley and Sons, 2 edition, 2001. 6

k Nearest Neighbors (knn) Label it red, when k = 3. Label it blue, when k = 7. 7

Support Vector Machines (SVM) Given training instances (x,y), learn a model f such that f(x) = y. Use f to predict y for new x. Good generalization (in theory & in practice!) Work well with few training instances Find globally best model Efficient algorithms [Vapnik, 1998] Vapnik, V. N. Statistical Learning Theory. John Wiley & Sons, 1998. 8

Support Vector Machines (SVM) What if data is not linearly separable? Mapping into a new feature space 12

Artificial Neural Networks Human brain as a collection of biological neurons and synapses [Hubel and Wiesel] Hubel, D. and Wiesel, T. Receptive fields of single neurones in the cat s striate cortex. The Journal of Physiology, 148(3):574 591, 1959. [Rosenblatt, 1959] Rosenblatt, F.. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386 408, 1959. 13

Artificial Neural Networks An n-dimensional input vector x is mapped to output variable o by means of the scalar product and a nonlinear function mapping f. 14

Deep Neural Networks Same networks as before, just BIGGER. Combination of three factors: Big data! Better algorithms! Parallel computing (GPU)! [Hinton et al., 2006] Hinton, G., Osindero, S. and Teh, Y-W.. A Fast Learning Algorithm for Deep Belief Nets. Neural Computing, 18(7):1527 1554, 2006. 17

Qirong Ho and Eric P. Xing,Big ML Software for Modern ML Algorithms,, IEEE BigData 2014

Qirong Ho and Eric P. Xing,Big ML Software for Modern ML Algorithms