Big Data Analytics* Outline. Issues. Big Data



Similar documents
CS54100: Database Systems

Introduction to MapReduce and Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Internals of Hadoop Application Framework and Distributed File System

Hadoop Architecture. Part 1

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop IST 734 SS CHUNG

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

MapReduce with Apache Hadoop Analysing Big Data

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Data-Intensive Computing with Map-Reduce and Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Parallel Programming and MapReduce

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Big Data Management and NoSQL Databases

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Getting to know Apache Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source Google-style large scale data analysis with Hadoop

Data Science in the Wild

Big Data With Hadoop

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Lecture 10 - Functional programming: Hadoop and MapReduce

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Performance and Energy Efficiency of. Hadoop deployment models

Map Reduce / Hadoop / HDFS

map/reduce connected components

Apache HBase. Crazy dances on the elephant back

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

A very short Intro to Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop WordCount Explained! IT332 Distributed Systems

HPCHadoop: MapReduce on Cray X-series

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Introduction to Hadoop

Hadoop: Embracing future hardware

How To Scale Out Of A Nosql Database

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Parallel Processing of cluster by Map Reduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Can the Elephants Handle the NoSQL Onslaught?

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Apache Hadoop. Alexandru Costan

Hadoop: Understanding the Big Data Processing Method

Introduction to Big Data Training

Mrs: MapReduce for Scientific Computing in Python

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Extreme Computing. Hadoop MapReduce in more detail.

Introduction to Hadoop

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Hadoop implementation of MapReduce computational model. Ján Vaňo

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MAPREDUCE Programming Model

Word Count Code using MR2 Classes and API

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Accelerating and Simplifying Apache

A Brief Outline on Bigdata Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop and Map-Reduce. Swati Gore

CS 378 Big Data Programming

Xiaoming Gao Hui Li Thilina Gunarathne

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Introduction to Cloud Computing

Hadoop Ecosystem B Y R A H I M A.

NoSQL and Hadoop Technologies On Oracle Cloud


CS 378 Big Data Programming. Lecture 2 Map- Reduce

Application Development. A Paradigm Shift

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Large scale processing using Hadoop. Ján Vaňo

Introduc8on to Apache Spark

MapReduce and Hadoop Distributed File System

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Sriram Krishnan, Ph.D.

Maximizing Hadoop Performance with Hardware Compression

Transcription:

Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary CS 5331 by Rattikorn Hewett Texas Tech University 1 2 Big Data Issues Big Data Common Issues Laney [2001] first defined 3 V s in Big Data management: Volume: size continues increasing Variety: different types of data (e.g., text, sensor data, audio, video, graph) Velocity: streams of data Now two more V s: Variability: in user interpretation and changes in the structure of the data Value: business value to organizations in decision-making Scalability Speed Accuracy Trust, Provenance, Privacy Interactiveness 3 4 1

Some tools Apache Hadoop platform for distributed parallel computing with MapReduce programming paradigm (later) Other Apache Family: Pig, Hive, Hbase, ZooKeeper, Cassandra, Cascading, etc. Apache S4 platform for real-time processing continuous data streams Storm streaming data for distributed applications similar to S4 but implemented by Twitter Data Analytics: Challenges Analytic Architecture How to deal with historic + real-time data at the same time? Batch layer+ Serving layer + Speed layer (e.g., Hadoop) (e.g., Storm) Statistical Significance Huge data & many questions at once Random answer Distributed Analytics Not all existing algorithms can be parallelized/distributed Time evolving data Analytics must be able to detect and adapt to evolving data 5 6 Data Analytics: Challenges Data Store and Analytics Compression vs. Sampling (more time less space) (lose Information) Visualization Huge data & many questions at once Random answer? Tagged Data Analytics 3% of data are tagged and less are analyzed Big data: Some Misconceptions Bigger data are not always better Big data analytics does not necessarily use MapReduce & Hadoop In real-time data analytic, data size is not as important as data recency Accuracy can be misleading - # fake correlations grow when # variables grow 7 8 2

Outline Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary What is Hadoop? A general-purpose storage & data analysis platform Open source Apache software, implemented in Java Enables easier parallel processing Designed to run on a set of computers (cheap hardware) Consists mainly of: A distributed file system MapReduce programming framework 9 10 Sequential vs. Parallel Processing Sequential Processing Program has a single code path from beginning to end Typical data analytics tasks require more Parallel Processing Program executes multiple code paths simultaneously Almost all apps are multi-threaded/multi-tasked, e.g., Email app: display mails & download new mails Data analytics: different analytics tasks run on different machines in a parallel/distributed system 11 Why we need parallel processing End of Moore s law (# transistors on a chip will double every two years) CPU has more cores but clock speed remains the same Process on multi-core multi-machine simultaneously BUT Writing parallelized code is hard What about hard drives? Cheap and plenty BUT slow for R/W to Split data stored in multiple hard drives and R/W simul. BUT connected machines are failure prone code must deal with these failures We need Parallel Processing but it is hard 12 3

But parallel processing is not new High Performance Computing (HPC) and Grid Computing Distribute work across a cluster of machines Access a shared file system (hosted by SAN Storage Area Network) Pro: for compute-intensive jobs Con: for data-intensive jobs to access large data volumes Network bandwidth becomes bottleneck Compute node is not utilized Simple Hadoop example Word Counts Task: count the number of occurrences of each word Input: large text file(s) with space separated words Output: text file with a count of each distinct word Why use Hadoop for this task? Lots of I/O Time saving (100s for GBs files) Easy to parallelize Count words on different pieces of the file independently Match MapReduce paradigm 13 14 Word count: sequential solution Parallelized Word Count first try Repeat Read a line Recognize words in the line Update frequency in a hash table Until no more text line to be read Works but performance is limited by the disk read speed May run out of ram for the hash table Single machine & multiple threads Each thread processes a portion of the text Combine results of each thread Problem: performance limited by the read speed of hard drive may be slower than sequential processing Add machines won t help! Run out of ram for hash table File reader A quick brown fox jumps over the lazy dog 15 16 4

Parallelized Word Count 2 nd try Use multiple servers & central file server The text file is on the file server Each server accesses portion of text from the file server and counts words Result: Speed relies on file sever Faster than sequential Problem: Bandwidth/network speed Data A quick brown fox jumps over the lazy dog File Server Each server counts words on a portion of the text Server Server Server Combine the word counts Server Server Parallelized Word Count Distributed Data 1. Get the count for one portion of the text 0. Spread text across servers Server 2. Combine counts from all of the servers Server Server Server Server No Centralized file server. A portion of data store in each server (computer node) for the processing Reduce network traffic & bandwidth 17 18 Parallelized Word Count Distributed Data Steps: 1. Distribute a portion of text to individual machines Must keep track of which portion is not processed, pending, processed 2. If machine is down Detect it and Restart the process 3. Combine all the results What if servers are done at the same time? Non-Hadoop Specific Hadoop Distributed File System (HDFS) Issues & Motivations: "Big Data" require multiple disks Mainframes are expensive while Clusters are cheap Data management in Clusters can fail "Big Data" require efficient data access and processing that are platform independent Must take care of all potential problems! 19 20 5

HDFS: Goals and Features Cluster environment Fault resistant detect faults and provide quick, automatic recovery Move computation, not data applications move themselves to where data is located Streaming data access batch processing with high throughput of data access Portability portable across heterogeneous HW and SW platforms HDFS: Basic Architecture NameNode master server, manages file system namespace, regulates access to files by clients DataNodes A Client reads data from HDFS* manage storage attached to nodes they run on Files split into one or more blocks (typically 64 MB), that are stored in a set of DataNodes blocks replicated for fault tolerance 21 * From p. 63 in http://ce.sysu.edu.cn/hope/uploadfiles/education/2011/10/201110221516245419.pdf 22 HDFS Architecture MapReduce Paradigm Job Task Data Location Programming model for processing large data sets map function processes a key/value pair to generate a set of intermediate key/value pairs reduce function merges all intermediate values associated with the same intermediate key MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status & monitoring 23 http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0002.html 24 6

MapReduce Process Steps: 1.Split data across servers to multiple inputs and keep track 2.Run Map code on individual inputs key-value pairs 3.Shuffle outputs send outputs to each processor with a corresponding assigned key 4.Run Reduce code to aggregate outputs for each key 5.Produce final output collect reduce results &generate final output Programming model User-specified two basic functions: map (in_key, in_value) -> list(out_key, intermediate_value) Takes input key/value pair to generate set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values that shared the same key to generate a set of merged output values (usually one) Inspired by operators in Lisp or other functional programming languages 25 26 Pseudocode : Word Count Execution map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 27 28 7

Parallel Execution 29 MapReduce in Hadoop JobTracker schedules & monitors tasks, monitors them, and re-execute failed tasks TaskTrackers - one per clusternode, execute tasks directed by JobTracker MapReduce & HDFS run on same set of nodes - can schedule tasks where data are already present USER: specifies input/output locations and Map/Reduce codes Hadoop framework implemented in Java BUT MapReduce codes can be Python, Ruby, C++, etc http://ce.sysu.edu.cn/hope/uploadfiles/education/2011/10/201110221516245419.pdf 30 WordCount.java of Map WordCount.java of Reduce public static class Map extends Mapper<Text, LongWritable, Text, IntWritable> { public void map( LongWritable word, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) Separate by spaces { word.set(tokenizer.nexttoken()); static IntWritable one = new IntWritable(1); context.write(word, one); (w, 1) } Longwritable Intwritable } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce( Text key, Iterable<IntWritable> values, (w, (1, 1, 1)) Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } (w, 3) } context.write(key, new IntWritable(sum)); } } http://wiki.apache.org/hadoop/wordcount 31 http://wiki.apache.org/hadoop/wordcount 32 8

Running Hadoop An example of command line: $ hadoop fs -mkdir brown Data Input file $ hadoop fs -put /corpora/icame/texts/brown1 brown/input $ hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output Code Arguments Outline Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary http://depts.washington.edu/uwcl/twiki/bin/view.cgi/main/hadoopwordcountexample 33 34 MongoDB (by 10gen) For Big Data, RDBM can t handle High Volume Data Variety - Unstructured data MonhoDB solution MongoDB Normalized Data in Relational DB Normalized Data in Document DB 35 36 9

MongoDB Implemented in C++, Data serialized to BSON Runs nearly everywhere, Memory caching Best features for all below in one! key/value stores Document DB & Relational DB Designed for operational DB for Big Data Not a data processing engine MongoDB MapReduce MongoDB MapReduce are quite capable but limited by JVS MongoDB MapReduce Hadoop MapReduce For heavy processing needs. Hadoop 37 38 Outline Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary RDBMS & MapReduce Traditional RDBMS MapReduce Data Size Giga/Tera bytes Peta/Hexa bytes Access Interactive & Batch Batch Updates R/W many times W once, R many Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear http://ce.sysu.edu.cn/hope/uploadfiles/education/2011/10/201110221516245419.pdf 39 40 10

Hadoop & MongoDB* MongoDB NoSQL datastore - document oriented, schema less that provides looser data consistency models (e.g., one record has 5 fields others 8 fields) Designed for Real-time processing, store Big Data Hadoop Open source implementation of MapReduce Technology Designed for Analytical purpose Hadoop OLAP vs. OLTP MongoDB Summary HPC parallel processing - good for computation-intensive but limited by disk access and network speed bottleneck Hadoop for parallel processing cheap & easy with high I/O throughput Takes care of problems in processing distributed data, so users can focus on the core algorithm not the be-all end-all solution for parallel processing not just for toy problems like Word Count BUT has been deployed to index the web at Google and Yahoo MongoDB is powerful data store for Big data that can be used along with Hadoop for complex Data Analytics * See http://www.slideshare.net/spf13/mongodb-and-hadoop 41 42 11