BIG DATA APPLICATIONS
|
|
- Annabelle Reeves
- 8 years ago
- Views:
Transcription
1 BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP
2 BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics Provide key insights to scientists and decision makers The amount of data is exploding
3 CHARACTERISTICS OF BIG DATA Commonly accepted 3Vs of big data Veracity, Value 3 M. Stonebraker: Big Data Means at Least Three Different Things, M. Walker: Data Veracity,
4 VOLUME The amount of data to process Figure 1 The digital universe: 50-fold growth from the beginning of 2010 to the end of 2020 Source: IDC's Digital Universe Study, sponsored by EMC, December Within these broad outlines of the digital universe are some singularities worth noting. First, J. while Gantz the and portion D. Reinsel: of the digital The Digital universe Universe holding in potential 2020: Big analytic Data, value Bigger is Digital growing, Shadows, only a tiny and Biggest Growth in the Far East, fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and
5 VOLUME Fortress archive at Purdue 5
6 VELOCITY How much data is generated on the internet every minute The global internet population ~3.2 billion 6 J. James: Data Never Sleeps 3.0:
7 VARIETY Data in many forms Database, photo, video, audio, web Data 7
8 VARIOUS DOMAINS Biology, Physics, Finance, Computer Science 8
9 BIG DATA ANALYTICS Finding the value Need a set of powerful tools! 9
10 OUTLINE Introduction to big data analytics Tools available MapReduce Hadoop Spark ITaP resource hathi Platform Examples Summary 10
11 TRADITIONAL METHODS R, SAS, Matlab, C/C++, MPI, Java, Python Challenges Need parallel processing Programmer need to handle parallelization explicitly Work distribution, data distribution, communication, fault tolerance Time to solution is way too long Need a set of tools that free programmers from all the above, and let them focus on the problem logic 11
12 MAPREDUCE PARADIGM Programmer writes map and reduce functions that run in parallel Map Input: <key in, val in > pairs Output: list of <key i, val i > pairs <key in, val in > map list<key i, val i > Reduce Input: <key j, list(val)> pairs Output: <key out, val out > pairs key j, list(val) reduce <key out, val out > 12
13 MAPREDUCE RUNTIME Execute map and reduce functions in parallel Communicate all the values associated with the same key from map to reduce in the shuffle stage <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > shuffle <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > Popular runtime libraries support this model: Hadoop, Spark, MapReduce-MPI (MR-MPI) 13 Hadoop: Spark: MapReduce-MPI:
14 EXAMPLE- WORDCOUNT Count the occurrence of each word in a collection of text documents This is a test. map Process 0 <This, 1> <is, 1> <a, 1> <test., 1> shuffle <This, {1,1}> <is, {1, 1}> reduce <This, 2> <is, 2> Process 1 map This is also a test. <This, 1> <is, 1> <also, 1> <a, 1> <test., 1> shuffle <a, {1,1}> <test., {1,1}> <also, {1}> reduce <a, 2> <test., 2> <also, 1> 14
15 MAPREDUCE ADVANTAGES Simplicity Developers choice of language C/C++, Java, Python, R etc Automatic parallelization and communication in a restricted manner Built-in fault tolerance Performance Bring computation to data location Scalability Petabytes of data, tens of thousand of compute nodes 15
16 HADOOP A library that supports the execution of MapReduce applications Uses Hadoop Distributed File System (HDFS) for data storage Uses Hadoop NextGen MapReduce (YARN) for MapReduce applications scheduling and execution MR Pig Hive YARN HDFS 16
17 HADOOP - HDFS Master slave architecture Single NameNode, one DataNode per compute node in the cluster Storage and access model Each file is stored as a sequence of blocks of the same size Each block is replicated multiple times Write-once-read-many access model for files 17
18 HADOOP - YARN Master slave architecture Single ResourceManager, one NodeManager per compute node in the cluster, one ApplicationMasterper applicaion ResourceManager Manages the use of resources across the cluster NodeManager Launches and monitors containers A container executes an applicationspecific process ApplicationMaster Negotiates resource containers from the ResourceManager Tracks status and progress 18
19 ANATOMY OF A YARN APPLICATION RUN A client contacts the ResourceManager and asks it to run an AppicationMasterprocess. The ResourceManager finds a NodeManager that can launch the ApplicationMaster in a container. ApplicationMaster may run a computation in the container or request more containers from the ResourceManager. 19 T. White: Hadoop: The Definitive Guide 4 th Edition
20 EXAMPLE - WORDCOUNT import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } private final static IntWritable private Text word = new Text(); one = new IntWritable(1); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 47 Lines of Code! public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); 20 } public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
21 HADOOP ECOSYSTEM 21 Hadoop Ecosystem Overview:
22 SPARK Extends MapReduce model to support more types of computations such as interactive queries and stream processing Spark SQL Spark Streaming SPARK MLlib GraphX Scheduler Standalone YARN Mesos EC2 Storage HDFS Local FS Amazon S3 22
23 SPARK PROGRAMMING ABSTRACTION Resilient Distributed Dataset (RDD) A distributed collection of items among cluster nodes Operations Transformations construct a new RDD from a previous one Actions compute a result based on an RDD Transformations filter() map() sample() Actions reduce() collect() count() Transformations are only computed when an action requires a result to be returned to the driver program (lazy evaluation) 23
24 ANATOMY OF A SPARK APPLICATION RUNS Driver program launches parallel operations on a cluster Manages a number of executors SparkContex object represents a connection to a cluster Builds RDDs Runs operations on RDDs 24
25 EXAMPLE WORDCOUNT from pyspark import SparkContext sc = SparkContext(appName="PythonWordCount") text_file = sc.textfile("hdfs://user/myname/text.txt") counts = text_file.flatmap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reducebykey(lambda a, b: a + b) counts.saveastextfile("hdfs://user/myname/count.txt") sc.stop() 6 Lines of Code! 25
26 SPARK ADVANTAGES Flexibility Supports a wider range of workflow in addition to MapReduce Provides APIs in Python, Scala, Java, SQL and R Performance Runs computation in memory 2014 Daytona Gray Sort 100 TB Benchmark 26 Sort rate/node (GB/min) Hadoop Spark
27 HATHI OVERVIEW Compute nodes run user jobs Front end nodes handle user log in, simple file manipulations, and other miscellaneous operations Adm node runs Hadoop master daemons hathi-fe00 hathi-fe01 hathi-adm hathia000 hathia004 hathia001 hathia002 hathia003 hathia005 27
28 HATHI SUPPORTED LIBRARIES Hadoop MapReduce Java API, Hadoop Streaming to run any executable Spark Scala, Python, R API Hive SQL-like language Pig Pig Latin statements 28
29 SUBMITTING TO HATHI Prepare input hdfs dfs copyfromlocal test.txt /user/my/input Submit the job Monitor the job hadoop jar wordcount.jar org.myorg.wordcount /user/my/input /user/my/output command line output and web interfaces HDFS: All applications: Hadoop JobHistory server: Spark History Server: Retrieve output hdfs dfs copytolocal /user/my/output output 29
30 SUMMARY MapReduce and related tools are the key to big data analytics Hadoop and Spark are popular runtime libraries that support MapReduce type operations Hathi is a free resource that is available for students who work on projects with professors ( 30
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
More informationHadoop Lab Notes. Nicola Tonellotto November 15, 2010
Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................
More informationEnterprise Data Storage and Analysis on Tim Barr
Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,
More informationMrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
More informationHDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"
HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter
More informationWord Count Code using MR2 Classes and API
EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND
More informationGetting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
More informationBig Data 2012 Hadoop Tutorial
Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationhadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.
hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.br Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install
More informationIntroduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
More informationHadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
More informationWorking With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationHadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
More informationBig Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
More informationZebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...
Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationHadoop (Hands On) Irene Finocchi and Emanuele Fusco
Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Big Data Computing March 23, 2015. Master s Degree in Computer Science Academic Year 2014-2015, spring semester I.Finocchi and E.Fusco Hadoop (Hands
More informationHadoop and Spark Tutorial for Statisticians
Hadoop and Spark Tutorial for Statisticians Feng Li November 30, 2015 Contents 1 Install Hadoop 3 1.1 Pre-requests............................ 3 1.1.1 SSH............................ 3 1.1.2 JDK............................
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationHadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationHadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.
Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/77 Business Drivers of Cloud Computing Large data centers allow for economics
More informationHadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
More informationHadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab
Hadoop Overview Lavanya Ramakrishnan Iwona Sakrejda Shane Canon Lawrence Berkeley National Lab July 2011 Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationExtreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
More informationIntroduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
More informationRunning Hadoop on Windows CCNP Server
Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More information19 Putting into Practice: Large-Scale Data Management with HADOOP
19 Putting into Practice: Large-Scale Data Management with HADOOP The chapter proposes an introduction to HADOOP and suggests some exercises to initiate a practical experience of the system. The following
More informationIntroduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
More informationArchitectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
More informationTutorial. Christopher M. Judd
Tutorial Christopher M. Judd Christopher M. Judd CTO and Partner at leader Columbus Developer User Group (CIDUG) Marc Peabody @marcpeabody Introduction http://hadoop.apache.org/ Scale up Scale up
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationMR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
More informationHadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May
2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationIntroduction to Big Data Science. Wuhui Chen
Introduction to Big Data Science Wuhui Chen What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data
More informationBig Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationBig Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
More informationAn Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr)
An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki Thessaloniki Greece 1 Outline
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationSpark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationApache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
More informationOutline of Tutorial. Hadoop and Pig Overview Hands-on
Outline of Tutorial Hadoop and Pig Overview Hands-on 1 Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon Lawrence Berkeley National Lab October 2011 Overview Concepts & Background MapReduce and
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationBig Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
More informationBig Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades
More informationIBM Big Data Platform
Mike Winer IBM Information Management IBM Big Data Platform The big data opportunity Extracting insight from an immense volume, variety and velocity of data, in a timely and cost-effective manner. Variety:
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationThree Approaches to Data Analysis with Hadoop
Three Approaches to Data Analysis with Hadoop A Dell Technical White Paper Dave Jaffe, Ph.D. Solution Architect Dell Solution Centers Executive Summary This white paper demonstrates analysis of large datasets
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationIntroduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
More informationFast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationE6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
More informationHadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
More informationBIG DATA ANALYTICS USING HADOOP TOOLS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment
BIG DATA ANALYTICS USING HADOOP TOOLS A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by
More informationSystems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
More informationConnecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More informationCase Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
More informationIstanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
More informationHadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
More informationStep 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
More informationHow To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)
MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output
More informationHadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
More informationHDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More information