Beyond Hadoop MapReduce Apache Tez and Apache Spark
|
|
|
- Julius Melton
- 10 years ago
- Views:
Transcription
1 Beyond Hadoop MapReduce Apache Tez and Apache Spark Prakasam Kannan Computer Science Department San Jose State University San Jose, CA ABSTRACT Hadoop MapReduce has become the de facto standard for processing voluminous data on large cluster of machines, however this requires any problem to be formulated into strict three-stage process composed of Map, Shuffle/Sort and Reduce. Lack of choice in inter-stage process communication facilities makes Hadoop MapReduce unfit for iterative (Graph Processing, Machine Learning) and interactive workloads (Streaming data processing, Interactive Data mining and Analysis). This paper delves into Hadoop and MapReduce architecture and its shortcomings and examines alternatives such as Apache Tez and Apache Spark for their suitability in iterative and interactive workloads. 1. HADOOP Traditionally large-scale computation has been processor bound and handled relatively small amount of data (typically several GBs) once the data is copied into memory, processor can perform complex computations, usually taking more time than the time it took the copy the data into memory. This kind of computation often performed by scaling the Hardware vertically while vertical scaling of hardware was very expensive it reduced the complexity of the application code greatly. Nevertheless expensive nature of such systems limited their adaptation to areas where cost was not a major concern. Distributed computing was invented to reduce the cost by scaling the hardware horizontally. However programming such systems is very complex and failure of nodes is the defining difference of such systems. Also these systems stored data centrally and had to suffer from limited bandwidth of the network. Arrival of Internet age has spawned large websites and petabyte scale databases, storing such massive amount of data centrally and shuttling them across network for processing has become impractical. Hadoop was created to handle processing of such massive amount of data using large cluster of desktop class hardware. Hadoop design is based on Google s GFS (Google File System) and MapReduce framework thus Hadoop is also made up of these two primary components namely HDFS (Hadoop Distributed File System) and MapReduce. Hadoop distributes the data across the nodes in the cluster in advance and computation on this data is performed locally on those nodes, Hadoop preserves the reliability by replicating this data across 2 or more nodes. MapReduce applications that process this distributed data are written using higher-level abstractions to free the application developers from the concerns like scheduling, node failure, network access and temporal dependencies. 1.1 HDFS HDFS designates one or more nodes in the cluster as Name Node and the rest as Data Nodes. Name Nodes maintain the metadata about the files on the HDFS and the actual data itself reside on one or more Data Nodes according to the replication settings. Data Nodes also double as Task Executors where actual processing of the data is performed. When a file is copied into HDFS it is split into blocks of 64MBs (or as specified in the configuration) and distributed across the cluster Figure 1. HDFS sharding and replication 1.2 MapReduce Applications that process the data stored on the Hadoop cluster are expressed in terms of Map and Reduce functions. Data in the file is presented to the Map function by the framework as a pair of key and value and the Map function maps this key and value into another pair of key and value. These key and value pairs can be any user defined object representation of the underlying data expressible in languages like Java, C, Python, etc. All values produced by a mapper for a given key is collected into list and sorted by the frame work and such sorted list of values and the key are presented to the Reduce function as value and key respectively. When a MapReduce application is submitted for execution to Hadoop cluster first mapping function is scheduled to run one or
2 more nodes in the cluster based on number of splits estimated for a given data file. Once mapping is complete output of the one or more instances of mapping function sorted by key are merged and all values of given key is given to single instance of reduce function which in turn produces one or more key and value pair. these communications can be one-to-one or broadcast to group of vertices or producer vertices scatter data into shards across the cluster and consumer vertices can gather shards. Scatter-Gather model is used to execute legacy MapReduce applications on DAG execution instead of MapReduce framework. Group of vertices can be scheduled on single node or single when the vertices on the down stream depend only on the data from the upstream vertices on the same node or process. These vertices can be schedule to execute sequentially or concurrently as well. Output of the vertices can be persisted to a local file system or HDFS with or with out replication based on the reliability requirements. Tez also provide for ephemeral storage of the output where the output is destroyed after consumption downstream vertices. Figure 2. MapReduce application runtime model 1.3 Limitations While MapReduce ameliorates the complexity of writing a distributed application greatly it achieves that goal through severe restrictions on input/output model and problem composition. MapReduce requires that all solutions be composed into just two phases however it allows chaining of several such MR phases for execution one after another. While this requirement simplifies the application development it also makes iterative, interactive and graph processing workloads highly inefficient. Requirement that the nodes in the cluster keep the interactions to the minimum and intermediate output are stored in the hard drive and almost all communication happen through transfer of files makes the turnaround times undesirable for such workloads. Such inherent limitations of MapReduce coupled with the evercheaper DRAM prices and highly reliable datacenter LAN has engendered new models of processing data on Hadoop clusters most notably Apache Tez and Apache Spark. 2. APACHE TEZ Apache Tez is a new application development model on Hadoop cluster based on the work done by Microsoft on Dryad. While MapReduce can be though of a DAG (Directed Acyclic Graph) of 3 vertices, Dryad allows application to composed into DAG of as many number of vertices of as desired and the interactions among them can be performed using shared memory, network pipes or files. 2.1 Application Model Tez models an application into dataflow graph where vertices represent the application logic and dataflow among these vertices of application logic as edges. Query plans of declarative languages like SQL maps naturally into such DAGs and improves the fidelity of the API. Vertices of the application vertices can interact using shared memory, TCP/IP or files stored on HDFS or local file systems DAG API // Define DAG DAG dag = new DAG(); Figure 3. SQL query as DAG // Define Vertex Vertex Map1 = new Vertex(Processor.class); // Define Edge Edge edge = new Edge(Map1, Reduce1, SCATTER_GATHER PRESISTED, SEQUNTIAL, Output.class, Input.class); // Connect them dag.addvertex(map1).addedge(edge) Dynamic Graph Configuration Most of the time different vertices of the a DAG will have different level of complexity and data characteristics so DAG execution engine of Tez support pluggable vertex management modules to gather runtime information like data samples and sizes. The execution engine uses the data collected by such modules to optimize and reconfigure execution plan downstream. The diagram below shows how data gathered by such modules are used to determine the number of reducers.
3 errors.count() // Count errors mentioning MySQL: errors.filter(_.contains("mysql")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("hdfs")).map(_.split('\t')(3)).collect() Figure 4. Dynamic graph configuration Most of this flexibility and efficiency is achieved in declarative fashion with out bogging down the application developer with the nitty-gritty of scheduling and networking APIs. This level of flexibility and efficiency makes Tez a great candidate for interactive data mining applications. 3. APACHE SPARK Apache Spark was developed in 2009 at UC Berkeley AMP Lab and then open sourced in While MapReduce and DAG Execution engines abstract distributed processing on a cluster Spark provides abstraction over distributed memory on such clusters. Spark is quite different from distributed memory abstractions like MemCached in that the later is provides for fine grained read and modify operations on mutable objects that is suitable for OLTP systems. In such systems fault tolerance is provided through replication of data, which may be impractical at data volumes of OLAP systems. Spark provides an abstraction based on coarse-grained transformations that apply same operation to many data items. Spark also keeps track of enough information about the lineage of such transformations, so such they can be recomputed in the event of failures. 3.1 Resilient Distributed Datasets RDD is a read-only, partitioned collection of records that can be created though deterministic operations on data in a stable storage or other RDD [3]. These deterministic operations are generally referred as transformations. As mentioned early these RDDs have enough information embedded in them describing how they are derived so they can be recomputed going all the back from the stable storage if needed. Spark allows application programmers to control how these RDD s are partitioned and persisted based on use case. For an example a RDD that is needed by different application or rerun of the same application can choose to save it on disk. On the other hand transient datasets or lookup tables are kept entirely in memory similarly datasets that are to be joined can be partitioned using same hash function so they are collocated. Spark takes liberty to spill over partitions of RDDs that cannot fit in memory or completely destroy on heuristics that it can be trivially recomputed or not needed anymore Spark API Lines = spark.textfile("hdfs://...") errors = lines.filter(_.startswith("error")) errors.persist() Figure 5. Lineage graph for the above query 3.2 Spark Application Model Spark uses DSL like API similar to DryadLINQ [4] written in Scala, it also has binding for other popular languages like Java, Python, etc., Spark distributions come with Scala or Python based shells that can be used to run short scripts or execute interactive queries. Full-fledged applications can also be built using the one of the language bindings. Transformations and Actions are two different sets of operations provided by Spark. Transformations are used to define RDDs these operations are lazy they are not realized until one of the terminal action is executed on such RDDs. Table 1. Transformations and Actions Transformations map(f :T U) filter(f:t Bool) flatmap( f : T Seq[U]) sample(fraction : Float) groupbykey() reducebykey( f : (V, V) V) union() join() cogroup() crossproduct() mapvalues( f : V W) sort(c : Comparator[K]) partitionby( p : Partitioner[K]) Actions count() collect() reduce(f:(t,t) T): lookup(k : K) save(path : String) 3.3 Spark Runtime Spark converts all transformations and terminal actions into a DAG and executes it using a DAG execution engine similar to that of Dryad.
4 Spark uses master-slave architecture similar to that of MapReduce to execute such DAG on the cluster. Spark application runtime consist of 2 major components one is called the Driver and the other is Executer. Driver coordinates a large number of Executers running on the cluster. Spark Application is launched on the cluster using an external service called cluster manger these cluster managers are pluggable components. Currently Spark supports Hadoop YARN and Apache Mesas as cluster mangers. 4. PERFORMANCE BENCHMARK Following benchmarks measure the response time for relational queries such as scans, aggregation and joins on the various clusters similar to Hadoop. These benchmarks were performed by AMP Labs on 3 different datasets of varying sizes. Further information about the benchmark can be found at: Scan Query SELECT pageurl, pagerank FROM rankings WHERE pagerank > X 4.2 Aggregation Query SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)
5 4.3 Join Query SELECT sourceip, totalrevenue, avgpagerank FROM (SELECT sourceip, AVG(pageRank) as avgpagerank, SUM(adRevenue) as totalrevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(` ') AND Date(`X') GROUP BY UV.sourceIP) ORDER BY totalrevenue DESC LIMIT 1
6 5. CONCLUSION It is very clear from cited sources and the benchmarks that Apache Tez and Apache Spark provide significant advance over the Hadoop MapReduce framework in terms of performance, data source diversity and expressiveness of the API. It is also very important to note that these advances were enabled by the advances in hardware as much as in the software that has enabled lower prices of memory modules, solid state drives and fast Ethernet. 6. REFERENCES [1] T. White. Hadoop: The Definitive Guide, 3 rd Edition [2] M. Isard, M. Budiu, Y. Yu, A. Birrel. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. [3] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing. [4] Y.Yu, M. Isard, D. Fetterly, M. Budiu, U.Erlingsson, P.K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI 08, 2008.
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Shark: SQL and Rich Analytics at Scale
Shark: SQL and Rich Analytics at Scale Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica AMPLab, EECS, UC Berkeley {rxin, joshrosen, matei, franklin, shenker, istoica}@cs.berkeley.edu
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
The Berkeley Data Analytics Stack: Present and Future
The Berkeley Data Analytics Stack: Present and Future Michael Franklin 27 March 2014 Technion Big Day on Big Data UC BERKELEY BDAS in the Big Data Context 2 Sources Driving Big Data It s All Happening
An Architecture for Fast and General Data Processing on Large Clusters
An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-12
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Resource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce
Data Management in the Cloud MAP/REDUCE 117 Programming model Examples Execution model Criticism Iterative map/reduce Map/Reduce 118 Motivation Background and Requirements computations are conceptually
MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Big Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Beyond Batch Processing: Towards Real-Time and Streaming Big Data
Beyond Batch Processing: Towards Real-Time and Streaming Big Data Saeed Shahrivari, and Saeed Jalili Computer Engineering Department, Tarbiat Modares University (TMU), Tehran, Iran [email protected],
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming
Distributed and Parallel Technology Datacenter and Warehouse Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 Based on earlier versions by
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org @apacheignite @dsetrakyan Agenda About In- Memory
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Big Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv [email protected] Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Matei Zaharia N. M. Mosharaf Chowdhury Michael Franklin Scott Shenker Ion Stoica Electrical Engineering and Computer Sciences University of California at Berkeley
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
SanDisk Solid State Drives (SSDs) for Big Data Analytics Using Hadoop and Hive
WHITE PAPER SanDisk Solid State Drives (SSDs) for Big Data Analytics Using Hadoop and Hive Madhura Limaye ([email protected]) October 2014 951 SanDisk Drive, Milpitas, CA 95035 2014 SanDIsk Corporation.
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
Processing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems
Procedia Computer Science Volume 51, 2015, Pages 2678 2682 ICCS 2015 International Conference On Computational Science MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems Hamza Zafar 1, Farrukh
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Implementation of Spark Cluster Technique with SCALA
International Journal of Scientific and Research Publications, Volume 2, Issue 11, November 2012 1 Implementation of Spark Cluster Technique with SCALA Tarun Kumawat 1, Pradeep Kumar Sharma 2, Deepak Verma
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
