The Stratosphere Big Data Analytics Platform
|
|
|
- Barbara Ball
- 10 years ago
- Views:
Transcription
1 The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, / 44
2 Big Data small data big data Amir H. Payberah (SICS) Stratosphere June 4, / 44
3 Big Data refers to datasets and flows large enough that has outpaced our capability to store, process, analyze, and understand. Amir H. Payberah (SICS) Stratosphere June 4, / 44
4 Where Does Big Data Come From? Amir H. Payberah (SICS) Stratosphere June 4, / 44
5 Big Data Market Driving Factors The number of web pages indexed by Google, which were around one million in 1998, have exceeded one trillion in 2008, and its expansion is accelerated by appearance of the social networks. Mining big data: current status, and forecast to the future [Wei Fan et al., 2013] Amir H. Payberah (SICS) Stratosphere June 4, / 44
6 Big Data Market Driving Factors The amount of mobile data traffic is expected to grow to 10.8 Exabyte per month by Worldwide Big Data Technology and Services Forecast [Dan Vesset et al., 2013] Amir H. Payberah (SICS) Stratosphere June 4, / 44
7 Big Data Market Driving Factors More than 65 billion devices were connected to the Internet by 2010, and this number will go up to 230 billion by The Internet of Things Is Coming [John Mahoney et al., 2013] Amir H. Payberah (SICS) Stratosphere June 4, / 44
8 Big Data Market Driving Factors Many companies are moving towards using Cloud services to access Big Data analytical tools. Amir H. Payberah (SICS) Stratosphere June 4, / 44
9 Big Data Market Driving Factors Open source communities Amir H. Payberah (SICS) Stratosphere June 4, / 44
10 Amir H. Payberah (SICS) Stratosphere June 4, / 44
11 Big Data Analytics Stack Amir H. Payberah (SICS) Stratosphere June 4, / 44
12 Hadoop Big Data Analytics Stack Amir H. Payberah (SICS) Stratosphere June 4, / 44
13 Spark Big Data Analytics Stack Amir H. Payberah (SICS) Stratosphere June 4, / 44
14 Stratosphere Big Data Analytics Stack * Under development ** Stratosphere community is thinking to integrate with Mahout Amir H. Payberah (SICS) Stratosphere June 4, / 44
15 Amir H. Payberah (SICS) Stratosphere June 4, / 44
16 What is Stratosphere? An efficient distributed general-purpose data analysis platform. Built on top of HDFS and YARN. Focusing on ease of programming. Amir H. Payberah (SICS) Stratosphere June 4, / 44
17 Project Status Research project started in 2009 by TU Berlin, HU Berlin, and HPI An Apache Incubator project 25 contributors v0.5 is released Amir H. Payberah (SICS) Stratosphere June 4, / 44
18 Motivation MapReduce programming model has not been designed for complex operations, e.g., data mining. Very expensive, i.e., always goes to disk and HDFS. Amir H. Payberah (SICS) Stratosphere June 4, / 44
19 Solution Extends MapReduce with more operators. Support for advanced data flow graphs. In-memory and out-of-core processing. Amir H. Payberah (SICS) Stratosphere June 4, / 44
20 Stratosphere Programming Model Amir H. Payberah (SICS) Stratosphere June 4, / 44
21 Stratosphere Programming Model A program is expressed as an arbitrary data flow consisting of transformations, sources and sinks. Amir H. Payberah (SICS) Stratosphere June 4, / 44
22 Transformations Higher-order functions that execute user-defined functions in parallel on the input data. Amir H. Payberah (SICS) Stratosphere June 4, / 44
23 Transformations Higher-order functions that execute user-defined functions in parallel on the input data. Amir H. Payberah (SICS) Stratosphere June 4, / 44
24 Transformations: Map All pairs are independently processed. Amir H. Payberah (SICS) Stratosphere June 4, / 44
25 Transformations: Map All pairs are independently processed. val input: DataSet[(Int, String)] =... val mapped = input.flatmap { case (value, words) => words.split(" ") } Amir H. Payberah (SICS) Stratosphere June 4, / 44
26 Transformations: Reduce Pairs with identical key are grouped. Groups are independently processed. Amir H. Payberah (SICS) Stratosphere June 4, / 44
27 Transformations: Reduce Pairs with identical key are grouped. Groups are independently processed. val input: DataSet[(String, Int)] =... val reduced = input.groupby(_._1).reducegroup(words => words.minby(_._2)) Amir H. Payberah (SICS) Stratosphere June 4, / 44
28 Transformations: Join Performs an equi-join on the key. Join candidates are independently processed. Amir H. Payberah (SICS) Stratosphere June 4, / 44
29 Transformations: Join Performs an equi-join on the key. Join candidates are independently processed. val counts: DataSet[(String, Int)] =... val names: DataSet[(Int, String)] =... val join = counts.join(names).where(_._2).isequalto(_._1).map { case (l, r) => l._1 + "and" + r._2 } Amir H. Payberah (SICS) Stratosphere June 4, / 44
30 Transformations: and More... Groups each input on key Amir H. Payberah (SICS) Stratosphere June 4, / 44
31 Transformations: and More... Groups each input on key Builds a Cartesian Product Amir H. Payberah (SICS) Stratosphere June 4, / 44
32 Transformations: and More... Groups each input on key Builds a Cartesian Product Merges two or more input data sets, keeping duplicates Amir H. Payberah (SICS) Stratosphere June 4, / 44
33 Example: Word Count val input = TextFile(textInput) val words = input.flatmap { line => line.split(" ") map(word => (word, 1)) } val counts = words.groupby(_._1).reduce((w1, w2) => (w1._1, w1._2 + w2._2)) val output = counts.write(wordsoutput, CsvOutputFormat()) Amir H. Payberah (SICS) Stratosphere June 4, / 44
34 Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } joined2 = small.join(joined1).where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) Amir H. Payberah (SICS) Stratosphere June 4, / 44
35 Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } joined2 = small.join(joined1).where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) (1) Partitioned hash-join - Hash/Sort tables into N buckets on join key - Send buckets to parallel machines - Join every pair of buckets Amir H. Payberah (SICS) Stratosphere June 4, / 44
36 Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } (2) Broadcast hash-join - Broadcast the smaller table joined2 = small.join(joined1) - Avoids the network overhead.where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) Amir H. Payberah (SICS) Stratosphere June 4, / 44
37 Join Optimization val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) joined1 = large.join(medium).where(_._3).isequalto(_._1).map { (left, right) =>... } joined2 = small.join(joined1).where(_._1).isequalto(_._2).map { (left, right) =>...} result = joined2.groupby(_._3).reducegroup(_.maxby(_._2)) (3) Grouping/Aggregation reuses the partitioning from step (1) - no shuffle Amir H. Payberah (SICS) Stratosphere June 4, / 44
38 What about Iteration? Amir H. Payberah (SICS) Stratosphere June 4, / 44
39 Iterative Algorithms Algorithms that need iterations: Clustering, e.g., K-Means Gradient descent, e.g., Logistic Regression Graph algorithms, e.g., PageRank... Amir H. Payberah (SICS) Stratosphere June 4, / 44
40 Iteration Loop over the working data multiple times. Amir H. Payberah (SICS) Stratosphere June 4, / 44
41 Iteration Loop over the working data multiple times. Iterations with hadoop Slow: using HDFS Everything has to be read over and over again Amir H. Payberah (SICS) Stratosphere June 4, / 44
42 Types of Iterations Two types of iteration at stratosphere: Bulk iteration Delta iteration Both operators repeatedly invoke the step function on the current iteration state until a certain termination condition is reached. Amir H. Payberah (SICS) Stratosphere June 4, / 44
43 Bulk Iteration In each iteration, the step function consumes the entire input, and computes the next version of the partial solution. A new version of the entire model in each iteration. Amir H. Payberah (SICS) Stratosphere June 4, / 44
44 Bulk Iteration - Example // 1st 2nd 10th map(1) -> 2 map(2) -> 3... map(10) -> 11 map(2) -> 3 map(3) -> 4... map(11) -> 12 map(3) -> 4 map(4) -> 5... map(12) -> 13 map(4) -> 5 map(5) -> 6... map(13) -> 14 map(5) -> 6 map(6) -> 7... map(14) -> 15 Amir H. Payberah (SICS) Stratosphere June 4, / 44
45 Bulk Iteration - Example // 1st 2nd 10th map(1) -> 2 map(2) -> 3... map(10) -> 11 map(2) -> 3 map(3) -> 4... map(11) -> 12 map(3) -> 4 map(4) -> 5... map(12) -> 13 map(4) -> 5 map(5) -> 6... map(13) -> 14 map(5) -> 6 map(6) -> 7... map(14) -> 15 val input: DataSet[Int] =... def step(partial: DataSet[Int]) = partial.map(a => a + 1) val numiter = 10; val iter = input.iterate(numiter, step) Amir H. Payberah (SICS) Stratosphere June 4, / 44
46 Delta Iteration Only parts of the model change in each iteration. Amir H. Payberah (SICS) Stratosphere June 4, / 44
47 Delta Iteration - Example Amir H. Payberah (SICS) Stratosphere June 4, / 44
48 Delta Iteration vs. Bulk Iteration Computations performed in each iteration for connected communities of a social graph. Amir H. Payberah (SICS) Stratosphere June 4, / 44
49 Stratosphere Execution Engine Amir H. Payberah (SICS) Stratosphere June 4, / 44
50 Stratosphere Architecture Master-worker model Job Manager: handles job submission and scheduling. Task Manager: executes tasks received from JM. All operators start in-memory and gradually go out-of-core. Amir H. Payberah (SICS) Stratosphere June 4, / 44
51 Job Graph and Execution Graph Jobs are expressed as data flows. Job graphs are transformed into the execution graph. Execution graphs consist information to schedule and execute a job. Amir H. Payberah (SICS) Stratosphere June 4, / 44
52 Channels Channels transfer serialized records in buffers. Pipelined Online transfer to receiver. Materialized Sender writes result to disk, and afterwards it transferred to receiver. Used in recovery. Amir H. Payberah (SICS) Stratosphere June 4, / 44
53 Fault Tolerance Task failure compensated by backup task deployment. Track the execution graph back to the latest available result, and recompute the failed tasks. Amir H. Payberah (SICS) Stratosphere June 4, / 44
54 Amir H. Payberah (SICS) Stratosphere June 4, / 44
55 Amir H. Payberah (SICS) Stratosphere June 4, / 44
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Big Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv [email protected] Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
Big Data Research in Berlin BBDC and Apache Flink
Big Data Research in Berlin BBDC and Apache Flink Tilmann Rabl [email protected] dima.tu-berlin.de bbdc.berlin 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2015 Agenda About Data Management,
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
MapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph Sebastian Schelter Invited talk at GameDuell Berlin 29th May 2012 the mandatory about me slide PhD student at the Database Systems and Information Management
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
CIEL A universal execution engine for distributed data-flow computing
Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202 Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Spark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
Hadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014
Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24 Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation
Interactive data analytics drive insights
Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has
BIG DATA AND ANALYTICS
BIG DATA AND ANALYTICS Björn Bjurling, [email protected] Daniel Gillblad, [email protected] Anders Holst, [email protected] Swedish Institute of Computer Science AGENDA What is big data and analytics? and why one must bother
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org #apacheignite Agenda Apache Ignite (tm) In- Memory
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
International Journal of Emerging Technology & Research
International Journal of Emerging Technology & Research High Performance Clustering on Large Scale Dataset in a Multi Node Environment Based on Map-Reduce and Hadoop Anusha Vasudevan 1, Swetha.M 2 1, 2
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits
Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
How To Choose A Data Flow Pipeline From A Data Processing Platform
S N A P L O G I C T E C H N O L O G Y B R I E F SNAPLOGIC BIG DATA INTEGRATION PROCESSING PLATFORMS 2 W Fifth Avenue Fourth Floor, San Mateo CA, 94402 telephone: 888.494.1570 www.snaplogic.com Big Data
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?
Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level? Dr. Frank Lee Chair, ECE/CS/IT New York Institute of Technology Old Westbury, NY 11568 Topics This talk describes:
