Big Data looks Tiny from the Stratosphere
|
|
|
- Benjamin Payne
- 10 years ago
- Views:
Transcription
1 Volker Markl Big Data looks Tiny from the Stratosphere
2 Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type Uncertainty/Quality etc. Data (volume) (velocity) (variability) (veracity) Selection/Grouping Relational Operators Extraction & Integration, ML, Optimization Predictive Models etc. Analysis (map/reduce) (Join/Correlation) (map/reduce or dataflow systems) (R, S+, Matlab) (R, S+, Matlab) Slide 3
3 Overview of Big Data Systems query plan scripting SQL-- XQuery? column store++ a query plan scalable parallel sort Slide 4
4 Outline Stratosphere architecture Layered and flexible stack for massively parallel data management The PACT programming model Using second-order functions for data parallelism Stratosphere optimizer Deeply embedding Map/Reduce Style UDFs into a query optimizer Iterative algorithms in Stratosphere Teaching a dataflow system to execute iterative algorithms with comparable performance to specialized systems Slide 5
5 The Stratosphere System Stack Layered approach several entry points to the system PACT Program Meteor Script Pact 4 Scala SOPREMO Compiler Scala-Compiler Plugin Stratosphere Optimizer Runtime Operators Nephele Parallel Dataflow Nephele Dataflow Engine Slide 6
6 The Stratosphere System Stack PACT Program Meteor Script Pact 4 Scala Nephele parallel dataflow engine - Resource allocation - Scheduling - Task communication - Fault tolerance - Execution monitoring SOPREMO Compiler Scala-Compiler Plugin Stratosphere Optimizer Runtime Operators Nephele Parallel Dataflow Nephele Dataflow Engine Slide 9
7 The Stratosphere System Stack PACT Program Meteor Script Pact 4 Scala Runtime engine - Memory management - Asynchronous IO - Query execution (sorting, hashing, ) SOPREMO Compiler Scala-Compiler Plugin Stratosphere Optimizer Runtime Operators Nephele Parallel Dataflow Nephele Dataflow Engine Slide 10
8 The Stratosphere System Stack PACT Program Meteor Script SOPREMO Compiler Pact 4 Scala Scala-Compiler Plugin Stratosphere optimizer picks: - Physical execution strategies - Partitioning strategies - Operator order Stratosphere Optimizer Runtime Operators Nephele Parallel Dataflow Nephele Dataflow Engine Slide 11
9 SOPREMO Compiler Scala- Compiler Stratosphere Optimizer Runtime Operators Nephele Dataflow Engine The PACT programming model Second-order functions for data parallelism
10 Parallelization Contracts (PACTs) Data Second-order function First-order function (user code) Data Map PACT Generalize Map/Reduce Describe how input is partitioned/grouped as second order function What is processed together? First-order UDF called once per input group Map PACT (record at a time, 1-dimensional) Each input record forms a group Each record is independently processed by UDF Reduce PACT (set at a time, 1-dimensional) One attribute is the designated key All records with same key value form a group Reduce PACT Slide 14
11 More Parallelization Contracts Cross PACT Each pair of input records forms a group Distributed Cartesian product CoGroup PACT All pairs with equal key values form a group 2D Reduce More PACTs currently under consideration For similarity operators, stream processing, etc Match PACT Each pair with equal key values forms a group Distributed equi-join Record-at-a-time, 2-dimensional Set-at-a-time, 2-dimensional Record-at-a-time, 2-dimensional Slide 15
12 PACT Programming Model Map C := max(a,b) Source 1 Extract (A,B) Sink 1 Reduce (on A) sum(b), avg(c) Match (A = D) if (A>3) emit Map if (D>4) emit Source 2 Extract (D,E) A PACT program is an arbitrary dataflow DAG consisting of operators An operator consists of A second-order function (SOF) signature (PACT) A user-defined first-order function (FOF) written in Java PACT programs serve as intermediate representation, but are also exposed to the user To implement UDFs for functionality not supported by Meteor Slide 16
13 SOPREMO Compiler Scala- Compiler Stratosphere Optimizer Runtime Operators Nephele Dataflow Engine The Stratosphere Optimizer Opening the Black Boxes (VLDB 2012)
14 Optimizer Design Cost-based optimizer produces physical execution plan given PACT program Annotates edges with distribution patters, e.g., broadcast, partition Chooses physical execution strategies (e.g., hash/sort) Reorders PACT functions Constructs Nephele job graph Challenge: Semantics of user-defined functions unknown How to derive correct transformations (this talk) How to cost functions (ongoing work) Mix and match UDFs and native operators (ongoing work) Slide 18
15 Optimization Overview Approach: Statically analyze user code in each PACT UDFs and extract properties Based on these properties, derive semantically correct transformations Enumerate semantically equivalent plans Contribution: How to deeply embed MapReduce functions into a query optimizer Parallelization and reordering Applies to data flows composed (in part) of functions written in arbitrary imperative code Exportable to Scope, SQL/MapReduce (e.g., Aster, Greenplum) Slide 19
16 via Static Code Analysis 1 void match (Record left, 2 Record right, 3 Collector col) { 4 Record out = copy (left); 5 if (left.get(0) > 3) { 6 double a = right.get(2); 7 out.set(2,1.0/a); 8 } 9 out.set(1, 42); 10 out.set(3,right.get(0)); 11 out.set(4,right.get(1)); 12 out.set(5,right.get(2)); 13 col.emit (out); 14 } Feasible: 1.Record data model, fixed API for 2.No control flow between operators Correct: Difficulty comes from different code paths Correctness guaranteed through conservatism Add to R,W when in doubt Slide 20
17 Opening the Black Boxes Analyze user code to discover: (O f,r f,w f,e f ) Output schema O f : Schema of output record given schema of input record(s) Read set R f : Attributes of the input record(s) that might influence output PACT f Write set W f : Attributes of the output record(s) that might have different values from respective input attributes Emit cardinality E f : Bounds on records emitted per call (1, >1, ) Slide 21
18 Code Analysis Algorithm 1 void match (Record left, 2 Record right, 3 Collector col) { 4 Record out = copy (left); 5 if (left.get(0) > 3) { 6 double a = right.get(2); 7 out.set(2,1.0/a); 8 } 9 out.set(1, 42); 10 out.set(3,right.get(0)); 11 out.set(4,right.get(1)); 12 out.set(5,right.get(2)); 13 col.emit (out); 14 } R f from get statements W f by backwards traversal of data flow graph starting from emit statement E f by traversing control flow graph Input 1 = [A,B,C] Input 2 = [D,E,F] Output = [A,B,C,D,E,F] R f = {A,B,C,D,E,F} W f = {B,C} E f = 1 Slide 22
19 Automatic Parallelization probeht (A) parti./sort. (A) Map C := max(a,b) Source 1 Extract (A,B) Sink 1 Reduce (on A) sum(b), avg(c) fifo Match (A = D) if (A>3) emit buildht (D) partition (D) Map if (D>4) emit Source 2 Extract (D,E) Optimizer can pick partitioning strategies From PACT signature E.g., for Match: broadcast, partition, SFR Partitioning strategies propagated top-down as interesting properties Can infer preserved partitioning via R/W sets Identifies pass-through UDFs A Reduce does not always imply a physical sort operator Slide 23
20 Operator Reordering buildht (A) Reduce (on A) sum(b), avg(c) buildht (A) partition (A) Map C := max(a,b) Source 1 Extract (A,B) Sink 1 Match (A = D) if (A>3) emit probeht (D) part./sort (D) Map if (D>4) emit Source 2 Extract (D,E) Reordering PACTs Reduce data volume Introduce new partitioning opportunities Reordering, partitioning, and physical operators in one stage Optimal execution plan Powerful transformations using read and write conflicts Can emulate most relational optimizations without knowing operator semantics Slide 24
21 Example Transformations Theorem 1: Two Map operators can be reordered if their UDFs have only read-read conflicts Theorem 2: For a Map and a Reduce, we need in addition the Reduce key groups to be preserved Enabled optimizations: Selection push-down (Bushy) join reordering Aggregation push-down In VLDB 12 paper: Formal proofs and conditions for safe reorderings for all possible PACT pairs based on R f,w f,e f Equivalent to invariant grouping transformation [Chaudhuri & Shim 1994] Reordering of non-relational Reduce functions Slide 25
22 SOPREMO Compiler Scala- Compiler Stratosphere Optimizer Runtime Operators Nephele Dataflow Engine Support for Iterative queries Spinning Fast Iterative Dataflows (VLDB 2012)
23 Bulk Iterations dynamic Reduce (on tid) (pid=tid, r= k) Match (on pid) (tid, k=r*p) p S (pid, r) A Sum up partial ranks Join P and A (pid, tid, p) constant Recompute state at each iteration Conceptual feedback edge in the dataflow lazy unrolling possible Distinguish dynamic data path (different data each iteration) and constant data path (same) Caching heuristics were constant and dynamic paths meet Cached data may be indexed Optimizer weighs costs for constant and dynamic data path differently Automatically favors plans that push work to the constant path 29
24 PageRank: Two Optimizer Plans Reduce (on tid) (pid=tid, r= k) buildhash- Table (pid) O Match (on pid) (tid, k=r*p) broadcast I fifo p fifo (pid, r) CACHE part./sort (tid) A Sum up partial ranks Join P and A probehashtable (pid) (pid, tid, p) Reduce (on tid) (pid=tid, r= k) probehash- Table (pid) O part./sort (tid) Match (on pid) I (tid, k=r*p) CACHE partition (pid) p (pid, r) partition (pid) A Sum up partial ranks Join P and A buildhashtable (pid) (pid, tid, p) Slide 30
25 Sparse Computational Dependencies Parts of the state change at each iteration, based on the parts that changed at the previous iteration Most graph algorithms and beyond Huge savings if we do not recompute the whole state Need in-place updates in persistent state while surfacing a pure functional model cid S 0 S 1 S vid
26 Incremental Iterations Different programming abstraction based on workset algorithms Algorithm works on two sets: Solution Set and Workset A delta to the solution set is computed from the workset A new workset is recomputed at each iteration Solution set is efficiently merged with delta set 32
27 Pregel as an Extended PACT Plan Working Set has messages sent by the vertices Create Messages from new state N W i+1 D i+10 Match CoGroup. U Delta set has state of changed vertices Aggregate messages and derive new state Graph Topology In-place updates in persistent hash table W i S i Slide 33
28 Big Data looks Tiny from the Stratosphere Stratosphere is a declarative, massively-parallel Big Data Analytics System, funded by DFG and EIT, available open-source under the Apache license Data analysis program PACT Dataflow Program compiler Execution plan Stratosphere optimizer Automatic selection of parallelization, shipping and local strategies, operator order and placement Job graph Execution graph Runtime operators Hash- and sort-based out-of-core operator implementations, memory management Big Data Analytics Volker Markl BDOD Big Data Chances and Challenges Slide 34 Parallel execution engine Task scheduling, network data transfers, resource allocation, checkpointing stratosphere.eu
Big Data Analytics. Chances and Challenges. Volker Markl
Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD
How To Get The Most Out Of Big Data
Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
Big Data Research in Berlin BBDC and Apache Flink
Big Data Research in Berlin BBDC and Apache Flink Tilmann Rabl [email protected] dima.tu-berlin.de bbdc.berlin 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2015 Agenda About Data Management,
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
The Stratosphere platform for big data analytics
The VLDB Journal DOI 10.1007/s00778-014-0357-y REGULAR PAPER The Stratosphere platform for big data analytics Alexander Alexandrov Rico Bergmann Stephan Ewen Johann-Christoph Freytag Fabian Hueske Arvid
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Efficient Data Access and Data Integration Using Information Objects Mica J. Block
Efficient Data Access and Data Integration Using Information Objects Mica J. Block Director, ACES Actuate Corporation [email protected] Agenda Information Objects Overview Best practices Modeling Security
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Reference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
How To Optimize Data Processing In A Distributed System
Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms An Overview of Performance Limitations in Big Data Systems and Proposed Optimizations VASILIKI KALAVRI Licentiate
Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012
Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected]
Challenges of Handling Big Data Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected] Trend Too much information is a storage issue, certainly, but too much information is also
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Unified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki [email protected]
Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki [email protected] Challenges of numerical computation over big data When applying any algorithm to big data
Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture
Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Adeniyi Abdul 2522715 Agenda Abstract Introduction
SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems
Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems Volker Markl [email protected] dima.tu-berlin.de dfki.de/web/research/iam/ bbdc.berlin Based on my 2014 Vision Paper On
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Big Data Can Drive the Business and IT to Evolve and Adapt
Big Data Can Drive the Business and IT to Evolve and Adapt Ralph Kimball Associates 2013 Ralph Kimball Brussels 2013 Big Data Itself is Being Monetized Executives see the short path from data insights
Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Domain driven design, NoSQL and multi-model databases
Domain driven design, NoSQL and multi-model databases Java Meetup New York, 10 November 2014 Max Neunhöffer www.arangodb.com Max Neunhöffer I am a mathematician Earlier life : Research in Computer Algebra
BSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses
Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch [email protected] Large-Scale Distributed Systems Group Department of Computing, Imperial College London
Report: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014
Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24 Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Database Application Developer Tools Using Static Analysis and Dynamic Profiling
Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract
CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes Final Exam Overview Open books and open notes No laptops and no other mobile devices
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015
In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 June 29-30, 2015 Contacts Alexandre Boudnik Senior Solution Architect, EPAM Systems [email protected]
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph Sebastian Schelter Invited talk at GameDuell Berlin 29th May 2012 the mandatory about me slide PhD student at the Database Systems and Information Management
Unique column combinations
Unique column combinations Arvid Heise Guest lecture in Data Profiling and Data Cleansing Prof. Dr. Felix Naumann Agenda 2 Introduction and problem statement Unique column combinations Exponential search
Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011
Real-time Streaming Analysis for Hadoop and Flume Aaron Kimball odiago, inc. OSCON Data 2011 The plan Background: Flume introduction The need for online analytics Introducing FlumeBase Demo! FlumeBase
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
A brief introduction of IBM s work around Hadoop - BigInsights
A brief introduction of IBM s work around Hadoop - BigInsights Yuan Hong Wang Manager, Analytics Infrastructure Development China Development Lab, IBM [email protected] Adding IBM Value To Hadoop Role
Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop
Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning
Big Data Technology Pig: Query Language atop Map-Reduce
Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class
ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process
ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc
Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
The Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
