MBrace: Cloud Computing with Monads
|
|
|
- Eunice Alexander
- 10 years ago
- Views:
Transcription
1 MBrace: Cloud Computing with Monads Jan Dzik Nick Palladinos Kostas Rontogiannis Eirik Tsarpalis Nikolaos Vathis Nessos Information Technologies, SA 7th Workshop on Programming Languages and Operating Systems November 3, 2013 Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 1 / 29
2 Introduction Motivation Motivation Distributed Computation is Challenging. Key to success: choose the right distribution framework. Each framework tied to particular programming abstraction. Map-Reduce, Actor model, Dataflow model, etc. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 2 / 29
3 Introduction Motivation Established distributed frameworks Restrict to specific distribution patterns. Not expressive enough for certain classes of algorithms. Difficult to influence task granularity. Time consuming to deploy, manage and debug. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 3 / 29
4 Introduction What is MBrace? What is MBrace? 1 A new programming model for the cloud. 2 An elastic, fault tolerant, multitasking cluster infrastructure. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 4 / 29
5 Introduction In This Talk In This Talk Concentrate on the programming model. Distributed Computation. Distributed Data. Benchmarks. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 5 / 29
6 The MBrace Programming Model The Cloud Monad The MBrace Programming Model A monad for composing distribution workflows. Essentially a continuation monad that admits distribution. Based on F# computation expressions. Inspired by the successful F# asynchronous workflows. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 6 / 29
7 The MBrace Programming Model The Cloud Monad A Basic cloud workflow let download (url : string) = cloud { let client = new System.Net.WebClient() let content = client.downloadstring(url) return content } : Cloud<string> Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 7 / 29
8 The MBrace Programming Model The Cloud Monad Composing cloud workflows let downloadsequential () = cloud { let! c1 = download " let! c2 = download " let c = c1 + c2 } return c Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 8 / 29
9 The MBrace Programming Model Distribution Combinators Parallel Composition let downloadparallel () = cloud { let! c1,c2 = download " < > download " } return c1 + c2 Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS 13 9 / 29
10 The MBrace Programming Model Distribution Combinators Distribution Primitives: an overview Binary parallel operator: < > : Cloud<'T> -> Cloud<'U> -> Cloud<'T * 'U> Variadic parallel combinator: Cloud.Parallel : Cloud<'T> [] -> Cloud<'T []> Non-deterministic parallel combinator: Cloud.Choice : Cloud<'T option> [] -> Cloud<'T option> Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
11 The MBrace Programming Model Additional Constructs Cloud Monad: additional constructs Monadic for loops. Monadic while loops. Monadic exception handling. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
12 The MBrace Programming Model Additional Constructs Example: Inverse squares let inversesquares (inputs : int []) = cloud { let jobs : Cloud<float> [] = [ for i in inputs -> cloud { return 1.0 / float (i * i) } ] try let! results = Cloud.Parallel jobs return Array.sum results } with :? DivideByZeroException -> return -1.0 Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
13 The MBrace Programming Model Evaluation in the Cloud How is it all executed? Scheduler/worker cluster organization. Symbolic execution stack (free monad/trampolines). Scheduler interprets monadic skeleton. Native leaf expressions dispatched to workers. Symbolic stack winds across multiple machines. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
14 The MBrace Programming Model Map-Reduce A Map-Reduce implementation let rec mapreduce (map : 'T -> Cloud<'R>) (reduce : 'R -> 'R -> Cloud<'R>) (identity : 'R) (input : 'T list) = cloud { match input with [] -> return identity [value] -> return! map value _ -> let left, right = List.split input let! r1, r2 = (mapreduce map reduce identity left) < > (mapreduce map reduce identity right) } return! reduce r1 r2 Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
15 The Distributed Data Programming Model Introduction What about Data Distribution? MBrace does NOT include a storage service (for now). Relies on third-party storage services. Storage Provider plugin architecture. Out-of-the-box support for FileSystem, SQL and Azure. Future support for HDFS and Amazon S3. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
16 The Distributed Data Programming Model The MBrace Data Programming Model The MBrace Data Programming Model Storage services interfaced through data primitives. Data primitives act as references to distributed resources. Initialized or updated through the monad. Come in immutable or mutable flavors. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
17 The Distributed Data Programming Model Cloud Ref Cloud Ref Simplest distributed data primitive of MBrace. Generic reference to a stored value. Conceptually similar to ML ref cells. Immutable by design. Cached in worker nodes for performance. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
18 The Distributed Data Programming Model Cloud Ref Cloud Ref: Example let createref (inputs : int []) = cloud { let! ref = CloudRef.New inputs } return ref : CloudRef<int []> let deref (ref : CloudRef<int []>) = cloud { let content = ref.value } return content : int [] Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
19 The Distributed Data Programming Model Cloud Ref Application: Data Sharding type DistribTree<'T> = Leaf of 'T Branch of CloudRef<DistribTree<'T>> * CloudRef<DistribTree<'T>> let rec map (f : 'T -> 'S) (tree : DistribTree<'T>) = cloud { match tree with Leaf t -> return! CloudRef.New (Leaf (f t)) Branch(l,r) -> let! l', r' = map f l.value < > map f r.value return! CloudRef.New (Branch(l',r')) } Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
20 The Distributed Data Programming Model Cloud File Cloud File References files in the distributed store. Untyped, immutable, binary blobs. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
21 The Distributed Data Programming Model Cloud File Cloud File : Example let getsize (file : CloudFile) = cloud { let! bytes = CloudFile.ReadAllBytes file return bytes.length / 1024 } cloud { let! files = CloudDir.GetFiles "/path/to/files" let jobs = Array.map getsize files let! sizes = Cloud.Parallel jobs return Array.sum sizes } Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
22 The MBrace Framework Performance Performance We tested MBrace against Hadoop. Both frameworks were run on Windows Azure. Clusters consisted of 4, 8, 16 and 32 quad-core nodes. Two algorithms were tested, grep and k-means. Source code available on github. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
23 The MBrace Framework Performance Distributed Grep (Windows Azure) Count occurrences of given pattern from input files. Straightforward Map-Reduce algorithm. Input data was 32, 64, 128 and 256 GB of text. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
24 The MBrace Framework Performance Distributed Grep (Windows Azure) Time (sec) MBrace Hadoop worker cores Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
25 The MBrace Framework Performance k-means Clustering (Windows Azure) Centroid computation out of a set of vectors. Iterative algorithm. Not naturally definable with Map-Reduce workflows. Hadoop implementation from Apache Mahout library. Input was 10 6, randomly generated, 100-dimensional points. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
26 The MBrace Framework Performance k-means Clustering (Windows Azure) 1,500 MBrace Hadoop Time (sec) 1, worker cores Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
27 Conclusions & Future Work Conclusions A big data platform for the.net framework. Language-integrated cloud workflows. User-specifiable parallelism patterns and task granularity. Distributed exception handling. Pluggable storage services. Data API integrated with programming model. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
28 Conclusions & Future Work Future Work Improved C# support. A rich library of combinators and parallelism patterns. A LINQ provider for data parallelism. Support for the Mono framework and Linux. Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
29 Conclusions & Future Work Thank You! Questions? Eirik Tsarpalis (Nessos IT) MBrace: Cloud Computing with Monads PLOS / 29
An Introduction and Developer s Guide to Cloud Computing with MBrace
An Introduction and Developer s Guide to Cloud Computing with MBrace v. 0.5.0, 2014 Nessos Information Technologies, SA. 1 Introduction As cloud computing and big data gain prominence in today s economic
CIEL A universal execution engine for distributed data-flow computing
Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202 Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
The Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. [email protected] [email protected] @OrionGM The Inside Scoop
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
A Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data
Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies
CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies Lecture 8 Cloud Programming & Software Environments Part 1 of 2 Spring 2013 A Specialty Course for Purdue University s M.S. in Technology
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica [email protected] Big
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Azure Data Lake Analytics
Azure Data Lake Analytics Compose and orchestrate data services at scale Fully managed service to support orchestration of data movement and processing Connect to relational or non-relational data
Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses
Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch [email protected] Large-Scale Distributed Systems Group Department of Computing, Imperial College London
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Investigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Disco: Beyond MapReduce
Disco: Beyond MapReduce Prashanth Mundkur Nokia Mar 22, 2013 Outline BigData/MapReduce Disco Disco Pipeline Model Disco Roadmap BigData/MapReduce Data too big to fit in RAM/disk of any single machine Analyze
Daniel J. Adabi. Workshop presentation by Lukas Probst
Daniel J. Adabi Workshop presentation by Lukas Probst 3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected]
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected] Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
BIG DATA - HADOOP PROFESSIONAL amron
0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
HiBench Installation. Sunil Raiyani, Jayam Modi
HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: [email protected] & [email protected] Abstract : In the information industry,
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
MapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData
Schema Design Patterns for a Peta-Scale World Aaron Kimball Chief Architect, WibiData About me Big Data Applications Applications Mobile Customer Relations Web Serving Analytics Data management, ML, and
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
BlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
HDFS Cluster Installation Automation for TupleWare
HDFS Cluster Installation Automation for TupleWare Xinyi Lu Department of Computer Science Brown University Providence, RI 02912 [email protected] March 26, 2014 Abstract TupleWare[1] is a C++ Framework
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Real Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
A Survey on Cloud Storage Systems
A Survey on Cloud Storage Systems Team : Xiaoming Xiaogang Adarsh Abhijeet Pranav Motivations No Taxonomy Detailed Survey for users Starting point for researchers Taxonomy Category Definition Example Instance
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us
DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design
Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide
Microsoft SQL Server Connector for Apache Hadoop Version 1.0 User Guide October 3, 2011 Contents Legal Notice... 3 Introduction... 4 What is SQL Server-Hadoop Connector?... 4 What is Sqoop?... 4 Supported
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Big Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University [email protected] 16.6-2015
Hadoop Operations Management for Big Data Clusters in Telecommunication Industry
Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49
HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz
HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz Overview Google App Engine (GAE) GAE Analytics Libraries
Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
BIG DATA USING HADOOP
+ Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP + Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
BIG DATA SOLUTION DATA SHEET
BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
Optimize the execution of local physics analysis workflows using Hadoop
Optimize the execution of local physics analysis workflows using Hadoop INFN CCR - GARR Workshop 14-17 May Napoli Hassen Riahi Giacinto Donvito Livio Fano Massimiliano Fasi Andrea Valentini INFN-PERUGIA
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Managing Hybrid deployments using Cloud Foundry on Azure
Managing Hybrid deployments using Cloud Foundry on Azure N I N G KUA N G & KU N D A N A PA L A G I R I P R O G R A M M A N A G E R S, A Z U R E C O M P U T E Session Takeaways Learn about Azure Support
CLOUD COMPUTING USING HADOOP TECHNOLOGY
CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:[email protected]
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
