Information Processing, Big Data, and the Cloud



Similar documents
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

CSE-E5430 Scalable Cloud Computing Lecture 2

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop and Map-Reduce. Swati Gore

Large-Scale Data Processing

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data and Apache Hadoop s MapReduce

Challenges for Data Driven Systems

Big Data With Hadoop

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Evaluating partitioning of big graphs

A programming model in Cloud: MapReduce

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Cloud Computing at Google. Architecture

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Architecture. Part 1

Big Data Processing with Google s MapReduce. Alexandru Costan

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Analytics. Chances and Challenges. Volker Markl


Map-Reduce for Machine Learning on Multicore

Parallel Computing. Benson Muite. benson.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

MapReduce and Hadoop Distributed File System

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Introduction to Hadoop

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Introduction to Hadoop

Spark: Cluster Computing with Working Sets

Oracle Big Data SQL Technical Update

Data Mining in the Swamp

Infrastructures for big data

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Are You Ready for Big Data?

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Cluster, Grid, Cloud Concepts

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

HIGH PERFORMANCE BIG DATA ANALYTICS

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Survey on Load Rebalancing for Distributed File System in Cloud

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop IST 734 SS CHUNG

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Improving MapReduce Performance in Heterogeneous Environments

Large-Scale Web Applications

How To Handle Big Data With A Data Scientist

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

NoSQL. Thomas Neumann 1 / 22

BIG DATA What it is and how to use?

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Data Systems CS 5965/6965 FALL 2015

In Memory Accelerator for MongoDB

Survey on Scheduling Algorithm in MapReduce Framework

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Graph Processing and Social Networks

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

How To Scale Out Of A Nosql Database

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Developing Fleet and Asset Tracking Solutions with Web Maps

Snapshots in Hadoop Distributed File System

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Machine Learning Big Data using Map Reduce

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Parallel Programming and MapReduce

Spark. Fast, Interactive, Language- Integrated Cluster Computing

The Sierra Clustered Database Engine, the technology at the heart of

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

The 4 Pillars of Technosoft s Big Data Practice

Introduction to Hadoop

GridSolve: : A Seamless Bridge Between the Standard Programming Interfaces and Remote Resources

Clouds vs Grids KHALID ELGAZZAR GOODWIN 531

Parallel Visualization for GIS Applications

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Lecture Data Warehouse Systems

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Systems, Big Data

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Transcription:

Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010

Information Processing Systems Model Parameters Data-intensive computing Fourth paradigm of science [J. Gray] Modeling & sim Traditional HPC strengths Model parameters and generate lots of data Assumes a-priori you know what you are doing Where we can do interesting work Information processing Take lots of data and produce summaries You may not know what you want These disciplines are not mutually incompatible Traditional commercial strengths { { { Queries, aggregation, data mining, visualization Simulation Data Data for Processing Observations

Sources of Data Text Web, papers, emails Trends, NL-translation Social media Facebook, Twitter, Netfix Clique analysis, epidemic spread Computer logs Hardware failures, software bugs, security Predictive analysis, intrusion detection Geospatial GIS, satellite imagery, volunteered data Zonal statistics, population modeling

Signifcance Modeling & sim is heavily invested in parallel and distributed systems Parallel, distributed: MPI, OpenMP,... More computers better simulations Modeling & sim is very heterogeneous, needs complicated tools (MPI) Most popular programming tool in the world IPS mostly done on single machines Data tools emphasize ease-of-use, visualization IPS applications often homogeneous But things are changing Only statisticians like R Google, FB, Twitter More data, more complex, more tools Workstation: Matlab, R, Python

Data Singularity Once upon a time we were data starved Collecting data was/is hard, slow, and analog Now we have more data than we know what to do with Collecting data is become easy{ier}, fast{er}, and digital Sensors, mobile devices, volunteered data Data generated at increasing rates We must analyze data at a rate faster than collection or we will drown < 100 kb / day

What is Big Data? Data-related problems that exceed local capacity Data becoming available at collection/storage limits [Shankar] Single-machine analysis approaching memory limits my graph won't ft in memory so I can't use R Traditional data tools need to evolve MB Gap is increasing Employ distributed resources Push computation to the edge for aggressive fltering Enable wider range of data scientists Year

What is Big Data? Data-related problems that exceed local capacity Data becoming available at collection/storage limits [Shankar] Single-machine analysis approaching memory limits What will these tools look like? my graph won't ft in memory so I can't use R Where can I get lots of machines? Gap is increasing Traditional data tools need to evolve Employ distributed resources Push computation to the edge for aggressive fltering Enable wider range of data scientists

Cloud Computing Maybe I m an idiot, but I have no idea what anyone is talking about [L. Ellison] [Hardware, platform]-as-a-service No contracts, no grants, almost instantly available Enable scale-up and scale-down

Cloud Computing Maybe I m an idiot, but I have no idea what anyone is talking about [L. Ellison] [Hardware, platform]-as-a-service Not so great for HPC No contracts, no grants, almost instantly available Enable scale-up and scale-down Opaque interconnect

Cloud Computing Maybe I m an idiot, but I have no idea what anyone is talking about [L. Ellison] [Hardware, platform]-as-a-service No contracts, no grants, almost instantly available Enable scale-up and Opaque interconnect Canscale-down Clouds help us with the Big Data problem?

MapReduce as a Starting Point MapReduce is a functional, dataparallel model Maps compute over individual data elements, output <key, value> pairs System sorts by key Reduce operations compute over collections of data elements by key Reverse index, PageRank, word histograms Map operations run in parallel Reduce operations parallelized by keys Redundant map/reduce instances for fault tolerance Scales to > 10,000 cores [Dean, Ghemawat '04]

MapReduce for Graph Processing Google Pregel Local functions applied to every vertex Vertices send/receive messages across edges at each timestep Nodes retain state across iterations (no need to touch disk) Global aggregation Consensus, clustering, bipartite matching Hashing + local lookup for communication [Malewicz 2010]

MapReduce for Geospatial SenseReduce Designed for spatiotemporal data Operate over temporal windows Ability to layer multiple data sources Ability to read nearby geographic features Join these vector fles and perform zonal statistics with this raster fle Bulk Synchronous, Abstract Regions Stack multiple analysis functions Perform average, then min, etc. Reduces redundant work

MapReduce for Geospatial SenseReduce Designed for spatiotemporal data Operate over temporal windows Ability to layer multiple data sources Ability to read nearby geographic features Join these vector fles and perform zonal statistics with this raster fle Bulk Synchronous, Abstract Regions Stack multiple analysis functions Perform average, then min, etc. Reduces redundant work

MapReduce for Geospatial SenseReduce Designed for spatiotemporal data Operate over temporal windows Ability to layer multiple data sources Ability to read nearby geographic features Join these vector fles and perform zonal statistics with this raster fle Bulk Synchronous, Abstract Regions Stack multiple analysis functions Perform average, then min, etc. Reduces redundant work

Partitioning Algorithms Data must be split across multiple machines Divide features into equal areas May artifcially split features Metadata update consistency a problem Divide by logical features May lead to unbalance if features have diferent complexity Customized partitioning schemes, including ability to stack i.e. frst split by feature, then by area

Partitioning Algorithms Data must be split across multiple machines Divide features into equal areas May artifcially split features Metadata update consistency a problem Divide by logical features May lead to unbalance if features have diferent complexity Customized partitioning schemes, including ability to stack i.e. frst split by feature, then by area

Scheduling and Load-balancing Use geography-constrained aggregation tree Parents perform partial reduce Parents also point to geographically close nodes minimizes communication across nodes, simplifes load-balancing Reduce Streaming data changes computation load over time Mobile sensors Weather remote sensing Rebalance computation during map and reduce stages Number of features as indicators of load Shared boundary No shared boundary Reduce

Programming at the Edge Data from smart network of sensors CPU + storage + communication Edge processing Reduce cost Reduce latency What is the right model for edge processing? MapReduce data logger Functional data stream [Newton], spreadsheets [Horey, DCOSS], How to schedule aggregation on these devices? Assuming: information processing has to take place somewhere [Horey, KDCloud]

Programming at the Edge Data from smart network of sensors CPU + storage + communication Edge processing Reduce cost Reduce latency What is the right model for edge processing? MapReduce data logger Functional data stream [Newton], spreadsheets [Horey, DCOSS], How to schedule aggregation on these devices? Assuming: information processing has to take place somewhere [Horey, KDCloud]

Programming at the Edge Data from smart network of sensors CPU + storage + communication Edge processing Reduce cost Reduce latency What is the right model for edge processing? MapReduce data logger Functional data stream [Newton], spreadsheets [Horey, DCOSS], Naïve scheduling can lead to increased latency How to schedule aggregation on these devices? Assuming: information processing has to take place somewhere [Horey, KDCloud]

Dangers and Pitfalls Lots of data lots of opportunities for security and privacy violations People will voluntarily give you sensitive information And sue you later for using it Can we implement information processing applications in an anonymous fashion? In general: No (Netfix prize) For specifc apps: Yes Negative Quad Tree Don't report your location Still build spatial distributions ~100,000 samples in 32x32 area (1 km2)

Dangers and Pitfalls Lots of data lots of opportunities for security and privacy violations People will voluntarily give you sensitive information And sue you later for using it Can we implement information processing applications in an anonymous fashion? In general: No (Netfix prize) For specifc apps: Yes Negative Quad Tree Don't report your location Still build spatial distributions ~100,000 samples in 32x32 area (1 km2) I am here

Dangers and Pitfalls These are the places where I definitely am not Lots of data lots of opportunities for security and privacy violations People will voluntarily give you sensitive But I will report information one of these blue locations And sue you later for using it Can we implement information processing applications in an anonymous fashion? In general: No (Netfix prize) For specifc apps: Yes Negative Quad Tree Don't report your location Still build spatial distributions ~100,000 samples in 32x32 area (1 km2) I am here

Dangers and Pitfalls These are the places where I definitely am not Lots of data lots of opportunities for security and privacy violations People will voluntarily give you sensitive But I will report information one of these blue locations And sue you later for using it Can we implement information processing applications in an anonymous fashion? In general: No (Netfix prize) For specifc apps: Yes Negative Quad Tree Don't report your location Still build spatial distributions ~100,000 samples in 32x32 area (1 km2) I am here

Open Questions Do we have a single programming model? Cloud Mobile Sensors, Text Graphs Geospatial Will computation actually run at the edge? Need compelling cyberphysical examples (latency, cost, etc.) Are there lessons from HPC that can be applied to IPS (and vice-versa)? Non-blocking collectives [Hoefer] What is the right storage paradigm? Posix, SQL, BigTable, VDB What programming language? IPS users are more varied What application domains can beneft from both approaches? Computational transportation

Conclusions Model Parameters HPC applications and platforms will directly afect and beneft from IPS Bigger simulations more data more IPS Simulations + sensor data better simulations Simulation Data Data for Processing This will require new methods and tools Better programming tools Bigger computational platforms Tools and methods beneft each other MPI MapReduce Observations