Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect



Similar documents
Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Challenges for Data Driven Systems

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Daniel J. Adabi. Workshop presentation by Lukas Probst

Putting Apache Kafka to Use!

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

The Stratosphere Big Data Analytics Platform

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Spark: Making Big Data Interactive & Real-Time

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Cloud Computing at Google. Architecture

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Unified Big Data Processing with Apache Spark. Matei

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Architectures for Big Data Analytics A database perspective

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How To Use Big Data For Telco (For A Telco)

THEMIS: Fairness in Data Stream Processing under Overload

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Brave New World: Hadoop vs. Spark

Spark and the Big Data Library

CSE-E5430 Scalable Cloud Computing Lecture 11

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

CSE-E5430 Scalable Cloud Computing Lecture 2

Unified Batch & Stream Processing Platform

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Unified Big Data Analytics Pipeline. 连 城

Reference Architecture, Requirements, Gaps, Roles

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

An Approach to Implement Map Reduce with NoSQL Databases

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Fast Data in the Era of Big Data: Twitter s Real-

Apache Flink Next-gen data analysis. Kostas

Big Data Processing with Google s MapReduce. Alexandru Costan

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Lecture 10 - Functional programming: Hadoop and MapReduce

Big Data Analytics - Accelerated. stream-horizon.com

Big Data looks Tiny from the Stratosphere

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data and Apache Hadoop s MapReduce

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Can the Elephants Handle the NoSQL Onslaught?

Big Data With Hadoop

NoSQL for SQL Professionals William McKnight

What s next for the Berkeley Data Analytics Stack?

Large-Scale Data Processing

A1 and FARM scalable graph database on top of a transactional memory layer

Tolerating Late Timing Faults in Real Time Stream Processing Systems. Stuart Perks

A Brief Introduction to Apache Tez

Map Reduce / Hadoop / HDFS

Promise of Low-Latency Stable Storage for Enterprise Solutions

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Moving From Hadoop to Spark

Chapter 7. Using Hadoop Cluster and MapReduce

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Real Time Big Data Processing

Introduction to Hadoop

Streaming items through a cluster with Spark Streaming

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Big Data and Industrial Internet

Using In-Memory Computing to Simplify Big Data Analytics

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Developing MapReduce Programs

Testing Big data is one of the biggest

Big Data Research in the AMPLab: BDAS and Beyond

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Apache Flink. Fast and Reliable Large-Scale Data Processing

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Scaling Out With Apache Spark. DTL Meeting Slides based on

Using Data Mining and Machine Learning in Retail

Hadoop IST 734 SS CHUNG

BIG DATA What it is and how to use?

Report Data Management in the Cloud: Limitations and Opportunities

The basic data mining algorithms introduced may be enhanced in a number of ways.

SAP and Hortonworks Reference Architecture

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Parallel Computing. Benson Muite. benson.

Distributed Computing and Big Data: Hadoop and MapReduce

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Transcription:

Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Simple past - Traditional Databases Database Management System (DBMS): Data relatively static but queries dynamic Queries DBMS Results Index Data Persistent relations Random access Low update rate Unbounded disk storage One-time queries Finite query result Queries exploit (static) indices 2

Simple past Small Data Small data OnLine Transaction Processing (OLTP) & OnLine Analytic Processing (OLAP) end of one-size fits all Transactions store data, moved to warehouse for reporting & analysis: e.g., client orders RDBMS difficult to scale Tx processing has substantial overhead for ACID properties, OLAP bottlenecked by ingestion (indexing)

Infinite Present - MapReduce Explosive growth of Data Volume (Big Data), e.g. clickstreams Dynamic resource availability (Cloud Computing) MapReduce much simpler data and processing model log everything, process in parallel (infinitely scalable) Input key*value pairs Input key*value pairs... Data store 1 map Data store n map (key 1, values...) (key 2, values...) (key 3, values...) (key 1, values...) (key 2, values...) (key 3, values...) == Barrier == : Aggregates intermediate values by output key key 1, intermediate values key 2, intermediate values key 3, intermediate values reduce reduce reduce final key 1 values final key 2 values final key 3 values

Infinite Present - MapReduce problems Costs: data still grows faster than storage and processing capacity Latency: everything is on disks, slow iteration (e.g., PageRank, Logistic Regression, Gradient descent, etc..) Map step: break page rank into even fragments to distribute to link targets Reduce step: add together fragments into next PageRank Iterate for next step... Batch processing model: stale results(e.g.,add 1 link) Page 5 View, Master, Slide Master to change this text to the title of your presentation

Infinite Present - Spark Raise of in-memory processing: Spark intermediate results can be stored in memory: fast iteration Can you go even lower latency? Page 6 View, Master, Slide Master to change this text to the title of your presentation

Fast Continuous - Stream Processing Systems Fast Data systems require quick responses (Velocity): fraud detection, high frequency trading, ad-prediction, etc... Event stream processing process one item at a time in real-time, state stored in memory, e.g., online machine learning Stream DSPS Results Transient streams Sequential access Potentially high rate Bounded main memory Continuous queries Working Storage Queries Produce time-dependant result stream Indexing?

Fast Continuous State in Stream Processing Data flow graph with processing operators - Apache S4, Twitter Storm, Nokia Dempsy,... Most interesting operators are stateful 8

Fast Continuous - Challenges Consider streaming recommender system (collaborative filtering) Input streams: user activities (item purchases, page views, clicks etc) Operators: maintain recommendation matrices Output streams: recommendations for users State: user-item matrix State: item-item matrix Activity user 1 Activity user 2 Activity user 3 Item recommendations 9 Processing needs to keep up with varying input rate acquire resources dynamically Fault-tolerance cannot restart processing from the start if there is a failure

Fast Continuous - Parallelising Stream Processing E Exploit data parallelism for scaling out 10

Fast Continuous: Elasticity and Fault tolerance Elasticity Provisioning for workload peaks unnecessarily conservative 100%" Utilisation 80%" 60%" 40%" 20%" Courtesy of MSRC 0%" 09/07" 09/08" 09/09" 09/10" 09/11" 09/12" 09/13" Date Failure tolerance Failure is a common occurrence Active fault-tolerance requires 2x resources Passive fault-tolerance leads to long recovery times 11

SEEP: Stream Processing System [SIGMOD 13] Dynamic scale out: increase resources when workload peaks occur need to partition the state and distribute across operators Hybrid fault-tolerance: low resource overhead with fast recovery checkpoint state periodically and reprocess a small subset of recent data Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch, "Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management, ACM International Conference on Management of Data (SIGMOD), New York, NY, June 2013 12

Simpler Imperative dataflow programming How do we write data processing queries? Common intermediate representation is dataflow models, tedious to use directly Functional approach is popular and easy to translate to dataflows e.g. Spark Can we do something for poor imperative programmers out there?

pers, Java distributed e dataflow ogram and to an ex- Using prossing tasks e variablelies on the processing. An SDG te: it is a hich exeory state. large state an process ned across n local inmputation. ints to acan be rec- Simpler Imperative Imperative Big Data Processing [USENIX ATC 14] Take annotated Java, and translate to a dataflow model Algorithm 1: Online collaborative filtering 1 @Partitioned Matrix useritem = new Matrix(); 2 @Partial Matrix coocc = new Matrix(); 3 4 void addrating(int user, int item, int rating) { 5 useritem.setelement(user, item, rating); 6 Vector userrow = useritem.getrow(user); 7 for (int i= ; i < userrow.size(); i++) 8 if (userrow.get(i) > ){ 9 int count = coocc.getelement(item, i); 10 coocc.setelement(item, i, count + 1); 11 coocc.setelement(i, item, count + 1); 12 } 13 } 14 Vector getrec(int user) { 15 Vector userrow = useritem.getrow(user); 16 @Partial Vector userrec = @Global coocc.multiply( userrow); 17 Vector rec = merge(@global userrec); 18 return rec; 19 } 20 Vector merge(@collection Vector[] alluserrec) { 21 Vector rec = new Vector(allUserRec[ ].size()); 22 for (Vector cur : alluserrec) 23 for (int i = ; i < alluserrec.length; i++) 24 rec.set(i, cur.get(i) + rec.get(i)); 25 return rec; 26 } 2Page 14 State in Data-Parallel Processing new rating rec request updateuseritem View, Master, Slide Master to change this text to the title of your presentation n 1 n 2 Task Element (TE) We describe an imperative implementation of a machine getuservec user Item State Element (SE) dataflow updatecoocc getrecvec coocc n 3 merge rec result

Simpler Imperative - Imperative Big Data Processing Two simple abstractions: partitioned state & partial state state merge partitioned state partial state Partition state state is partitioned across operators instances Partial state each instance has local state that can be merged when necessary Page 15 View, Master, Slide Master to change this text to the title of your presentation

Conditional Future Perfect One model to rule them all? Recap: Going to the zoo Transaction processing Batch processing: MapReduce Iterative processing: Spark Stream processing: Storm, Seep More: graph processing, deep learning, etc.. Can a new comprehensive model emerge? Spark can do graph and batch processing, has been extended to "micro-batch" stream processing Microsoft Naiad & our SEEP integrate stream and iterative + batch Can we integrate transaction & stream processing? Page 16 View, Master, Slide Master to change this text to the title of your presentation

Conditional Future Perfect Current Data Processing Architectures Online processing User Near-real-time processing Offline processing User Web Front End events Apache Kafka HDFS Web Front End transactions m m m r r r acce share Transaction Processing System Serving Layer Stream Processing System Batch Processing Unified Stream and Transactio Processing System (a) Different models, complex interoperation, duplicate code, data movement, resource sharing (b)

Conditional Future Perfect Unified Stream and Transaction processing Online processing User Near-real-time processing Offline processing User Web Front End events Apache Kafka HDFS Web Front End transactions m m m r r r access to shared state Transaction Processing System Serving Layer Stream Processing System Batch Processing Unified Stream and Transaction Processing System (a) (b) Page 18 View, Master, Slide Master to change this text to the title of your presentation

Conditional Future Perfect Unified Stream and Tx processing Why now? Renewed interest in in-memory databases, most business core DBs fit in memory! Stream processing are becoming more sophisticated in terms of state management Fault-tolerance approaches are becoming similar: deterministic databases, allow replication/ recomputation approaches Challenges Unify processing models operations (Tx with nondeterministic ordering) vs events (mostly deterministic) Synchronisation and ordering are key to scalable performance Performance vs generality tradeoffs (80% on 80%?) Page 19 View, Master, Slide Master to change this text to the title of your presentation

Conclusion Rapid evolution of data processing systems Transaction processing MapReduce In-Memory Processing Stream processing Each with their own strength Lack of cohesive view and interaction problems It is time to look back and map existing solutions work on integration, synergies, and tradeoffs Happy to collaborate on any of the above Page 20 View, Master, Slide Master to change this text to the title of your presentation