Online data processing with S4 and Omid*

Similar documents
Apache S4: A Distributed Stream Computing Platform

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Extending Hadoop beyond MapReduce

Apache HBase. Crazy dances on the elephant back

Hadoop Architecture. Part 1

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop IST 734 SS CHUNG

LARGE-SCALE DATA STORAGE APPLICATIONS

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

NoSQL Data Base Basics

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Big Data Analytics - Accelerated. stream-horizon.com

Big Data and Apache Hadoop s MapReduce

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Apache Hadoop. Alexandru Costan

Hadoop and Map-Reduce. Swati Gore

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Architectures for massive data management

YARN Apache Hadoop Next Generation Compute Platform

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Apache Hadoop FileSystem and its Usage in Facebook

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

I/O Considerations in Big Data Analytics

Big Data With Hadoop

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

How To Scale Out Of A Nosql Database

Large scale processing using Hadoop. Ján Vaňo

Challenges for Data Driven Systems

BIG DATA TRENDS AND TECHNOLOGIES

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Virtualizing Apache Hadoop. June, 2012

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010

From Spark to Ignition:

Hadoop Ecosystem B Y R A H I M A.

MapReduce with Apache Hadoop Analysing Big Data

Mark Bennett. Search and the Virtual Machine

Building Hyper-Scale Platform-as-a-Service Microservices with Microsoft Azure. Patriek van Dorp and Alex Thissen

Domain driven design, NoSQL and multi-model databases

Data Management in the Cloud

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

In-Memory BigData. Summer 2012, Technology Overview

Introduction to Cloud Computing

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

How To Use Big Data For Telco (For A Telco)

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA SOLUTION DATA SHEET

Big Data and Industrial Internet

Hadoop Distributed File System (HDFS) Overview

Trafodion Operational SQL-on-Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Application Development. A Paradigm Shift

Comparing SQL and NOSQL databases

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

THE HADOOP DISTRIBUTED FILE SYSTEM

CSE-E5430 Scalable Cloud Computing Lecture 2

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

A Survey of Distributed Database Management Systems

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Hadoop implementation of MapReduce computational model. Ján Vaňo

White Paper on NETWORK VIRTUALIZATION

BIG DATA What it is and how to use?

How To Handle Big Data With A Data Scientist

Hadoop & its Usage at Facebook

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Future Prospects of Scalable Cloud Computing

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Cassandra A Decentralized, Structured Storage System

Omid: Lock-free Transactional Support for Distributed Data Stores

Networking in the Hadoop Cluster

Big Data Technology Core Hadoop: HDFS-YARN Internals

Apache Flink Next-gen data analysis. Kostas

Module 14: Scalability and High Availability

Database Scalability and Oracle 12c

Alternatives to HIVE SQL in Hadoop File Structure

Cloud Computing Is In Your Future

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 16,

<Insert Picture Here> Oracle In-Memory Database Cache Overview

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Transcription:

Online data processing with S4 and Omid* Flavio Junqueira Microsoft Research, Cambridge * Work done while in Yahoo! Research

Big Data defined Wikipedia In information technology, big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Hortonworks A Big Data system has four properties: - It uses local storage to be fast but inexpensive - It uses clusters of commodity hardware to be inexpensive - It uses free software to be inexpensive - It is open source to avoid expensive vendor lock-in http://hortonworks.com/blog/big-data-defined/ http://hortonworks.com/blog/big-data-defined-part-deux-value-definition/

Hadoop @ Yahoo! Eric Baldeschwieler @IBM Big Data, May 2011

Context: Back in 2008 Needed scalable real-time processing Direct feedback Optimization Adaptation Use case Ad ranking with clickthrough analysis Solution Distributed stream processing platform At the time No generic platform available Research project Source: unbounce.com

Stream Processing Platform Enables applications that process streams of events Desirable properties Online meaning low-latency Best effort Scalable Fault tolerance (perhaps limited) Flexible Source: ji-make.com

S4: Simple Scalable Streaming System http://incubator.apache.org/s4

S4 Evolution Aug. 08 Internal Yahoo! codebase Nov. 10 Open source release (github) 0.3 Sep. 11 Apache project 0.4 Sep. 11 S4 piper 0.5

System overview Resources allocated independently External stream Complete application Internal stream APP 1 APP 2 Sub-system 1 Sub-system 2

App Cluster 2 Cluster 1 MyApp External stream AdapterApp Send events according to hash partitioning MyApp S4 Node Processing Element (PE) App defines keys One PE per key

Deployment Logical cluster Blessed Repository Zookeeper S4R Publish new application Physical cluster

Fault tolerance: Fail-over http://incubator.apache.org/s4/doc/0.6.0/fault_tolerance/

Fault tolerance: Checkpointing Pluggable backend http://incubator.apache.org/s4/doc/0.6.0/fault_tolerance/ Uncoordinated and Asynchronous checkpoints Lazy recovery PE state recovered upon message Scheme is lossy Prevents loss of state accumulated over extended periods

Writing an app

Skeleton of an app HelloInputAdapter Events from external source HelloApp Creates topology Connects adapter to first PE HelloPE Process events

Skeleton of an app HelloInputAdapter Events from external source HelloApp Creates topology Connects adapter to first PE HelloPE Process events public class HelloInputAdapter extends AdapterApp { @Override protected void onstart() { Event event = new Event(); event.put("name", String.class, line); getremotestream().put(event); connectedsocket.close(); } }

Skeleton of an app HelloInputAdapter Events from external source HelloApp Creates topology Connects adapter to first PE HelloPE Process events public class HelloApp extends App { @Override protected void oninit() { // create a prototype HelloPE hellope = createpe(hellope.class); // Create a stream that listens to the "names" stream and // passes events to the hellope instance. createinputstream("names", new KeyFinder<Event>() { @Override public List<String> get(event event) { return Arrays.asList(new String[] { event.get("name") }); } }, hellope); } }

Skeleton of an app HelloInputAdapter Events from external source HelloApp Creates topology Connects adapter to first PE HelloPE Process events public class HelloPE extends ProcessingElement { public void onevent(event event) { System.out.println("Hello " + (seen? "again " : "") + event.get("name") + "!"); } }

S4 Piper Lessons learned

Lessons from initial design State loss upon node crash Rigid communication layer UDP only No retransmission, no flow control Hard to use/debug/deploy Subjective, but that s the overall feeling Isolated applications No regression tests

S4 Piper improvements Dynamic coupling of applications Via a simple registration scheme Communication via TCP Throttling Retransmission and flow control Fault tolerance Checkpointing Node failover

Other ways of achieving low latency? Omid project: https://github.com/yahoo/omid

Context Incremental processing a la Percolator (Google, OSDI 2010) Distributed transactions Observers Bigtable Use case Search index Online updates Crawl to index in 5s Omid is about transactions

Why transactions? Offline MapReduce Fault tolerance Scalability Online Event Processing Shared state Fault tolerance Scalability Txns

How does it differ from S4 stream processing? S4 Data lives in memory Access to databases is expensive Incremental processing Computation is close to the data Higher latency Omid Targets lower latency for transactions

Omid architecture WAL for recoverability Centralized Status Oracle Lock-free scheme Runs on unmodified HBase Status replicated to clients Replication reduces SO load

Throughput vs. Latency

Use case News recommendation system Users with similar interests are clustered Upon a new article Check which clusters might be interested in that article Recommend article to users in the cluster Problems txns solve Concurrent operations reconfiguring the clusters Queries while clusters are being reconfigured http://www.slideshare.net/danielgmezferro/omid-efficient-transaction-management-and-incremental-processing-for-hbase-hadoop-summit-europe-2013

Wrap up

Online processing Goal Receive events Make them ready for consumption fast Two techniques Stream processing Events processed against small amount of local memory Very low latency (+250k events/node/s) Incremental processing Shared state in the form of a datastore Events processed against the datastore Higher latency

Acknowledgements S4 Matthieu (lead developer) Daniel Gómez Ferro Leo Neumeyer Kishore Gopalakrishna Omid Daniel Gómez Ferro (lead developer) Maysam Yabandeh Ivan Kelly Ben Reed S4 project: http://incubator.apache.org/s4 Omid project: https://github.com/yahoo/omid