Apache S4: A Distributed Stream Computing Platform

Similar documents
Online data processing with S4 and Omid*

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

CDH AND BUSINESS CONTINUITY:

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

CHAPTER 7 SUMMARY AND CONCLUSION

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

The Big Data Paradigm Shift. Insight Through Automation

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Fast Data in the Era of Big Data: Twitter s Real-

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Comprehensive Analytics on the Hortonworks Data Platform

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop and Map-Reduce. Swati Gore

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Distributed File Systems

Getting Real Real Time Data Integration Patterns and Architectures

GigaSpaces Real-Time Analytics for Big Data

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

NoSQL for SQL Professionals William McKnight

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How To Scale Out Of A Nosql Database

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Boosting Business Agility through Software-defined Networking

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Big Data and Market Surveillance. April 28, 2014

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Design and Evolution of the Apache Hadoop File System(HDFS)

S4: Distributed Stream Computing Platform

Virtualizing Apache Hadoop. June, 2012

Hadoop IST 734 SS CHUNG

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

ANALYTICS BUILT FOR INTERNET OF THINGS

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hybrid Software Architectures for Big

Architectures for massive data management

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Cassandra A Decentralized, Structured Storage System

In-Memory BigData. Summer 2012, Technology Overview

Application Development. A Paradigm Shift

Real Time Data Processing using Spark Streaming

High Availability Using Raima Database Manager Server

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

How To Make A Network Overlay More Efficient

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Real Time Analytics for Big Data. NtiSh Nati

Key Challenges in Cloud Computing to Enable Future Internet of Things

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Data Center Optimization. Disaster Recovery

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Networking in the Hadoop Cluster

BIG DATA What it is and how to use?

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Enterprise and Standard Feature Compare

Large scale processing using Hadoop. Ján Vaňo

INDIA September 2011 virtual techdays

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Scaling Out With Apache Spark. DTL Meeting Slides based on

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

BIG DATA USING HADOOP

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

From Spark to Ignition:

Analyzing Big Data with AWS

Understanding traffic flow

Cloud Computing at Google. Architecture

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Hadoop Ecosystem B Y R A H I M A.

SQL Server 2005 Features Comparison

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Time series IoT data ingestion into Cassandra using Kaa

Extending Hadoop beyond MapReduce

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Giving life to today s media distribution services

Big Data? Definition # 1: Big Data Definition Forrester Research

Administering a Microsoft SQL Server 2000 Database

JoramMQ, a distributed MQTT broker for the Internet of Things

Ingres Replicated High Availability Cluster

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Transcription:

Apache S4: A Distributed Stream Computing Platform Presented at Stanford Infolab Nov 4, 2011 http://incubator.apache.org/projects/s4 (migrating from http://s4.io) S4 Committers: {fpj, kishoreg, leoneu, mmorel, robbins}@apache.org Presented by Leo Neumeyer (@leoneu) 1

About Me Born in Buenos Aires, Argentina, studied EE. School/Work in Canada (Signal Processing, Speech Coding). SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab founded speech recognition spin-off Nuance Comm Inc. Mindstech: Startup to teach spoken English in Asia using web audio/video (before 2-way media was widely available). Yahoo! Labs: Search advertising (optimization, auctions). Quantbench: mission is to create a marketplace for data scientists, data providers, and investment funds. 2

S4 Project History Started as a research project at Yahoo! Labs in August 2008 out of the need to personalize search ads in real-time. Open sourced in September 2009. Moved to Apache Incubator in October 2011. 3

Motivation Personalized Search Twitter Trends Online Parameter Optimization Predict Market Prices Automatic Trading Network Intrusion Detection given multiple event streams extract information using data driven models in real time with low latency at scale It's Fun! Spam Filtering Sensor Networks 4

S4 Architecture Node App Server App App PE Prototype App PE Instance App Stream App Unlimited number of nodes. Each node has one process. There is one server process per node. The server loads/unloads apps. Apps encapsulate units of work. They can consume and produce event streams. An app is a graph composed of PE prototypes and streams that produce, consume, and transmit msgs. PE instances are clones of the prototype. They are associated with a unique key and contain the state. S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable, event driven, pluggable platform that allows programmers to easily implement applications for processing continuous unbounded streams of data. 5

Latency vs. Accuracy Zero Errors Real-Time Latency Unconstrained Constrained Why? Reproducible results Limited control over inbound data rate and computing complexity Use Debug Train Models Process unstructured data Tolerance to small errors Graceful recovery from inbound data streams 6

Design Actors programming model. Probabilistic thinking in both algorithms and systems. Run on commodity hardware. All in-memory, no disk bottlenecks. Pluggable (Protocols, applications, serialization, etc.) Object oriented design POJOs Static typing, no string literals, minimize type casting. Science friendly constant change, ease of use. 7

Programming Model Example: estimate clickthrough rate in a web application after applying a filter to remove bot traffic. 8

Coding an App 9

Research Areas: Systems Checkpointing strategies Replication strategies Dynamic load balancing Adaptive load management Query languages 10

Fault Tolerance Problem Approaches S4 High Availability State Loss (Crashes, system updates) Warm/hot failover Cold failover Lossy checkpointing Lossless checkpoint. Warm failover Standby nodes + Apache Zookeeper Lossy checkpointing Low Latency Decouple stream processing from checkpointing Asynchronous writes Uncoordinated checkpointing Approach: checkpoints are count or time based, pluggable backend to support any data store, lazy PE restore, tuning is application dependent. Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011. 11

Resilience in a Distributed Word Count Task 12

Research Areas: Algorithms Self-adaptive models: adaptive language models using small amounts of data. Personalization: learn from user feedback (clicks, location, behavior) to deliver relevant information in RT. Trend detection: find personal Twitter trends relevant to you. Intrusion detection: summarize high level state of the network and detect unusual patterns. Sensor networks: large amounts of audio/video and other sources require processing, recognition, detection, and tracking. Detect events across sensors. 13

Personalized Search Ads Goal is to maximize: Revenue Click yield User experience By controlling: Ranking Pricing Filtering Placement S. Schroedl, A. Kesari, and L. Neumeyer, Personalized ad placement in web search, in ADKDD 10: Proceedings of the 4th Annual International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010. 14

Personalized Search Ads Model ad click intent using recent user activity. More likely to click show more North ads. Example 1 First query is digital slr camera Next query is canon slr More likely than average to click another ad Example 2 Repeated query without previous clicks Less likely to click another ad 15

Personalized Search Ads Modeling user session Typical features: Number of searches/clicks by user past 24 hrs User COPC: Ratio of observed clicks to predicted clicks Identical query searched before / clicked before Time (seconds) since last search/click Similarity measures: current vs. previous queries Modeling technique: stochastic gradient-descent boosted trees (GDBT) 16

Personalized Search Ads Target P[CLICK ad,query,user] Approximation P[CLICK ad,query]*ucp[user,session] Non-personalized long-term model computed using Hadoop User Click Propensity (UCP) for user session computed using S4 17

Personalized Search Ads Results: We can reduce the average number of ads (ad footprint) by 7% without decreasing click yield and revenue. - OR - For a given ad footprint we can increase click yield by ~2%. 18

Thank you! Join the Apache S4 project: s4-user-subscribe@incubator.apache.org s4-dev-subscribe@incubator.apache.org 19