BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Similar documents
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Big Data Analytics - Accelerated. stream-horizon.com

Real Time Big Data Processing

Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Real-time Big Data Analytics with Storm

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Web Analytics Platform on AWS for Yottaa

Next-Generation Cloud Analytics with Amazon Redshift

Analytics on Spark &

Big Data and Fast Data combined is it possible?

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Architectures for massive data management

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Big Data on Microsoft Platform

Predictive Analytics with Storm, Hadoop, R on AWS

Principles and best practices of scalable real-time data systems. Nathan Marz James Warren 6$03/(&+$37(5 MANNING

Apache Hadoop. Alexandru Costan

The 4 Pillars of Technosoft s Big Data Practice

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop & Spark Using Amazon EMR

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data and Industrial Internet

Getting Real Real Time Data Integration Patterns and Architectures

Big data blue print for cloud architecture

CRITEO INTERNSHIP PROGRAM 2015/2016

A Brief Introduction to Apache Tez

Big Data Infrastructure at Spotify

the missing log collector Treasure Data, Inc. Muga Nishizawa

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform

Building Scalable Big Data Pipelines

Hadoop and Map-Reduce. Swati Gore

Data Challenges in Telecommunications Networks and a Big Data Solution

Apache Kafka Your Event Stream Processing Solution

Chapter 7. Using Hadoop Cluster and MapReduce

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Scalable Architecture on Amazon AWS Cloud

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Big Data Analysis: Apache Storm Perspective

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Openbus Documentation

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

From Spark to Ignition:

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Rakam: Distributed Analytics API

Unified Batch & Stream Processing Platform

SAP and Hortonworks Reference Architecture

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Syslog Analyzer ABOUT US. Member of the TeleManagement Forum

Open source Google-style large scale data analysis with Hadoop

BIG DATA What it is and how to use?

Native Connectivity to Big Data Sources in MSTR 10

Big Data With Hadoop

Big Data at Cloud Scale

CSE-E5430 Scalable Cloud Computing Lecture 11

A Brief Outline on Bigdata Hadoop

Manifest for Big Data Pig, Hive & Jaql

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Pipeline and Analytics Platform

Understanding traffic flow

A stream computing approach towards scalable NLP

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Amazon Kinesis and Apache Storm

CitusDB Architecture for Real-Time Big Data

IBM System x reference architecture solutions for big data

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Dashboard Engine for Hadoop

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

The basic data mining algorithms introduced may be enhanced in a number of ways.

Real-time Ad-hoc Analytics on S3 with MemSQL

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Testing Big data is one of the biggest

MEAP Edition Manning Early Access Program Big Data version 1

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Transcription:

BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane

Executive Summary Growing data volumes and real time decision making requirements are making traditionally used processing systems redundant. Increasingly, businesses are leveraging Big Data technologies for enabling high impact decision making in real time. In the case of mobile campaign management, high data volumes and velocities pose a data processing challenge. The lack of real time views on campaign performance often leads to over or under delivering campaigns. Often, campaigns are run for shorter durations and over delivering entails heavy losses in terms of revenues and missed opportunities. This paper talks about the Lambda architecture- a combination of batch and real time processing capabilities. And how this was used as a solution on a big data platform, to enable real time decision making for mobile ad campaign management. 2

Background Traditionally, batch jobs were used to update dashboards meant for monitoring or decision making purposes. In a batch processing system, data is processed in batches set at specified intervals, and then made available to users. To handle the continuously growing volumes of data, batch processing technologies have evolved from Java based ones to Hadoop based big data ones. However, since processing is done in batches, there are large time lags between the data arriving onto the platform and being In a highly dynamic displayed to users. Also, if the velocity of data generation business world, the is high, each batch takes longer to process, leading to inability to take strategic decisions almost instantly leads to heavy further delays in generating reports. While this issue is easily overcome by adding more servers to the system, it is a costly thing to do. losses in terms of revenues and business opportunities. In turn, making real time processing capabilities crucial to have. Large time ranges have considerable impact on business decisions, often leading to heavy revenue losses and missed opportunities. Real time data processing tools help in overcoming this issue. Real time processing systems continuously process data as it arrives. Whenever data has to be viewed, reports are generated based on the most recent data. This makes monitoring and decision making more precise than in the case of a batch processing system. 3

The issue with real time processing, is that factors such as delayed events, code changes, or system failures, call for re-processing the time ranges and then re-computing the metrics. Making it a time and resource consuming affair. Citing the example we have used for this paper. Advertising campaigns have become commonplace on the mobile platform. These campaigns are usually run for short durations. For campaigns to be successful, their perfomance needs to be constantly monitored and priorities adjusted accordingly. If campaigns under deliver, advertisers lose visibility. However, if they over deliver, platforms lose revenue. Reprioritizing campaign objectives based on historic data often leads to under or over delivering campaigns. To ensure this does not happen real time reporting is essential. In our case batch jobs were being used for processing campaign data. With over 500 GB data being generated per hour at almost 200k messages/second, campaign priorities took a while to calculate. In most cases, the campaigns would over deliver in terms of the number of impressions, leading to considerable losses in terms of revenue. To minimize the revenue loss, campaign performance reports needed to be generated in real time. 4

Dealing with Volume and Velocity In our case, the volume of data generated and the speed at which data was generated were considerably high. To ensure success of the campaigns, data processing needed to factor in both the volume and velocity aspects of the data. Using a Big Data based batch processing system, large data sets could be easily handled. Depending on the batch intervals, large volumes of data could be easily processed, along with scope for managing incremental increases. However, at the high velocity this data was being The lambda architecture is an amalgamation of batch and real time processing. It provides a solution where real time decision making is enabled for critical areas and cost efficiency is maintained by not making the entire processing happen in real time. generated, the time taken for processing each batch would increase, adding to the delay in generating reports. A workaround would entail an expensive addition of servers. Handling high velocity data required a real-time processing tool that compensated for the high-latency of batch systems. However, using such a system alone would mean risking time consuming and costly reprocessing whenever delayed events, code changes, or system failures occurred. Since, neither batch nor real time systems alone, could handle the volume and velocity conundrum. We looked at a solution where both the systems worked in tandem. The batch component would store the master data set and batch process all the metrics. The real time component would continuously process recent data and integrate the old data into the batch. 5

Lambda Architecture Using a combination of batch and real time processing systems in parallel is an approach which businesses are increasingly exploring now. While the batch layer ensures scalability and fault resilience for the system, the real time layer takes care of processing metrics required on the go. Nathan Marz coined the term Lambda Architecture (LA) to describe a generic, scalable and fault-tolerant data processing architecture. Data that enters the system is dispatched to the batch and the speed layers for simultaneous processing. The batch layer serves two functions: Managing the master dataset (an immutable, append-only set of raw data) Pre-computing the batch views. The serving layer indexes the batch views so that they can be queried in a low-latency and ad-hoc manner. The speed layer compensates for the high latency of updates to the serving layer and deals only with recent data. Any incoming query can be answered by merging the results from both batch and real-time views. 6

Solution Overview To process data at almost 200k messages / second, high velocity receivers and storage systems were required for processing and storing incoming records. Kafka proved to be the best choice as a message queue. Apache Storm Stream and Redshift seemed best suited to manage real time data and data storage respectively. For processing historic data and generating adhoc reports (service layer), batch-processing services like Amazon Elastic MapReduce, PIG and Apache Spark were used. The Architecture The figure below provides an overview of how the lambda architecture looks like. Amazon Redshift Slave Node Master Node Slave Node Data Batch Layer MySQL Real Time Apache Storm Serving layer Slave Node Query Amazon S3 Slave Node Batch View EMR Amazon S3 Speed Layer 7

Components Batch layer The Kafka messaging queue passes the log messages to Amazon S3 where they are stored as the master dataset. These logs are then batch processed on Amazon Elastic MapReduce using PIG scripts, MapReduce and Apache Spark. The metrics derived from this processing are loaded into the Amazon Redshift Data warehouse and are made available as key value stores for to visualization / reporting tools such as tableau. Speed Layer In order to generate real time reports, the Kafka messaging queue passes the logs to the Apache Storm Cluster via a Message Queue Kafka Storm Topology. The storm topology reads the events from Batch Layer Amazon S3, EMR, MapReduce, Apache Spark, Amazon Redshift joins, data aggregations, and transformations are carried Speed Layer Apache Storm, MySQL layer where regular purging makes the data sizes smaller. In the Kafka Queue with the help of Kafka Spouts. Stream out using Storm bolts. The metrics derived are calculated on the cluster and stored in MySQL. A majority of the issues related to concurrent updates are managed on the speed turn, helping to reduce the processing complexity. 8

Serving Layer Views get created separately on batch as well as real time data. The serving layer is responsible for merging the views created on batch and real time layers. The merged views are then used for generating reports. Real time views are transient in nature show only the most recent data, older data is discarded once it passes through the serving layer and is stored progressively in the batch layer. Error Recovery The lambda architecture enables error rectification by allowing the views to be re-computed. If this seems time consuming in a particular case, we can simply revert to the previous, non-corrupted version of the data. Doing this was possible since the data in the master dataset was immutable - could not be altered after being created. The data in the master dataset does not get updated and is only appended to (time-based ordering). This makes for a Human Fault Tolerant System where bad data can be completely removed and re-computation can be easily done. 9

Conclusion In the case of mobile campaign management, the Advantages Input data retained without any changes Data can be reprocessed to re-derive an output Obeys the CAP theorem Disadvantages Maintaining a single code to produces the same result in two complex distributed systems Heavy operational load of running and debugging two systems Latencies caused during merging data at theserving layer Lambda architecture works as a cost effective and scalable solution. The speed layer uses most recent data for real time processing. The batch layer maintains the master dataset for non real time processing. This framework helps churn out real time output where needed while ensuring that too much data does not remain in the real time processing system at any given point of time. 10

About Talentica Talentica Software is an innovative outsourced product development company that helps startups build their own products. We help technology companies transform their ideas into successful products by partnering in their roadmap from pre-funded startups to a profitable acquisition. We have successfully built core intellectual property for more than 60 customers so far. We have the deep technological expertise, proven track record and unique methodology to build products successfully. Our customers include some of the most innovative product companies across USA, Europe and India. Office No. 501, Amar Megaplex Baner, Pune 411045 Tel: +91 20 4660 4000 Fax: +91 20 4075 6699 www.talentica.com 11