BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER

Similar documents
Big Data Web Originated Technology meets Television Bhavan Gandhi and Sanjeev Mishra ARRIS

TV INSIGHTS APPLICATION OF BIG DATA TO TELEVISION

TV INSIGHTS APPLICATION OF BIG DATA TO TELEVISION

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Hadoop IST 734 SS CHUNG

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data on Microsoft Platform

Data Migration from Grid to Cloud Computing

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

The Next Wave of Data Management. Is Big Data The New Normal?

Large scale processing using Hadoop. Ján Vaňo

Hadoop Ecosystem B Y R A H I M A.

Hosting Transaction Based Applications on Cloud

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Enhancing Massive Data Analytics with the Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem

How To Scale Out Of A Nosql Database

What is Analytic Infrastructure and Why Should You Care?

Snapshots in Hadoop Distributed File System

An Approach to Implement Map Reduce with NoSQL Databases

Workshop on Hadoop with Big Data

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Toward Lightweight Transparent Data Middleware in Support of Document Stores

Manifest for Big Data Pig, Hive & Jaql

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A Hadoop use case for engineering data

Future Prospects of Scalable Cloud Computing

Accelerating and Simplifying Apache

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Hadoop and Map-Reduce. Swati Gore

BIG DATA What it is and how to use?

Transforming the Telecoms Business using Big Data and Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Approaches for parallel data loading and data querying

Big Data and Apache Hadoop s MapReduce

Open source Google-style large scale data analysis with Hadoop

Business Intelligence for Big Data

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

EXECUTIVE REPORT. Big Data and the 3 V s: Volume, Variety and Velocity

Search and Real-Time Analytics on Big Data

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

American International Journal of Research in Science, Technology, Engineering & Mathematics

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Alternatives to HIVE SQL in Hadoop File Structure

Generic Log Analyzer Using Hadoop Mapreduce Framework

This Symposium brought to you by

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

BIG DATA SOLUTION DATA SHEET

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Trafodion Operational SQL-on-Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Technical Overview Simple, Scalable, Object Storage Software

How To Handle Big Data With A Data Scientist

Hadoop implementation of MapReduce computational model. Ján Vaňo

INTRODUCTION TO CASSANDRA

The 4 Pillars of Technosoft s Big Data Practice

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Big Data for Processing Data and Performing Workload

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Traditional BI vs. Business Data Lake A comparison

Internals of Hadoop Application Framework and Distributed File System

Oracle Big Data SQL Technical Update

Slave. Master. Research Scholar, Bharathiar University

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Open Access Research of Massive Spatiotemporal Data Mining Technology Based on Cloud Computing

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Business Intelligence and Column-Oriented Databases

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop FileSystem and its Usage in Facebook

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache Hadoop: The Big Data Refinery

Testing 3Vs (Volume, Variety and Velocity) of Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

NetFlow Analysis with MapReduce

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data: Tools and Technologies in Big Data

Big Data Defined Introducing DataStack 3.0

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

BIG DATA-AS-A-SERVICE

Sentimental Analysis using Hadoop Phase 2: Week 2

Virtualizing Apache Hadoop. June, 2012

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Transcription:

BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER

TABLE OF CONTENTS INTRODUCTION WHAT IS BIG DATA?... 3 Web Originated... 3 Big Data for Television... 3 ANALYTICS SYSTEM ARCHITECTURE... 4 Data Collection... 4 Storage & Persistence... 5 Data Analysis & Processing... 6 ANALYTICS FOR DIGITAL AD INSERTION... 7 DEPLOYMENT MODELS... 9 ACKNOWLEDGEMENTS... 10 CONCLUSION... 10 REFERENCES... 11 Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 2

INTRODUCTION WHAT IS BIG DATA? Web Originated Terms typically used to characterize Big Data are volume, velocity, and variety [1, 2, 3]. Volume refers to the sheer amount of data that currently exists and is being created continuously. Velocity is the speed with which this data can be processed and analyzed for a meaningful use. Variety is simply the different types of data that can be collected. From the perspective of the Internet, zettabytes (2 70 bytes) of data are created and stored in the form of images, documents, tweets, videos, maps, social interactions, etc. This data has to be analyzed and classified with speed so that end users can find and retrieve relevant information to accomplish their tasks. Much of this data is either unstructured, or in cases where there is inherent structure, the variety of the data leads to the need to support multiple structures. The advent of the Internet and the volume of data that needs to be managed, stored, and analyzed led to the emergence of modern day Big Data technologies. Yahoo and Google have developed breakthrough technology on processing Internet scale disparate data and storing the processed results on distributed commodity servers to enable high availability and scalability. The Google File System (GFS) [4] and Big Table [5] showcase early pioneering work that was developed for storing data at Google. This is the pre- cursor to some of the mainstream, open source big data related technologies that are being developed and evolved by the open source community like Apache. Apache HBase [6] and Apache Hadoop Distributed File System (HDFS) [7] are open source equivalents of Big Table and GFS respectively. Given that companies like Google, Yahoo, and Facebook are storing petabytes (2 50 bytes) of data, there is a need to access this vast repository of data to process the information and derive meaning quickly. MapReduce is a framework that works on top of HDFS and HBase; it allows for parallel access, incremental, and distributed processing of huge amounts of data to derive useful meaning that can be used by subsequent services or applications. Hive and Pig are higher- level languages that allow for programming or scripting of MapReduce jobs. Researchers at Facebook developed Hive to allow SQL- like queries against their Big Data sets [8, 9]. Pig was developed by Yahoo staff to reduce the complexity of programming MapReduce jobs. Big Data for Television The television eco- system is migrating from dedicated hardware / software solutions to more flexible sets of services that allow customized and interactive experiences for end- Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 3

consumers. The power of Big Data technologies from the web world can be harnessed to create measurable and customized experiences in the television space. This is especially true of experiences that are migrating from the single primary screen, in- home, linear viewing experiences to multi- screen, on- the- go, on- demand content consumption experiences. There are essentially three stakeholders that benefit from Big Data and corresponding data science: the operator, the content provider, and the end consumer. The operator benefits by having meaningful information for monitoring and optimizing system performance for capacity planning and measuring impact. The content provider can use this information to optimize and understand how their programming is being used by the end- consumer. Finally, the end consumers benefit from better interactive and targeted experiences that are more meaningful for them or their household. A well- instrumented TV eco- system mirrors the web world in terms of volume, velocity, and variety of data that is collected, albeit not at the same scale. Volume of data is collected from loosely coupled systems and modules that span application, control, and delivery services. With disparate services comes data variety. All collected data has to be analyzed with velocity to provide meaningful and useful information to subsequent applications impacting user interaction, or systems impacting operator systems. The analyzed results can be provided through application programming interfaces (APIs), or through visualizations and operational dashboards. ANALYTICS SYSTEM ARCHITECTURE Analytics systems are comprised of three major components: data collection, storage, and analysis. A high level architecture of a general analytics system is given in Figure 1. Data Collection The key to any analytic system is collecting data, which is federated across a number of services within an eco- system. This requires instrumenting network, client services and applications, or equipment to generate and report events, which are then sent to an event intake mechanism. From an applications and a services perspective, event reporting should be a lightweight process where the complexities of ingesting events are abstracted away from the event collection probes. Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 4

Figure 1 - High Level Analytics Architecture At ARRIS, we have implemented an event- reporting library (in Java) that abstracts away the event collection logistics from the instrumented service. The instrumented service needs to define the events that are meaningful within its context and report them either using an abstracted library or directly to an event intake mechanism. Keeping this lightweight for the instrumented services, it is best to collect raw or minimally processed events, hence placing much of the compute burden on the analytics system. We are currently using Spring XD [10] as our real- time event intake mechanism. It is an open source project being developed under the Spring framework umbrella. Spring XD, a distributed event intake mechanism supporting HTTP, TCP, and Syslog among others, allows for processing of these events prior to storing and persisting. We define one Spring XD stream per event type or event source. Storage & Persistence Once events are ingested, the question becomes one of the database technologies that the system should employ. The answer is quite simply a function of system requirements and the types of data that are being housed. The choices are either relational database management systems (RDBMS - SQL), or non- relational (NoSQL). For storing and subsequently processing events from a variety of sources, where there is no obvious cohesive relational structure, a non- relational database such as HBase or files stored directly on HDFS indexed with Hive are obvious choices. As previously mentioned, this stems from Big Data technologies from the Internet. This allows for storing raw or minimally processed events for subsequent domain specific analysis. The result of any analysis or processing that is done on the data can also be stored in the same DBMS Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 5

The flexibility of Hadoop makes it a great choice for systems collecting information from loosely coupled, disparate sources, where processing typically needs to be done across the events to derive meaningful domain specific insight. Data Analysis & Processing Domain specific analysis needs to be done to accomplish targeted goals of the system. This is application dependent and determined by an articulated set of key performance indicators. For example, in the web domain, search and retrieval end up being important tasks. More complex tasks may include either content or social recommendations. Similarly, in the TV space, the data science applied determines the efficacy of the solution. A Hadoop / HDFS system is designed to run MapReduce jobs for pulling and processing information. Higher- level languages that allow MapReduce scripting are Pig, Hive, and Scalding. Processing may be a simple statistical processing, correlating, and tabulating information across various individual services, or it may require machine learning techniques to do predictive analysis to profile human or system behavior; predicting user behavior facilitates end- user interaction, or predict system behavior / trends to circumvent system limitations. A typical data analysis system houses sensitive user or subscriber information, which has to be protected and secured. A policy- based access control mechanism is imperative for ensuring that subscriber- specific information is exposed only to authorized external applications or business intelligence tools. Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 6

Analysis and processing information have three primary goals: 1. Create visuals and dashboards to provide business intelligence, 2. Provide application- programing interfaces for downstream usage to drive system behavior or user enhancements, and 3. Derive atomic and aggregate data sets on which additional analysis and exploration can be performed. This is the data science or intelligence that rides on top of the Big Data technologies to derive business value. ANALYTICS FOR DIGITAL AD INSERTION Dynamic ad insertion (DAI) is a prime example of an application where Analytics plays a differentiating role increasing operator efficiency, measurement, and relevant ad targeting. Following the principles and architecture outlined previously, a data analytics system for DAI is comprised of instrumenting data sources for event collection, event intake, persistence / storage, and analysis. The results of the analysis are exposed via an application interface or they are exposed to a business intelligence tool for visualizing system information through appropriate dashboards. Figure 2 is an architecture that is specifically designed for a DAI system. In this example, we are following an SCTE- 130 based approach for DAI targeting multi- screen devices based on subscriber / user profiles [11]. The sources of event information are: ADS Ad Decision Service for capturing ad placement request, response and notification events ADM Ad Decision Manager for content delivery events. In the case of HLS, this may include manifest related information that is delivered to a client / application. CIS Content Information Service for processed content and ad metadata POIS Placement Opportunity Information Service for ad placement location events. Clients Where possible actual client / application events are collected to get actual content and ad consumption information directly from the consumer facing application. In addition to these operational events, bulk data imports are needed to correlate collected events with actual subscribers and programming. For this, we collect Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 7

Subscriber Management System (SMS) information and Electronic Program Guide (EPG) information. Figure 2 - Analytics Architecture for DAI The events and bulk data are ingested via an event intake module, which has the ability to do some filtering and processing of incoming data before storing it into our Persistence / Storage service. The collected information is processed using a variety of statistical techniques to derive user / usage profiles, and to collect historical information about DAI system usage. The user profiles are exposed via an SIS compliant interface so ADS systems can do appropriate ad targeting based on this user information. Historical performance and measurement information for content and ad delivery is visualized through appropriate dashboards for Business Intelligence. In a typical DAI analytics system, the value is in: Classifying system behavior and usage. This has operational benefits for understanding how the system is being used. This is useful for operators, content providers, and advertisers. Characterizing subscribers so that appropriate insight is provided when targeting specific ads for better effectiveness. Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 8

Understanding network and system limitations by analyzing and getting insight into trends of system usage and comparing them to current capacity has system is able to support. This has implications on scalability, storage, and bandwidth management. DEPLOYMENT MODELS The architecture supports various deployment models, either dedicated hardware- based or cloud (public or private). To enable cloud deployment, the services are developed to be hardware agnostic and can be scaled up or down as needed. Except in certain cases such as database interactions, the communication between different services is restricted to REST or SOAP (with preferably JSON, or XML, encoding) for simplicity. As new nodes can be added or removed, services are addressed by a proxy service. The architecture is analogous to any Internet- scale web application: a distributed n- tier system. Unlike a standard web application that handles millions of requests/response from users, the DAI application handles millions of events from different components of the TV eco- system, which we denote as event sources. So as not to add additional burden on the event sources, we ensure that events are consumed asynchronously, and any additional filtering or processing burden lies on the event consumers. Figure 3 - Analytics Deployment Model Special attention is given to high availability and fault tolerance throughout the system and especially at event consumer and storage/persistence layers. The Spring XD layer is Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 9

deployed behind a software load balancer that allows us to deploy replica of each Spring XD container one container per event type. Similar approach is taken at storage layer. Moreover, the Hadoop infrastructure (HDFS/HBase etc.) is already equipped with high availability and fault tolerance. Figure 3 depicts a typical deployment model. ACKNOWLEDGEMENTS This paper represents work that is being done by the Applied Research Center in ARRIS Network & Cloud business. CONCLUSION Big Data technologies were developed primarily to target specific needs of the Internet. However, as our TV delivery systems grow more complex and as the technologies evolve to be more web- like, there is direct applicability and flexibility in using these technologies for television systems. A dynamic digital ad insertion system is just one example that can employ Big Data with appropriate analysis to provide more meaningful user interaction and targeting, operator measurement, and drive network efficiencies. MEET OUR EXPERT: Bhaven Gandhi Bhavan Gandhi is currently a Senior Director at ARRIS. He manages the Cloud Services & Analytics in the Advanced Technology group. He leads a multi- disciplinary team of researchers to prototype end- to- end solutions that create new interactive and converged media applications around the home and mobile spaces. Part of this is to understand how these systems can be instrumented to collect data for improving operational efficiencies and user experiences. Bhavan is a multimedia expert who has successfully taken media- centric technologies from R&D to product impact. He has sixteen issued patents, one of which was recognized by Motorola as the Business Patent of the Year. Bhavan is well versed in image and video- processing, who received his B.S. and M.S. degrees in Electrical Engineering from the University of Illinois at Urbana- Champaign. He came to ARRIS via the acquisition of the Home division of Motorola Mobility in 2013. Prior to joining Motorola in 1998, Bhavan was a Research Engineer at Eastman Kodak Co. Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 1 0

REFERENCES 1. Arthur, L. (2013, August 15). What is big data?. Retrieved from http://www.forbes.com/sites/lisaarthur/2013/08/15/what- is- big- data/ 2. Couts, A. (2012, September 25). Breaking down 'big data' and Internet in the age of variety, volume, and velocity. Retrieved from http://www.digitaltrends.com/web/state- of- the- web- big- data/ 3. Laney, D. (2001). 3D data management: Controlling data volume, velocity, and variety. In Application Delivery Strategies. Stamford, CT: META Group, Inc. 4. Ghemawat, S., Gobioff, H., & Leung, S. (2003, October). The Google file system. ACM - SOSP '03, Bolton Landing, NY. 5. Chang, F., Ghemawat, S., Hsieh, W., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., & Gruber, R. E. (2006, November). Bigtable: A distributed storage system for structured data. OSDI'06: Seventh symposium on operating system design and implementation, Seattle, WA. 6. Apace hbase. (2014, March 17). Retrieved from http://hbase.apache.org 7. Apache hadoop. (2014, March 14). Retrieved from http://hadoop.apache.org 8. Thusoo, A., Sarma, S. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Wyckoff, P., & Murthy, R. (2009, August). Hive - a warehousing solution over a map- reduce framework. ACM - VLDB '09, Lyon, France. 9. Apache Hadoop, Hive, and Pig on Google compute engine. (n.d.). Retrieved from https://cloud.google.com/developers/articles/apache- hadoop- hive- and- pig- on- google- compute- engine 10. Spring xd. (2014). Retrieved from http://projects.spring.io/spring- xd/ 11. American National Standard, Society of Cable & Telecommunications Engineers (2011). Digital program insertion advertising systems interfaces: Part 1 advertising systems overview (ANSI/SCTE 130-1 2011) ARRIS Enterprises, Inc. 2014 All rights reserved. No part of this publication may be reproduced in any form or by any means or used to make any derivative work (such as translation, transformation, or adaptation) without written permission from ARRIS Enterprises, Inc. ( ARRIS ). ARRIS reserves the right to revise this publication and to make changes in content from time to time without obligation on the part of ARRIS to provide notification of such revision or change. Copyright 2014 ARRIS Enterprises, Inc. All rights Reserved. 1 1