SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!



Similar documents
Real Time Data Processing using Spark Streaming

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Architectures for massive data management

Streaming items through a cluster with Spark Streaming

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to Spark

Hybrid Software Architectures for Big

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Spark: Making Big Data Interactive & Real-Time

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Scaling Out With Apache Spark. DTL Meeting Slides based on

Unified Big Data Analytics Pipeline. 连 城

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Big Data Analytics with Cassandra, Spark & MLLib

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Dell In-Memory Appliance for Cloudera Enterprise

Big Data Analytics Hadoop and Spark

Beyond Hadoop with Apache Spark and BDAS

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Apache Spark and Distributed Programming

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

CSE-E5430 Scalable Cloud Computing Lecture 11

Data Challenges in Telecommunications Networks and a Big Data Solution

Getting Real Real Time Data Integration Patterns and Architectures

Next-Gen Big Data Analytics using the Spark stack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Flink Next-gen data analysis. Kostas

FAST DATA APPLICATION REQUIRMENTS FOR CTOS AND ARCHITECTS

Going Deep with Spark Streaming

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

From Spark to Ignition:

Big Data Web Analytics Platform on AWS for Yottaa

Unified Batch & Stream Processing Platform

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Moving From Hadoop to Spark

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Unified Big Data Processing with Apache Spark. Matei

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

LIGHTHOUSE. Case Study: Impact Telecom Early Warning System

Hybrid Solutions Combining In-Memory & SSD

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Analyzing Big Data with AWS

Kafka & Redis for Big Data Solutions

Hadoop & Spark Using Amazon EMR

BIG DATA TRENDS AND TECHNOLOGIES

HDP Hadoop From concept to deployment.

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

CDH AND BUSINESS CONTINUITY:

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Brave New World: Hadoop vs. Spark

Apache Hadoop: Past, Present, and Future

Oracle Big Data Fundamentals Ed 1 NEW

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Real-time Big Data Analytics with Storm

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

INTRODUCTION TO CASSANDRA

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Trafodion Operational SQL-on-Hadoop

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Processing and Analyzing Streams. CDRs in Real Time

Productionizing a 24/7 Spark Streaming Service on YARN

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Big Data on Google Cloud

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Rakam: Distributed Analytics API

[Hadoop, Storm and Couchbase: Faster Big Data]

Apache HBase. Crazy dances on the elephant back

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Big Data Processing. Patrick Wendell Databricks

Spark and the Big Data Library

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

A Brief Introduction to Apache Tez

Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Spark: Cluster Computing with Working Sets

Hadoop Architecture. Part 1

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

How Companies are! Using Spark

Transcription:

SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble!

Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure to its limits Open to a proof-of-concept engagement with emerging technology Wanted to test on historical data We introduced Spark Streaming Technology would scale Could prove it enabled new analytic techniques (incident detection) Open to Scala requirement Wanted to prove it was easy to deploy EC2 helped 2

Spark Streaming in Telco Telecommunications Wholesale Business Process 90 Million calls per day Scale up to 1,000 calls per second nearly half-a-million calls in a 5 minute window Technology is loosely split into Operational Support Systems (OSS) Business Support Systems (BSS) Core technology is mature Analytics on LAMP stack Technology team is strongly skilled in that stack 3

Jargon Number Comprised of Country Code (possibly), Area Code (NPA), Exchange (NXX) and 4 other digits Area codes and exchanges are often geo-coded 1512 867 5309 4

Jargon Trunk Group A trunk is a line connecting transmissions for two points. The group of trunks has some common property, in this case being owned by the same entity. Transmissions from ingress trunks are routed to transmissions to egress trunks. Route In this case, selection of a trunk group to facilitate the termination at the calls destination QoS Quality of Service governed by metrics Call Duration Short calls are an indication of quality problems ASR Average Supervision Rate This company measures this as #connected calls / #calls attempted Real-time: Within 15 minutes 5

The Problem A switch handles most of their routing Configuration table in switch governs routing if-this-then-that style logic. Proprietary technology handles adjustments to that table Manual intervention also required Call Logs Business Rules Application Database Intranet Portal 6

The Problem Backend system receives a log of calls from the switch File dumped every few minutes 180 well defined fields representing features of a call event Supports downstream analytics once enriched with pricing, geocoding and account information Their job is to connect calls at the most efficient price without sacrificing quality 7

Why Spark? Interesting technology Workbench can simplify operationalizing analytics They can skip a generation of clunky big data tools Works with their data structures Will scale-out rather than up Can handle fault-tolerant in-memory updates 8

Spark Basics - Architecture Worker Tasks Cache Spark Driver Spark Context Master Tasks Worker Cache Worker Tasks Cache 9

Spark Basics Call Status Count Example val cdrlogpath = /cdrs/cdr20140731042210.ssv val conf = new SparkConf().setAppName( CDR Count") val sc = new SparkContext(conf) val cdrlines = sc.textfile(cdrlogpath) val cdrdetails = cdrlines.map(_.split( ; )) val successful = cdrdetails.filter(x => x(6)== S ).count() val unsuccessful = cdrdetails.filter(x => x(6)== U ).count() println( Successful: %s, Unsuccessful: %s.format(successful, unsuccessful)) 10

Spark Basics - RDD s Operations on data generate distributable tasks through a Directed Acyclic Graph Functional programming FTW! Resilient Data is redundantly stored, and can be recomputed through a generated DAG Distributed The DAG can process each small task, as well as a subset of the data through optimizations in the Spark planning engine. Dataset This construct is native to Spark computation 11

Spark Basics - RDD s Lazy Transformations for tasks and slices 12

Streaming Applications Why try it? Streaming Applications Site Activity Statistics Spam detection System monitoring Intrusion Detection Telecommunications Network Data 13

Streaming Models Record-at-a-time Receive One Record and process it Simple, low-latency High-Throughput Micro-Batch Receive records and occasionally run a batch process over a window Process *must* run fast enough to handle all records collected Harder to reduce latency Easy Reasoning Global state Fault tolerance Unified Code 14

DStreams Stands for Discretized Streams A series of RDD s Spark already provided computation model on RDD s Note records are ordered as they are received They are also time-stamped for global computation in that order Is that always the way you want to see your data? 15

Fault Tolerance Parallel Recovery Failed Nodes Stragglers! 16

Fault Tolerance - Recompute 17

Throughput vs. Latency 18

Anatomy of a Spark Streaming Program val sparkconf = new SparkConf().setAppName( QueueStream ) val ssc = new StreamingContext(sparkConf, Seconds(1)) val rddqueue = new SynchronizedQueue[RDD[Int]]() val inputstream = ssc.queuestream(rddqueue) val mappedstream = inputstream.map(x => (x % 10, 1)) val reducedstream = mappedstream.reducebykey(_ + _) reducedstream.print() ssc.start() for(i ß 1 to 30) { } rddqueue += ssc.sparkcontext.makerdd(1 to 1000, 10) Thread.sleep(1000) ssc.stop() Utilities also available for Twitter Kafka Flume Filestream 19

Windows Slide Window 20

Streaming Call Analysis with Windows val path = "/Users/chance/Documents/cdrdrop val conf = new SparkConf().setMaster("local[12]").setAppName("CDRIncidentDetection").set("spark.executor.memory","8g") val ssc = new StreamingContext(conf,Seconds(iteration)) val callstream = ssc.textfilestream(path) val cdr = callstream.window(seconds(window),seconds(slide)).map(_.split(";")) val cdrarr = cdr.filter(c => c.length>136).map(c => extractcalldetailrecord(c)) val result = detectincidents(cdrarr) result.foreach(rdd => rdd.take(10).foreach{case(x,(d,high,low,res)) => println(x + "," + high + "," + d + "," + low + "," + res) }) ssc.start() ssc.awaittermination() 21

Demonstration 22

Can we enable new analytics? Incident detection Chose a univariate technique[1] to detect behavior out of profile from recent events Technique identifies out of profile events dramatic shifts in the profile Easy to understand Recent Window 23

Is it simple to deploy? No, but EC2 helped Client had no Hadoop, and little NoSQL expertise Develop and Deploy Built with sbt, ran on master Architecture involved Pushed new call detail logs to HDFS on EC2 Streaming picks up new data and updates RDD s accordingly Results were explored in two ways Accessing results through data virtualization Writing RDD results (small) to SQL database Using a business intelligence tool to create report content Call Logs HDFS on EC2 Streaming DataCurrent Processing Multiple Delivery Options Analysis and Reporting Dashboards 24

Summary of Results Technology would scale Handled 5 minutes of data in just a few seconds Proved new analytics enabled Solved single-variable incident detection Small, simple code Made a case for Scala and Hadoop adoption Team is still skeptical Wanted to prove it was easy to deploy EC2 helped Burned on forward slash bug in AWS secret token 25

Incident Visual 26

References [1] Zaharia et al : Discretized Streams [2] Zaharia et al: Discretized Streams: Fault-Tolerant Streaming [3] Das : Spark Streaming Real-time Big-Data Processing [4] Spark Streaming Programming Guide [5] Running Spark on EC2 [6] Spark on EMR [7] Ahelegby: Time Series Outliers 27

Contact Us CONTACT US Email: chance at blacklightsolutions.com Phone: 512.795.0855 Web: www.blacklightsolutions.com Twitter: @chancecoble 28