SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble!
Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure to its limits Open to a proof-of-concept engagement with emerging technology Wanted to test on historical data We introduced Spark Streaming Technology would scale Could prove it enabled new analytic techniques (incident detection) Open to Scala requirement Wanted to prove it was easy to deploy EC2 helped 2
Spark Streaming in Telco Telecommunications Wholesale Business Process 90 Million calls per day Scale up to 1,000 calls per second nearly half-a-million calls in a 5 minute window Technology is loosely split into Operational Support Systems (OSS) Business Support Systems (BSS) Core technology is mature Analytics on LAMP stack Technology team is strongly skilled in that stack 3
Jargon Number Comprised of Country Code (possibly), Area Code (NPA), Exchange (NXX) and 4 other digits Area codes and exchanges are often geo-coded 1512 867 5309 4
Jargon Trunk Group A trunk is a line connecting transmissions for two points. The group of trunks has some common property, in this case being owned by the same entity. Transmissions from ingress trunks are routed to transmissions to egress trunks. Route In this case, selection of a trunk group to facilitate the termination at the calls destination QoS Quality of Service governed by metrics Call Duration Short calls are an indication of quality problems ASR Average Supervision Rate This company measures this as #connected calls / #calls attempted Real-time: Within 15 minutes 5
The Problem A switch handles most of their routing Configuration table in switch governs routing if-this-then-that style logic. Proprietary technology handles adjustments to that table Manual intervention also required Call Logs Business Rules Application Database Intranet Portal 6
The Problem Backend system receives a log of calls from the switch File dumped every few minutes 180 well defined fields representing features of a call event Supports downstream analytics once enriched with pricing, geocoding and account information Their job is to connect calls at the most efficient price without sacrificing quality 7
Why Spark? Interesting technology Workbench can simplify operationalizing analytics They can skip a generation of clunky big data tools Works with their data structures Will scale-out rather than up Can handle fault-tolerant in-memory updates 8
Spark Basics - Architecture Worker Tasks Cache Spark Driver Spark Context Master Tasks Worker Cache Worker Tasks Cache 9
Spark Basics Call Status Count Example val cdrlogpath = /cdrs/cdr20140731042210.ssv val conf = new SparkConf().setAppName( CDR Count") val sc = new SparkContext(conf) val cdrlines = sc.textfile(cdrlogpath) val cdrdetails = cdrlines.map(_.split( ; )) val successful = cdrdetails.filter(x => x(6)== S ).count() val unsuccessful = cdrdetails.filter(x => x(6)== U ).count() println( Successful: %s, Unsuccessful: %s.format(successful, unsuccessful)) 10
Spark Basics - RDD s Operations on data generate distributable tasks through a Directed Acyclic Graph Functional programming FTW! Resilient Data is redundantly stored, and can be recomputed through a generated DAG Distributed The DAG can process each small task, as well as a subset of the data through optimizations in the Spark planning engine. Dataset This construct is native to Spark computation 11
Spark Basics - RDD s Lazy Transformations for tasks and slices 12
Streaming Applications Why try it? Streaming Applications Site Activity Statistics Spam detection System monitoring Intrusion Detection Telecommunications Network Data 13
Streaming Models Record-at-a-time Receive One Record and process it Simple, low-latency High-Throughput Micro-Batch Receive records and occasionally run a batch process over a window Process *must* run fast enough to handle all records collected Harder to reduce latency Easy Reasoning Global state Fault tolerance Unified Code 14
DStreams Stands for Discretized Streams A series of RDD s Spark already provided computation model on RDD s Note records are ordered as they are received They are also time-stamped for global computation in that order Is that always the way you want to see your data? 15
Fault Tolerance Parallel Recovery Failed Nodes Stragglers! 16
Fault Tolerance - Recompute 17
Throughput vs. Latency 18
Anatomy of a Spark Streaming Program val sparkconf = new SparkConf().setAppName( QueueStream ) val ssc = new StreamingContext(sparkConf, Seconds(1)) val rddqueue = new SynchronizedQueue[RDD[Int]]() val inputstream = ssc.queuestream(rddqueue) val mappedstream = inputstream.map(x => (x % 10, 1)) val reducedstream = mappedstream.reducebykey(_ + _) reducedstream.print() ssc.start() for(i ß 1 to 30) { } rddqueue += ssc.sparkcontext.makerdd(1 to 1000, 10) Thread.sleep(1000) ssc.stop() Utilities also available for Twitter Kafka Flume Filestream 19
Windows Slide Window 20
Streaming Call Analysis with Windows val path = "/Users/chance/Documents/cdrdrop val conf = new SparkConf().setMaster("local[12]").setAppName("CDRIncidentDetection").set("spark.executor.memory","8g") val ssc = new StreamingContext(conf,Seconds(iteration)) val callstream = ssc.textfilestream(path) val cdr = callstream.window(seconds(window),seconds(slide)).map(_.split(";")) val cdrarr = cdr.filter(c => c.length>136).map(c => extractcalldetailrecord(c)) val result = detectincidents(cdrarr) result.foreach(rdd => rdd.take(10).foreach{case(x,(d,high,low,res)) => println(x + "," + high + "," + d + "," + low + "," + res) }) ssc.start() ssc.awaittermination() 21
Demonstration 22
Can we enable new analytics? Incident detection Chose a univariate technique[1] to detect behavior out of profile from recent events Technique identifies out of profile events dramatic shifts in the profile Easy to understand Recent Window 23
Is it simple to deploy? No, but EC2 helped Client had no Hadoop, and little NoSQL expertise Develop and Deploy Built with sbt, ran on master Architecture involved Pushed new call detail logs to HDFS on EC2 Streaming picks up new data and updates RDD s accordingly Results were explored in two ways Accessing results through data virtualization Writing RDD results (small) to SQL database Using a business intelligence tool to create report content Call Logs HDFS on EC2 Streaming DataCurrent Processing Multiple Delivery Options Analysis and Reporting Dashboards 24
Summary of Results Technology would scale Handled 5 minutes of data in just a few seconds Proved new analytics enabled Solved single-variable incident detection Small, simple code Made a case for Scala and Hadoop adoption Team is still skeptical Wanted to prove it was easy to deploy EC2 helped Burned on forward slash bug in AWS secret token 25
Incident Visual 26
References [1] Zaharia et al : Discretized Streams [2] Zaharia et al: Discretized Streams: Fault-Tolerant Streaming [3] Das : Spark Streaming Real-time Big-Data Processing [4] Spark Streaming Programming Guide [5] Running Spark on EC2 [6] Spark on EMR [7] Ahelegby: Time Series Outliers 27
Contact Us CONTACT US Email: chance at blacklightsolutions.com Phone: 512.795.0855 Web: www.blacklightsolutions.com Twitter: @chancecoble 28