NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU
Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky enough to manage the Data Science team in JDSU s Location Intelligence business unit, where boredom is seldom an issue. arieso.com jdsu.com logicalgenetics.com Page 2
JDSU Arieso Location Intelligence Arieso s solutions harness the power of customer generated, geolocated intelligence to tackle some of the biggest challenges facing mobile operators today. ariesogeo locates and analyses data from billions of mobile connection events, giving operators a rich source of intelligence to help boost network performance and enrich user experience. Page 3
JDSU Arieso Location Intelligence Page 4
THE MOTHER OF INVENTION
Necessity New Customer! Monitor LTE network for entire USA Geolocate every event within 2 minutes Process and store 34+ billion calls daily 24x7 Operation Build in the flexibility to grow with the network Page 6
Daily Data Volumes 100TB data storage
Daily Data Size Comparison Facebook hits: 100 billion LTE Connections in new GEO: 34 billion Calls in GEO globally today: 12 billion Google searches: 4 billion Tweets: 500 million LinkedIn page views: 47 million
A Traditional Geo System LOADERS APP SERVERS ORACLE DB SERVER CALL TRACE DATA ANALYSES
Lambda Architecture Streaming Batch Real Time Processing Historical Processing Page 10
STREAMING
Streaming Technology Ecosystem Page 12
Tools Storm doing for realtime processing what Hadoop did for batch processing Distributed computing platform Manages execution and message transport between processing blocks Free and open source Developed by Twitter to process high volumes of event data High performance; Fault tolerant Windows and Linux support
Tools Storm Cluster Spout Spout Bolt Bolt Bolt Nimbus
Tools - Kafka Apache Kafka is publish-subscribe messaging rethought as a distributed commit log Distributed and fault tolerant messaging system Scalable and Durable Guaranteed delivery Built in Scala, Windows and Linux compatible Originally developed by LinkedIn Page 15
Tools - Kafka Partition A - Carolinas 1 2 3 4 5 6 7 8 9 10 Partition B - North Texas 1 2 3 4 5 6 7 Writes Partition C - Los Angeles 1 2 3 4 5 6 7 8 9 Each partition is an ordered, immutable sequence of messages All messages are persisted for a configurable time Page 16
Tools - Kafka Partition A - Carolinas 1 2 3 4 5 6 7 8 9 10 Read(4) Read(5) Read(6) Consumers read at a specified offset It is up to the consumer to manage the offset reads don t have to be sequential Page 17
Tools - Redis High Speed, In Memory Key-Value store Master/Slave Replication Support Primarily supports Linux, 64-bit Windows build maintained by Microsoft
Streaming Architecture - Have It Your Way Application teams choose their favourite fillings for their own custom burger Our platform provides the bun!
Streaming Architecture (Simplified Lots) Control Framework (Storm) Quick Geolocator Streaming Location Feed CTR Parser Bridge 2 minute world CTUM Intensive Geolocator 15 minute world Data Loader Identity Matrix Distributed Services (Redis) Network Service Geo (x28)
BATCH PROCESSING
Batch Processing Technology Ecosystem Page 22
Hbase Use Apache HBase when you need random, realtime read/write access to your Big Data Hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware
Hbase Freeform App Schemas Loading Thing Application Thing Data Store Page 24
Hbase It s not relational! Cell Name:string PSC:int Cpich:float RNC_ID:int M 1 RNC Name:string MCC:int MNC:int
Hbase It s not relational! Cell CellName:string PSC:int Cpich:float RNCName:string MCC:int MNC:int Structure tables to suite the application, not the database Stop thinking in 3 rd Normal Form Don t worry about duplication or repetition Big wide tables are the way to do it
Hbase Cell Versioning QUERY @ 03:15:00 QUERY @ 02:34:00 Stilton 03:00:00 QUERY @ 01:45:00 31.6 145 02:00:00 166 01:30:00 Cell_A 30.0 187 01:00:00 Cell CpichPower PSC Cheese
Spark a fast and general engine for largescale data processing Distributed, general purpose data processing Scala or Java Development Execute analyses next to the data Page 28
THE END! Page 29
Spark Example // Setup the spark context // (the connection and environment for the job) val sc = new SparkContext(new SparkConf().setAppName("TaxiFraudsters").setMaster("local[*]")) // Load the data into a friendly class val triprows = sc.textfile("d:/data/ny Taxi/trip_data_*.csv").parseCsv() val trips = triprows.map(row => new Trip ( row("medallion"), row("trip_distance").todouble, (row("pickup_longitude").todouble, row("pickup_latitude").todouble), (row("dropoff_longitude").todouble, row("dropoff_latitude").todouble) ) ) // Data cleansing - lots of dodgy lats and longs in the files val newyorkcity = (-73.979681, 40.7033127) val cleanedtrips = trips.filter(trip => trip.pickuplocation.getdistanceto(newyorkcity) < 100).filter(trip => trip.dropofflocation.getdistanceto(newyorkcity) < 100) // Find the difference between reported and straight-line distance val tripdistances = cleanedtrips.map(trip => (trip, trip.pointtopointdistance - trip.reporteddistance)).filter({case (trip, difference) => difference > trip.reporteddistance / 10.0}) // Total unaccounted for hours per medallion val fraudsters = tripdistances.map({case (trip, difference) => (trip.medallion, difference)}).reducebykey(_ + _) // Find the naughtiest drivers val sorted = fraudsters.map({case (key, data) => (data, key)}).sortbykey(ascending = false) // print the result - nothing is executed until now sorted.take(10).foreach(println)