NOT IN KANSAS ANY MORE

Similar documents

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Architectures for massive data management

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Kafka & Redis for Big Data Solutions

Using Kafka to Optimize Data Movement and System Integration. Alex

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Practical Cassandra. Vitalii

NoSQL Data Base Basics

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Putting Apache Kafka to Use!

Big Data With Hadoop

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop IST 734 SS CHUNG

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

BIG DATA ANALYTICS For REAL TIME SYSTEM

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

From Spark to Ignition:

Big Data Analysis: Apache Storm Perspective

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Internals of Hadoop Application Framework and Distributed File System

HDP Hadoop From concept to deployment.

Hadoop implementation of MapReduce computational model. Ján Vaňo

BIG DATA TRENDS AND TECHNOLOGIES

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem B Y R A H I M A.

Chapter 7. Using Hadoop Cluster and MapReduce

The Internet of Things and Big Data: Intro

Large scale processing using Hadoop. Ján Vaňo

Preparing Your Data For Cloud

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Big Data Analytics - Accelerated. stream-horizon.com

Integrating Big Data into the Computing Curricula

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Search and Real-Time Analytics on Big Data

How To Use Big Data For Telco (For A Telco)

Openbus Documentation

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

How To Scale Out Of A Nosql Database

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data and Industrial Internet

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Xiaoming Gao Hui Li Thilina Gunarathne

Open source Google-style large scale data analysis with Hadoop

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

BIG DATA What it is and how to use?

Real Time Data Processing using Spark Streaming

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

<Insert Picture Here> Big Data

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Workshop on Hadoop with Big Data

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

A Brief Outline on Bigdata Hadoop

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Real-time Big Data Analytics with Storm

White Paper: Hadoop for Intelligence Analysis

The Hadoop Eco System Shanghai Data Science Meetup

MapReduce with Apache Hadoop Analysing Big Data

Domain driven design, NoSQL and multi-model databases

How To Create A Data Visualization With Apache Spark And Zeppelin

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

[Hadoop, Storm and Couchbase: Faster Big Data]

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Using distributed technologies to analyze Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Apache Kylin Introduction Dec 8,

INTRODUCTION TO CASSANDRA

So What s the Big Deal?

Introduction to Hbase Gkavresis Giorgos 1470

Hybrid Software Architectures for Big

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Apache Flink Next-gen data analysis. Kostas

Bigtable is a proven design Underpins 100+ Google services:

Unified Batch & Stream Processing Platform

Dominik Wagenknecht Accenture

Challenges for Data Driven Systems

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Open source large scale distributed data management with Google s MapReduce and Bigtable

White Paper: What You Need To Know About Hadoop

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

The 3 questions to ask yourself about BIG DATA

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Design and Evolution of the Apache Hadoop File System(HDFS)

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Transcription:

NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU

Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky enough to manage the Data Science team in JDSU s Location Intelligence business unit, where boredom is seldom an issue. arieso.com jdsu.com logicalgenetics.com Page 2

JDSU Arieso Location Intelligence Arieso s solutions harness the power of customer generated, geolocated intelligence to tackle some of the biggest challenges facing mobile operators today. ariesogeo locates and analyses data from billions of mobile connection events, giving operators a rich source of intelligence to help boost network performance and enrich user experience. Page 3

JDSU Arieso Location Intelligence Page 4

THE MOTHER OF INVENTION

Necessity New Customer! Monitor LTE network for entire USA Geolocate every event within 2 minutes Process and store 34+ billion calls daily 24x7 Operation Build in the flexibility to grow with the network Page 6

Daily Data Volumes 100TB data storage

Daily Data Size Comparison Facebook hits: 100 billion LTE Connections in new GEO: 34 billion Calls in GEO globally today: 12 billion Google searches: 4 billion Tweets: 500 million LinkedIn page views: 47 million

A Traditional Geo System LOADERS APP SERVERS ORACLE DB SERVER CALL TRACE DATA ANALYSES

Lambda Architecture Streaming Batch Real Time Processing Historical Processing Page 10

STREAMING

Streaming Technology Ecosystem Page 12

Tools Storm doing for realtime processing what Hadoop did for batch processing Distributed computing platform Manages execution and message transport between processing blocks Free and open source Developed by Twitter to process high volumes of event data High performance; Fault tolerant Windows and Linux support

Tools Storm Cluster Spout Spout Bolt Bolt Bolt Nimbus

Tools - Kafka Apache Kafka is publish-subscribe messaging rethought as a distributed commit log Distributed and fault tolerant messaging system Scalable and Durable Guaranteed delivery Built in Scala, Windows and Linux compatible Originally developed by LinkedIn Page 15

Tools - Kafka Partition A - Carolinas 1 2 3 4 5 6 7 8 9 10 Partition B - North Texas 1 2 3 4 5 6 7 Writes Partition C - Los Angeles 1 2 3 4 5 6 7 8 9 Each partition is an ordered, immutable sequence of messages All messages are persisted for a configurable time Page 16

Tools - Kafka Partition A - Carolinas 1 2 3 4 5 6 7 8 9 10 Read(4) Read(5) Read(6) Consumers read at a specified offset It is up to the consumer to manage the offset reads don t have to be sequential Page 17

Tools - Redis High Speed, In Memory Key-Value store Master/Slave Replication Support Primarily supports Linux, 64-bit Windows build maintained by Microsoft

Streaming Architecture - Have It Your Way Application teams choose their favourite fillings for their own custom burger Our platform provides the bun!

Streaming Architecture (Simplified Lots) Control Framework (Storm) Quick Geolocator Streaming Location Feed CTR Parser Bridge 2 minute world CTUM Intensive Geolocator 15 minute world Data Loader Identity Matrix Distributed Services (Redis) Network Service Geo (x28)

BATCH PROCESSING

Batch Processing Technology Ecosystem Page 22

Hbase Use Apache HBase when you need random, realtime read/write access to your Big Data Hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware

Hbase Freeform App Schemas Loading Thing Application Thing Data Store Page 24

Hbase It s not relational! Cell Name:string PSC:int Cpich:float RNC_ID:int M 1 RNC Name:string MCC:int MNC:int

Hbase It s not relational! Cell CellName:string PSC:int Cpich:float RNCName:string MCC:int MNC:int Structure tables to suite the application, not the database Stop thinking in 3 rd Normal Form Don t worry about duplication or repetition Big wide tables are the way to do it

Hbase Cell Versioning QUERY @ 03:15:00 QUERY @ 02:34:00 Stilton 03:00:00 QUERY @ 01:45:00 31.6 145 02:00:00 166 01:30:00 Cell_A 30.0 187 01:00:00 Cell CpichPower PSC Cheese

Spark a fast and general engine for largescale data processing Distributed, general purpose data processing Scala or Java Development Execute analyses next to the data Page 28

THE END! Page 29

Spark Example // Setup the spark context // (the connection and environment for the job) val sc = new SparkContext(new SparkConf().setAppName("TaxiFraudsters").setMaster("local[*]")) // Load the data into a friendly class val triprows = sc.textfile("d:/data/ny Taxi/trip_data_*.csv").parseCsv() val trips = triprows.map(row => new Trip ( row("medallion"), row("trip_distance").todouble, (row("pickup_longitude").todouble, row("pickup_latitude").todouble), (row("dropoff_longitude").todouble, row("dropoff_latitude").todouble) ) ) // Data cleansing - lots of dodgy lats and longs in the files val newyorkcity = (-73.979681, 40.7033127) val cleanedtrips = trips.filter(trip => trip.pickuplocation.getdistanceto(newyorkcity) < 100).filter(trip => trip.dropofflocation.getdistanceto(newyorkcity) < 100) // Find the difference between reported and straight-line distance val tripdistances = cleanedtrips.map(trip => (trip, trip.pointtopointdistance - trip.reporteddistance)).filter({case (trip, difference) => difference > trip.reporteddistance / 10.0}) // Total unaccounted for hours per medallion val fraudsters = tripdistances.map({case (trip, difference) => (trip.medallion, difference)}).reducebykey(_ + _) // Find the naughtiest drivers val sorted = fraudsters.map({case (key, data) => (data, key)}).sortbykey(ascending = false) // print the result - nothing is executed until now sorted.take(10).foreach(println)