Predictive Analytics with Storm, Hadoop, R on AWS

Similar documents

Real-time Big Data Analytics with Storm

Introducing Storm 1 Core Storm concepts Topology design

Real Time Big Data Processing

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Hadoop & Spark Using Amazon EMR

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Thing Big: How to Scale Your Own Internet of Things.

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Big Data Pipeline and Analytics Platform

Data Stream Algorithms in Storm and R. Radek Maciaszek

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big data blue print for cloud architecture

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Architectures for massive data management

HADOOP. Revised 10/19/2015

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Analytics Nokia

Openbus Documentation

What's New in SAS Data Management

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

TRAINING PROGRAM ON BIGDATA/HADOOP

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

TECHNOLOGY WHITE PAPER Jun 2012

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

the missing log collector Treasure Data, Inc. Muga Nishizawa

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Unified Batch & Stream Processing Platform

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

BIG DATA HADOOP TRAINING

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Online Courses. Version 9 Comprehensive Series. What's New Series

Big Data Course Highlights

Scalable Architecture on Amazon AWS Cloud

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Integrating a Big Data Platform into Government:

Getting Started with Hadoop with Amazon s Elastic MapReduce

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Next-Generation Cloud Analytics with Amazon Redshift

Logentries Insights: The State of Log Management & Analytics for AWS

Big Data Analytics Roadmap Energy Industry

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

MySQL Enterprise Monitor

TECHNOLOGY WHITE PAPER Jan 2016

SAP HANA SPS 09 - What s New? HANA IM Services: SDI and SDQ

Big Data Analytics - Accelerated. stream-horizon.com

Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment

Towards Smart and Intelligent SDN Controller

Evolving Data Warehouse Architectures

The Inside Scoop on Hadoop

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data and Analytics in Government

Streaming Big Data Performance Benchmark. for

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Agile Business Intelligence Data Lake Architecture

Integrating VoltDB with Hadoop

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Cost Optimization with AWS

Luncheon Webinar Series May 13, 2013

Analytics on Spark &

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Amazon Web Services. Lawrence Berkeley LabTech Conference 9/10/15. Jamie Baker Federal Scientific Account Manager AWS WWPS

So What s the Big Deal?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Hybrid Software Architectures for Big

Using distributed technologies to analyze Big Data

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Performance and Scalability Overview

Complete Java Classes Hadoop Syllabus Contact No:

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Three Open Blueprints For Big Data Success

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Innovative Geschäftsmodelle Ermöglicht durch die AWS Cloud

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Transcription:

Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS

Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using Big Data IMAGINE Strategy and Roadmap ILLUMINATE Training and Education IMPLEMENT Hands-On Data Science and Data Engineering CONFIDENTIAL 2

Boston Storm 2013-02-28 Meetup Agenda Intro Agenda Project Information Predictive Analytics Storm Overview Architecture & Design Deployment Lessons Best Practices Future Bonus: - Storm & Big Data Patterns CONFIDENTIAL 3

Project Definition AdGlue: Solving biggest problem for local advertisers Where s my ad? Their Needs: Scale up for new business deals More lively site Better predictions Recommendations. Use Cases - Scale batch analysis pipeline; Generate timely stats - Recommendations - Predictions How many page views in the next 30 days? Environment - AWS - Version 1 of site & analytics in production Project Plan - 8-9 weeks - Combined Data Engineering + Data Science Engagement - Staff 1 Arch + 1 PM 1 Data Engineer 2 Data Scientists 3 Client Engineers CONFIDENTIAL 4

Predictive Analytics Process Model Design & Build - Listening & Learning - Discovery (Digging through the data) - Creating a Research Agenda - Testing & Learning Production Predictive Model Development - Data Cleansing, Aggregations, Conditioning - Predictive Model Training Process - Predictive Model Execution Process Challenges: - What functional forms predict future impression counts given counts up to time T? - Robust estimators, like medians rather than means, to cope with outliers - How do we distinguish between new articles, versus old articles we're seeing for the first time? - How well do impression counts correspond to real humans? CONFIDENTIAL 5

Solution based on this approach theory Analyze Massive Historical Data Set Analyze Recent Past Near Realtime Prediction Massive Historical Set = S3 Analyze = Hadoop + Pig + R Recent Past = Storm + NoSQL Analyze = R + Web Service CONFIDENTIAL 6

Storm Overview DAG Processing of never ending streams of data - Open Sourced: https://github.com/nathanmarz/storm/wiki - Used at Twitter plus > 24 other companies - Reliable - At Least Once semantics - Think MapReduce for data streams - Java / Clojure based - Bolts in Java and Shell Bolts - Not a queue, but usually reads from a queue. Related: - S4, CEP Compromises - Static topologies & cluster sizing Avoid messy dynamic rebalancing - Nimbus SPOF Strong Community Support, No commercial support CONFIDENTIAL 7

Storm Concepts Review Cluster - Supervisor - Worker Topology Streams Spout Bolt Tuple Stream Groupings - Shuffle, Field Trident DRPC CONFIDENTIAL 8

Why Storm? Why Realtime? Needed better way to manage queue readers and logic pipeline Much better than roll your own Reliable (Message guarantees, fault tolerant) Multi-node scaling (1MM messages / 10 nodes) It works For more reasons: https://github.com/nathanmarz/storm/wiki/rationale Better end-user experience - View an ad, see the counter move. Need to catch fast moving events - Content half life measured in hours Path to additional real-time capabilities - Trend analysis to recommend hot articles for example. - Ability to bolt on additional analytics CONFIDENTIAL 9

Overall Architecture CloudWatch, SNS (Metrics, Alarms, Notifications) Ad Serving Impression View Ad LB Edge Edge Interactions SQS ElasticCache (Tuple state tracking) Storm - Queue Management - Simple Bot Filtering - Real-time Bucketization - Performance Counters - Event Logging Archive Logs S3 S3S3S3 Management Server Edge DynamoDB EMR (Hadoop) Ad Management Ad Selling LB Edge Performance Counters Impression Buckets Cleansing Model Training Recommendations R Model RDS (MySQL) getprediction R Model Model Parameters CONFIDENTIAL 10

Analytics Architecture EMR (Hadoop) Impression Bucketization Train Model Parameters R Model Impressions Impression Buckets (Batch) Predictive Model Parameters Impression Buckets (Realtime) Storm S3 Adapter Impression Spout Simple Bot Annotator BucketBolt Web Request Impression Prediction R Model CONFIDENTIAL 11

Storm Topology (Greatly Simplified) SQS Event Spout S3 Adapter<T> S3 S3S3S3 SQS SimpleBotFilter Command Spout Performance Counters<T> DynamoDB Adapter<T> Performance CONFIDENTIAL 12

Storm Deployment Storm-deploy project https://github.com/nathanmarz/storm-deploy/wiki Uses pallet & jclouds project to deploy cluster Configured through conf/clusters.yaml & ~/.pallet/config.clj files Pros: - Quick and easy AWS deployment Tip: Use Puppet/Chef for production deployment Cons: - Requires Leinigen v1.x, no warning - Project not kept up to date - Changes & debugging in Clojure - Recovering a node is possible but slow CONFIDENTIAL 13

Lessons Easy to develop, hard to debug - Timeouts Storm Infinite loop of failures - Use Memcached to count # tuple failures At Least Once processing - Hadoop based read-repair job Performance Counters not getting flushed - Tick Tuples Always ACK Batching to S3 - Run a compaction & event de-duplication job in Hadoop CONFIDENTIAL 14

Lessons Understand your timescales - Frequency at which you emit running totals/ averages / stats - Frequency at which you write logs to S3 - Frequency at which you commit to DynamoDB / RDS / Painful tuning procedures when your topology carries lots of tuples - TOPOLOGY_MESSAGE_TIMEOUT_SECS - TOPOLOGY_MAX_SPOUT_PENDING CONFIDENTIAL 15

Storm Best Practices Debug and unit test topology application logic in local mode. - Mock testing - Multiple environments - Exception Handling & Logging When running distributed - Start with small number of workers and slots, with fewer log files to dig through. - Automated deployment Use Metrics - Instrument your spouts and bolts. - Needed when scaling in order to optimize performance. - Helps diagnosis problems. Latest WIP versions of storm add specialized metrics, also improve nimbus reporting. Use test data that is similar to production data. Distribution across topology is data dependent. CONFIDENTIAL 16

Future Improvements Only once semantics - Trident S3 small file sizes - Segment topology just for S3 persistence - Incremental S3 uploads (faster too) DynamoDB costs - Use DRPC to access Time series and metric Deploy using Chef/Puppet - AWS OpsWorks? Revisit analytical models - Compare performance - Compare with other models, do they perform better? - Feature Analysis CONFIDENTIAL 17

Bonus

Storm & Big Data Patterns Transactional Transactional Transactional Source Transactional Source Systems Source Systems Source Systems Systems CRUD Event Edge Edge Server Edge Server Servers Event Event Edge Edge Server Server Devices STORM Parse, Map, Enrich, Filter, Distribute Log Aggregation ETL Dimensional Counts Indexer Analytics Subscription Services DFS System of Record OLAP Fuzzy Search Dashboard Partners CONFIDENTIAL 19

Questions?