Building Real-Time Analytics Into Big Data Applications

Similar documents

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Real Time Big Data Processing

Scalability in the Cloud HPC Convergence with Big Data in Design, Engineering, Manufacturing

Hadoop & Spark Using Amazon EMR

Thing Big: How to Scale Your Own Internet of Things.

Building your Big Data Architecture on Amazon Web Services

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Amazon Web Services. Lawrence Berkeley LabTech Conference 9/10/15. Jamie Baker Federal Scientific Account Manager AWS WWPS

Introduction to Amazon Web Services! Leo Senior Solutions Architect

Innovative Geschäftsmodelle Ermöglicht durch die AWS Cloud

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Amazon Web Services Annual ALGIM Conference. Tim Dacombe-Bird Regional Sales Manager Amazon Web Services New Zealand

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Amazon EC2 Product Details Page 1 of 5

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Introduction to AWS in Higher Ed

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Amazon Glacier. Developer Guide API Version

Amazon Elastic Beanstalk

LONDON. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

AWS Cloud for HPC and Big Data

/ Cloud Computing. Recitation 5 September 29 th & October 1 st 2015

Amazon Kinesis and Apache Storm

AIST Data Symposium. Ed Lenta. Managing Director, ANZ Amazon Web Services

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

CLOUD COMPUTING FOR THE ENTERPRISE AND GLOBAL COMPANIES Steve Midgley Head of AWS EMEA

Architectures for massive data management

AWS Lambda. Developer Guide

Scalable Architecture on Amazon AWS Cloud

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

Monitoring Linux and Windows Logs with Graylog Collector. Bernd Ahlers Graylog, Inc.

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

How To Use Aws.Com

Technology Enablement

Why should you look at your logs? Why ELK (Elasticsearch, Logstash, and Kibana)?

Big Data Pipeline and Analytics Platform

Amazon Web Services. Steve Spano, Brig Gen (Ret), USAF GM, Defense and National Security AWS, Worldwide Public Sector

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Logentries Insights: The State of Log Management & Analytics for AWS

Time series IoT data ingestion into Cassandra using Kaa

Preparing Your IT for the Holidays. A quick start guide to take your e-commerce to the Cloud

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Systems Integration in the Cloud Era with Apache Camel. Kai Wähner, Principal Consultant

AWS Account Setup and Services Overview

Microservices on AWS

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

IAN MASSINGHAM. Technical Evangelist Amazon Web Services

Cloud Computing with Amazon Web Services and the DevOps Methodology.

Deep Dive: Maximizing EC2 & EBS Performance

Professional Hadoop Solutions

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Bernd Ahlers Michael Friedrich. Log Monitoring Simplified Get the best out of Graylog2 & Icinga 2

AWS Directory Service. Simple AD Administration Guide Version 1.0

AWS Performance Tuning

SECURITY IS JOB ZERO. Security The Forefront For Any Online Business Bill Murray Director AWS Security Programs

Cloudera Certified Developer for Apache Hadoop

ITIL Event Management in the Cloud

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Cloud computing - Architecting in the cloud

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Application Security Best Practices. Matt Tavis Principal Solutions Architect

Amazon Web Services. For Government, Education, and Nonprofit Organizations. Jakob Huhn. Partner Manager Benelux, Public Sector

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Headline Goes Here Hari Shreedharan Speaker Name or Subhead Goes Here

MapReduce (in the cloud)

Big Data Management and Security

Cloud Big Data Architectures

Unified Batch & Stream Processing Platform

Amazon Cloud Storage Options

Amazon Simple Notification Service. Developer Guide API Version

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

A programming model in Cloud: MapReduce

Xiaoming Gao Hui Li Thilina Gunarathne

Backup and Recovery of SAP Systems on Windows / SQL Server

EEDC. Scalability Study of web apps in AWS. Execution Environments for Distributed Computing

Building Success on Acquia Cloud:

Monitoring and Scaling My Application

Java SE 8 Programming

How To Set Up Wiremock In Anhtml.Com On A Testnet On A Linux Server On A Microsoft Powerbook 2.5 (Powerbook) On A Powerbook 1.5 On A Macbook 2 (Powerbooks)

Edge Configuration Series Reporting Overview

PERFORMANCE MODELS FOR APACHE ACCUMULO:

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

StorReduce Technical White Paper Cloud-based Data Deduplication

How to Leverage Cloud to Quickly Build Scalable Applications

Amazon Relational Database Service (RDS)

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Cloud Computing project Report

The Inside Scoop on Hadoop

Transcription:

Building Real-Time Analytics Into Big Data Applications Shawn Gandhi, Solutions Architect @shawnagram 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

{ } "payerid": "Joe", "productcode": "AmazonS3", "clientproductcode": "AmazonS3", "usagetype": "Bandwidth", "operation": "PUT", "value": "22490", "timestamp": "1216674828" Metering Record 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36-0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Common Log Entry <165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [examplesdid@32473 iut="3" eventsource="application" eventid="1011"][examplepriority@32473 class="high"] Syslog Entry SeattlePublicWater/Kinesis/123/Realtime 412309129140 MQTT Record <R,AMZN,T,G,R1> NASDAQ OMX Record

Data volume Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012 2016 Forecast and 2011 Vendor Shares

Abraham Wald (1902-1950)

Big data: best served fresh Big data Hourly server logs: were your systems misbehaving one hour ago Weekly / Monthly Bill: what you spent this billing cycle Daily customer-preferences report from your website s click stream: what deal or ad to try next time Daily fraud reports: was there fraud yesterday Real-time big data Amazon CloudWatch metrics: what went wrong now Real-time spending alerts/caps: prevent overspending now Real-time analysis: what to offer the current customer now Real-time detection: block fraudulent use now

Talk outline A tour of Amazon Kinesis concepts in the context of a Twitter Trends service Implementing the Twitter Trends service Amazon Kinesis in the broader context of a big data ecosystem Q&A

Sending and reading data from Amazon Kinesis streams Sending Reading HTTP Post Get* APIs AWS SDK LOG4J Kinesis Client Library + Connector Library Apache Storm Flume Fluentd Amazon Elastic MapReduce

A Tour of Amazon Kinesis Concepts in the Context of a Twitter Trends Service

Demo http://bit.ly/kinesisdemo

twitter-trends.com website twitter-trends.com Elastic Beanstalk twitter-trends.com

Too big to handle on one box twitter-trends.com

The solution: streaming map/reduce My top-10 twitter-trends.com My top-10 Global top-10 My top-10

Core concepts Data record Stream Partition key twitter-trends.com My top-10 My top-10 Global top-10 My top-10 Data record Shard Worker Shard: 1 4 1 7 1 8 2 1 2 3 Sequence number

How this relates to Amazon Kinesis twitter-trends.com Kinesis Kinesis application

Core concepts recapped Data record ~ tweet Stream ~ all tweets (the Twitter Firehose) Partition key ~ Twitter topic (every tweet record belongs to exactly one of these) Shard ~ all the data records belonging to a set of Twitter topics that will get grouped together Sequence number ~ each data record gets one assigned when first ingested Worker ~ processes the records of a shard in sequence number order

Using the Amazon Kinesis API directly twitter-trends.com K I N E S I S process(records): iterator = getsharditerator(shardid, { LATEST); while for (record (true) { in records) { [records, updatelocaltop10(record); iterator] = } getnextrecords(iterator, maxrecstoreturn); if process(records); (timetodooutput()) { } writelocaltop10toddb(); } } } while (true) { localtop10lists = scanddbtable(); updateglobaltop10list( localtop10lists); sleep(10);

Challenges with using the Amazon Kinesis API directly Manual creation How many of workers and assignment How per many EC2 to shards instance? EC2 instances? twitter-trends.com K I N E S I S Kinesis application

Using the Amazon Kinesis Client Library twitter-trends.com K I N E S I S Shard mgmt table Kinesis application

Elastic scale and load balancing Auto scaling Group Amazon CloudWatch twitter-trends.com K I N E S I S Shard mgmt table

Shard management Auto scaling Group Amazon CloudWatch Amazon SNS AWS Console twitter-trends.com K I N E S I S Shard mgmt table

The challenges of fault tolerance K I N E S I S X X X Shard mgmt table twitter-trends.com

Fault tolerance support in KCL K I N E S I S Availability Zone X1 Shard mgmt table twitter-trends.com Availability Zone 3

Checkpoint, replay design pattern Amazon Kinesis 21 23 17 18 14 10 8 3 5 2 Host A 18X 0310 Shard ID Shard-i Local top-10 {#Obama: 10235, #Congress: #Seahawks: 9910, 9835, } 23 21 18 17 14 10 Shard-i 23 8 5 21 3 18 2 17 14 Host B 10 23 Shard ID Lock Seq num Shard-i Host AB 10 0

Catching up from delayed processing Amazon Kinesis Host A How long until we ve caught up? 23 21 18 17 14 10 Shard-i 8 5 3 10 2 5 3 X2 Host B Shard throughput SLA: - 1 MBps in - 2 MBps out => Catch up time ~ down time 23 8 21 18 17 5 14 3 2 Unless your application can t process 2 MBps on a single host => Provision more shards

Concurrent processing applications Amazon Kinesis twitter-trends.com 23 018 310 23 18 21 17 10 14 8 3 5 2 23 21 18 17 14 10 8 5 3 2 Shard-i 10 23 21 18 8 17 5 14 3 2 spot-the-revolution.org 10 023

Why not combine applications? Amazon Kinesis twitter-trends.com Output state and checkpoint every 10 seconds twitter-stats.com Output state and checkpoint every hour

Implementing twitter-trends.com

Creating an Amazon Kinesis stream

How many shards? Additional considerations: - How much processing is needed per data record/data byte? - How quickly do you need to catch up if one or more of your workers stalls or fails? You can always dynamically reshard to adjust to changing workloads

Twitter Trends shard processing code Class TwitterTrendsShardProcessor implements IRecordProcessor { public TwitterTrendsShardProcessor() { } @Override public void initialize(string shardid) { } @Override public void processrecords(list<record> records, IRecordProcessorCheckpointer checkpointer) { } } @Override public void shutdown(irecordprocessorcheckpointer checkpointer, ShutdownReason reason) { }

Twitter Trends shard processing code Class TwitterTrendsShardProcessor implements IRecordProcessor { private Map<String, AtomicInteger> hashcount = new HashMap<>(); private long tweetsprocessed = 0; { @Override public void processrecords(list<record> records, IRecordProcessorCheckpointer checkpointer) computelocaltop10(records); } } if ((tweetsprocessed++) >= 2000) { } emittodynamodb(checkpointer);

Twitter Trends shard processing code private void computelocaltop10(list<record> records) { for (Record r : records) { String tweet = new String(r.getData().array()); String[] words = tweet.split(" \t"); for (String word : words) { if (word.startswith("#")) { if (!hashcount.containskey(word)) hashcount.put(word, new AtomicInteger(0)); hashcount.get(word).incrementandget(); } } } private void emittodynamodb(irecordprocessorcheckpointer checkpointer) { persistmaptodynamodb(hashcount); try { checkpointer.checkpoint(); } catch (IOException KinesisClientLibDependencyException InvalidStateException ThrottlingException ShutdownException e) { // Error handling } hashcount = new HashMap<>(); tweetsprocessed = 0; }

Amazon Kinesis as a gateway into the big data ecosystem

Big data has a lifecycle Twitter trends anonymized archive Twitter trends statistics Twitter trends processing application twitter-trends.com

Connector framework public interface IKinesisConnectorPipeline<T> { IEmitter<T> getemitter(kinesisconnectorconfiguration configuration); IBuffer<T> getbuffer(kinesisconnectorconfiguration configuration); ITransformer<T> gettransformer( KinesisConnectorConfiguration configuration); } IFilter<T> getfilter(kinesisconnectorconfiguration configuration);

Transformer and Filter interfaces public interface ITransformer<T> { public T toclass(record record) throws IOException; } public byte[] fromclass(t record) throws IOException; public interface IFilter<T> { } public boolean keeprecord(t record);

Tweet transformer for S3 archival public class TweetTransformer implements ITransformer<TwitterRecord> { private final JsonFactory fact = JsonActivityFeedProcessor.getObjectMapper().getJsonFactory(); @Override public TwitterRecord toclass(record record) { try { return fact.createjsonparser( record.getdata().array()).readvalueas(twitterrecord.class); } catch (IOException e) { throw new RuntimeException(e); } }

Example Amazon Redshift connector Amazon Redshift K I N E S I S Amazon S3

Example Amazon Redshift connector v2 K I N E S I S K I N E S I S Amazon Redshift Amazon S3

Obrigado! @shawnagram