CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Agenda Real-Time Analytics Data Ingestion Data Processing n Architecture n AWS Lambda Customer Implementations
Real-Time Analytics Real-time Ingest! Highly Scalable" Durable" Elastic " Replay-able Reads" " Continuous Processing FX! + Load-balancing incoming streams" Fault-tolerance, Checkpoint / Replay" Elastic" Enable multiple apps to process in parallel" Continuous, real-time workloads! Low end-to-end latency! Continuous data flow!
Data Ingestion
Starting simple... foo-analysis.com Global top-10
Distributing the workload Elastic Beanstalk foo-analysis.com Global top-10
Or using a Elastic Data Broker Local top-10 Local top-10 Local top-10 Elastic Beanstalk foo-analysis.com Global top-10
Amazon Kinesis Managed Stream Elastic Beanstalk foo-analysis.com K I N E S I S Partition Key Worker My top-10 Sequence Number Data Record Global top-10 Data Record Stream Shard 14 17 18 21 23
Amazon Kinesis Common Data Broker Data Sources Data Sources Availability Zone Availability Zone Availability Zone [Data Archive] App. 1 App. 2 S3 Data Sources Data Sources AWS Endpoint Shard 1 Shard 2 Shard N [Metric Extraction] App. 3 [Sliding Window Analysis] DynamoDB Redshift App. 4 Data Sources [Machine Learning] EMR
Amazon Kinesis Distributed Streams From batch to continuous processing Scale shards elastically UP or DOWN without losing sequencing Workers can replay records for up to 24 hours Scale up to GB/sec without losing durability Records stored across multiple availability zones Multiple parallel Kinesis Apps output to anything RDBMS, S3, In-house Data Warehouse, Messaging, another stream, JavaSDK, PythonSDK, etc.
Data Processing
Emerging Architecture Data Streams Spark Storm KCL Streaming Analytics Notifications & Alerts APIs Dashboards/ visualizations Real Time Micro Batch Data Archive DW Hadoop Batch Analysis Dashboards/ visualizations Deep Learning Batch
Real-time: Event-based processing Producer Amazon Kinesis Kinesis Storm Spout Apache Storm Elas7Cache (Redis) Node.js Client (D3) hap://blogs.aws.amazon.com/bigdata/post/tx36lyscy2r0a9b/implement- a- Real- 7me- Sliding- Window- Applica7on- Using- Amazon- Kinesis- and- Apache
Micro-Batches: Drip feeding the data hap://blogs.aws.amazon.com/bigdata/post/tx2anln1pgeldju/best- Prac7ces- for- Micro- Batch- Loading- on- Amazon- RedshiY
Offline Batch: Hadoop for discovery Offline Analysis Producer Amazon Kinesis Kinesis Applica7on S3 EMR Ad- hoc Analysis Amazon Kinesis Hive Pig EMR Cascading MapReduce hap://blogs.aws.amazon.com/bigdata/post/tx36lyscy2r0a9b/implement- a- Real- 7me- Sliding- Window- Applica7on- Using- Amazon- Kinesis- and- Apache
Putting it together Producer Amazon Kinesis Apache Storm DynamoDB App Client Real Time KCL RedshiY BI Tools Micro Batch Batch KCL S3 EMR
AWS Lambda An event-driven computing service for dynamic applications AWS Lambda func/ons can be triggered by data stream updates from Amazon Kinesis and Amazon DynamoDB. For instance, you can watch for a pabern, such as an address, and trigger an alert.
A focus on functions, data and events S3 event notifications DynamoDB Streams Kinesis events Custom events Cloud func7ons
Putting AWS Lambda to work Server-free back-end Data triggers IoT Stream processing Indexing & synchronization
AWS Lambda for reactive computing Photo bucket S3 Extract Metadata Cloud Function Metadata DynamoDB Trending Cloud Function Trending DynamoDB NotifyCloud Function SNS Push notification
Processing Events from Kinesis Write million of events from Kinesis into Elas7search with only 60 lines of code!!! haps://gist.github.com/tylr/ e8baf45c07ced23ef013 hap://docs.aws.amazon.com/lambda/latest/dg/walkthrough- kinesis- events- adminuser.html
Customer deployments on AWS
GREE International re:invent 2014 GAM301 - Real-Time Game Analytics with Amazon Kinesis, Redshift, and DynamoDB Session - https://www.youtube.com/watch?v=elpwlj6yi44 Slide: http://www.slideshare.net/amazonwebservices/ gam301-realtime-game-analytics-with-amazon-kinesisamazon-redshift-and-amazon-dynamodb-awsreinvent-2014
Key Requirements for Analytics Initial Requreiments Data collection & streaming to database Zero data loss Zero data corruption Guaranteed data delivery New Requirements Near real-time data latency Real-time ad-hoc analysis Ease of adding consumers Managed Service
Data Collection Source of Data Mobile Devices Game Servers Ad Networks Data Sizes Size of event ~ 1 KB 500M+ events/day 500G+/day & growing JSON format
Architecture
SocialMetrix re:invent 2014 ARC202: Real-World Real-Time Analytics Session: https://www.youtube.com/watch?v=nia33zwfa8e Slides: http://www.slideshare.net/zer0/arc202-arc202- real-world-real-time-analytics20141109mhfinaledit
Drivers for architecture evolution More customers, bigger customers Add new features Keep costs under control
Requirements at 4th iteration Monitor millions of social media profiles Make data accessible (exploration, PoC) Improve UI response times Testing our data pipelines Reprocessing (faster)
Architecture
120 100 80 60 40 20 0 160 140 120 100 80 60 40 20 - Cost over Architecture Costs Customers Active Customers #1 #2 #3 #4
THANK YOU!!! http://aws.amazon.com/big-data