Analytics in the Cloud Peter Sirota, GM Elastic MapReduce
Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor.
What is Big Data? Terabytes of semi-structured log data in which businesses want to: find correlations/perform pattern matching generate recommendations calculate advanced statistics (i.e., TP99) Twitter Firehose 50 million tweets per day 1,400% growth per year How can advertisers drink from it? Social graphs Value increases with exponential growth in data connections Big Data is full of valuable, unanswered questions!
Why is Big Data Hard (and Getting Harder)? Today s Data Warehouses Need to consolidate from multiple data sources in multiple formats across multiple businesses Unconstrained growth of this business-critical information Today s Users Expect faster response time of fresher data Sampling is not good enough and history is important Demand inexpensive experimentation with new data Become increasingly sophisticated Data Scientists Current systems don t scale (and weren t meant to) Long time to provision more infrastructure Specialized DB expertise required Expensive and inelastic solutions We need tools built specifically for Big Data!
What is this thing called Hadoop? Dealing with Big Data requires two things: Distributed, scalable storage Inexpensive, flexible analytics Apache Hadoop is an open source software platform that addresses both of these needs Includes a fault tolerant, distributed storage system (HDFS) developed for commodity servers Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data sets Key benefits Affordable Cost / TB is a fraction of traditional options Proven at scale Numerous petabyte implementations in production; linear scalability Flexible Data can be stored with or without schema
RDBMS vs. MapReduce/Hadoop RDBMS Predefined schema Strategic data placement for query tuning Exploit indexes for fast retrieving SQL only Doesn t scale linearly MapReduce/Hadoop No schema is required Random data placement Fast scan of the entire dataset Uniform query performance Linearly scales for reads and writes Support many languages including SQL Complementary technologies
Why Amazon Elastic MapReduce? Managed Apache Hadoop Web Service Monitor thousands of clusters per day Use cases span from University students to Fortune 50 Reduces complexity of Hadoop management Handles node provisioning, customization, and shutdown Tunes Hadoop to your hardware and network Provides tools to debug and monitor your Hadoop clusters Provides tight integration with AWS services Improved performance working with S3 Automatic re-provisioning on node failure Dynamic expanding/shrinking of cluster size Spot integration
Elastic MapReduce Key Features Simplified Cluster Configuration/Management Resize running job flows Support for EIP/IAM/Tagging Workload-specific configurations Bootstrap Actions Enhanced Monitoring/Debugging Free CloudWatch Metrics / Alarms Hadoop Metrics in Console Ganglia Support Improved Performance S3 Multipart Upload Cluster Compute Instances
Analytics Use Cases Targeted advertising / Clickstream analysis Data warehousing applications Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs) Web indexing Data mining and BI
APACHE HIVE DATA WAREHOUSE FOR HADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data E.g. arbitrary field separators possible such as, in CSV file formats Inherits linear scalability of Hadoop
AWS Data Warehousing Architecture
Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Steady State) Data Warehouse (Batch Processing) Data Warehouse (Steady State) Expand to 25 instances Shrink to 9 instances
Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow Duration: 14 Hours Scenario #2 Job Flow Duration: 7 Hours #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Time Savings: 50% Cost Savings: ~22% Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms Track Hadoop job progress Alarm on degradations in cluster health Monitor aggregate Elastic MapReduce usage
Big Data Ecosystem And Tools We have a rapidly growing ecosystem and will continue to integrate with a wide range of partners. Some examples: Business Intelligence MicroStrategy, Pentaho Analytics Datameer, Karmasphere, Quest Open source Ganglia, SQuirrel SQL
Resources Amazon Elastic MapReduce aws.amazon.com/elasticmapreduce aws.amazon.com/articles/elastic-mapreduce forums.aws.amazon.com/forum.jspa?forumid=52