on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \
So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze and share it
100 GB Challenges start at relatively small volumes 1,000 PB
Unconstrained data growth EB ZB 95% of the 1.2 zettabytes of data in the digital universe is unstructured 70% of of this is usergenerated content GB TB PB Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 2012. Source: IDC
Where does it come from? Web sites Blogs/Reviews/Emails/Pictures Social Graphs Facebook, Linked-in, Contacts Application server logs Web sites, games Sensor data Weather, water, smart grids Images/videos Traffic, security cameras Twitter 50m tweets/day 1,400% growth/year
Why AWS and big data? Storage Innovation Amazon Amazon DynamoDB RedShift Amazon S3 HPC Spot EMR
Services Amazon EMR (Elastic Map Reduce) AWS Data Pipeline Amazon Redshift Hosted Hadoop framework Move data among AWS services and onpremises data sources Petabyte-scale data warehouse service AWS Worldwide Public Sector Team
How do you get your slice of it? AWS Direct Connect AWS Import/Export Queuing Amazon Storage Gateway Dedicated low latency bandwidth Physical media shipping Highly scalable event buffering Sync local storage to the cloud
Where do you put your slice of it? AWS Relational Database Service Fully managed database (MySQL, Oracle, MS SQL Server, AWS SimpleDB NoSQL, Schema-less Smaller datasets AWS DynamoDB NoSQL, Schema-less, Provisioned throughput database Amazon S3 Object datastore up to 5TB per object 99.999999999% durability PostgreSQL)
Where do you put your slice of it? Amazon Glacier Long term cold storage From $0.01 per GB/Month 99.999999999% durability
How quick do you need to read it? Single digit ms 10s-100s ms <5 hours AWS DynamoDB Social scale applications Provisioned throughput performance Flexible consistency models AWS S3 Any object, any app 99.999999999% durability Objects up to 5TB in size Performance AWS Glacier Media & asset archives Extremely low cost S3 levels of durability Scale Price
Operate at any scale Unlimited data Performance Scale Price
Data has gravity App Data App http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
and inertia at volume Data http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
easier to move applications to the data Data http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Bring compute capacity to the data Very large dataset seeks strong & consistent compute for short term relationship, possibly longer
Flexible compute resources, on demand Amazon Elastic Cloud (EC2) Basic unit of compute capacity Range of CPU, memory & local disk options 27 Instance types available, from micro through cluster compute to SSD backed Vertical Scaling From $0.02/hr Feature Flexible Scalable Machine Images Full control VM Import/Export Monitoring Inexpensive Secure Details Run Windows or Linux distributions Wide range of instance types from micro to cluster compute Configurations can be saved as machine images (AMIs) from which new instances can be created Full root or administrator rights Import and export VM images to transfer configurations in and out of EC2 Publishes metrics to Cloud Watch On-demand, Reserved and Spot instance types Full firewall control via Security Groups
Elastic capacity as you need it On and Off Fast Growth Variable peaks Predictable peaks
Elastic capacity as you need it WASTE On and Off Fast Growth Variable peaks CUSTOMER DISSATISFACTION Predictable peaks
Elastic capacity as you need it Capacity Traditional IT capacity Time Elastic cloud capacity Your IT needs
Elastic capacity as you need it On and Off Fast Growth Variable peaks Predictable peaks
From one instance
to thousands
Why AWS and big data? Storage Innovation DynamoDB S3 RedShift HPC Spot EMR
Why AWS and big data? Storage Innovation DynamoDB S3 RedShift HPC Spot EMR
AWS EMR Elastic MapReduce
Amazon Elastic MapReduce A key tool in the toolbox to help with challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases AWS Worldwide Public Sector Team
Hadoop-as-a-service Map-Reduce engine Integrated with tools What is EMR? Massively parallel Integrated to AWS services Cost effective AWS wrapper
HDFS Reliable storage MapReduce Data analysis
EC2 instance Input file map reduce Output file
EC2 instance Input file map reduce Output file EC2 instance Input file map reduce Output file EC2 instance Input file map reduce Output file
Map? Reduce? Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11 map Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10 reduce Person Total Alice 25 Bob 49 Charlie 63 David 29
AWS Elastic MapReduce Architecture AWS Worldwide Public Sector Team
Pig HDFS Amazon EMR
HDFS Amazon EMR Amazon S3 Amazon DynamoDB
Data management HDFS Amazon EMR Amazon S3 Amazon DynamoDB
Data management Analytics languages Pig HDFS Amazon EMR Amazon S3 Amazon DynamoDB
Data management Analytics languages Pig HDFS Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB
Data management Analytics languages Pig HDFS Amazon EMR Amazon RDS Amazon RedShift AWS Data Pipeline Amazon S3 Amazon DynamoDB
Useful Resources & Links AWS : http://aws.amazon.com/big-data AWS HPC: http://aws.amazon.com/hpc-applications Architecture Center: http://aws.amazon.com/architecture Documentation: http://aws.amazon.com/documentation Security Center: http://aws.amazon.com/security Whitepapers: http://aws.amazon.com/whitepapers Resources: http://aws.amazon.com/resources Case Studies: http://aws.amazon.com/solutions/case-studies Solution Providers: http://aws.amazon.com/solutions/global-solution-providers Calculator: http://calculator.s3.amazonaws.com/calc5.html TCO Calculator: http://aws.amazon.com/tco-calculator AWS Blog: http://aws.typepad.com The Power of 60: http://www.powerof60.com
Thank you! Tim Bixler Manager, Federal Solutions Architecture tbixler@amazon.com