Building your Big Data Architecture on Amazon Web Services Abhishek Sinha @abysinha sinhaar@amazon.com
AWS Services Deployment & Administration Application Services Compute Storage Database Networking AWS Global Infrastructure
AWS Global Infrastructure 9 Regions 25 Availability Zones Continuous Expansion
$5.2B retail business 7,800 employees A whole lot of servers Every day, AWS adds enough server capacity to power that whole $5B enterprise
Powering the Most Popular Internet Businesses
We have partners and technologies ready to help
Solving Problems for Organizations Around the World
Value proposition of the AWS cloud No Upfront Investment Replace capital expenditure with variable expense Speed and agility Infrastructure in minutes not weeks Low ongoing cost Customers leverage our economies of scale 37 PRICE REDUCTIONS Focus on business Not undifferentiated heavy lifting Flexible capacity No need to guess capacity requirements and overprovision Global Reach Go global in minutes and reach a global audience
Gartner Magic Quadrant for Cloud Infrastructure as a Service (August 19, 2013) Gartner Magic Quadrant for Cloud Infrastructure as a Service, Lydia Leong, Douglas Toombs, Bob Gill, Gregor Petri, Tiny Haynes, August 19, 2013. This Magic Quadrant graphic was published by Gartner, Inc. as part of a larger research note and should be evaluated in the context of the entire report.. The Gartner report is available upon request from Steven Armstrong (asteven@amazon.com). Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
An engineer s definition When your data sets become so large that you have to start innovating how to collect, store, organize, analyze and share it
Generation Collection & storage Analytics & computation Collaboration & sharing
Lower cost, higher throughput Generation Collection & storage Analytics & computation Collaboration & sharing
Lower cost, higher throughput Generation Collection & storage Highly constrained Analytics & computation Collaboration & sharing
Data volume Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012 2016 Forecast and 2011 Vendor Shares
Amazon Web Services helps remove constraints
Elastic and highly scalable + No upfront capital expense + Only pay for what you use + Available on-demand = Remove constraints
More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location, time, day, week etc. Social data
10 TB of streaming data per day
Who buys video games?
Per day: 3.5 billion records 13 TB of click stream logs 71 million unique cookies
Today
Big Data tools Elastic MapReduce and Redshift
Big Data tools Elastic MapReduce and Redshift
How does EMR work? Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Put the data into S3 EMR Cluster S3 EMR Launch the cluster using the EMR console, CLI, SDK, or APIs Get the output from S3 You can also store everything in HDFS
What can you run on EMR EMR Cluster S3 EMR
Resize Nodes EMR Cluster S3 EMR You can easily add and remove nodes
10 node cluster x 10 hours costs exactly the same as running 100 nodes cluster x 1 hours
Big Data tools Elastic MapReduce and Redshift
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud MPP Load Query Resize Backup Restore Parallelize and Distribute Everything Dramatically Reduce I/O
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud MPP Load Query Resize Backup Restore Parallelize and Distribute Everything Dramatically Reduce I/O Direct-attached storage Large data block sizes Column data store Data compression Zone maps
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Redshift data is encrypted Continuously backed up to S3 Automatic node recovery Transparent disk failure Protect Operations Simplify Provisioning
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Protect Operations Simplify Provisioning Redshift data is encrypted Continuously backed up to S3 Automatic node recovery Transparent disk failure Create a cluster in minutes Automatic OS and software patching Scale up to 1.6PB with a few clicks and no downtime
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Start Small and Grow Big Extra Large Node (XL) 3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE 1 node (2TB) 2-32 node cluster (64TB) 8 Extra Large Node (8XL) 24 spindles, 16TB, 120GiB RAM 16 virtual cores, 10GigE 2-100 node cluster (1.6PB)
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Easy to provision and scale No upfront costs, pay as you go High performance at a low price Open and flexible with support for popular BI tools
Sydney Singapore Tokyo Price Per Hour for XL Node On-Demand $ 1.25 1 Year Reservation $ 0.75 3 Year Reservation $ 0.45 (US$)
So for example. 1 XL node reserved for 3 years: = 0.45c x number of hours in a month = $340 per month 1 XL node cluster gives you: 2 Cores, 16 GB RAM, 2 TB Disk Plus 2 TB storage in S3 for backups & snapshots
Big Data + Cloud = Awesome Combination Big data: Potentially massive datasets Iterative, experimental style of data manipulation and analysis Frequently not a steady-state workload; peaks and valleys Data is a combination of structured and unstructured data in many formats AWS Cloud: Massive, virtually unlimited capacity Iterative, experimental style of infrastructure deployment/usage At its most efficient with highly variable workloads Tools for managing structured and unstructured data
THANK YOU Please come visit us at the Solution Architects Corner at AWS booth sinhaar@amazon.com @abysinha