Big Data for everyone Democratizing big data with the cloud Steffen Krause Technical Evangelist @AWS_Aktuell skrause@amazon.de
Does this Data make me look big?
Overview Designing big data solutions in the cloud Not the only way to do it (but one that we have seen)
Big Data withaws Storage Big Data Compute Challenges start at relatively small volumes 100 GB 1,000 PB
Big Data withaws Storage Big Data Compute When data sets and data analytics need to scale to the point that you have to start innovatingaround how tocollect, store, organize, analyze and share it
Invest in data centers?
Generation Collection & storage Analytics & computation Collaboration & sharing
Generation Collection & storage Analytics & computation Collaboration & sharing
Storage Big Data Data has gravity Compute App Data App http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Storage Big Data and inertia at volume Compute Data http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Storage Big Data easier to move applications to the data Compute Data http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
S3 as a single source of truth S3 Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-incloud.html
Generation Collection & storage Analytics & computation Collaboration & sharing
Hadoop based Analysis Amazon SQS DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon S3 Amazon EMR
Amazon Elastic MapReduce(EMR)? EMR is Hadoop in the Cloud
1 instance for 100 hours = 100 instances for 1 hour
Small instance = $6
1 instance for 1000hours = 1000instances for 1 hour
Small instance = $60
When you turn off your cloud resources, you actually stop paying for them
SQL based processing Amazon SQS DynamoDB Pre-processing framework Petabyte scale Columnar Data - warehouse Any SQL or NO SQL Store Log Aggregation tools Amazon S3 Amazon EMR Amazon Redshift
Generation Collection & storage Analytics & computation Collaboration & sharing
Sharing results and visualizations Amazon SQS Amazon Redshift Business Intelligence Tools DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon S3 Amazon EMR Business Intelligence Tools
The complete architecture Amazon SQS DynamoDB Amazon Redshift Business Intelligence Tools Visualization tools Any SQL or NO SQL Store Log Aggregation tools Amazon S3 Amazon EMR Amazon data pipeline GIS tools Business Intelligence Tools GIS tools on hadoop
Use cases
28
Lesson 1: Don t leave your Amazon account logged in at home Lesson 2: Use the data you have to drive proactive processes
Analyzing Credit Risk Requires 5 million simulations On AWS the simulation time reduced from 23 hours to 20 minutes
3000 cores for Risk Analysis (Monte Carlo) 3000 - CPU cores 300 cores during the weekend 300 - Wed Thu Fri Sat Sun Mon Tue
In 60 minutes, CHANNEL 4can analyze and model in-session data to deliver highly targeted ads to viewers before a program ends To get closer to growing video-on-demand (VOD) audiences and match them with advertisers, Channel 4 chose a cloud-based solution to help make sense of and monetize its unprecedented volumes of platform data.
Features powered by Amazon Elastic MapReduce: People Who Viewed this Also Viewed Review highlights Auto complete as you type on search Search spelling suggestions Top searches Ads 200 Elastic MapReduce jobs per day Processing 3TB of data
SkillPages Customer Use Case Everyone Needs Skilled People At Home At Work In Life Repeatedly
Data Architecture Join via Facebook Add a Skill Page Web Servers Raw Data Amazon S3 User Action Trace Events Invite Friends Get Data Amazon Redshift Amazon S3 Aggregated Data Raw Events Excel Data Analyst Tableau EMR Hive Scripts Process Content Process log files with regular expressions to parse out the info we need. Processes cookies into useful searchable data such as Session, UserId, API Security token. Filters surplus info like internal varnish logging. Internal Web
Foursquare 0.6 0.5 0.4 0.3 0.2 0.1 0 Female Gender Male We found that Amazon Redshift offers the performance we needed while freeing us from the licensing costsof our previous solution With Amazon Redshift and Tableau, anyone in the company can set up any queries they like from how users are reacting to a feature, to growth by demographic or geography, to the impact sales efforts have had in different areas. It s very flexible Jon Hoffman, Software Engineer, Foursquare Age When do people go to a place? Gorilla Coffee Gray's Papaya Amorino 0 20 40 60 80
Stack analysis and sharing Application Stack Scala/Liftweb Scala Mongo/Postgres/Flat Files API Machines Databases mongoexport postgres dump WWW Machines Application code Batch Jobs Logs Flume Data Stack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
Everything that was a limited resource is now a programmable resource
Resources Hadoop Technology and Use Cases: http://www.powerof60.com/ http://aws.amazon.com/de Start withthefree Tier: http://aws.amazon.com/de/free/ 25 US$ creditsfornewgerman customers: http://aws.amazon.com/de/campaigns/account/ Twitter: @AWS_Aktuell Facebook: http://www.facebook.com/awsaktuell Webinars: http://aws.amazon.com/de/about-aws/events/