Real Time Big Data Processing



Similar documents
SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Hadoop & Spark Using Amazon EMR

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

How to Leverage Cloud to Quickly Build Scalable Applications

Service Organization Controls 3 Report

Thing Big: How to Scale Your Own Internet of Things.

Building your Big Data Architecture on Amazon Web Services

Amazon Web Services Yu Xiao

BIG DATA TRENDS AND TECHNOLOGIES

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

So What s the Big Deal?

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Using Amazon EMR and Hunk to explore, analyze and visualize machine data

Information Builders Mission & Value Proposition

The Future of Data Management

Native Connectivity to Big Data Sources in MSTR 10

Amazon Web Services Annual ALGIM Conference. Tim Dacombe-Bird Regional Sales Manager Amazon Web Services New Zealand

Amazon Elastic Beanstalk

Service Organization Controls 3 Report

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Scalable Architecture on Amazon AWS Cloud

Big Data and Market Surveillance. April 28, 2014

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Predictive Analytics with Storm, Hadoop, R on AWS

Analyzing Big Data with AWS

From Spark to Ignition:

Real-time Big Data Analytics with Storm

Expand Your Infrastructure with the Elastic Cloud. Mark Ryland Chief Solutions Architect Jenn Steele Product Marketing Manager

Big Data at Cloud Scale

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Building 1000 node cluster on EMR Manjeet Chayel

Cloud Big Data Architectures

Introduction to AWS in Higher Ed

ur skills.com

Large scale processing using Hadoop. Ján Vaňo

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Open source Google-style large scale data analysis with Hadoop

AWS Performance Tuning

Testing Big data is one of the biggest

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Deploying for Success on the Cloud: EBS on Amazon VPC. Phani Kottapalli Pavan Vallabhaneni AST Corporation August 17, 2012

CLOUD COMPUTING FOR THE ENTERPRISE AND GLOBAL COMPANIES Steve Midgley Head of AWS EMEA

HDP Hadoop From concept to deployment.

Big data blue print for cloud architecture

The Inside Scoop on Hadoop

Introduction to Amazon Web Services! Leo Senior Solutions Architect

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Amazon Web Services. Lawrence Berkeley LabTech Conference 9/10/15. Jamie Baker Federal Scientific Account Manager AWS WWPS

Big Data Analytics Nokia

Scalability in the Cloud HPC Convergence with Big Data in Design, Engineering, Manufacturing

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Using ArcGIS for Server in the Amazon Cloud

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Upcoming Announcements

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Luncheon Webinar Series May 13, 2013

Big Data Analytics - Accelerated. stream-horizon.com

Evolution from Big Data to Smart Data

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Next-Generation Cloud Analytics with Amazon Redshift

DLT Solutions and Amazon Web Services

Big Data and Industrial Internet

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Transforming the Telecoms Business using Big Data and Analytics

A Survey on Cloud Storage Systems

Deploying Database clusters in the Cloud

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Intro to AWS: Storage Services

BIG DATA What it is and how to use?

Amazon EC2 Product Details Page 1 of 5

Amazon Kinesis and Apache Storm

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

The Internet of Things and Big Data: Intro

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Big Data and Analytics in Government

Big Data on Microsoft Platform

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Transcription:

Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services

Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

Global Infrastructure Region US-WEST (Oregon) EU-WEST (Ireland) GOV CLOUD ASIA PAC (Tokyo) US-EAST (Virginia) US-WEST (San Francisco) SOUTH AMERICA (Sao Paulo) ASIA PAC (Singapore) ASIA PAC (Sydney)

Global Infrastructure Availability Zone

Ingestion Integration

Elastic Block Store Availability High performance block storage device 99.99% 1GB to 1TB in size Durability Mount as drives to instances with 99.999999999% snapshot/cloning functionalities Is a Web Store Not a file system No Single Points of Failure Eventually consistent Simple Storage Service Highly scalable object storage for the internet 1 byte to 5TB in size 99.999999999% durability Paradigm Performance Redundancy Security Pricing Typical use case Limits IMAGE Object store Very Fast Across Availability Zones Public Key / Private Key $0.095/GB/month Write once, read many 100 Buckets, Unlimited Storage, 5TB Objects

Billions Objects in S3 2100 2000 1500 Peak Requests: 1.2 Million / Second 1300 1000 762 500 262 0 102 14 40 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Today

Storage Lifecycle Integration Simple Storage Service Highly scalable object storage 1 byte to 5TB in size 99.999999999% durability Glacier Long term object archive Extremely low cost per gigabyte 99.999999999% durability

Complex Analytics Unstructured Data Parallel ETL

Analytics Elastic MapReduce Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure Elastic MapReduce Managed, elastic Hadoop (1.x & 2.x) cluster Integrates with S3, DynamoDB and Redshift Install Storm, Spark & Shark, Hive, Pig, Impala & End User Tools Automatically Support for Spot Instances Integrated HBase NOSQL Database

Analytics Orchestration Input: RDS Table Table: User-Demographics SQL Precondition: Select last_update from table > #{YY-MM-DD} Input: DynamoDB Table Table: User-Event-Data-#{year-month} Activity: EMR Transform Hive Query: user-metrics.hql Frequency: Daily Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure Data Pipeline Automatically Provision EC2 & EMR Resources Manage Dependencies & Scheduling Automatically Retry and Notify of Success & Failure Output: S3 file Path: s3://trend-data/#{year-month-day}.csv Success Notification: metrics@example.com Failure Notification: emr-admin@example.com Delay Notification: : emr-admin@example.com

Structured Data Management

Database RDS Dynamo DB Redshift Elasticache Deployment & Administration App Services Analytics Compute Storage Database Networking DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive AWS Global Infrastructure

Database RDS Dynamo DB Redshift ElastiCache Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Load data from S3, DynamoDB and EMR Extensive Security Features Scale from 160 GB -> 1.6 PB Online

Amazon Redshift parallelizes and distributes everything Common BI Tools Query JDBC/ ODBC Load Backup Restore Resize Leader Node 10GigE Mesh Compute Node Compute Node Compute Node

VOLUME VELOCITY VARIETY

VOLUME VELOCITY VARIETY

Big Data is Moving Fast ZB EB GB TB PB IT Application logs, Infrastructure logs, Metering, Audit logs, Change and Configuration logs Web sites / Mobile Apps/ Ads Clickstream, User Engagement Sensor data Weather, Smart Grids, Wearables Social Media, User Content 450MM+ Tweets/day

The Move to Real Time Analytics Recency Requirements are moving from Daily to Minute Granularity Query Engine Approach Streaming Big Data (Data Warehouse, YesSQL, Processing Approach NoSQL databases) Real-time response to Repeated queries over the same well-structured data content in semi-structured Pre-computations like data streams indices and dimensional Relatively simple views improve query computations on data performance (aggregates, filters, sliding Batch Engines (Map- Reduce) window, etc.) Semi-structured data is Enables data lifecycle by processed once or twice moving data to different The query is run on the stores / open source systems data. There are no precomputations.

Application Services Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add and Remove Shards for Performance Use Kinesis Worker Library to Process Data Integration with S3, Redshift and Dynamo DB

AWS Endpoint Amazon Kinesis App.1 Data Sources [Aggregate & De-Duplicate] Data Sources Availability Zone Availability Zone Availability Zone App.2 S3 Data Sources Shard 1 Shard 2 Shard N [Metric Extraction] DynamoDB Data Sources App.3 [Sliding Window Analysis] Redshift Data Sources App.4 [Machine Learning]

Kinesis Connectors Analytics Tooling Integration S3 Batch Write Files for Archive into S3 Sequence Based File Naming Redshift Once Written to S3, Load to Redshift Manifest Support Kinesis User Defined Transformers DynamoDB BatchPut Append to Table User Defined Transformers Storm S3 Dynamo DB Redshift Storm Use Kinesis as a Spout Provided as Open Source on GitHub amazon-kinesis-connectors

EMR Integration with Kinesis Read Data Directly into Hive, Pig, Streaming and Cascading from Kinesis Streams No Intermediate Data Persistence Required Simple way to introduce real time sources into Batch Oriented Systems Multi-Application Support & Automatic Checkpointing

Integrated Real Time Analytics

@AWS_UKI for local AWS events & news @AWScloud for Global AWS News and Announcements Thank you! Join the sessions in our AWS Lab Meet the team at our Amazon Web Services Partner Village