Cloud Big Data Architectures

Similar documents
SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Real Time Big Data Processing

Enterprise Operational SQL on Hadoop Trafodion Overview

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Bringing Big Data to People

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Cloud Scale Distributed Data Storage. Jürmo Mehine

Azure Data Lake Analytics

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop & Spark Using Amazon EMR

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

Big Data Technologies Compared June 2014

So What s the Big Deal?

Open Source Technologies on Microsoft Azure

Big Analytics in the Cloud. Matt Winkler PM, Big

Big Data and Industrial Internet

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

INTRODUCTION TO CASSANDRA

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Data Analytics Infrastructure

Dominik Wagenknecht Accenture

From Spark to Ignition:

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Sentimental Analysis using Hadoop Phase 2: Week 2

TOP 8 TRENDS FOR 2016 BIG DATA

Applications for Big Data Analytics

Understanding NoSQL on Microsoft Azure

Big Data and Data Science: Behind the Buzz Words

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Modernizing Your Data Warehouse for Hadoop

Oracle Big Data SQL Technical Update

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

HDP Hadoop From concept to deployment.

Preparing Your Data For Cloud

The Internet of Things and Big Data: Intro

BIG DATA & DATA SCIENCE

Thing Big: How to Scale Your Own Internet of Things.

REAL-TIME BIG DATA ANALYTICS

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Architecting Open source solutions on Azure. Nicholas Dritsas Senior Director, Microsoft Singapore

The Future of Data Management

Challenges for Data Driven Systems

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

How To Handle Big Data With A Data Scientist

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

#TalendSandbox for Big Data

Why Big Data in the Cloud?

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Tap into Hadoop and Other No SQL Sources

Towards Smart and Intelligent SDN Controller

Moving From Hadoop to Spark

Databricks. A Primer

Big Data Web Analytics Platform on AWS for Yottaa

The 3 questions to ask yourself about BIG DATA

Microsoft Azure Data Technologies: An Overview

Understanding NoSQL Technologies on Windows Azure

Big Data Processing: Past, Present and Future

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Native Connectivity to Big Data Sources in MSTR 10

Ganzheitliches Datenmanagement

Big Data & Cloud Computing. Faysal Shaarani

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

The Inside Scoop on Hadoop

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Databricks. A Primer

Making big data simple with Databricks

Big Data. Lyle Ungar, University of Pennsylvania

Ali Ghodsi Head of PM and Engineering Databricks

How to Leverage Cloud to Quickly Build Scalable Applications

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Next-Generation Cloud Analytics with Amazon Redshift

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Big Data on Google Cloud

SAP and Hortonworks Reference Architecture

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data at Cloud Scale

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Upcoming Announcements

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Transcription:

Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016

About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization 4. Bonus (if time allows)

Save ALL of your Data

What is the ACTUAL Cost of Saving all Data Using newer technologies Going beyond Relational

About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)

1. Big Data Yes! But what kind?

Pattern 1 Which type(s) of Big Data work best? -- when to use Hadoop -- when to use NoSQL and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational and what type of workload for hot, warm or cold data

Choice is good, right?

When do I use? Hadoop NoSQL Big Relational

Size Matters

I don t Want Text here One Vendor s View

Where is Hadoop Used?

Hadoop is your LAST CHOICE Volume 10 TB or greater to start Growth of 25% YOY Where FROM Where TO Velocity and Variety Spark over HIVE Kafka and Samsa Veracity Pay, train and hire team Top $$$ for talent IF you can find it WATCH OUT for Cloud Vendors who promise easy access Complexity of ecosystem Cloudera knows best

When do I use? Hadoop NoSQL Big Relational

225 NoSQL Database Types to Choose From

Let s review some NoSQL concepts Key-Value Redis, Riak, Aerospike Graph Neo4j Document MongoDB Wide-Column Cassandra, HBase

Key Questions - Storage Volume how much now, what growth rate? Variety what type(s) of data? rectangular, graph, k-v, etc Velocity batches, streams, both, what ingest rate? Veracity current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?

NoSQL Example Open Source is Free Rapid iteration, innovation Can start up for free (on premise) Can rent for cheap or free on the cloud Can use with the command line for free Some vendors offer free online training Ex. www.neo4j.org Not Free Constant releases Can be deceptively hard to set up (time is money) Don t forget to turn it off if on the cloud! GUI tools, support, training cost $$$ Ex. www.neo4j.com 21

Practice Applying Concepts - NoSQL

NoSQL Applied Log Files??? Product Catalogs??? Social Games??? Social aggregators??? Line-of- Business???

NoSQL Applied Log Files Columnstore HBase Product Catalogs Key/Value Redis Social Games Document MongoDB Social aggregators Graph Neo4j Line-of- Business RDBMS SQL Server

More than NoSQL NoSQL NewSQL U-SQL Non-relational Can be optimized inmemory Eventually consistent Schema on Read Example: Aerospike Relational plus more Often in-memory Some kind of SQL-layer Schema on Write Example: MemSQL What??? Microsoft s universal SQL language Example: Azure Data Lake

Focus

How Best to Store your Data? Complexity Scalability Developer Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high

Hadoop 5% NoSQL 30% RDBMS 65% Real World Big Data -- When do I use what?

Do the Cloud Vendors Understand Big Data Realities?

Cloud Big Data Vendors - Storage AWS 5-10X market share of next competitor Most complete offering Most mature offering Notable: Big Relational GCP Lean, mean and cheap Fastest player Requires top developers Notable: Query as a Service Azure Catching up Best tooling integration Notable: On-premise integration

Place your screenshot here AWS Console 17 Data services

Place your screenshot here GCP Console 8 Data Services

Place your screenshot here Azure Console 15 Data Services

Cloud Offerings Big Data AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight

Practice Applying Concepts Real Cost of Storage Types

Cloud NoSQL Applied AWS Log Files Product Catalogs Social Games Social aggregators Line-of- Business

Cloud NoSQL Applied AWS Log Files Stream or Hadoop Kinesis or EMR Product Catalogs Key/Value DynamoDB Social Games Document MongoDB Social aggregators Graph Neo4j Line-of- Business RDBMS RDS

??? The fastest growing cloud-based Big Data products are

Relational The fastest growing cloud-based Big Data products are

When do I use? Hadoop NoSQL Big Relational

Practice Applying Concepts Real Cost of Storage Types

Reasons to use Big Relational Cloud Services Developers DevOps Cloud Vendors AWS Developers DevOps Cloud Vendors GCP

Reasons to use Big Relational Cloud Services Developers Most know RDBMS query patterns Many know basic administration DevOps Most know RDBMS administration Many know basic RDBMS queries Many know query optimization Cloud Vendors - AWS Aurora RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem ETL Integration with AWS products Developers Most know coding language patterns to interact with RDBMS systems DevOps Familiar RDBMS security patterns Familiar auditing Partner tooling integration Cloud Vendors - GCP Big Query familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration

My top Big Data Cloud Services

ETL is 75% of all Big Data Projects Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.

About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)

2. Data Pipelines Build vs. Buy

Pattern 2 How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds

Key Questions Ingestion and ETL Volume how much and how fast, now and future? Variety what type(s) or data, any pre-processing needed? Velocity batches or steaming? Veracity verification on ingest needed? new data needed?

Together How does your data pipeline flow?

Considering Initial Load/Transform Data Quality Batch vs. Stream

Pipeline Phases Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy secure, audit and monitor

Cloud Big Data Vendors - ETL AWS 5X market share of next competitor Notable: Many, strong ETL Partners GCP Lean, mean and cheap Fastest player Notable: DataFlow requires Java or Python developers Azure Difficulty with scale Best tooling integration Notable: Nothing

How Best to Ingest and ETL your Data? Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high

Considering Initial Load/Transform Data Quality Batch vs. Stream

Building a Streaming Pipeline Stream Interval Window

Near Real-time Streams Load Test All The Things

Key Questions - Streaming Volume how much data now and predicted over next 12 months? Variety what types of data now and future? Velocity volume of input data / time now and near future? Veracity volume of EXISTING data now

Cloud Big Data Vendors - Streaming AWS 5X market share of next competitor Most complete offering Most mature offering Notable: Kinesis Firehose GCP Lean, mean and cheap Fastest player Requires top developers Notable: DataFlow flexible Azure Catching up Best tooling integration Notable: Stream Analytics integration with other products

Place your screenshot here AWS Console 17 Data services

Place your screenshot here GCP Console 8 Data Services

Place your screenshot here Azure Console 15 Data Services

Cloud Offerings Data and Pipelines AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables Streaming or ML Kinesis AWS Machine Learning DataFlow Google Machine Learning StreamInsight Azure ML NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight Cloud ETL Data Pipelines DataFlow Azure Data Pipeline

How Best to Stream your Data? Complexity Scalability Developer Cost Batches easy medium low Windows difficult big high Real-time very difficult huge high

Practice Applying Concepts

Designing Cloud Data Pipelines Log Files Product Catalogs Social Games Social aggregators Line-of- Business

About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)

3. Making Sense of Data Analytics and Presentation

Pattern 3 How best to Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualization products or roll your own

Making Sense of Data Reports Machine Learning Presentation

Volume Variety Velocity Veracity Key Questions - Query

Graphs What is nature of your questions?

Cloud Big Data Vendors - Query AWS 5X market share of next competitor Most complete offering Most mature offering Notable: Big Relational GCP Lean, mean and cheap Fastest player Notable: Flexible, powerful machine learning Azure WATCH OUT Cost! Notable: Developer Tooling

Query Languages SQL Everyone knows it But how well do they know it? NoSQL Vendor Language Too many to list How will you learn it? Cypher Query language for graph databases The future? ORM Good, bad or horrible? Again, how well do they know it? HIVE Shown in too many vendor demos Really hard to make performant Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more

Practice Applying Concepts Understanding D3

How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop

How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high

Machine Learning aka Predictive Analytics AWS ML for developers GUI-based GCP 3 Flavors of ML Python-based languages Azure ML for Data Scientists R Language

Presentation If you can t see it, it s not worth it.

Innovation in Data Visualization Dashboards More than KPIs Mobile Alerts Data Stories Reports Level of Detail Meaningful Taxonomies Fast enough Drill for Data

D3 The language of Data Visualization

Cloud Big Data Vendors - Visualization AWS Most complete offering Notable: Partners & QuickSight GCP Big Query Partners Notable: New Dashboards Azure Integrated Notable: PowerBI

About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)

4. About IoT It s happening now

Place your screenshot here Data Generation Device

IoT is Big Data Realized

235,000,000,000 $ The IoT Market 20 Billion devices And a lot of users 2017 By the year

IoT all the Things

Cloud Big Data Vendors - IoT AWS First to market Most complete offering Most mature offering Notable: AWS IoT Rules GCP Still in Beta Fastest player Requires top developers Notable: Weave Azure Catching up Best tooling integration Notable: Device Mgmt.

Save ALL of your Data

The Next Generation

brigada! Any questions? You can find me at @lynnlangit