Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016
About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization 4. Bonus (if time allows)
Save ALL of your Data
What is the ACTUAL Cost of Saving all Data Using newer technologies Going beyond Relational
About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)
1. Big Data Yes! But what kind?
Pattern 1 Which type(s) of Big Data work best? -- when to use Hadoop -- when to use NoSQL and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational and what type of workload for hot, warm or cold data
Choice is good, right?
When do I use? Hadoop NoSQL Big Relational
Size Matters
I don t Want Text here One Vendor s View
Where is Hadoop Used?
Hadoop is your LAST CHOICE Volume 10 TB or greater to start Growth of 25% YOY Where FROM Where TO Velocity and Variety Spark over HIVE Kafka and Samsa Veracity Pay, train and hire team Top $$$ for talent IF you can find it WATCH OUT for Cloud Vendors who promise easy access Complexity of ecosystem Cloudera knows best
When do I use? Hadoop NoSQL Big Relational
225 NoSQL Database Types to Choose From
Let s review some NoSQL concepts Key-Value Redis, Riak, Aerospike Graph Neo4j Document MongoDB Wide-Column Cassandra, HBase
Key Questions - Storage Volume how much now, what growth rate? Variety what type(s) of data? rectangular, graph, k-v, etc Velocity batches, streams, both, what ingest rate? Veracity current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?
NoSQL Example Open Source is Free Rapid iteration, innovation Can start up for free (on premise) Can rent for cheap or free on the cloud Can use with the command line for free Some vendors offer free online training Ex. www.neo4j.org Not Free Constant releases Can be deceptively hard to set up (time is money) Don t forget to turn it off if on the cloud! GUI tools, support, training cost $$$ Ex. www.neo4j.com 21
Practice Applying Concepts - NoSQL
NoSQL Applied Log Files??? Product Catalogs??? Social Games??? Social aggregators??? Line-of- Business???
NoSQL Applied Log Files Columnstore HBase Product Catalogs Key/Value Redis Social Games Document MongoDB Social aggregators Graph Neo4j Line-of- Business RDBMS SQL Server
More than NoSQL NoSQL NewSQL U-SQL Non-relational Can be optimized inmemory Eventually consistent Schema on Read Example: Aerospike Relational plus more Often in-memory Some kind of SQL-layer Schema on Write Example: MemSQL What??? Microsoft s universal SQL language Example: Azure Data Lake
Focus
How Best to Store your Data? Complexity Scalability Developer Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high
Hadoop 5% NoSQL 30% RDBMS 65% Real World Big Data -- When do I use what?
Do the Cloud Vendors Understand Big Data Realities?
Cloud Big Data Vendors - Storage AWS 5-10X market share of next competitor Most complete offering Most mature offering Notable: Big Relational GCP Lean, mean and cheap Fastest player Requires top developers Notable: Query as a Service Azure Catching up Best tooling integration Notable: On-premise integration
Place your screenshot here AWS Console 17 Data services
Place your screenshot here GCP Console 8 Data Services
Place your screenshot here Azure Console 15 Data Services
Cloud Offerings Big Data AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight
Practice Applying Concepts Real Cost of Storage Types
Cloud NoSQL Applied AWS Log Files Product Catalogs Social Games Social aggregators Line-of- Business
Cloud NoSQL Applied AWS Log Files Stream or Hadoop Kinesis or EMR Product Catalogs Key/Value DynamoDB Social Games Document MongoDB Social aggregators Graph Neo4j Line-of- Business RDBMS RDS
??? The fastest growing cloud-based Big Data products are
Relational The fastest growing cloud-based Big Data products are
When do I use? Hadoop NoSQL Big Relational
Practice Applying Concepts Real Cost of Storage Types
Reasons to use Big Relational Cloud Services Developers DevOps Cloud Vendors AWS Developers DevOps Cloud Vendors GCP
Reasons to use Big Relational Cloud Services Developers Most know RDBMS query patterns Many know basic administration DevOps Most know RDBMS administration Many know basic RDBMS queries Many know query optimization Cloud Vendors - AWS Aurora RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem ETL Integration with AWS products Developers Most know coding language patterns to interact with RDBMS systems DevOps Familiar RDBMS security patterns Familiar auditing Partner tooling integration Cloud Vendors - GCP Big Query familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration
My top Big Data Cloud Services
ETL is 75% of all Big Data Projects Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.
About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)
2. Data Pipelines Build vs. Buy
Pattern 2 How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds
Key Questions Ingestion and ETL Volume how much and how fast, now and future? Variety what type(s) or data, any pre-processing needed? Velocity batches or steaming? Veracity verification on ingest needed? new data needed?
Together How does your data pipeline flow?
Considering Initial Load/Transform Data Quality Batch vs. Stream
Pipeline Phases Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy secure, audit and monitor
Cloud Big Data Vendors - ETL AWS 5X market share of next competitor Notable: Many, strong ETL Partners GCP Lean, mean and cheap Fastest player Notable: DataFlow requires Java or Python developers Azure Difficulty with scale Best tooling integration Notable: Nothing
How Best to Ingest and ETL your Data? Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
Considering Initial Load/Transform Data Quality Batch vs. Stream
Building a Streaming Pipeline Stream Interval Window
Near Real-time Streams Load Test All The Things
Key Questions - Streaming Volume how much data now and predicted over next 12 months? Variety what types of data now and future? Velocity volume of input data / time now and near future? Veracity volume of EXISTING data now
Cloud Big Data Vendors - Streaming AWS 5X market share of next competitor Most complete offering Most mature offering Notable: Kinesis Firehose GCP Lean, mean and cheap Fastest player Requires top developers Notable: DataFlow flexible Azure Catching up Best tooling integration Notable: Stream Analytics integration with other products
Place your screenshot here AWS Console 17 Data services
Place your screenshot here GCP Console 8 Data Services
Place your screenshot here Azure Console 15 Data Services
Cloud Offerings Data and Pipelines AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables Streaming or ML Kinesis AWS Machine Learning DataFlow Google Machine Learning StreamInsight Azure ML NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight Cloud ETL Data Pipelines DataFlow Azure Data Pipeline
How Best to Stream your Data? Complexity Scalability Developer Cost Batches easy medium low Windows difficult big high Real-time very difficult huge high
Practice Applying Concepts
Designing Cloud Data Pipelines Log Files Product Catalogs Social Games Social aggregators Line-of- Business
About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)
3. Making Sense of Data Analytics and Presentation
Pattern 3 How best to Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualization products or roll your own
Making Sense of Data Reports Machine Learning Presentation
Volume Variety Velocity Veracity Key Questions - Query
Graphs What is nature of your questions?
Cloud Big Data Vendors - Query AWS 5X market share of next competitor Most complete offering Most mature offering Notable: Big Relational GCP Lean, mean and cheap Fastest player Notable: Flexible, powerful machine learning Azure WATCH OUT Cost! Notable: Developer Tooling
Query Languages SQL Everyone knows it But how well do they know it? NoSQL Vendor Language Too many to list How will you learn it? Cypher Query language for graph databases The future? ORM Good, bad or horrible? Again, how well do they know it? HIVE Shown in too many vendor demos Really hard to make performant Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more
Practice Applying Concepts Understanding D3
How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop
How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high
Machine Learning aka Predictive Analytics AWS ML for developers GUI-based GCP 3 Flavors of ML Python-based languages Azure ML for Data Scientists R Language
Presentation If you can t see it, it s not worth it.
Innovation in Data Visualization Dashboards More than KPIs Mobile Alerts Data Stories Reports Level of Detail Meaningful Taxonomies Fast enough Drill for Data
D3 The language of Data Visualization
Cloud Big Data Vendors - Visualization AWS Most complete offering Notable: Partners & QuickSight GCP Big Query Partners Notable: New Dashboards Azure Integrated Notable: PowerBI
About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus (if time allows)
4. About IoT It s happening now
Place your screenshot here Data Generation Device
IoT is Big Data Realized
235,000,000,000 $ The IoT Market 20 Billion devices And a lot of users 2017 By the year
IoT all the Things
Cloud Big Data Vendors - IoT AWS First to market Most complete offering Most mature offering Notable: AWS IoT Rules GCP Still in Beta Fastest player Requires top developers Notable: Weave Azure Catching up Best tooling integration Notable: Device Mgmt.
Save ALL of your Data
The Next Generation
brigada! Any questions? You can find me at @lynnlangit