Big Data & Cloud Computing Faysal Shaarani
Agenda Business Trends in Data What is Big Data? Traditional Computing Vs. Cloud Computing Snowflake Architecture for the Cloud
Business Trends in Data Critical decision-making tool and driver for business over time. Different type of data: (Internet of things will only increase the volume and structure of this data Data volume and variety has made it very difficult and costly to ingest, process, and distill information for timely and accurate decision making.
What is Big Data? Various attempts to define it. Data sets too large and complex to manipulate or interrogate with standard methods or tools. Businesses not looking into it would likely be in trouble
Big Data is not just about Volume
Big Data at a microscopic level Structured Data Semi-Structured Data Multi-Structured Data Governed by the 5 V s: Volume, Velocity, Variety, Veracity and Value
What Makes up Big Data?
Big Data Analytics is a Must Large Amounts of Data Available Competitive Advantage Better strategic and operational business decisions. Identifications of hidden patterns unknown correlations Effective marketing, Customer Satisfaction, and increased revenue
Applications for Big Data Analytics
Big Data for Smarter Healthcare
Big Data Market Size
The Data World is Different Today Conventional Data s: static, predictable queries on highly refined data were the norm. Knobs and parameters for the user or DBAs to tune based on their knowledge of the queries and workloads to be run. That s near impossible to manage in today s world. Cloud Data s Reduces up-front project costs. pay-as-you-go, on-demand, and elastic scalability model Enables organizations to scale their applications as required while paying only for the resources they use. Provides significant benefits for both the business and IT.
Limitations of Traditional Databases Limited Elasticity: Compute and data Complex to Manage: data indexing partitioning, DBAs, query Tuning Costly: infrastructure Management, Licensing, Tools and Skills. Both Shared nothing architecture and shared disk architecture dbs have two dimensions of scalability (data and compute). None are elastic
Benefits of Cloud Computing Infinite resources, Elasticity on Demand Pay only for what you use Bring solutions to market quickly No need to involve IT
Public Clouds & Ecosystem Tools Infrastructure: AWS; Azure; Google Cloud Data Warehousing: RedShift, Snowflake, others. BI Tools: Tableau, Looker, Microstrategy, SAS ETL: Talend, Informatica, CloverETL
On-premise Databases in the Cloud Any on-premise database can be hosted in the cloud. i.e. Oracle, MySQL, SQL Server, DB2, etc. Amazon Redshift (Open Source Moved to the cloud) Fast, fully managed, petabyte-scale data warehouse service Simple and cost-effective to efficiently analyze all your data using any existing business intelligence tools. Just $0.25 per CPU hour & $1,000 per terabyte per year No commitments or upfront costs Less than a tenth of most other data warehousing solutions.
Databases Architected for The Cloud Snowflake: Fast, fully managed, petabyte-scale data warehouse service simple and cost-effective to efficiently analyze all your data using any existing business intelligence tools. Just $1 to $2 per server/hr & $200 per terabyte per month No commitments or upfront costs Less than tenth of most other data warehousing solutions.
Data Warehousing Cloud Service ETL & Data Loading Database is separate from Virtual One Virtual, multiple Databases Finance Users Virtual Virtual S Virtual Marketing Users One Database, multiple Virtual s Database s Virtual scales independently from Database Data loading does not Test/Dev Users Virtual S Virtual Virtual Sales Users Impact query performance Biz Dev User
Data Warehousing Cloud Service ETL & Data Loading Supports structured and semi-structured data: JSON and Avro Finance Users Marketing Users The tools you know + Snowflake web UI Database s Test/Dev Users Sales Users Biz Dev User
Multidimensional Elasticity Three dimensions of elasticity ETL & Data Loading Data Workload Users Workload Elasticity Finance Users Test/Dev Users Virtual Virtual Virtual Databases Virtual Virtual Virtual Marketing Users Sales Users Biz Dev User Data Elasticity Data Elasticity
Inside Multidimensional Elasticity CSV Loading Running on EC2 Columnar compressed FDN Files Stored on S3 Virtual adaptively caches FDN files in local flash storage Query optimization and runtime execution prunes data for efficiency Running on EC2
Snowflake Architecture User Interface ODBC Driver JDBC Driver Web UI Cloud Services Optimization Query Mgmt Mgmt Security Metadata Virtual Processing EC2 Database Storage S3 Data Sales Marketing Materials Cloud Infrastructure Amazon AWS Customer Service Financial Analysts Quality Control Loading
Relational Processing of Semi-Structured Data 1. Variant data type compresses storage of semistructured data 2. Data is analyzed during load to discern repetitive attributes within the hierarchy 3. Repetitive attributes are columnar compressed and statistics are collected for relational query optimization 4. SQL extensions enable relational queries against both semi-structured and structured data
Security General Availability Features Account Service Account 2-factor authentication Account Account Federated Authentication Data encryption over the Internet Snowflake Operations Operations Encryption of data at rest Roles and privilege management Auditing/security logging of all operations
10TB Query Workload Comparison Oracle Redshift Snowflake Snowflake Improvement Upfront Commitment $1.7M $48,000 (1-Year Reserved Instance) $10,000 5x Load 9 hours 14 hours $100 1.4 hours $45 10x Query 1.5 hours 3.5 hours $25 40 min $20 5x Resize Forklift upgrade 3 hours, full data migration 5 minutes no data migration 30x Monthly Idle Cost $47,000 $5500 $500 (10TB DB Storage) 10x