Big data architectures and the data lake

Similar documents
Modernizing Your Data Warehouse for Hadoop

Bringing Big Data to People

Please give me your feedback

HDP Hadoop From concept to deployment.

HDP Enabling the Modern Data Architecture

Azure Data Lake Analytics

Ganzheitliches Datenmanagement

Microsoft Analytics Platform System. Solution Brief

The Future of Data Management

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Technologies Compared June 2014

Traditional BI vs. Business Data Lake A comparison

Parallel Data Warehouse

SAP and Hortonworks Reference Architecture

Upcoming Announcements

Big Data Processing: Past, Present and Future

Data Services Advisory

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Reference Architecture, Requirements, Gaps, Roles

Hadoop in the Hybrid Cloud

Production ready hadoop. By Deepak Rao Na,onal Head Datawarehousing Bajaj Finserv

A Modern Data Architecture with Apache Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

The Future of Data Management with Hadoop and the Enterprise Data Hub

James Serra Sr BI Architect

Comprehensive Analytics on the Hortonworks Data Platform

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Building an Effective Data Warehouse Architecture James Serra

Hadoop, the Data Lake, and a New World of Analytics

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Luncheon Webinar Series May 13, 2013

Big Data: Making Sense of it all!

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

2015 Ironside Group, Inc. 2

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Data Integration Checklist

The Inside Scoop on Hadoop

Information Builders Mission & Value Proposition

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Investor Presentation. Second Quarter 2015

Big Data Realities Hadoop in the Enterprise Architecture

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Microsoft Big Data. Solution Brief

Cloudera Enterprise Data Hub in Telecom:

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Virtualizing Apache Hadoop. June, 2012

Deploying an Operational Data Store Designed for Big Data

Modern Data Architecture for Predictive Analytics

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

How To Handle Big Data With A Data Scientist

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

#TalendSandbox for Big Data

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Apache Hadoop: The Big Data Refinery

Big Data Storage Challenges for the Industrial Internet of Things

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Modern Data Warehousing

Protecting Big Data Data Protection Solutions for the Business Data Lake

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

More Data in Less Time

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Big Data at Cloud Scale

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How to Enhance Traditional BI Architecture to Leverage Big Data

Information Architecture

Agile Business Intelligence Data Lake Architecture

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Dominik Wagenknecht Accenture

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Dell In-Memory Appliance for Cloudera Enterprise

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

The Microsoft Modern Data Warehouse

Hadoop & Spark Using Amazon EMR

Building Blocks of Cortana Intelligence Suite

Polybase for SQL Server 2016

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Cisco IT Hadoop Journey

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Building Your Big Data Team

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

BIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

The big data revolution

Oracle Database 12c Plug In. Switch On. Get SMART.

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Transcription:

Big data architectures and the data lake James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com

About Me Microsoft, Big Data Evangelist In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions Blog at JamesSerra.com Former SQL Server MVP Author of book Reporting with Microsoft SQL Server 2012

Agenda Big Data Architectures Why data lakes Top-down vs Bottom-up Data lake defined Hadoop as the data lake Modern Data Warehouse Federated Querying Solution in the cloud SMP vs MPP

Big Data Architectures

Enterprise data warehouse augmentation Seen when EDW has been in existence a while and EDW can t handle new data Cons: not offloading EDW work, can t use existing tools, data hub difficulty understanding data

Data hub plus EDW Data hub is used as temporary staging and refining, no reporting Cons: data hub is temporary, no reporting/analyzing done with the data hub (temporary)

All-in-one Data hub is total solution, no EDW Cons: queries are slower, new training for reporting tools, difficulty understanding data, security limitations

Modern Data Warehouse Evolution of three previous scenarios Ultimate goal Supports future data needs Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance

Why data lakes

10 Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema ( schema-on-write ) 5. Create reports. Analyze data Dedicated ETL tools (e.g. SSIS) Relational ETL pipeline Queries LOB Applications Defined schema Results All data not immediately required is discarded or archived

Need to collect any data Harness the growing and changing nature of data Structured Unstructured Streaming Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format

The three V s

13 New big data thinking: All data has value All data has potential value Data hoarding No defined schema stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit Iterate Gather data from all sources Store indefinitely Analyze See results

Top-down vs Bottom-up

Two Approaches to Information Management for Analytics: Top-Down + Bottoms-Up How can we make it happen Theory Hypothesis Observation Confirmation What happened Descriptive Analytics Why did it happen Diagnostic Analytics What will happen Predictive Analytics Prescriptive Analytics Theory Hypothesis Pattern Observation

Data Warehousing Uses A Top-Down Approach Understand Corporate Strategy Gather Requirements Business Requirements Implement Data Warehouse Reporting & Analytics Design Reporting & Analytics Development Dimension Modelling Physical Design Technical Requirements ETL Design ETL Development Data sources Setup Infrastructure Install and Tune

The data lake Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Devices Batch queries Interactive queries Real-time analytics Machine Learning Data warehouse

Data Lake + Data Warehouse Better Together What happened Descriptive Analytics What will happen Predictive Analytics Why did it happen Diagnostic Analytics Data sources How can we make it happen Prescriptive Analytics

Data lake defined

What is a data lake A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. A place to store unlimited amounts of data in any format inexpensively Allows collection of data that you may or may not use later: just in case A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: just in time or schema on read Complements EDW and can be seen as a data source for the EDW capturing all data but only passing relevant data to the EDW Frees up expensive EDW resources (storage and processing), especially for data refinement Allows for data exploration to be performed without waiting for the EDW team to model and load the data Some processing in better done on Hadoop than ETL tools like SSIS

Traditional Approaches Current state of a data warehouse MONITORING AND TELEMETRY DATA SOURCES ETL DATA WAREHOUSE BI AND ANALYTCIS OLTP ERP CRM LOB Star schemas, views other readoptimized structures Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay

Traditional Approaches Current state of a data warehouse MONITORING AND TELEMETRY DATA SOURCES ETL DATA WAREHOUSE BI AND ANALYTCIS OLTP ERP CRM LOB Star schemas, views other readoptimized structures Emailed, centrally stored Excel reports and dashboards INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to innovate to relieve starvation

New Approaches Data Lake Transformation (ELT not ETL) DATA SOURCES DATA WAREHOUSE Star schemas, views other readoptimized structures BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) OLTP ERP CRM LOB Transform relevant data into data sets NON-RELATIONAL DATA FUTURE DATA SOURCES All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools

Data Lake layers Raw data layer raw events are stored for historical reference Cleansed data layer raw events are transformed into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings) Application data layer business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc) Still need data governance so your data lake does not turn into a data swamp!

Should I use Hadoop or NoSQL for the data lake Open-source software ecosystem that allows for massively parallel computing No inherent structure (no conversion to JSON needed) Good for batch processing, large files, volume writes, parallel scans, sequential access (NoSQL designed for large-scale OLTP) Large ecosystem of products Low cost Most implementations use Hadoop as the data lake

Hadoop as the data lake

What is Hadoop Distributed, scalable system on commodity HW Composed of a few parts: HDFS Distributed file system MapReduce Programming model Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm OPERATIONAL SERVICES AMBARI OOZIE FALCON Core Services FLUME SQOOP LOAD & EXTRACT NFS WebHDFS DATA SERVICES HBASE Hadoop Cluster compute & storage... PIG MAP REDUCE YARN HDFS HIVE & HCATALOG Main players are Hortonworks, Cloudera, MapR....... compute & storage Microsoft Confidential WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead) Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware

Hortonworks Data Platform 2.4 Simply put, Hortonworks ties all the open source products together (22)

The real cost of Hadoop http://www.wintercorp.com/tcod-report/

Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/dr, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)

Modern Data Warehouse

Modern Data Warehouse Think about future needs: Increasing data volumes Real-time performance New data sources and types Cloud-born data Multi-platform solution Hybrid architecture

The Dream Modern Data Warehouse

The Reality

Roles when use both Data Lake and DW Data Lake/Hadoop (staging and processing environment) Batch reporting Data refinement/cleaning ETL workloads Store historical data Sandbox for data exploration One-time reports Data scientist workloads Data Warehouse/RDBMS (serving and compliance environment) Low latency High number of users Additional security Large support for tools Easily create reports (Self-service BI) A data lake is just a glorified file folder with data files in it how many end-users can accurately create reports from it

Microsoft data platform solutions Product Category Description More Info SQL Server 2016 RDBMS Earned top spot in Gartner s Operational Database magic quadrant. JSON support SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has built-in high availability and disaster recovery. JSON support (preview) SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. Provision and scale quickly. Can pause service to reduce cost Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and seamless integration of all your data Azure Data Lake Store (preview) Hadoop storage Removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics Azure Data Lake Analytics (preview) Hadoop-as-a-service/Big Data-as-a-service Cloud-based service that dynamically provisions resources so you can run queries on exabytes of data. Includes U- SQL, a new big data query language HDInsight PaaS Hadoop compute A managed Apache Hadoop, Spark, R, HBase, and Storm cloud service made easy DocumentDB Azure Table Storage PaaS NoSQL: Document Store PaaS NoSQL: Key-value Store Get your apps up and running in hours with a fully managed NoSQL database service that indexes, stores, and queries data using familiar SQL syntax Store large amount of semi-structured data in the cloud https://www.microsoft.com/en-us/servercloud/products/sql-server-2016/ https://azure.microsoft.com/enus/services/sql-database/ https://azure.microsoft.com/enus/services/sql-data-warehouse/ https://www.microsoft.com/en-us/servercloud/products/analytics-platformsystem/ https://azure.microsoft.com/enus/services/data-lake-store/ https://azure.microsoft.com/enus/services/data-lake-analytics/ https://azure.microsoft.com/enus/services/hdinsight/ https://azure.microsoft.com/enus/services/documentdb/ https://azure.microsoft.com/enus/services/storage/tables/

Cortana Intelligence Suite Integrated as part of an end-to-end suite Data Sources Information Management Big Data Stores Machine Learning and Analytics Intelligence People Data Factory Data Lake Store Machine Learning Cognitive Services Data Catalog SQL Data Warehouse Data Lake Analytics Bot Framework Web Apps Event Hubs HDInsight (Hadoop and Spark) Cortana Apps Mobile Stream Analytics Bots Dashboards & Visualizations Sensors and devices Power BI Automated Systems Data Intelligence Action

Federated Querying

Federated Querying Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse. A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology

PolyBase Query relational and non-relational data with T-SQL

Solution in the cloud

Benefits of the cloud Agility Grow hardware as demand is needed (unlimited elastic scale). Change hardware instantly Fire up a server quickly (abbreviated infrastructure implementation build-out times) Reduce hardware as demand lessons (pay for what you need) Make it easy to experiment, fail fast Availability High availability and disaster recovery built-in or easy to implement Reliability Four nines SLA, storage durability, network redundancy, automatic geography redundancy Security The cloud datacenters have the ultimate in security Other Cost savings: co-location space, hardware, power, etc No need to manage the hardware infrastructure, reallocating staff No commitment or long-term vendor lock Allows companies to benefit from changes in the technology impacting the latest storage solutions More frequent updates to OS, sql server, etc, done for you Really helpful for proof-of-concept (POC) or development projects with a known lifespan

Constraints of on-premise data Scale constrained to on-premise procurement Capex up-front costs, most companies instead prefer a yearly operating expense (OpEx) A staff of employees or consultants must be retained to administer and support the hardware and software in place Expertise needed for tuning and deployment

Talking points when using the cloud for DW Public and private cloud Cloud-born data vs on-prem born data Transfer cost from/to cloud and on-prem Sensitive data on-prem, non-sensitive in cloud Look at hybrid solutions

SMP vs MPP

SMP vs MPP SMP - Symmetric Multiprocessing Multiple CPUs used to complete individual processes simultaneously All CPUs share the same memory, disks, and network controllers (scale-up) All SQL Server implementations up until now have been SMP Mostly, the solution is housed on a shared SAN MPP - Massively Parallel Processing Uses many separate CPUs running in parallel to execute a single program Shared Nothing: Each CPU has its own memory and disk (scale-out) Segments communicate using high-speed network between nodes

DW SCALABILITY SPIDER CHART Data Volume Mixed Workload Strategic, Tactical Loads, SLA Strategic, Tactical Loads Strategic, Tactical 5 PB 500 TB 100 TB 50 TB 10 TB 1.000 10.000 Query Concurrency MPP Multidimensional Scalability SMP Tunable in one dimension on cost of other dimensions The spiderweb depicts important attributes to consider when evaluating Data Warehousing options. Data Freshness Near Real Time Data Feeds Daily Load Ad Hoc Queries Data Analysis/Mining Weekly Load Strategic Batch Reporting, Repetitive Queries 100 Simple Star MB s 3-5 Way Joins Multiple, Integrated Stars 5-10 Way Joins Normalized Query complexity Joins + OLAP operations + Aggregation + Complex Where constraints + Views Parallelism Big Data support is newest dimension. Query Freedom GB s TB s Multiple, Integrated Stars and Normalized Schema Sophistication Query Data Volume

Summary We live in an increasingly data-intensive world Much of the data stored online and analyzed today is more varied than the data stored in recent years More of our data arrives in near-real time This present a large business opportunity. Are you ready for it

Other Related Presentations Building a Big Data Solution How does Microsoft solve Big Data Benefits of the Azure cloud Should I move my database to the cloud Implement SQL Server on a Azure VM Relational databases vs Non-relational databases Introduction to Microsoft s Hadoop solution (HDInsight) Introducing Azure SQL Database Introducing Azure SQL Data Warehouse Visit my blog at: JamesSerra.com (where these slide decks are posted under the Presentation tab)

Resources Why use a data lake http://bit.ly/1wdy848 Big Data Architectures http://bit.ly/1rbbabs The Modern Data Warehouse: http://bit.ly/1xux4py Hadoop and Data Warehouses: http://bit.ly/1xuxfu9

Q & A James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/jamesserra Visit my blog at: JamesSerra.com (where this slide deck is posted under the Presentations tab)