BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Similar documents
BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Big Data Can Drive the Business and IT to Evolve and Adapt

The Future of Data Management

The Future of Data Management with Hadoop and the Enterprise Data Hub

The Internet of Things and Big Data: Intro

Large scale processing using Hadoop. Ján Vaňo

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

NEWLY EMERGING BEST PRACTICES FOR BIG DATA

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Architectures for Big Data Analytics A database perspective

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Oracle Big Data SQL Technical Update

Big Data Course Highlights

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Introduction

Testing Big data is one of the biggest

Hadoop IST 734 SS CHUNG

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

BIG DATA TRENDS AND TECHNOLOGIES

Native Connectivity to Big Data Sources in MSTR 10

Information Builders Mission & Value Proposition

Implement Hadoop jobs to extract business value from large and varied data sets

Chapter 7. Using Hadoop Cluster and MapReduce

Constructing a Data Lake: Hadoop and Oracle Database United!

Real Time Big Data Processing

Luncheon Webinar Series May 13, 2013

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

NoSQL for SQL Professionals William McKnight

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Using distributed technologies to analyze Big Data

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Cost-Effective Business Intelligence with Red Hat and Open Source

BIG DATA What it is and how to use?

How To Scale Out Of A Nosql Database

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Tap into Hadoop and Other No SQL Sources

Hadoop and Map-Reduce. Swati Gore

Presenters: Luke Dougherty & Steve Crabb

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

How Cisco IT Built Big Data Platform to Transform Data Management

HDP Hadoop From concept to deployment.

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hadoop Ecosystem B Y R A H I M A.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

CSE-E5430 Scalable Cloud Computing Lecture 2

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Modernizing Your Data Warehouse for Hadoop

Navigating the Big Data infrastructure layer Helena Schwenk

CitusDB Architecture for Real-Time Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Cisco IT Hadoop Journey

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Big + Fast + Safe + Simple = Lowest Technical Risk

HDP Enabling the Modern Data Architecture

Bringing Big Data into the Enterprise

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Move Data from Oracle to Hadoop and Gain New Business Insights

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

A very short Intro to Hadoop

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Bringing Big Data to People

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

ITG Software Engineering

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big data for the Masses The Unique Challenge of Big Data Integration

Case Study : 3 different hadoop cluster deployments

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Data Challenges in Telecommunications Networks and a Big Data Solution

Practical Hadoop by Example

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

New Modeling Challenges: Big Data, Hadoop, Cloud

A Brief Outline on Bigdata Hadoop

SQL Server PDW. Artur Vieira Premier Field Engineer

Parallel Data Warehouse

Using RDBMS, NoSQL or Hadoop?

Big Data on Microsoft Platform

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Evolution from Big Data to Smart Data

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

KNIME & Avira, or how I ve learned to love Big Data

Apache Hadoop: Past, Present, and Future

}w!"#$%&'()+,-./012345<ya

Big Data Technologies Compared June 2014

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Transcription:

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014

The Data Warehouse Mission Identify all possible enterprise data assets Select those assets that have actionable content and can be accessed Bring the data assets into a logically centralized enterprise data warehouse Expose those data assets most effectively for decision making

Enormous RDBMS Legacy Legacy RDBMSs have been spectacularly successful, and we will continue to use them. Too successful If all you have is a hammer, everything looks like a nail. RDBMS dilemma: a new ocean of new data types that are being monetized for strategic advantage Unstructured, semi-structured and machine data Evolving schemas, just-in-time schemas Links, images, genomes, geo-positions, log data

Houston: we have a problem Traditional RDBMSs cannot handle The new data types Extended analytic processing Terabytes/hour loading with immediate query access We want to use SQL and SQL-like languages, but we don t want the RDBMS storage constraints The disruptive solution: Hadoop

The Data Warehouse Stack in Hadoop Hadoop is an open source distributed storage and processing framework To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:

The Data Warehouse Stack in Hadoop Hadoop is an open source distributed storage and processing framework To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:

Hadoop for Exploratory DW/BI Sources: Transactions Free text Images Machines/ Sensors Links/ Networks EDW Overflow Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop HDFS Files: Metadata (system table): HCatalog Industry standad HW; Fault tolerant; Replicated; Write once(!); Agnostic content; Scalable to infinity All clients can use this to read files Query Engines: Hive SQL Impala SQL Others These are query engines, not databases! BI Tools: Tableau Bus Obj Cognos QlikView Others Query engines can access HDFS files before ETL BI tools are the ultimate glue integrating EDW resources

Data Load to Query in One Step Copy into HDFS with ETL tool, Sqoop, or Flume into standard HDFS files (write once) registering metadata with HCatalog Declare query schema in Hive or Impala (no data copying or reloading) Immediately launch familiar SQL queries: Exploratory BI

Typical Large Hadoop Cluster 100 nodes (5 racks) Each node Dual hex core CPU running at 3 GHz 64-378 GB of RAM 24-36 TB disk storage (6-10 TB effective storage with default redundancy of 3X) Overall cluster (!) 6.4-37.8 TB of RAM (wow, think about this ) Up to a PB of effective storage Approximate fully loaded cost per TB: $1000 +/-

210 Committing to High Performance HDFS files with Embedded Schemas Sources: Transactions Free text Images Machines/ Sensors Links/ Networks EDW Overflow Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop HDFS Raw Files: Commodity HW; Fault tolerant; Replicated; Append Only(!); Agnostic content; Scalable to infinity Parquet Columnar Metadata FILES: (system table): HCatalog Read optimized schema defined column store All clients can use this to read files Query Engines: Hive SQL Impala SQL Others These are query engines, not databases! BI Tools: Tableau Bus Obj Cognos QlikView Others

High Performance Data Warehouse Thread in Hadoop Copy data from raw HDFS file into Parquet columnar file Parquet is not a database: it s a file accessible to multiple query and analysis apps Parquet data can be updated and the schema modified Query Parquet data with Hive or Impala At least 10x performance gain over simple raw file Hive launches MapReduce jobs: relation scan Ideal for ETL and transfer to conventional EDW Impala launches in-memory individual queries Ideal for interactive query in Hadoop destination DW Impala at least 10x additional performance gain over Hive

Use Hadoop as Platform for Direct Analysis or ETL to Text/Number DB Huge array of special analysis apps for Unstructured text Hyper structured text/numbers (machine data) Positional data from GPS Images Audio, video Consume results with increasing SQL support from these individual apps Or, write text/number data into Hadoop from unstructured source or external EDW relational DBMS

The Larger Picture: Why Use Hadoop as Part of Your EDW? Strategic: Open floodgates to new kinds of data New kinds of analysis impossible in RDBMS Schema on read for exploratory BI Attack same data from multiple perspectives Choose SQL and non-sql approaches at query time Keep hyper granular data in active archive forever No compromise data analysis Compliance Simultaneous incompatible analysis modes on same data files Enterprise data hub: one location for all data resources Tactical: Dramatically lowered operational costs Linear scaling across response time, concurrency, and data size well beyond petabytes Highly reliable write-once, redundantly stored data Meet ETL SLAs

It s Not That Difficult Important existing tools already work in Hadoop ETL tool suites: familiar data flows and user interfaces BI query tools: identical user interfaces, integration Standard job schedulers, sort packages (e.g. SyncSort) Skills you need anyway: Java, Python or Ruby, C, SQL, Sqoop data transfer Linux admin but, MapReduce programming no longer needed Investigate and add incrementally: Analytic tools: MADLib extensions to RDBMS, SAS, R Specialty data tools E.g., Splunk (machine data)

Integration is Crucial Integration is MORE than bringing separate data sources onto a common platform. Suppose you have two customer facing data sources in your DW producing the following results. Is this integration?

Doing Integration the Right Way Teaspoon sip of EDW 101 for Hadoop Professionals! Build a conformed dimension library Plan to download dimensions from EDW Attach conformed dimensions to every possible source Join dimensions at query time to fact tables in SQL-capable files Embed dimension content as columns in NoSQL structures, and also HBase.

Integrating Big Data Remember: Data warehouse integration is drilling across: Establish conformed attributes (e.g., Customer Category) in each database Fetch separate answer sets from different platforms grouped on the same conformed attributes Sort-merge the answer sets at the BI layer (or is this Post-BI? Depends on the way you fetch )

Out of the Box Possibility: Billions of Rows, Millions of Columns Tough problem for all current relational platforms: huge Name-Value data sources (e.g. customer observations) Think about Hbase (!) Intended for impossibly wide schemas Fully general binary data content Fire hose SCD1 and SCD2 updates of individual records Continuously growing row and columns Only simple SQL direct access possible now: no joins Not yet ready for full EDW membership Stay tuned!

Summing Up: The Data Warehouse Renaissance Hadoop DW becomes equal partner with Enterprise DW Hadoop will be the strategic environment of choice for new data types and new analysis modes Hadoop: Extreme data type diversity Huge library of specialty analysis tools with SQL extensions Starting point for exploratory BI and ETL-to-EDW processing Destination point for serious BI Permanent active archive of hyper granular data BI tools implement Hadoop-to-EDW integration BI tools must step up to deliver the final integration payload

The Kimball Group Resource www.kimballgroup.com Best selling data warehouse books NEW BOOK! The Classic Toolkit 3 rd Ed.è In depth data warehouse classes taught by primary authors Dimensional modeling (Ralph/Margy) ETL architecture (Ralph/Bob) Dimensional design reviews and consulting by Kimball Group principals White Papers on Integration, Data Quality, and Big Data Analytics