New Modeling Challenges: Big Data, Hadoop, Cloud



Similar documents
Bringing Big Data to People

Implement Hadoop jobs to extract business value from large and varied data sets

Open Source Technologies on Microsoft Azure

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Understanding NoSQL Technologies on Windows Azure

Azure Data Lake Analytics

Cloud Scale Distributed Data Storage. Jürmo Mehine

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Modernizing Your Data Warehouse for Hadoop

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Oracle Database 12c Plug In. Switch On. Get SMART.

HDP Hadoop From concept to deployment.

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

The Future of Data Management

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Microsoft Azure Data Technologies: An Overview

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Upcoming Announcements

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Native Connectivity to Big Data Sources in MSTR 10

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Integrating Big Data into the Computing Curricula

Understanding NoSQL on Microsoft Azure

NoSQL for SQL Professionals William McKnight

Data Modeling for Big Data

Agile Business Intelligence Data Lake Architecture

How To Handle Big Data With A Data Scientist

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

The Future of Data Management with Hadoop and the Enterprise Data Hub

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

How to Enhance Traditional BI Architecture to Leverage Big Data

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Constructing a Data Lake: Hadoop and Oracle Database United!

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Reference Architecture, Requirements, Gaps, Roles

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Workshop on Hadoop with Big Data

TRAINING PROGRAM ON BIGDATA/HADOOP

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Luncheon Webinar Series May 13, 2013

Testing Big data is one of the biggest

Please give me your feedback

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

BIG DATA What it is and how to use?

How To Scale Out Of A Nosql Database

Oracle Big Data SQL Technical Update

HDP Enabling the Modern Data Architecture

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. DATA MANAGEMENT FOR ANALYTICS

Hadoop Job Oriented Training Agenda

INTRODUCTION TO CASSANDRA

Introduction to Apache Cassandra

Qsoft Inc

Can the Elephants Handle the NoSQL Onslaught?

Bringing Big Data into the Enterprise

Peers Techno log ies Pv t. L td. HADOOP

Big Data Course Highlights

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Building Your Big Data Team

BIG DATA TRENDS AND TECHNOLOGIES

Data Modeling in the Age of Big Data

Data Analyst Program- 0 to 100

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Preparing Your Data For Cloud

Practical Hadoop by Example

Big Data and Data Science: Behind the Buzz Words

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Apache Hadoop: The Big Data Refinery

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Sentimental Analysis using Hadoop Phase 2: Week 2

Enterprise Operational SQL on Hadoop Trafodion Overview

Traditional BI vs. Business Data Lake A comparison

Modern Data Warehousing

Dell In-Memory Appliance for Cloudera Enterprise

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Safe Harbor Statement

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

So What s the Big Deal?

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

A Brief Outline on Bigdata Hadoop

Transcription:

New Modeling Challenges: Big Data, Hadoop, Cloud Karen López @datachick www.datamodel.com Karen Lopez Love Your Data InfoAdvisors.com @datachick Senior Project Manager & Architect 1

Disclosure I m a Data Architect / Modeler I am biased You have been warned 4 POLL: Who Are You? 2014 - NoSQLNow! - D'Antoni / Lopez 2

5 POLL: Architect Much? 2014 - NoSQLNow! - D'Antoni / Lopez 6 POLL: Cloud Much? 2014 - NoSQLNow! - D'Antoni / Lopez 3

7 2014 - NoSQLNow! - D'Antoni / Lopez POLL: SQL Much? NoSQL Much? Hadoop Much? What we will be doing Terminology, Foundations and Thoughts. Modern Architectural Components Architecting versus building Cost, benefit, risk Demos! 4

What would you like to do? Outcomes 10 Concepts Features Existing Tech New Tech Hybrid Evaluate Pilot Production Benefits Trade-offs What Where When Why Architects Devs DBAs Modeling Dev Test Who How Costs Trade-offs How much 5

Terminology and Thoughts Because we need a vocabulary to share ideas Schemaless Hadoop Cloud Data versus Processing Read-Optimized Scale Up / Out BIG DATA Variable Data Polyglot Volume Analytics Data Reservoir Unstructured data Write-Optimized Logical Data Model Data Lake 6

Polyschematic SQL Server Relational Read-Optimized Persistence HA / DR On Premises Velocity Volume Business Intelligence External Data Physical Data Model JBIC data Data Swamp ROI Architect Fit Technology Business Data Protection Cost, Benefit, Risk 7

BIG DATA [x] Vs Data so big it s awkward to work with Always capitalized Big Data A confusing term because it defines what it IS NOT. NoSQL Scale Not SQL? Not Relational? Not Only Relational? A confusing term because it defines what it IS NOT. 8

Terminology ACID Atomic, Consistent, Isolated, Durable BASE Basically available, Soft state, Eventual consistency Eventual consistency Schemaless Constraints / Have-to/ MUST / OBEY / Rigid / Inflexible Relational Tables with rows On purpose With many benefits Write-optimized This is a FEATURE Transactionoptimized same columns with the same datatypes with the same constraints with the same domains Data integrity Data quality Consistent 9

Data Modeling Now What Data Models? Reverse engineered data models (diagrams) Physical Data Models Faux Logical Data Models Logical Data Models Conceptual Data Models Data Models Traditional Process Conceptual (Data) Model Logical Data Model Physical Data Model(s) MART MART OLTP OLTP OLTP OLTP OLTP OLTP OLTP Aug 2014 OLTP 10

Traditional Data Architect Involvement Project Initiation Architecture and Infrastructure Design SW Requirements Development Deployment The Big Data Story Lots of data Coming at us fast Lots of variety in format & quality We want all the data Highly available It s web scale 11

What do we really mean by scale? Bringing computing to the data Massively parallel processing Cheap, commodity hardware, but lots of it Optimized for Query/Reads/Questions/Telling stories Cloud Can we fit another buzzword in? Enable on-demand scaling Pay as you go pricing Click to deploy Service licensing, not product licensing, if any Managed by others, not your data center 12

But The Cloud is Different Fine Tuning nope Patching, maintaining Skills Professional development Optimized subsystems Putting off upgrades How you work Who you work for How you stay up to date How you think about problems How you get new features We ve been down this road before Traditional transactional applications Reportingoptimized tables/struct ures Data Warehouse / Dimensional Modeling 13

There was a lot of contention NoSQL, Not Only SQL Relational Graph Columnar/Column Family Key Value Document Databases Others 14

Graph Databases Key Value Pair 15

Columnar InfoAdvisors - infoadvisors.com Aug 2014 Modern Architectural Components & Concepts 16

Hadoop Ecosystem (a Zoo) Stats processing (RHadoop) Machine Learning (Mahout) Scripting (Pig) Query (Hive) Event Pipeline (Flume) Distributed Processing (MapReduce) Distributed Storage (HDFS) Scripting (Pig) Data Integration ( ODBC / SQOOP/ REST) Metadata (HCatalog) NoSQL Database (HBase) Pipeline / Workflow (Oozie) Graph (Pegasus) SQL-like Query (HiveQL) Distributed Processing (MapReduce) Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Yellow = Data Movement Green = Packages Hadoop Zoo data warehouse SQL-like Query (Impala) Distributed Storage (HDFS) 17

MapReduce Shuffle and sort data Get from large data to smaller data Parallel processing Hive SQL-like query language Abstraction on top of MapReduce Metastructure on top of HDFS 18

HBase Column Family NoSQL database Key Value Store Hundreds of Millions/Billions of rows Based on Google Big Table Another type of HDInsight Cluster Large, Schemaless (really, Schema on read) Optimized for retrieving specific rows from large datasets Strictly Consistent Classic DW Architecture Data Mart ETL EDW Data Mart 19

Modern DW Architecture Hadoop ETL Analytics Mart EDW Data Mart What s new Computing Storage Icing 20

What s new Analytics Mart Scale and Nodes Scale up versus scale out Nodes, not beefier servers Data stays, clusters come and go Quick Create for playing Custom Create for architecting a solution 21

Every design decision should include cost, benefit and risk - Karen Lopez Common Themes In-memory capabilities Build Now, Analyze Later Framework Optimized for bulk loads, not singleton updates Read optimized workloads Most products have HA/DR built in by design Most products have scale out capabilities Text 22

Cloud vs On-Premises On-Premises Short Term Use Rapid Scale Test Use Cases Pay as you go Internet data source Large long term implementations Well known workloads Shared clusters Large initial investment Windows vs. Other OSs (Linux) 23

Graph Databases There s a book And it s FREE! Graphdatabases.com 24

Document Databases JSON, BSON, XML, YAML Key Value Pair 25

Key Value Pair Cassandra CQL SELECT balance FROM accounts WHERE account_id=3476 CREATE TABLE monkeyspecies ( species text PRIMARY KEY, common_name text, population varint, average_size int ) WITH comment='important biological records' AND read_repair_chance = 1.0; 26

Columnar 53 => CREATE TABLE Retail.Product_Dimension ( Product_Key integer NOT NULL, Product_Description varchar(128), SKU_Number char(32) NOT NULL, Vertica SQL Category_Description char(32), Department_Description char(32) NOT NULL, Package_Type_Description char(32), Package_Size char(32), Fat_Content integer, Diet_Type char(32), Weight integer, Weight_Units_of_Measure char(32), Shelf_Width integer, Shelf_Height integer, SELECT balance FROM accounts WHERE account_id=3476 Shelf_Depth integer ); 27

The CLOUD Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Database as a Service (DBaaS) 55 Let s talk billing. Per hour Per node head node security node data nodes Data egress, not ingress Geo-redundancy 28

Blob versus HDFS HDFS HDInsight Blob versus HDFS HDFS HDInsight 29

Where Data Models Can Help Create a Physical Data Model of Key Data Sources Model the metadata Published in a shareable portal One Entity per tab or Worksheet Someone Normalized (master data + combinations) definitions gotchas expected domains metadata Cross model searchability Data Lineage Resuse http://www.irs.gov/file_source/pub/irs-soi/zipcode08flds.xls The Data Story *http://www.dsireusa.org/incentives/incentive.cfm?incentive_code=us37f 30

Discussion - Who What technical skills are needed? What professional skills are needed? What personal characteristics are needed? Discussion - Tools What data tools are needed? What data architecture tools are needed? What project management tools are needed? 31

Discussion - Money How much does a traditional DW/BI project cost? How much does operations & maintenance cost? How much does the software cost? How much do servers cost? Modern Data Architect Involvement Project Initiation Architecture and Infrastructure Design SW + Data Requirements Development Deployment 32

Anti-patterns for new DBs & Data Schemaless means faster delivery It doesn t matter if data changes We don t have to steward data Its selfdocumented We don t need a data architect We don t need ANY architects We don t have to understand data Cc Kazuhisa OTSUBO 33

10 Tips for Architects Understand the use cases for hybrid technologies. Evaluate / profile your data requirements for suitability for each database/datastore type. Understand the licensing / editions for commercial database features / products. Hadoop and other NoSQL technologies are optimized for read. Data Modeling and Analysis still happens, just later*. 34

10 Tips for Architects Hands on experience makes a real difference in your understanding. Evaluate which analytical features each product supports. Test your current development tools for support. Test your database design / data modeling tools. Leverage your existing metadata / models. Finally There business models of Open Source technologies means there are hundreds, maybe thousands of new things to learn. That s both a pro and a con for the architect. 35

Outcomes Concepts Features Existing Tech New Tech Hybrid Evaluate Pilot Production Benefits Trade-offs What Where When Why Architects Devs DBAs Modeling Dev Test Who How Wrap up Costs Trade-offs How much NoSQL = SQL + NoSQL External Data will become more important Modern Data Architecture = Hybrids, not either/or 36

Karen Lopez Love Your Data InfoAdvisors.com @datachick Senior Project Manager & Architect @datachick www.infoadvisors.com www.datamodel.com www.dataversity.com @venusbarbie @data_model 74 37

Thank You I d love to help you with your big data, nosql, cloud or other projects. karenlopez@infoadvisors.com 38