Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. DATA MANAGEMENT FOR ANALYTICS

Similar documents
Safe Harbor Statement

SQL. Short introduction

Oracle Big Data SQL Technical Update

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop Job Oriented Training Agenda

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

How To Create A Table In Sql (Ahem)

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Cisco Data Preparation

Oracle Big Data Building A Big Data Management System

<Insert Picture Here> Big Data

ANALYTICS IN BIG DATA ERA

Hadoop & SAS Data Loader for Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

SQL Server An Overview

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

New Modeling Challenges: Big Data, Hadoop, Cloud

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Implement Hadoop jobs to extract business value from large and varied data sets

Qsoft Inc

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

HDP Hadoop From concept to deployment.

Cloudera Certified Developer for Apache Hadoop

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Bringing the Power of SAS to Hadoop. White Paper

The Future of Data Management

Hadoop Ecosystem B Y R A H I M A.

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Architecting for the Internet of Things & Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Luncheon Webinar Series May 13, 2013

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Move Data from Oracle to Hadoop and Gain New Business Insights

Big Data and Advanced Analytics Technologies for the Smart Grid

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Lofan Abrams Data Services for Big Data Session # 2987

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Internals of Hadoop Application Framework and Distributed File System

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

MySQL and Hadoop. Percona Live 2014 Chris Schneider

QUEST meeting Big Data Analytics

Oracle Big Data Discovery The Visual Face of Hadoop

Data processing goes big

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Oracle Database 12c Plug In. Switch On. Get SMART.

What's New in SAS Data Management

Why Big Data in the Cloud?

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Teradata s Big Data Technology Strategy & Roadmap

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Integrating VoltDB with Hadoop

Harnessing big data with Hortonworks Data Platform and Red Hat JBoss Data Virtualization

Bringing Big Data to People

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

COURSE CONTENT Big Data and Hadoop Training

Moving From Hadoop to Spark

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Big Data Introduction

Are You Big Data Ready?

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Fact Sheet In-Memory Analysis

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data Analytics Nokia

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Workshop on Hadoop with Big Data

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Large scale processing using Hadoop. Ján Vaňo

Apache Hadoop: The Big Data Refinery

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Certified Big Data and Apache Hadoop Developer VS-1221

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

The Future of Data Management with Hadoop and the Enterprise Data Hub

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Making Sense of Big Data in Insurance

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Information Builders Mission & Value Proposition

Has been into training Big Data Hadoop and MongoDB from more than a year now

Data. Data and database. Aniel Nieves-González. Fall 2015

Advanced Big Data Analytics with R and Hadoop

Dell In-Memory Appliance for Cloudera Enterprise

Transcription:

DATA MANAGEMENT FOR ANALYTICS

WHAT IS ANALYTICS? A VERY BROAD TERM OFTEN CONFUSED Descriptive What happened? When? Why? Advanced What will happen? When? Why? How do we benefit? What actions should I take? ANALYTICS 2

THE DISCONNECT TOH-MAY-TOH TOH-MAH-TOH I need data. Can you be more specific? Nope, not yet.???? 3

PARADIGM SHIFT IT S ABOUT THE DATA S DESTINATION Design Extract Transform Load Validate Refresh 4

PARADIGM SHIFT IT S ABOUT THE DATA S JOURNEY Access Explore Clean Transform Analytic method 5

PREPARING DATA FOR ANALYTICS Data Access Understand Cleansing Reshape Leverage Metadata Access to multiple sources of data. New sources combined with existing and legacy sources: Validate data movement and verify consistency and completeness: Perform cleansing functions on joined data to increase value Data is rarely in the format needed and different methods of analytics require different shapes of data Metadata is a valuable asset to assist in the collaboration between business and IT Data Types Data Movement Combine Filter Statistical Analysis Distributions Associations De-duplication Enrichment Standardization Missing values Wide Long Transposition Understand how models are built Collaborate on the data TRADITIONALLY OPERATIONALIZED MANUAL AUTOMATED PROCESS PROCESS 80% 20% 80% 20% 6

ACCESS SO MANY DATA TYPES AND SOURCES Access Excel SQLServer Oracle MySQL boolean Yes/No Bit Byte N/A Boolean integer Number Int Number Int Int float Number (single) Float Number Float Numeric currency Currency Money NA NA Money string NA Char Char Char Char string Text VarChar VarChar VarChar VarChar binary OLE Obj Memo Binary Varbinary Image Long Raw Blob Text Binary Varbinary 7

ACCESS SO MUCH DATA MOVEMENT Data Data Data SAS Server Push some, or ALL processing to the data 8

UNDERSTAND WHAT DO I HAVE AND HOW USEFUL IS IT? Is my data consistent? Is my data complete? Is my data highly unique? 9

UNDERSTAND Is my data normal? WHAT DO I HAVE AND HOW USEFUL IS IT? Is my data linear? What are the associations in the data? 10

CLEAN FILLING IN THE GAPS AND STANDARDIZING Standardizing Text Standardizing Numeric De-duplication 11

CLEAN FILLING IN THE GAPS AND STANDARDIZING Dropping outliers Grouping or binning data 12

RESHAPE PURPOSE BUILT DATA STRUCTURE Efficient storage Fast retrieval Defined schema WIDE tables / Time series data Iteration (build, test, repeat) Schema-less 13

RESHAPE TURNING DATA AROUND Add up all the quantities for each product purchased in each product category. 14

RESHAPE TURNING DATA AROUND Each product category will become its own row, with each product purchased its own distinct category column. 15

PARADIGM SHIFT IT S ABOUT THE DATA S JOURNEY Access Explore Clean Transform Analytic method How can we do this better? 16

METADATA LINEAGE & TRACEABILITY A view into existing data sources/targets, jobs and the associated owners 17

METADATA COLLABORATION AND REPEATABILITY Managed, collaborative environments with shared content, data sources and personal development space 18

LEVERAGING A FRAMEWORK FOR SUCCESS SOURCES DATA MANAGEMENT DATA GOVERANCE CONSUMERS EVENT STREAM PROCESSING DATA INTEGRATION XML Cloud DATA ACCESS DATA QUALITY MQ DATA VIRTUALIZATION MASTER DATA MGMT RDBMS 19

GROWTH OF THE INTERNET OF THINGS TRENDS TODAY

Publish Subscribe ENGINEERED FOR FAST AND ADAPTIVE ACTION Event Stream Processing Model Streaming Events Event Actions Continuous Query SAS In-Memory SAS-generated Insights Enrichment Data Analytic Models Busines s Rules Copyr i g ht 2015, SAS Ins titut e Inc. All rights res er ve d.

Publish Subscribe ENGINEERED FOR FAST AND ADAPTIVE ACTION Event Stream Processing Model Streaming Events Event Actions Continuous Query SAS In-Memory SAS-generated Insights Enrichment Data Analytic Models Busines s Rules Copyr i g ht 2015, SAS Ins titut e Inc. All rights res er ve d.

Publish Subscribe ENGINEERED FOR FAST AND ADAPTIVE ACTION Event Stream Processing Model Streaming Events Event Actions Continuous Query SAS In-Memory Low-latency assessment of high-volume, high-velocity data streams to detect, filter, aggregate & analyze SAS-generated Insights Enrichment Data Analytic Models Busines s Rules Copyr i g ht 2015, SAS Ins titut e Inc. All rights res er ve d.

STREAMING DATA TAKE REAL TIME ACTION APPLY MULTI-PHASE ANALYTICS FOCUS ON RELEVANT DATA Detect and monitor events of interest and trigger appropriate realtime actions & alerts Apply multi-phase analytics to determine events that can benefit from deeper and more complex analysis Continuous loading of relevant streaming data for in-depth analytics 28

HADOOP TRENDS WHY HADOOP? $ 1. Store data for less 2. Process data more quickly (for less $ ) 29

HADOOP TRENDS ROLES IT S PLAYING Stage structured data. Process structured data. Archive any data. Process any data. Access any data. (via data warehouse) Access any data. (via Hadoop) 30

TERMINOLOGY TRADITIONAL Primary Key RDBMS Relationship Index Normalize Primary Key Database Constraint Table Foreign Key SQL Schema 31

TERMINOLOGY HADOOP Hadoop Cluster NameNode Pig Hive DataNode HDFS YARN Block Cloudera JobTracker MapReduce 32

TERMINOLOGY Παραδεισένι ο νησί. IT S ALL GREEK TO ME (MOST)! Αρχαίοι ναοί. Είναι όλα τα ελληνικά μου. Σαλάτα. Ο Θεός της βροντής. Γιαούρτι. Ολυμπιακοί Αγώνες. Ελληνορωμ αϊκή. Όμορφη αρχιτεκτονικ ή. Μεγάλοι της λογοτεχνίας και της φιλοσοφίας. Τραγωδία. Μεσογείου. 33

SAS & INTEL STUDY Results & Key Findings HADOOP ADOPTION & CHALLENGES 60% - cited advanced analytics, data discovery, or as an analytical lab Research summary: SAS and Intel asked more than 300 IT-managers from the largest companies in Denmark, Finland, Norway and Sweden about the adoption of Big Data analytics and Hadoop. http://nordichadoopsurvey.com Primary reason for considering Hadoop 22% - would like to speed up processing Adoption / Obstacles 35% - cited Resources and Competencies 34

HADOOP BIG DATA CHALLENGES 35

CHALLENGES HADOOP SKILLS SHORTAGE CURRENT USER TOOLS ARE NOT BIG DATA ENABLED 1) Performing even the simplest tasks in Hadoop typically requires mastering disparate tools and writing hundreds of lines of code MapReduce Pig Latin HiveQL HDFS Sqoop and Oozie 2) User tools are not engineered to process data inside Hadoop. Tools are not optimized for Hadoop Users move data out of Hadoop to do data management and data quality This requires more processing time Data is duplicated and more storage is required Users do not use the Hadoop platform as it was designed 36

SELF-SERVICE DATA PREPARATION FOR HADOOP Manage Data inside Hadoop Reduce Complexity of Hadoop Accelerate User adoption 37

SAS DATA LOADER FOR HADOOP SELF-SERVICE DATA PREPARATION FOR HADOOP Reduce Complexity of Hadoop Manage Data inside Hadoop Accelerate User adoption Query, Join and Filter Transform and Integrate Analytics Profile Hadoop Load into and memory Cleanse Empower Business Users Unburden IT - Harness the Power of Big Data 38

SAS DATA MANAGEMENT THE DATA MANAGEMENT JOURNEY GETTING STARTED What does Data Governance mean to us? How do we implement and sustain a program? You can get there from here! How do we even get started? 39

REVERSE IT BE MORE PRODUCTIVE 20% 80% 40

MERCI BEAUCOUP! www.sas.com