DATA MANAGEMENT FOR ANALYTICS
WHAT IS ANALYTICS? A VERY BROAD TERM OFTEN CONFUSED Descriptive What happened? When? Why? Advanced What will happen? When? Why? How do we benefit? What actions should I take? ANALYTICS 2
THE DISCONNECT TOH-MAY-TOH TOH-MAH-TOH I need data. Can you be more specific? Nope, not yet.???? 3
PARADIGM SHIFT IT S ABOUT THE DATA S DESTINATION Design Extract Transform Load Validate Refresh 4
PARADIGM SHIFT IT S ABOUT THE DATA S JOURNEY Access Explore Clean Transform Analytic method 5
PREPARING DATA FOR ANALYTICS Data Access Understand Cleansing Reshape Leverage Metadata Access to multiple sources of data. New sources combined with existing and legacy sources: Validate data movement and verify consistency and completeness: Perform cleansing functions on joined data to increase value Data is rarely in the format needed and different methods of analytics require different shapes of data Metadata is a valuable asset to assist in the collaboration between business and IT Data Types Data Movement Combine Filter Statistical Analysis Distributions Associations De-duplication Enrichment Standardization Missing values Wide Long Transposition Understand how models are built Collaborate on the data TRADITIONALLY OPERATIONALIZED MANUAL AUTOMATED PROCESS PROCESS 80% 20% 80% 20% 6
ACCESS SO MANY DATA TYPES AND SOURCES Access Excel SQLServer Oracle MySQL boolean Yes/No Bit Byte N/A Boolean integer Number Int Number Int Int float Number (single) Float Number Float Numeric currency Currency Money NA NA Money string NA Char Char Char Char string Text VarChar VarChar VarChar VarChar binary OLE Obj Memo Binary Varbinary Image Long Raw Blob Text Binary Varbinary 7
ACCESS SO MUCH DATA MOVEMENT Data Data Data SAS Server Push some, or ALL processing to the data 8
UNDERSTAND WHAT DO I HAVE AND HOW USEFUL IS IT? Is my data consistent? Is my data complete? Is my data highly unique? 9
UNDERSTAND Is my data normal? WHAT DO I HAVE AND HOW USEFUL IS IT? Is my data linear? What are the associations in the data? 10
CLEAN FILLING IN THE GAPS AND STANDARDIZING Standardizing Text Standardizing Numeric De-duplication 11
CLEAN FILLING IN THE GAPS AND STANDARDIZING Dropping outliers Grouping or binning data 12
RESHAPE PURPOSE BUILT DATA STRUCTURE Efficient storage Fast retrieval Defined schema WIDE tables / Time series data Iteration (build, test, repeat) Schema-less 13
RESHAPE TURNING DATA AROUND Add up all the quantities for each product purchased in each product category. 14
RESHAPE TURNING DATA AROUND Each product category will become its own row, with each product purchased its own distinct category column. 15
PARADIGM SHIFT IT S ABOUT THE DATA S JOURNEY Access Explore Clean Transform Analytic method How can we do this better? 16
METADATA LINEAGE & TRACEABILITY A view into existing data sources/targets, jobs and the associated owners 17
METADATA COLLABORATION AND REPEATABILITY Managed, collaborative environments with shared content, data sources and personal development space 18
LEVERAGING A FRAMEWORK FOR SUCCESS SOURCES DATA MANAGEMENT DATA GOVERANCE CONSUMERS EVENT STREAM PROCESSING DATA INTEGRATION XML Cloud DATA ACCESS DATA QUALITY MQ DATA VIRTUALIZATION MASTER DATA MGMT RDBMS 19
GROWTH OF THE INTERNET OF THINGS TRENDS TODAY
Publish Subscribe ENGINEERED FOR FAST AND ADAPTIVE ACTION Event Stream Processing Model Streaming Events Event Actions Continuous Query SAS In-Memory SAS-generated Insights Enrichment Data Analytic Models Busines s Rules Copyr i g ht 2015, SAS Ins titut e Inc. All rights res er ve d.
Publish Subscribe ENGINEERED FOR FAST AND ADAPTIVE ACTION Event Stream Processing Model Streaming Events Event Actions Continuous Query SAS In-Memory SAS-generated Insights Enrichment Data Analytic Models Busines s Rules Copyr i g ht 2015, SAS Ins titut e Inc. All rights res er ve d.
Publish Subscribe ENGINEERED FOR FAST AND ADAPTIVE ACTION Event Stream Processing Model Streaming Events Event Actions Continuous Query SAS In-Memory Low-latency assessment of high-volume, high-velocity data streams to detect, filter, aggregate & analyze SAS-generated Insights Enrichment Data Analytic Models Busines s Rules Copyr i g ht 2015, SAS Ins titut e Inc. All rights res er ve d.
STREAMING DATA TAKE REAL TIME ACTION APPLY MULTI-PHASE ANALYTICS FOCUS ON RELEVANT DATA Detect and monitor events of interest and trigger appropriate realtime actions & alerts Apply multi-phase analytics to determine events that can benefit from deeper and more complex analysis Continuous loading of relevant streaming data for in-depth analytics 28
HADOOP TRENDS WHY HADOOP? $ 1. Store data for less 2. Process data more quickly (for less $ ) 29
HADOOP TRENDS ROLES IT S PLAYING Stage structured data. Process structured data. Archive any data. Process any data. Access any data. (via data warehouse) Access any data. (via Hadoop) 30
TERMINOLOGY TRADITIONAL Primary Key RDBMS Relationship Index Normalize Primary Key Database Constraint Table Foreign Key SQL Schema 31
TERMINOLOGY HADOOP Hadoop Cluster NameNode Pig Hive DataNode HDFS YARN Block Cloudera JobTracker MapReduce 32
TERMINOLOGY Παραδεισένι ο νησί. IT S ALL GREEK TO ME (MOST)! Αρχαίοι ναοί. Είναι όλα τα ελληνικά μου. Σαλάτα. Ο Θεός της βροντής. Γιαούρτι. Ολυμπιακοί Αγώνες. Ελληνορωμ αϊκή. Όμορφη αρχιτεκτονικ ή. Μεγάλοι της λογοτεχνίας και της φιλοσοφίας. Τραγωδία. Μεσογείου. 33
SAS & INTEL STUDY Results & Key Findings HADOOP ADOPTION & CHALLENGES 60% - cited advanced analytics, data discovery, or as an analytical lab Research summary: SAS and Intel asked more than 300 IT-managers from the largest companies in Denmark, Finland, Norway and Sweden about the adoption of Big Data analytics and Hadoop. http://nordichadoopsurvey.com Primary reason for considering Hadoop 22% - would like to speed up processing Adoption / Obstacles 35% - cited Resources and Competencies 34
HADOOP BIG DATA CHALLENGES 35
CHALLENGES HADOOP SKILLS SHORTAGE CURRENT USER TOOLS ARE NOT BIG DATA ENABLED 1) Performing even the simplest tasks in Hadoop typically requires mastering disparate tools and writing hundreds of lines of code MapReduce Pig Latin HiveQL HDFS Sqoop and Oozie 2) User tools are not engineered to process data inside Hadoop. Tools are not optimized for Hadoop Users move data out of Hadoop to do data management and data quality This requires more processing time Data is duplicated and more storage is required Users do not use the Hadoop platform as it was designed 36
SELF-SERVICE DATA PREPARATION FOR HADOOP Manage Data inside Hadoop Reduce Complexity of Hadoop Accelerate User adoption 37
SAS DATA LOADER FOR HADOOP SELF-SERVICE DATA PREPARATION FOR HADOOP Reduce Complexity of Hadoop Manage Data inside Hadoop Accelerate User adoption Query, Join and Filter Transform and Integrate Analytics Profile Hadoop Load into and memory Cleanse Empower Business Users Unburden IT - Harness the Power of Big Data 38
SAS DATA MANAGEMENT THE DATA MANAGEMENT JOURNEY GETTING STARTED What does Data Governance mean to us? How do we implement and sustain a program? You can get there from here! How do we even get started? 39
REVERSE IT BE MORE PRODUCTIVE 20% 80% 40
MERCI BEAUCOUP! www.sas.com