ACHIEVING BUSINESS VALUE WITH BIG DATA. By W H Inmon. copyright 2014 Forest Rim Technology, all rights reserved

Similar documents
DATA WAREHOUSE/BIG DATA AN ARCHITECTURAL APPROACH

The growth of computing can be measured in two ways growth in what is termed structured systems and growth in what is termed unstructured systems.

ANALYZING THE TEXT IN MEDICAL RECORDS: A COLLECTIVE APPROACH USING VISUALIZATION. By W H Inmon

DATA WAREHOUSING IN THE HEALTHCARE ENVIRONMENT. By W H Inmon

Advanced Big Data Analytics with R and Hadoop

Testing 3Vs (Volume, Variety and Velocity) of Big Data

OnX Big Data Reference Architecture

PARALLEL PROCESSING AND THE DATA WAREHOUSE

Testing Big data is one of the biggest

ITG Software Engineering

The Next Wave of Data Management. Is Big Data The New Normal?

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

A very short Intro to Hadoop

How To Handle Big Data With A Data Scientist

Implement Hadoop jobs to extract business value from large and varied data sets

REAL-TIME BIG DATA ANALYTICS

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Move Data from Oracle to Hadoop and Gain New Business Insights

MapReduce With Columnar Storage

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

EC Wise Report: Unlocking the Value of Deeply Unstructured Data. The Challenge: Gaining Knowledge from Deeply Unstructured Data.

Luncheon Webinar Series May 13, 2013

There s no way around it: learning about Big Data means

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

The Five Most Common Big Data Integration Mistakes To Avoid O R A C L E W H I T E P A P E R A P R I L

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Data processing goes big

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Getting Started Practical Input For Your Roadmap

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

Native Connectivity to Big Data Sources in MSTR 10

Big Data Can Drive the Business and IT to Evolve and Adapt

MapReduce with Apache Hadoop Analysing Big Data

L1: Introduction to Hadoop

Hadoop and Map-Reduce. Swati Gore

SOME STRAIGHT TALK ABOUT THE COSTS OF DATA WAREHOUSING

Bringing Big Data into the Enterprise

Building Your Big Data Team

COMPUTER AND COMPUTERISED ACCOUNTING SYSTEM

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi

Big Data Integration: A Buyer's Guide

Apache Hadoop: The Big Data Refinery

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Workshop on Hadoop with Big Data

TRANSACTION DATA ENRICHMENT AS THE FIRST STEP ON THE BIG DATA JOURNEY

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Cloud Computing at Google. Architecture

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Dealing with Data Especially Big Data

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

A Study on Big Data Integration with Data Warehouse

Big Data and Hadoop for the Executive A Reference Guide

Generating the Business Value of Big Data:

BIG DATA CHALLENGES AND PERSPECTIVES

... Foreword Preface... 19

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Big Data the four V's

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

How To Create A Visual Analytics Tool

A capacity planning / queueing theory primer or How far can you go on the back of an envelope? Elementary Tutorial CMG 87

Hadoop IST 734 SS CHUNG

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

OCR LEVEL 2 CAMBRIDGE TECHNICAL

Agile Business Intelligence Data Lake Architecture

Large Scale/Big Data Federation & Virtualization: A Case Study

Open source large scale distributed data management with Google s MapReduce and Bigtable

Virtualizing Apache Hadoop. June, 2012

The 3 questions to ask yourself about BIG DATA

The Internet of Things and Big Data: Intro

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Using distributed technologies to analyze Big Data

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Architectures for Big Data Analytics A database perspective

Barriers. So what is Big Data?!

Search and Real-Time Analytics on Big Data

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst

Enterprise Intelligence - Enabling High Quality in the Data Warehouse/DSS Environment. by Bill Inmon. INTEGRITY IN All Your INformation

Big Data Zurich, November 23. September 2011

Data Services Advisory

Open source Google-style large scale data analysis with Hadoop

Data Mining in the Swamp

Big Data. Dr.Douglas Harris DECEMBER 12, 2013

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Transcription:

ACHIEVING BUSINESS VALUE WITH BIG DATA By W H Inmon

First there were Hollerith punched cards. Then there were magnetic tape files. Then there was disk storage followed by parallel disk storage. With each advance in storage technology came new capabilities speed, capacity, direct access of data - and a drop in the unit cost of storage. Now there is Big Data, where we can store even greater volumes of data at even lower cost per unit of storage. Big Data has opened the door to the collection of volumes of data and types of data that previously have gone unnoticed in the corporation. Now organizations can afford to capture and store data like never before. But with the advent of Big Data comes a related issue - how do we get business value out of the data collected into Big Data? Fig 1 shows that ultimately business value needs to be derived from Big Data in order for Big Data to be adopted and used by organizations. Big Data Business value Big Data has several characteristics that mark it a differently from other types of data. Some of those characteristics are - The unit cost of storage of Big Data is low (as compared to other forms of storage) - Big Data is collected and stored in an unstructured fashion - Big Data is accessed and managed in an indirect manner that does not lend itself to high performance transaction processing - The volumes of data stored in Big Data are orders of magnitude greater than that that was possible with other technologies. There are undoubtedly other characteristics of Big Data that set it apart from earlier storage technologies. So what are organizations technicians and business managers focusing on today as Big Data becomes an accepted part of the technical landscape? Fig 2 shows some of the aspects of Big Data that people are focusing on

Hive Hadoop Pig Map Reduce Cirro Hbase In Fig 2 are found such topics as Mongo, Cirro, PIG, Hadoop, Hive, HBase, MapReduce and other aspects and features of Big Data. These and other technological aspects are part of the technological landscape of Big Data and each of these topics are of interest and are deserving of study. But these are technologies and as interesting as they are do not directly address the topic of deriving business value out of Big Data. Another way to understand the phenomena of Big Data and deriving business value is shown by Fig 3. Big Data Unstructured Data Business Value Fig 3 makes the point that in order to get to business value out of Big Data that you have to go through another entirely different technological barrier, and that barrier is making sense of the unstructured data raw text - that comes with Big Data. The technology of Big Data is one challenge to be conquered, but once you have met the Big Data challenge, you are automatically confronted with the challenge of making sense of unstructured data, because all data in Big Data is unstructured. It is noted that languages that access Big Data NoSQL certainly have a place but any access language - by itself - does not address the fundamental challenges of making sense of unstructured data. Instead unstructured data needs to pass through a separate transformation process before the unstructured data found in Big Data can be used to achieve business value. Stated differently, in order to achieve business value you have to address two very different technological challenges the technology of Big Data and the transformations that come with unstructured data. In a way Big Data merely introduces unstructured text. Once introduced, unstructured text then must be transformed in order to achieve business value. TRANSFORMATIONAL CHALLENGES

In order to illustrate the challenges of transformation of unstructured data, there are several important issues that arise with trying to make sense of unstructured data. Some of those many issues of transformation are - The need for context - The need for interpretation - The need for simple editing - The need for certain forms of standardization - The need for complex editing - The need for acronym resolution, and so forth. This list merely scratches the surface of the many and diverse transformations that need to be done to unstructured data in order to unlock the business value of unstructured data. For further explanation of many of these transforms refer to the web site www.forestrimtech.com. However, this list is a good starting point to explain why raw text cannot be used to achieve business value by itself. CONTEXT The first and most obvious shortcoming of raw text is that it needs context in order to be used to achieve business value. The raw text found in Big Data has either no context or very limited context. Contrast the raw text found in Big Data with the classical records of data typically found in a dbms. In a classical dbms there are records. A record is made up of keys and attributes. The attributes found in the dbms provide context to the data contained in the dbms. As a simple example, suppose that in Big Data we find the string of text -.Joe Foster.. Fig 4 shows the unstructured string of text that contains the name Joe Foster....Joe Foster... What is the meaning of Joe Foster in Big Data? Is Joe a customer? A policeman? A preacher? A race car driver? We just don t know much about Joe Foster when Joe is found in a raw unstructured string of data. But suppose we have a dbms record, and in the dbms record data is recorded as Customer number 12998177 Customer name Joe Foster Customer address 635 Adrian Lane, Tuscaloosa, Alabama..

Once we find the record for Joe Foster in a dbms, the attribute metadata tells us a lot about Joe Foster. In a dbms there is context about Joe that allows us to use Joe accurately and succinctly in making business decisions in the computer. So one of the fundamental differences between unstructured strings of text found in Big Data and the records of text found in a dbms is the lack of context that exists in a string of unstructured data. Now consider the types of query that can be done on a string of unstructured text. In a query against the unstructured string we can ask does the string Joe Foster exist? Fig 5 shows we can ask if Joe Foster is found in Big Data. Simple query Can you find Joe Foster?...Joe Foster......Joe Foster... The query found in Fig 5 is a simple query. We can look through huge amounts of data in Big Data and find out if Joe Foster is found there. But because the data is all unstructured text we can t find out much else. We can t find out if Joe is a policeman, a soldier, a customer or a newly born baby. Because there is no (or very little context) found in unstructured text, the types of query that we can do against unstructured strings of data is very limited. And where the type of query is limited, the business value is limited. In order to do a more sophisticated query we need context to be able to understand the data that we operate on. The need for context in the process of query can be stated very simply as seen in Fig 6. simple query sophisticated query raw text raw text+ context Indeed, trying to make business decisions on text that does not have context can be anything but positive. As a simple example of the importance of context, consider two men that are standing on a street corner and a young lady passes by. One man says to the other She s hot. What is meant by the expression She s hot? Is the young lady attractive and one of the men wishes that he could have a date with the lady? Does she s hot mean that the lady is attractive? Or are the men on a street corner in Houston, Texas in July and the temperature is 99 degrees and the humidity is 100%. The lady is sweating from every pore in her body and she is physically very hot. Or has the young lady been just given a traffic ticket and she is mad? Does she s hot mean that the lady is irritated? Or are there even more interpretations of what is meant by she s hot?

Unless we know the context of what is being said, we can make a very wrong and embarrassing business decision. Not understanding context leads to incorrect interpretation, and incorrect interpretation leads to wrong assumptions. You cannot possibly make good business decisions when you do not understand interpretation and underlying assumptions of raw text. Furthermore, the need to understand the context of information is hardly limited to the words she s hot. ALL words ALL raw text - are open to interpretation and need context in order to be understood properly. And Big Data is nothing but raw text. The problem is that when raw text is found in the unstructured strings that are found in Big Data there is little or no context that comes with the raw text. The lack of context of raw text GREATLY limits the sophistication of the queries that can be done against unstructured data. Only the most basic of queries can be done as long as the data being queried has no context. And there is little business value in being able to make only unsophisticated queries. In order to achieve business value you need to be able to make sophisticated queries. INTERPRETATION AND COMPONENT CONTEXTUALIZATION As important as context is in making sophisticated queries of data, context is not the only element that needs to be considered. Another equally important element of making sophisticated queries is the ability to interpret raw text. Suppose that the following string of data is found in raw text 20121215- f450-dnv-chi lybag cno po-009-12-ag 12356. This kind of string might be found in a log record. What kind of query can be done against this text? A simple query can be done to find if the string f450 is present. And indeed Big Data can conduct such a simple search. But what business value is there is finding the value f450? The answer is not much business value here. In order to unlock the business value of the raw text, two activities need to be performed against the string 20121215-f450-dnv-chi lybag cno po-009-12-ag 12356. The first activity is that the raw string must be interpreted. Then after it is interpreted it needs to be component contextualized. As an example of interpretation the string can be interpreted to mean 20121215-f450-dnv-chi lybag cno po-009-12-ag 12356 December 15, 2012 flight 450 Frontier Airlines Denver to Chicago domestic flight lost yellow bag claim number po-009-12 agent Ann Maeda. Then, after interpretation of the raw text is done, component contextualization parses the string and creates a context identification that might look like date December 15, 2012, airline Frontier, flight number 450, flight type domestic, action item lost item, item type bag, bag description yellow, flight source Denver, flight destination Chicago, claim number po-009-12, agent number 12356, agent name Ann Maeda. Once the component contextualization is done and is described to the system holding the data, all sorts of analytical process can be done against the raw string of data. Queries such as - How many bags are lost on domestic flights? - How many incidents occur in the month of December? - How many times has agent Ann Maeda had to track lost bags? Lost yellow bags?

- Of bags that are lost, how many are yellow? And so forth. Once interpretation and component contextualization are done, then sophisticated queries can be done. But until the raw text is properly interpreted, no sophisticated analytical processing can occur. And as long as no sophisticated analysis can occur, only limited business value can be attained. Once again, there is very little business value making queries that are simple. In order to achieve business value, you need to be able to create sophisticated queries. And in order to create sophisticated queries, you need to transform the raw text. SIMPLE STANDARDIZATION Contextualization and interpretation are absolutely essential in order to take raw unstructured text (like that found in Big Data) and turn it into a form that can be analyzed in a sophisticated manner. But these are hardly the only activities that need to be done to the raw text found in unstructured data (i.e., Big Data) to transform the unstructured data into a form that yields business value. Another simple but important activity that must be done in the transformation of unstructured textual information is that of standardization of certain types of data. As an example of the need for standardization, date values need to be standardized. Suppose that the following unstructured strings of data are found in Big Data March 13, 1999.., 2012/27/01 and.12/31/2010... To the readers eye it is obvious that these unstructured strings of text contain date values. But can the computer do comparison processing against these dates? Indeed does the computer even understand that these are date values? The problem is that these dates are in different formats. Trying to get the computer to do a comparison against these unstructured forms of date will yield unpredictable and unuseful results. To the computer these strings just look like raw text and are not particularly meaningful. In order to be useful, the unstructured form of dates must be converted into a standardized format. In this case the following transformation of date must be done March 13, 1999 == date 19990313 2012/27/01 == date 20120127 12/31/2010 == date 20101231 Once the raw text is recognized as dates, once the dates have their values recognized, and once the dates are converted into a common format, then the dates can be meaningfully compared to each other. But until a common standardized format is achieved, no meaningful comparison of dates can be done. And unless a comparison of dates can be achieved, limited business value can be achieved from the raw unstructured text.

Once again, until the raw text passes through a fundamental transformation, the queries that are done against it are not very sophisticated. And as long as queries are not sophisticated, there is limited business value. Until queries are sophisticated, achieving business value is not able to be done. OTHER TRANSFORMS Certainly contextualization, interpretation, component contextualization and standardization are absolutely necessary to be done in order to turn raw text into a form that can be used for sophisticated analytical processing. And in order to achieve business value it is necessary to do sophisticated analysis. But these transforms are hardly the only type of transforms that are needed in order to turn unstructured text into a form of data that can be analyzed in a sophisticated manner. Indeed, as important as these transforms are, they barely scratch the surface for what is needed in order to turn raw text into useful information from which business value can be derived. In order to see a discussion of more transforms, refer to Textual ETL as described on the web site www.forestrimtech.com. With a complete and mature set of transforms, it is seen that raw text can be turned into a form of text that can yield business value, as depicted by Fig 7. Big Data textual transformation Business value AN ARCHITECTURAL RENDERING From an architectural perspective, how then does the business derive value from Big Data? Fig 8 shows that raw text is placed into Big Data, as the first step. raw text Big Data After raw text is placed into Big Data, one possibility is to use textual ETL in order to transform the raw text into a transformed state which is then placed into a standard dbms. Fig 9 shows this possibility.

raw text Textual ETL Standard dbms Big Data The raw text is transformed into a form that is able to support sophisticated queries and analytical processing. The transformation is done by textual ETL. However the output from textual ETL does not have to go into a standard dbms. An alternative is to send the output i.e., the transformed text - back into Big Data. By sending the output of textual ETL back into Big Data, whole new query possibilities are opened up. Now it is possible to use NoSQL for sophisticated query of data. Fig 10 shows the possibilities for different query types when textual ETL is used and where the output from Textual ETL is placed in either standard dbms or Big Data. raw text Textual ETL simple query Standard dbms sophisticated query Big Data transformed text Business value sophisticated query The Inmon/Krishnan Big Data Architecture copyright Forest Rim Technology, 2012, all rights reserved Business value In Fig 10, simple queries can be made on raw text found in Big Data or sophisticated queries can be made from standard dbms or from transformed text found in Big Data that has arrived via textual ETL. And once the queries are made against transformed text, then those queries can be very sophisticated. And once sophisticated queries can be run against the data that originates from Big Data, then business value is easily achievable. INTEGRATING BIG DATA AND CORPORATE DATA Another way to look at the value of Textual ETL is to consider the questions how is it possible to combine the data found in Big Data and the existing organizational environment. With current technology there are analytical tools for Big Data and there are analytical tools for the existing corporate environment. However for all practical purposes the different environments are about as different as night and day. However with textual ETL there is a way to integrate and incorporate data.

Big Data Textual ETL Corporate systems In Fig 11 it is seen that with Textual ETL data can be exchanged and integrated freely between the Big Data environment and the Corporate Systems environment. Bill Inmon the father of data warehousing works for Forest Rim Technology, a company dedicated to the technology for the management and usage of unstructured text. Forest Rim Technology is located in Castle Rock, Co. Forest Rim has a web site forestrimtech.com.