Big Data. Dr.Douglas Harris DECEMBER 12, 2013



Similar documents
BIG DATA STRATEGY. Rama Kattunga Chair at American institute of Big Data Professionals. Building Big Data Strategy For Your Organization

Business Intelligence (BI) Data Store Project Discussion / Draft Outline for Requirements Document

The following is intended to outline our general product direction. It is intended for informational purposes only, and may not be incorporated into

Integrating Netezza into your existing IT landscape

The Relationship Between Information Governance, Data Governance, and Big Data. Richard Kessler November 2015

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Operational Excellence for Data Quality

The Role of the BI Competency Center in Maximizing Organizational Performance

WHITE PAPER Analytics for digital retail

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

This Symposium brought to you by

Big Data and Analytics in Government

Data Warehousing in the Age of Big Data

Your Data, Any Place, Any Time.

Next Generation Business Performance Management Solution

Effective Data Governance

Big Data and Big Data Governance

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

Big Data and Healthcare Payers WHITE PAPER

Beyond the Single View with IBM InfoSphere

GUIDE Wealth Management. 9 Social Media Guidelines for Wealth Management Firms

Luncheon Webinar Series May 13, 2013

Delivering Customer Value Faster With Big Data Analytics

ANALYTICS PAYS BACK $13.01 FOR EVERY DOLLAR SPENT

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

III JORNADAS DE DATA MINING

CLOUD COMPUTING IN HIGHER EDUCATION

Analance Data Integration Technical Whitepaper

Proven Testing Techniques in Large Data Warehousing Projects

Enabling Data Quality

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Business Intelligence

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

The 2-Tier Business Intelligence Imperative

Cloudera Enterprise Data Hub in Telecom:

BIG DATA TECHNOLOGY. Hadoop Ecosystem

SAP Agile Data Preparation

Talousjohto muutosagenttina ja informaatiotulvan tulkkina

Big Data. Fast Forward. Putting data to productive use

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

BIG DATA TRENDS AND TECHNOLOGIES

Role of Analytics in Infrastructure Management

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

The IBM Solution Architecture for Energy and Utilities Framework

Explore the Possibilities

Accelerate Your Transformation: Social, Mobile, and Analytics in the Cloud

Whitepaper Data Governance Roadmap for IT Executives Valeh Nazemoff

Hadoop for Enterprises:

Big Data Management and Predictive Analytics as-a-service for the Retail Industry

The Next Wave of Data Management. Is Big Data The New Normal?

10 Biggest Causes of Data Management Overlooked by an Overload

How the emergence of OpenFlow and SDN will change the networking landscape

Streamlining the Process of Business Intelligence with JReport

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

8 CRITICAL METRICS FOR MEASURING APP USER ENGAGEMENT

Big data: Unlocking strategic dimensions

Your Data, Any Place, Any Time. Microsoft SQL Server 2008 provides a trusted, productive, and intelligent data platform that enables you to:

Three Fundamental Techniques To Maximize the Value of Your Enterprise Data

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Understanding traffic flow

The Private Cloud Your Controlled Access Infrastructure

Big Data Explained. An introduction to Big Data Science.

Amplify Serviceability and Productivity by integrating machine /sensor data with Data Science

Data Center Infrastructure Management. optimize. your data center with our. DCIM weather station. Your business technologists.

Why Big Data in the Cloud?

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Riversand Technologies, Inc. Powering Accurate Product Information PIM VS MDM VS PLM. A Riversand Technologies Whitepaper

VIEWPOINT. High Performance Analytics. Industry Context and Trends

perspective Progressive Organization

Master Data Management

Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage

JEGI Sector Insights: Transforming the Marketplace

Accelerate your Big Data Strategy. Execute faster with Capgemini and Cloudera s Enterprise Data Hub Accelerator

An RCG White Paper The Data Governance Maturity Model

How To Handle Big Data With A Data Scientist

Reduce and manage operating costs and improve efficiency. Support better business decisions based on availability of real-time information

BEYOND BI: Big Data Analytic Use Cases

IBM Software Wrangling big data: Fundamentals of data lifecycle management

PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA

Using Business Intelligence to Achieve Sustainable Performance

Customer intelligence: Part II Layered analytics An integrated approach

Life Sciences. White Paper. Integrated Digital Marketing: The Key To Understanding Your Customer

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

Transcription:

Dr.Douglas Harris DECEMBER 12, 2013 GOWTHAM REDDY Fall,2013

Table of Contents Computing history:... 2 Why Big Data and Why Now?... 3 Information Life-Cycle Management... 4 Goals... 5 Information Management Policies... 5 Governance... 6 Master Data Management... 8 Metadata... 9 Benefits of Information Life-Cycle Management... 9

The biggest phenomenon that has captured the attention of the modern computing industry today since the "Internet" is "Big Data". These two words combined together was first popularized in the paper on this subject by McKinsey & Co., and the foundation definition was first popularized by Doug Laney from Gartner. The fundamental reason why "Big Data" is popular today is because the technology platforms that have emerged along with it, provide the capability to process data of multiple formats and structures without worrying about the constraints associated with traditional systems and database platforms. Big data is the measurement of large, complex data, specifically data that falls into the "3V's model": high volume, high velocity and high variety. For example, big data can be applied to the wealth of information derived from social media or data obtained from the billions of mobile phones in use daily. Computing history: In the late 1980s, we were introduced to the concept of decision support and data warehousing. This wave of being able to create trends, perform historical analysis, and provide predictive analytics and highly scalable metrics created a series of solutions, companies, and an industry in itself. In 1995, with the clearance to create a commercial Internet, we saw the advent of the "dotcom" world and got the first taste of being able to communicate peer to peer in a consumer world. With the advent of this capability, we also saw a significant increase in the volume and variety of data. In the following five to seven years, we saw a number of advancements driven by web commerce or e-commerce, which rapidly changed the business landscape for an organization. New models emerged and became rapidly adopted standards, including the business-to-consumer direct buying/selling (website), consumer-to-consumer marketplace trading (ebay and Amazon), and business-to- business-to-consumer selling (Amazon). This entire flurry of activity drove up data volumes more than ever before. Along with the volume, we began to see the emergence of additional data, such as consumer review, feedback on experience, peer surveys, and the emergence of word-of-mouth marketing. This newer and additional data brings in subtle layers of complexity in data processing and integration. Along the way between 1997 and 2002, we saw the definition and redefinition of mobility solutions. Cellular phones became ubiquitous and the use of voice and text to share sentiments, opinions, and trends among people became a vibrant trend. This increased the ability to communicate and create a crowd-based affinity to products and services, which has significantly driven the last decade of technology innovation, leading to even more disruptions in business landscape and data management in terms of data volumes, velocity, variety, complexity, and usage.

The years 2000 to 2010 have been a defining moment in the history of data, emergence of search engines (Google, Yahoo), personalization of music (ipod), tablet computing (ipad), bigger mobile solutions (smartphones, 3 G networks, mobile broadband, Wi-Fi), and emergence of social media (driven by Facebook, MySpace, Twitter, and Blogger). All these entities have contributed to the consumerization of data, from data creation, acquisition, and consumption perspectives. The business models and opportunities that came with the large-scale growth of data drove the need to create powerful metrics to tap from the knowledge of the crowd that was driving them, and in return offer personalized services to address the need of the moment. This challenge was not limited to technology companies; large multinational organizations like P&G and Unilever wanted solutions that could address data processing, and additionally wanted to implement the output from large-scale data processing into their existing analytics platform. Google, Yahoo, Facebook, and several other companies invested in technology solutions for data management, allowing us to consume large volumes of data in a short amount of time across many formats with varying degrees of complexity to create a powerful decision support platform. These technologies and their implementation are discussed in detail in later chapters in this book. Why Big Data and Why Now? These are the two most popular questions that are crossing the minds of any computing professional: Why Big Data? Why now? The promise of Big Data is the ability to access large volumes of data that can be useful in gaining critical insights from processing repeated or unique patterns of data or behaviors. This learning process can be executed as a machinemanaged process with minimal human intervention, making the analysis simpler and errorfree. The answer to the second question Why now? is the availability of commodity infrastructure combined with new data processing frameworks and platforms like Hadoop and NoSQL, resulting in significantly lower costs and higher scalability than traditional data management platforms. The scalability and processing architecture of the new platforms were limitations of traditional data processing technologies, though the algorithms and methods existed. The key thing to understand here is the data part of Big Data was always present and used in a manual fashion, with a lot of human processing and analytic refinement, eventually being used in a decision-making process. What has changed and created the buzz with Big Data is the automated data processing capability that is extremely fast, scalable, and has flexible processing. While each organization will have its own set of data requirements for Big Data processing, here are some examples:

Weather data there is a lot of weather data reported by governmental agencies around the world, scientific organizations, and consumers like farmers. What we hear on television or radio is an analytic key performance indicator (KPI) of temperature and forecasted conditions based on several factors. Contract data there are many types of contracts that an organization executes every year, and there are multiple liabilities associated with each of them. Labor data elastic labor brings a set of problems that organizations need to solve. Maintenance data records from maintenance of facilities, machines, non-computer-related systems, and more. Financial reporting data corporate performance reports and annual filing to Wall Street. Compliance data financial, healthcare, life sciences, hospitals, and many other agencies that file compliance data for their corporations. Clinical trials data pharmaceutical companies have wanted to minimize the life cycle of processing for clinical trials data and manage the same with rules-based processing; this is an opportunity for Big Data. Processing doctors notes on diagnosis and treatments another key area of hidden insights and value for disease state management and proactive diagnosis; a key machine learning opportunity. Contracts every organization writes many types of contracts every year, and must process and mine the content in the contracts along with metrics to measure the risks and penalties. Information Life-Cycle Management Information life-cycle management is the practice of managing the life cycle of data across an enterprise from its creation or acquisition to archival. The concept of information lifecycle management has always existed as "records management" since the early days of computing, but the management of records meant archival and deletion with extremely limited capability to reuse the same data when needed later on. Today, with the advancement in technology and commoditization of infrastructure, managing data is no longer confined to periods of time and is focused as a data management exercise. Why manage data? The answer to this question lies in the fact that data is a corporate asset and needs to be treated as such. To manage this asset, you need to understand the needs of the enterprise with regards to data life cycle, data security, compliance requirements, regulatory requirements, auditability and traceability, storage and management, metadata and master data requirements, and data stewardship and ownership, which will help you design and implement a robust data governance and management strategy.

Information life-cycle management forms one of the foundational pillars in the management of data within an enterprise. It is the platform on which the three pillars of data management are designed. The first pillar represents process, the second represents the people, and the third represents the technology. Goals Data management as an enterprise function. Improve operational efficiencies of systems and processes. Reduce total cost of ownership by streamlining the use of hardware and resources. Productivity gains by reducing errors and improving overall productivity by automating data management and life cycle. Implement an auditable system. Reduce system failure risk. Provide business continuity. Maintain flexibility to add new requirements. Information life-cycle management consists of the subcomponents shown in below figure: Information life-cycle management components as applied to the enterprise Information Management Policies The policies that define the business rules for the data life cycle from acquisition, cleansing, transformation, retention, and security are called information management policies: Data acquisition policies are defined: Applications where data entry functions are performed

Web and OLTP applications Data warehouse or datamart ETL or CDC processes Analytical databases ETL processes Data transformation policies are business rules to transform data from source to destination, and include transformation of granularity levels, keys, hierarchies, metrics, and aggregations. Data quality policies are defined as part of data transformation processes. Data retention: Traditionally, data retention policies have been targeted at managing database volumes across the systems within the enterprise in an efficient way by developing business rules and processes to relocate data from online storage in the database to offline storage in the file. The offline data can be stored at remote secure sites. The retention policy needs to consider the requirements for data that mandates support for legal case management, compliance auditing management, and electronic discovery. With Big Data and distributed storage on commodity hardware, the notion of offline storage is now more a label. All data is considered active and accessible all the time. The goals of data retention shift to managing the compression and storage of data across the disk architecture. The challenge in the new generation will be on the most efficient techniques of data management. Data security policies are oriented toward securing data from an encryption and obfuscation perspective and also data security from a user access perspective. Governance Information and program governance are two important aspects of managing information within an enterprise. Information governance deals with setting up governance models for data within the enterprise and program governance deals with implementing the policies and processes set forth in information governance. Both of these tasks are fairly peoplespecific as they involve both the business user and the technology teams. A governance process is a multistructured organization of people who play different roles in managing information. The hierarchy of the different bodies of the governance program is shown in figure and the roles and responsibilities are outlined in the following subsections.

Data governance teams Executive Governance Board Consists of stakeholders from the executive teams or their direct reports. Responsible for overall direction and funding. Program Governance Council Consists of program owners who are director-level members of the executive organization. There can be multiple representatives in one team for a small organization, while a large organization can have multiple smaller teams that will fold into a large charter team. Responsible for overall direction of the program, management of business and IT team owners, coordination of activities, management of budget, and prioritization of tasks and programs. Business Owners Represent the business teams in the council. These are program heads within the business unit (marketing, finance, sales, etc.). Responsible for leading the business unit's program initiative and its implementation as a stakeholder. Business Teams Consists of members of a particular business unit, for example, marketing or market research or sales. Responsible for implementing the program and data governance policies in their projects, report to the council on issues and setbacks, and work with the council on resolution strategies.

IT Owners Consists of IT project managers assigned to lead the implementation and support for a specific business unit. Responsible for leading the IT teams to work on the initiative, the project delivery, issue resolution, and conflict management, and work with the council to solve any issue that can impact a wider audience. IT Teams Consists of members of IT teams assigned to work with a particular business team for implementing the technology layers and supporting the program. Responsible for implementing the program and data governance technologies and frameworks in the assigned projects, report to the council on issues and setbacks, and work with the council on resolution strategies. Data Governance Council Consists of business and IT stakeholders from each unit in the enterprise. The members are SMEs who own the data for that business unit and are responsible for making the appropriate decisions for the integration of the data into the enterprise architecture while maintaining their specific requirements within the same framework. Responsible for: Data definition Data-quality rules Metadata Data access policy Encryption requirements Obfuscation requirements Master data management policies Issue and conflict resolution Data retention policies Master Data Management Is implemented as a standalone program. Is implemented in multiple cycles for customers and products. Is implemented for location, organization, and other smaller data sets as an add-on by the implementing organization. Measured as a percentage of changes processed every execution from source systems.

Operationalized as business rules for key management across operational, transactional, warehouse, and analytical data Metadata Is implemented as a data definition process by business users, Has business-oriented definitions for data for each business unit. One central definition is regarded as the enterprise metadata view of the data. Has IT definitions for metadata related to data structures, data management programs, and semantic layers within the database. Has definitions for semantic layers implemented for business intelligence and analytical applications. All the technologies used in the processes described above have a database, a user interface for managing data, rules and definitions, and reports available on the processing of each component and its associated metrics. There are many books and conferences on the subject of data governance and program governance. We recommend readers peruse the available material for continued reading on implementing governance for a traditional data warehouse. Benefits of Information Life-Cycle Management Increases process efficiencies. Helps enterprises optimize data quality. Accelerates ROI. Helps reduce the total cost of ownership for data and infrastructure investments. Data management strategies help in managing data and holistically improve all the processes, including: 1. Predictable system availability 2. Optimized system performance 3. Improved reusability of resources 4. Improved management of metadata and master data 5. Improved systems life-cycle management 6. Streamlined operations management of data life cycle 7. Legal and compliance requirements 8. Metadata life-cycle management 9. Master data management 10. Optimize spending and costs 11. Reduce data-related risks