Big Data. Dr.Douglas Harris DECEMBER 12, 2013

Dr.Douglas Harris DECEMBER 12, 2013 GOWTHAM REDDY Fall,2013

Table of Contents Computing history:... 2 Why Big Data and Why Now?... 3 Information Life-Cycle Management... 4 Goals... 5 Information Management Policies... 5 Governance... 6 Master Data Management... 8 Metadata... 9 Benefits of Information Life-Cycle Management... 9

The biggest phenomenon that has captured the attention of the modern computing industry today since the "Internet" is "Big Data". These two words combined together was first popularized in the paper on this subject by McKinsey & Co., and the foundation definition was first popularized by Doug Laney from Gartner. The fundamental reason why "Big Data" is popular today is because the technology platforms that have emerged along with it, provide the capability to process data of multiple formats and structures without worrying about the constraints associated with traditional systems and database platforms. Big data is the measurement of large, complex data, specifically data that falls into the "3V's model": high volume, high velocity and high variety. For example, big data can be applied to the wealth of information derived from social media or data obtained from the billions of mobile phones in use daily. Computing history: In the late 1980s, we were introduced to the concept of decision support and data warehousing. This wave of being able to create trends, perform historical analysis, and provide predictive analytics and highly scalable metrics created a series of solutions, companies, and an industry in itself. In 1995, with the clearance to create a commercial Internet, we saw the advent of the "dotcom" world and got the first taste of being able to communicate peer to peer in a consumer world. With the advent of this capability, we also saw a significant increase in the volume and variety of data. In the following five to seven years, we saw a number of advancements driven by web commerce or e-commerce, which rapidly changed the business landscape for an organization. New models emerged and became rapidly adopted standards, including the business-to-consumer direct buying/selling (website), consumer-to-consumer marketplace trading (ebay and Amazon), and business-to- business-to-consumer selling (Amazon). This entire flurry of activity drove up data volumes more than ever before. Along with the volume, we began to see the emergence of additional data, such as consumer review, feedback on experience, peer surveys, and the emergence of word-of-mouth marketing. This newer and additional data brings in subtle layers of complexity in data processing and integration. Along the way between 1997 and 2002, we saw the definition and redefinition of mobility solutions. Cellular phones became ubiquitous and the use of voice and text to share sentiments, opinions, and trends among people became a vibrant trend. This increased the ability to communicate and create a crowd-based affinity to products and services, which has significantly driven the last decade of technology innovation, leading to even more disruptions in business landscape and data management in terms of data volumes, velocity, variety, complexity, and usage.

The years 2000 to 2010 have been a defining moment in the history of data, emergence of search engines (Google, Yahoo), personalization of music (ipod), tablet computing (ipad), bigger mobile solutions (smartphones, 3 G networks, mobile broadband, Wi-Fi), and emergence of social media (driven by Facebook, MySpace, Twitter, and Blogger). All these entities have contributed to the consumerization of data, from data creation, acquisition, and consumption perspectives. The business models and opportunities that came with the large-scale growth of data drove the need to create powerful metrics to tap from the knowledge of the crowd that was driving them, and in return offer personalized services to address the need of the moment. This challenge was not limited to technology companies; large multinational organizations like P&G and Unilever wanted solutions that could address data processing, and additionally wanted to implement the output from large-scale data processing into their existing analytics platform. Google, Yahoo, Facebook, and several other companies invested in technology solutions for data management, allowing us to consume large volumes of data in a short amount of time across many formats with varying degrees of complexity to create a powerful decision support platform. These technologies and their implementation are discussed in detail in later chapters in this book. Why Big Data and Why Now? These are the two most popular questions that are crossing the minds of any computing professional: Why Big Data? Why now? The promise of Big Data is the ability to access large volumes of data that can be useful in gaining critical insights from processing repeated or unique patterns of data or behaviors. This learning process can be executed as a machinemanaged process with minimal human intervention, making the analysis simpler and errorfree. The answer to the second question Why now? is the availability of commodity infrastructure combined with new data processing frameworks and platforms like Hadoop and NoSQL, resulting in significantly lower costs and higher scalability than traditional data management platforms. The scalability and processing architecture of the new platforms were limitations of traditional data processing technologies, though the algorithms and methods existed. The key thing to understand here is the data part of Big Data was always present and used in a manual fashion, with a lot of human processing and analytic refinement, eventually being used in a decision-making process. What has changed and created the buzz with Big Data is the automated data processing capability that is extremely fast, scalable, and has flexible processing. While each organization will have its own set of data requirements for Big Data processing, here are some examples:

Weather data there is a lot of weather data reported by governmental agencies around the world, scientific organizations, and consumers like farmers. What we hear on television or radio is an analytic key performance indicator (KPI) of temperature and forecasted conditions based on several factors. Contract data there are many types of contracts that an organization executes every year, and there are multiple liabilities associated with each of them. Labor data elastic labor brings a set of problems that organizations need to solve. Maintenance data records from maintenance of facilities, machines, non-computer-related systems, and more. Financial reporting data corporate performance reports and annual filing to Wall Street. Compliance data financial, healthcare, life sciences, hospitals, and many other agencies that file compliance data for their corporations. Clinical trials data pharmaceutical companies have wanted to minimize the life cycle of processing for clinical trials data and manage the same with rules-based processing; this is an opportunity for Big Data. Processing doctors notes on diagnosis and treatments another key area of hidden insights and value for disease state management and proactive diagnosis; a key machine learning opportunity. Contracts every organization writes many types of contracts every year, and must process and mine the content in the contracts along with metrics to measure the risks and penalties. Information Life-Cycle Management Information life-cycle management is the practice of managing the life cycle of data across an enterprise from its creation or acquisition to archival. The concept of information lifecycle management has always existed as "records management" since the early days of computing, but the management of records meant archival and deletion with extremely limited capability to reuse the same data when needed later on. Today, with the advancement in technology and commoditization of infrastructure, managing data is no longer confined to periods of time and is focused as a data management exercise. Why manage data? The answer to this question lies in the fact that data is a corporate asset and needs to be treated as such. To manage this asset, you need to understand the needs of the enterprise with regards to data life cycle, data security, compliance requirements, regulatory requirements, auditability and traceability, storage and management, metadata and master data requirements, and data stewardship and ownership, which will help you design and implement a robust data governance and management strategy.

Information life-cycle management forms one of the foundational pillars in the management of data within an enterprise. It is the platform on which the three pillars of data management are designed. The first pillar represents process, the second represents the people, and the third represents the technology. Goals Data management as an enterprise function. Improve operational efficiencies of systems and processes. Reduce total cost of ownership by streamlining the use of hardware and resources. Productivity gains by reducing errors and improving overall productivity by automating data management and life cycle. Implement an auditable system. Reduce system failure risk. Provide business continuity. Maintain flexibility to add new requirements. Information life-cycle management consists of the subcomponents shown in below figure: Information life-cycle management components as applied to the enterprise Information Management Policies The policies that define the business rules for the data life cycle from acquisition, cleansing, transformation, retention, and security are called information management policies: Data acquisition policies are defined: Applications where data entry functions are performed

Web and OLTP applications Data warehouse or datamart ETL or CDC processes Analytical databases ETL processes Data transformation policies are business rules to transform data from source to destination, and include transformation of granularity levels, keys, hierarchies, metrics, and aggregations. Data quality policies are defined as part of data transformation processes. Data retention: Traditionally, data retention policies have been targeted at managing database volumes across the systems within the enterprise in an efficient way by developing business rules and processes to relocate data from online storage in the database to offline storage in the file. The offline data can be stored at remote secure sites. The retention policy needs to consider the requirements for data that mandates support for legal case management, compliance auditing management, and electronic discovery. With Big Data and distributed storage on commodity hardware, the notion of offline storage is now more a label. All data is considered active and accessible all the time. The goals of data retention shift to managing the compression and storage of data across the disk architecture. The challenge in the new generation will be on the most efficient techniques of data management. Data security policies are oriented toward securing data from an encryption and obfuscation perspective and also data security from a user access perspective. Governance Information and program governance are two important aspects of managing information within an enterprise. Information governance deals with setting up governance models for data within the enterprise and program governance deals with implementing the policies and processes set forth in information governance. Both of these tasks are fairly peoplespecific as they involve both the business user and the technology teams. A governance process is a multistructured organization of people who play different roles in managing information. The hierarchy of the different bodies of the governance program is shown in figure and the roles and responsibilities are outlined in the following subsections.

Data governance teams Executive Governance Board Consists of stakeholders from the executive teams or their direct reports. Responsible for overall direction and funding. Program Governance Council Consists of program owners who are director-level members of the executive organization. There can be multiple representatives in one team for a small organization, while a large organization can have multiple smaller teams that will fold into a large charter team. Responsible for overall direction of the program, management of business and IT team owners, coordination of activities, management of budget, and prioritization of tasks and programs. Business Owners Represent the business teams in the council. These are program heads within the business unit (marketing, finance, sales, etc.). Responsible for leading the business unit's program initiative and its implementation as a stakeholder. Business Teams Consists of members of a particular business unit, for example, marketing or market research or sales. Responsible for implementing the program and data governance policies in their projects, report to the council on issues and setbacks, and work with the council on resolution strategies.

IT Owners Consists of IT project managers assigned to lead the implementation and support for a specific business unit. Responsible for leading the IT teams to work on the initiative, the project delivery, issue resolution, and conflict management, and work with the council to solve any issue that can impact a wider audience. IT Teams Consists of members of IT teams assigned to work with a particular business team for implementing the technology layers and supporting the program. Responsible for implementing the program and data governance technologies and frameworks in the assigned projects, report to the council on issues and setbacks, and work with the council on resolution strategies. Data Governance Council Consists of business and IT stakeholders from each unit in the enterprise. The members are SMEs who own the data for that business unit and are responsible for making the appropriate decisions for the integration of the data into the enterprise architecture while maintaining their specific requirements within the same framework. Responsible for: Data definition Data-quality rules Metadata Data access policy Encryption requirements Obfuscation requirements Master data management policies Issue and conflict resolution Data retention policies Master Data Management Is implemented as a standalone program. Is implemented in multiple cycles for customers and products. Is implemented for location, organization, and other smaller data sets as an add-on by the implementing organization. Measured as a percentage of changes processed every execution from source systems.

Operationalized as business rules for key management across operational, transactional, warehouse, and analytical data Metadata Is implemented as a data definition process by business users, Has business-oriented definitions for data for each business unit. One central definition is regarded as the enterprise metadata view of the data. Has IT definitions for metadata related to data structures, data management programs, and semantic layers within the database. Has definitions for semantic layers implemented for business intelligence and analytical applications. All the technologies used in the processes described above have a database, a user interface for managing data, rules and definitions, and reports available on the processing of each component and its associated metrics. There are many books and conferences on the subject of data governance and program governance. We recommend readers peruse the available material for continued reading on implementing governance for a traditional data warehouse. Benefits of Information Life-Cycle Management Increases process efficiencies. Helps enterprises optimize data quality. Accelerates ROI. Helps reduce the total cost of ownership for data and infrastructure investments. Data management strategies help in managing data and holistically improve all the processes, including: 1. Predictable system availability 2. Optimized system performance 3. Improved reusability of resources 4. Improved management of metadata and master data 5. Improved systems life-cycle management 6. Streamlined operations management of data life cycle 7. Legal and compliance requirements 8. Metadata life-cycle management 9. Master data management 10. Optimize spending and costs 11. Reduce data-related risks