Information Stewardship: Moving From Big Data to Big Value



Similar documents
Designing a Cloud Storage System

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Advanced In-Database Analytics

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

So What s the Big Deal?

The Next Wave of Data Management. Is Big Data The New Normal?

1. Understanding Big Data

Beyond the Single View with IBM InfoSphere

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Protecting Big Data Data Protection Solutions for the Business Data Lake

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

NetApp Big Content Solutions: Agile Infrastructure for Big Data

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Data Refinery with Big Data Aspects

CitusDB Architecture for Real-Time Big Data

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

Taming Big Data Storage with Crossroads Systems StrongBox

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

The IBM data governance blueprint: Leveraging best practices and proven technologies

CDH AND BUSINESS CONTINUITY:

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Reducing Storage TCO With Private Cloud Storage

Industry Trends & Challenges in Oil & Gas

Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Washington State s Use of the IBM Data Governance Unified Process Best Practices

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Backup and Recovery 1

DATA MINING AND WAREHOUSING CONCEPTS

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

BIG DATA STRATEGY. Rama Kattunga Chair at American institute of Big Data Professionals. Building Big Data Strategy For Your Organization

The Microsoft Large Mailbox Vision

Transforming the Telecoms Business using Big Data and Analytics

BIG DATA TRENDS AND TECHNOLOGIES

Big Data & Analytics for Semiconductor Manufacturing

Brochure. Data Protector 9: Nine reasons to upgrade

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Reaping the Rewards of Big Data

Business white paper. environments. The top 5 challenges and solutions for backup and recovery

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Business Analytics and the Nexus of Information

Turning Big Data into Big Decisions Delivering on the High Demand for Data

Cloud Courses Description

Data Centric Computing Revisited

EMC BACKUP MEETS BIG DATA

SOLUTION BRIEF KEY CONSIDERATIONS FOR LONG-TERM, BULK STORAGE

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Effective Data Integration - where to begin. Bryte Systems

How To Use Big Data Effectively

Big Data must become a first class citizen in the enterprise

Spatio-Temporal Networks:

IBM CommonStore Archiving Preload Solution

Symantec NetBackup Appliances

Big Data, Big Risk, Big Rewards. Hussein Syed

Red Hat Storage Server

Intelligent document management for the legal industry

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Why RAID is Dead for Big Data Storage. The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal

Top 10 Automotive Manufacturer Makes the Business Case for OpenStack

Pacific Life Insurance Company

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Delivering Customer Value Faster With Big Data Analytics

Contact Centers in the Cloud: A Better Way to Source

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

IBM Solution Framework for Lifecycle Management of Research Data IBM Corporation

Big data management with IBM General Parallel File System

More Data in Less Time

Time Value of Data. Creating an active archive strategy to address both archive and backup in the midst of data explosion.

Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010

Top 5 reasons to choose HP Information Archiving

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Investment Bank Case Study: Leveraging MarkLogic for Records Retention and Investigation

Balance and maximise your Oracle EBS investment with IBM Optim A Priceline and Travel Industry Case Study Philip McBride

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Optimized for the Industrial Internet: GE s Industrial Data Lake Platform

Transcription:

Information Stewardship: Moving From Big Data to Big Value By John Burke Principal Research Analyst, Nemertes Research Executive Summary Big data stresses tools, networks, and storage infrastructures. The amount of data, its growth rate, performance requirements, and needs for protection in disaster all create challenges for existing systems. To survive the transition to big data, organizations should practice good information stewardship by defining policy to guide every byte of information through acquisition and classification, its lifecycle in storage, its protection and insurance against disaster, and ultimate disposition at the end of its life. Guided by the principles of good stewardship, and using some combination of new analytics and storage architectures plus the mixing of public and private cloud resources, IT can provide an infrastructure up to the challenge of big data. IT needs to understand the goals of and inputs to all big data projects to make the best storage architecture choices for supporting them and know the full costs of internal compute and storage to make proper evaluation of possible cloud options. The Issue About 29% of organizations say they have a big data initiative already; another 6% plan to before the end of 2013, and 14% are evaluating whether they have a big data problem and how to deal with it. But what is big data? At its core, it is about deriving new business value from the combination of previously isolated streams of data. (Please see Figure 1.) 1

Figure 1: Crossing the Streams Drives Big Data Big data is typified by: Massive volumes of data (in relative, not absolute, terms) Rapid changes in data Multiple sources and formats of data Big data projects often, for example, focus on combining data stored in traditional databases with unstructured data such as social media posts or video clips and semi- structured data such as streams of environmental sensor readings. Since big data is, well, big (even small companies with big data initiatives average 2 petabytes of data, and large companies grapple with orders of magnitude more), it can stress both network and storage infrastructures and can create challenges for data management and analytics tools. And, because it brings together diverse types of information, it requires significant investment of staff time and potentially drives changes in organization. All these challenges add up to investments in staff and potentially in any or all parts of the data center infrastructure and even expansion into cloud compute and storage. To survive the transition, organizations should practice good information stewardship. 1 Petabyte = 1024 Terabytes 1 Terabyte = 1024 Gigabytes For comparison, 1 terabyte equals approximately 8 seasons of a 1-hour HD TV show like CSI, while 1 petabyte equals a bit more than 13 years of HD programming. 2

Information Stewardship Information Stewardship (IS) is driven by the principle that for every byte of information entering the organization there be policy in place defining how that information is to be stored, managed, and protected throughout its life. IS therefore comprises several disciplines: Data quality management Information lifecycle management Information protection Disaster resilience Compliance management Applying IS principles to big data is a path to sustainable success and the chance to wring big insights out of the data. Whether the goal of analysis is to improve productivity or decrease costs, raise ad revenues or boost sales through customer loyalty programs, improve patient health or raise graduation rates, detect financial fraud or discern terrorist plots, underneath big data analytics lies the data. Long- term success with big data analytics rests on proper management of the data. A Lifecycle for Big Data IS tells us to pay attention to the whole lifecycle of the data starting with acquisition and attendant issues of data quality management and ending with the question of whether, how, and how long to retain data once the initial burst of use is complete. Figure 2: The Big Data Lifecycle 3

Acquisition and Classification New streams of data mean new concerns for making sure things mean what you think they mean (the IS discipline of data quality management), as well as for data security and privacy protection (IS s information protection), and compliance (another IS discipline). Many IT departments are used to thinking of critical data as primarily living inside databases. Data gets in through approved and controlled channels, generally with strict protections to guard consistency. When big data ramps up, data acquisition expands to encompass other less- controlled sources and includes new formats (such as pictures, sound, or video), and those assumptions are subverted. If two payroll records in a database refer to Jane Doe, it is usually straightforward to determine if they mean the same Jane Doe (do they point back to the same personnel record, for example). When a personnel record and a video clip of a customer complaining both refer to John Doe, do they mean the same one? There is no way to prevent or resolve every ambiguity when disparate kinds of information are brought together. It is critical to understand the limits of disambiguation in these cases. IS says it is also crucial to have defined ways of assessing whether data in different streams are consistent with each other and whether inconsistencies will be a roadblock to use or simply a cautionary note. Understanding what the data is also drives classification of data for purposes of managing it throughout its lifecycle. Classification serves to: Drive how data are secured in order to achieve and maintain compliance with various laws, rules, and industry standards. Guide how data are stored in order to provide necessary performance. Determine how disaster recovery procedures will provide for continuity of access or restoration of the data in the event of major service disruption (the IS discipline of disaster resilience). Storage, Management and Analysis, and Disposition Once you have data, you want to use it, and so organizations doing big data projects have to work through issues of data management and analysis. This has implications for storage infrastructure. (It also sometimes pushes existing data management and analytics tools past their limits, which can drive adoption of alternatives to traditional SQL- driven relational databases, such as Hadoop.) The IS discipline of information lifecycle management dictates that storage be tailored to the use cases it serves. To serve big data, the storage infrastructure must be able to deal with stresses big data brings along with it. 4

First and foremost, there is the volume of data to be managed, typically in the multiple- petabyte range, and the rate at which it grows often at a much faster clip than was the case for other datasets. Beyond the volume of data there are challenges associated with servicing enough analytical transactions per second for jobs to get done in a reasonable amount of time and with the size of the individual data objects as well. In a big data environment, analysis tools or applications may want to address tens or hundreds of times as much data as existing data warehouse and business intelligence systems. Likewise, analysis may involve tossing around video or audio clips or other objects that are each much larger than typical database records. This can strain or overwhelm storage systems already in place. Figure 3: Big Data Grows Faster...Much Faster There are several solutions for these problems. One is to break the storage up across more storage nodes in a scale- out architecture. The mostly widely used data management platform for big data, Hadoop, handles breaking a dataset across all the participating analytics nodes, each of which typically has some local or direct- attached storage that it controls. Another solution is to add significant amounts of solid- state disk to the environment to speed up response times to storage requests. A third is to use cloud resources to move some or all of the problems out of the data center entirely (use cloud servers and storage for the big data environment or perhaps simply to use cloud storage for older data). IT can also use cloud resources to help with disaster resilience. Making big data resilient is a special challenge thanks to the volume of data in question as well as the speed with which it grows and the demands on storage performance created by analysis. These make continuous syncing of the full data set among data centers a 5

problem and push IT managers to approach the problem differently. One solution is fragmenting the data set across multiple data centers each holding only part of the data and replicating the fragments rather than full copies. Another is to, again, use cloud resources for disaster resilience. And, of course, IS guides organizations to deal with the end of life question for data: Does it have an expiration date? If not, it has to be flagged for retention and provisions made for its long- term storage (this is another place cloud storage can come into the picture). If so, processes have to be in place to remove it from storage and audit that process for future reference. Conclusions and Recommendations Big data stresses tools, networks, and storage infrastructures. The volume of data in question, its rate of growth, the level of responsiveness required to perform useful analysis, and the needs of business continuity all create challenges for existing systems. To survive the transition to a big data environment, organizations should practice good information stewardship: Define policy to guide every byte of information through a process of classification, its lifecycle in storage, protection and insurance against disaster, and ultimate disposition at end of life. Guided by IS principles, some combination of new analytics and storage architectures and the mixing of public and private cloud resources can provide a big data infrastructure that both delivers solid performance in the face of new and dramatically more challenging workloads and remains resilient in the face of disaster. Big data stewards: ± Understand the goals of any big data project, as well as the full spectrum of data sources involved and the data management and analysis tools under review, to make the best storage architecture choices for supporting them. ± Incorporate the full lifecycle of data in the planning for the big data. initiative from acquisition and classification through archiving or disposal. ± Fully calculate the cost of internal storage, including staff and environmental operating costs per terabyte of space available for use, to have a yardstick for measuring the attractiveness of cloud storage for various functions. ± Consider the costs of disaster resilience (replication tools, spare capacity, and bandwidth, plus the staff costs accompanying them). ± Evaluate cloud storage and cloud compute as adjuncts to, or replacements for, in- house resources in any phase of the lifecycle where they make economic sense. 6

About Nemertes Research: Nemertes Research is a research- advisory and strategic- consulting firm that specializes in analyzing and quantifying the business value of emerging technologies. You can learn more about Nemertes Research at our Website, www.nemertes.com, or contact us directly at research@nemertes.com. 7