NAVAL POSTGRADUATE SCHOOL THESIS



Similar documents
Big Data in Financial Services Industry: Market Trends, Challenges, and Prospects

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

Big Data in Financial Services Industry: Market Analysis and Forecasts

TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data

The Big Data Market: Business Case, Market Analysis & Forecasts

Big Data Vendor Revenue and Market Forecast

Big Data and Telecom Analytics Market: Business Case, Market Analysis & Forecasts

How To Understand The Business Case For Big Data

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

What is Hot on the Market and Trends. SDA Bocconi Quantitative Methods Competence Center

The Big Data Market: Business Case, Market Analysis & Forecasts

Big Data Market Size and Vendor Revenues

The Big Data Market : Opportunities, Challenges, Strategies, Industry Verticals and Forecasts

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Customized Report- Big Data

Market for Telecom Structured Data, Big Data, and Analytics: Business Case, Analysis and Forecasts

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Vendor Revenue and Market Forecast,

The Next Wave of Data Management. Is Big Data The New Normal?

Chapter 1. Contrasting traditional and visual analytics approaches

BIG DATA What it is and how to use?

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

BIRT in the World of Big Data

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data and Healthcare Payers WHITE PAPER

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Data Visualization Techniques

Data Visualization Techniques

Foundations of Business Intelligence: Databases and Information Management

Big Data. Fast Forward. Putting data to productive use

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Improving Decision Making and Managing Knowledge

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Hexaware E-book on Predictive Analytics

Sunnie Chung. Cleveland State University

Are You Ready for Big Data?

Data Warehouse design

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Advanced Big Data Analytics with R and Hadoop

This Symposium brought to you by

Transforming the Telecoms Business using Big Data and Analytics

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Customer Analytics. Turn Big Data into Big Value

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY

Beyond Watson: The Business Implications of Big Data

The Future of Data Management with Hadoop and the Enterprise Data Hub

Achieving Business Value through Big Data Analytics Philip Russom

Web Data Mining: A Case Study. Abstract. Introduction

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Information Management course

The Big Data Paradigm Shift. Insight Through Automation

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

How Big Data is Different

In-Memory Analytics for Big Data

Using Data Mining and Machine Learning in Retail

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

The Future of Data Management

Data Mining Solutions for the Business Environment

Data Isn't Everything

Annex: Concept Note. Big Data for Policy, Development and Official Statistics New York, 22 February 2013

Tap into Hadoop and Other No SQL Sources

ANALYTICS CENTER LEARNING PROGRAM

Oracle Big Data SQL Technical Update

Data Modeling for Big Data

Using Tableau Software with Hortonworks Data Platform

Data processing goes big

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Foundations of Business Intelligence: Databases and Information Management

Bringing Big Data into the Enterprise

Are You Ready for Big Data?

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

The 4 Pillars of Technosoft s Big Data Practice

Introduction to Data Mining

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Testing Big data is one of the biggest

BIG DATA CHALLENGES AND PERSPECTIVES

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute.

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Big data for the Masses The Unique Challenge of Big Data Integration

Building Big with Big Data Now companies are in the middle of a renovation that forces them to be analytics-driven to continue being competitive.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Transcription:

NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS VISUALIZATION OF BIG DATA THROUGH SHIP MAINTENANCE METRICS ANALYSIS FOR FLEET MAINTENANCE AND REVITALIZATION by Isaac J. Donaldson March 2014 Thesis Advisor: Second Reader: Thomas Housel Johnathan Mun Approved for public release;distribution is unlimited

THIS PAGE INTENTIONALLY LEFT BLANK

REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE March 2014 4. TITLE AND SUBTITLE VISUALIZATION OF BIG DATA THROUGH SHIP MAINTENANCE METRICS ANALYSIS FOR FLEET MAINTENANCE AND REVITALIZATION 6. AUTHOR(S) Isaac J. Donaldson 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School Monterey, CA 93943-5000 9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) N/A 3. REPORT TYPE AND DATES COVERED Master s Thesis 5. FUNDING NUMBERS 8. PERFORMING ORGANIZATION REPORT NUMBER 10. SPONSORING/MONITORING AGENCY REPORT NUMBER 11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. government. IRB protocol number N/A. 12a. DISTRIBUTION / AVAILABILITY STATEMENT Approved for public release; distribution is unlimited 13. ABSTRACT (maximum 200 words) 12b. DISTRIBUTION CODE A There are between 150 and 200 parameters for measuring the performance of ship maintenance processes in the U.S. Navy. Despite this level of detail, budgets and timelines for performing maintenance on the Navy s fleet appear to be problematic. Making sense of what these parameters mean in terms of the overall performance of ship maintenance processes is clearly a big data problem. The current process for presenting data on the more than 150 parameters measuring ship maintenance performance costs and processes, containing billions of data points, is still done by static, cumbersome spreadsheets. The central goal of this thesis is to provide a means to aggregate voluminous maintenance data in such a way that the causal factors contributing to cost and schedule overruns can be better understood by ship maintenance leadership. Big data visualization software was examined to determine if visualization tools could improve the understanding of U.S. Navy ship maintenance by its leaders. This thesis concludes that the visualization of big data supports decision making by enabling leaders to quickly identify trends, develop a better understanding of the problem space, establish defensible baselines for monitoring activities, perform forecasting, and evaluate metrics for use. 14. SUBJECT TERMS Big Data, Big Data Visualization, Visualization Software, 3D Printing, 3D Laser Scanning Technology, Collaborative Product Lifecycle Management 15. NUMBER OF PAGES 133 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT Unclassified 18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified 19. SECURITY CLASSIFICATION OF ABSTRACT Unclassified 20. LIMITATION OF ABSTRACT NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. 239-18 UU i

THIS PAGE INTENTIONALLY LEFT BLANK ii

Approved for public release;distribution is unlimited VISUALIZATION OF BIG DATA THROUGH SHIP MAINTENANCE METRICS ANALYSIS FOR FLEET MAINTENANCE AND REVITALIZATION Isaac J. Donaldson Lieutenant, United States Navy B.S., Embry-Riddle Aeronautical University, 2005 Submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN NETWORK OPERATIONS from the NAVAL POSTGRADUATE SCHOOL March 2014 Author: Isaac J. Donaldson Approved by: Thomas Housel Thesis Advisor Johnathan Mun Second Reader Dan Boger Chair, Department of Information Sciences iii

THIS PAGE INTENTIONALLY LEFT BLANK iv

ABSTRACT There are between 150 and 200 parameters for measuring the performance of ship maintenance processes in the U.S. Navy. Despite this level of detail, budgets and timelines for performing maintenance on the Navy s fleet appear to be problematic. Making sense of what these parameters mean in terms of the overall performance of ship maintenance processes is clearly a big data problem. The current process for presenting data on the more than 150 parameters measuring ship maintenance performance costs and processes, containing billions of data points, is still done by static, cumbersome spreadsheets. The central goal of this thesis is to provide a means to aggregate voluminous maintenance data in such a way that the causal factors contributing to cost and schedule overruns can be better understood by ship maintenance leadership. Big data visualization software was examined to determine if visualization tools could improve the understanding of U.S. Navy ship maintenance by its leaders. This thesis concludes that the visualization of big data supports decision making by enabling leaders to quickly identify trends, develop a better understanding of the problem space, establish defensible baselines for monitoring activities, perform forecasting, and evaluate metrics for use. v

THIS PAGE INTENTIONALLY LEFT BLANK vi

TABLE OF CONTENTS I. INTRODUCTION...1 A. OVERVIEW...1 II. LITERATURE REVIEW...3 A. BIG DATA...3 B. THE BIG DATA ECOSYSTEM...6 C. BIG DATA TECHNOLOGIES AND TOOLS...10 D. GOVERNMENT SPENDING ON BIG DATA...23 E. BIG DATA PROJECTS IN GOVERNMENT...24 F. GOVERNMENT BIG DATA CASE STUDIES...27 G. LESSONS LEARNED...29 H. BIG DATA IN THE U.S. NAVY...31 III. SHIP MAINTENANCE VIGNETTES...33 A. INTRODUCTION...33 B. MAINTENANCE AND MODERNIZATION SPENDING...33 C. MAINTENANCE VIGNETTES...35 1. New Work Vignette: USS Iwo Jima...36 2. Deferred Maintenance Vignette: USS Bataan and USS Iwo Jima...37 3. Modernizations Vignettes: USS Iwo Jima and USS Wasp...38 D. SUMMARY...40 IV. SHIP MAINTENANCE SIMULATIONS...41 A. OVERVIEW...41 B. MAINTENANCE COST CATEGORIES...43 C. DATA COLLECTION...44 D. FINAL SIMULATION RESULTS INCORPORATING DIFFERENT COMBINATIONS OF TECHNOLOGIES INTO U.S. NAVY SHIP MAINTENANCE PROGRAMS...44 E. VISUALIZATION SOFTWARE ANALYSIS OF U.S. NAVY SHIP MAINTENANCE...45 1. Visualization Model...45 2. Definitized Estimate, All Ships...48 3. Definitized Estimates of the Top 5 Ships...51 4. Definitized Estimates of Top 5 Ships by Expense Details...53 5. Actual Costs of the Top 5 Ships by Type Expense...57 6. Definitized Estimate versus Actual of the Top 5 Ships by Type Expense...60 7. Definitized Estimate versus Actual of the Top 5 Ships by Type Expense...63 8. Definitized Estimate versus Actual of the Top 5 Ships by Work..65 9. Simulation 1 and 2: Introduction of 3DP and AM Radical...68 a. Actual versus 3DP for the Top 5 Ships by Type Expense...69 vii

b. Actual versus 3DP of the Top 5 Ships by Type Expense, Work...71 c. Actual versus AM Radical of the Top 5 Ships by Type Expense...73 d. Actual versus AM Radical of the Top 5 Ships by Type Expense, Work...76 10. Alternative Figures...79 a. Definitized Estimate versus Actual of the Type Expense by Work...79 b. Definitized Estimate versus Actual of the Work by Ship...82 c. Actual versus AM Radical of the Work by Ship...85 11. LOD and Availability Density Bubble Charts...87 a. LOD versus Expense (Actual Cost)...87 b. LOD versus Expense (Actual Cost) Highlighted...90 c. Availability Density versus Expense (Actual Cost)...92 12. Drill Down Spreadsheets...95 F. SUMMARY...106 V. CONCLUSIONS AND RECOMMENDATIONS...107 A. CONCLUSIONS...107 B. RECOMMENDATIONS...109 APPENDIX BIG DATA IMPLICATIONS FOR ENTERPRISE ARCHITECTURE...111 LIST OF REFERENCES...117 INITIAL DISTRIBUTION LIST...121 viii

LIST OF FIGURES Figure 1. The Digital Universe (from Gantz & Reinsel, 2012)...3 Figure 2. Big Data Revenue by Type (from Kelly et al., 2013)...4 Figure 3. Big Data Revenue by Component (from Kelly et al., 2013)...5 Figure 4. Big Data Market Projection by Segment (from Kelly et al., 2013)...5 Figure 5. Bar Chart (from Choy, Chawla, & Whitman, 2012)...16 Figure 6. Box Plot (from Choy et al., 2012)...17 Figure 7. Bubble Plot (from Choy et al., 2012)...17 Figure 8. Correlation Matrix (from Choy et al., 2012)...18 Figure 9. Cross-Tabulation Chart (from Choy et al., 2012)...18 Figure 10. Clustergram (from Manyika et al, 2011)...19 Figure 11. Geo Map (from Choy et al., 2012)...19 Figure 12. Heat Map (from Choy et al., 2012)...19 Figure 13. Histogram (from Choy et al., 2012)...20 Figure 14. History Flow (from Manyika et al., 2011)...20 Figure 15. Line Chart (from Choy et al., 2012)...21 Figure 16. Pareto Chart (from Choy et al., 2012)...21 Figure 17. Scatter Plot (from Choy et al., 2012)...22 Figure 18. Tag Cloud (from Manyika et al., 2011)...22 Figure 19. Tree Map (from Choy et al., 2012)...22 Figure 20. U.S. Government Spending on Big Data (from King, August, 2013)...23 Figure 21. Systems Supported by DOD Maintenance (from OASD[L&MR], 2011)...33 Figure 22. U.S. Navy Ship Maintenance Costs (from Department of the Navy, 2012)...34 Figure 23. Ship Maintenance Work Classifications...35 Figure 24. Vignette Overview...36 Figure 25. Project Phases...42 Figure 26. Visualization Model (from J. Kornitsky, personal communication, November, 2013)...47 Figure 27. Definitized Estimate, All Ships Solar Graph (from J. Kornitsky, personal communication, November, 2013)...50 Figure 28. Definitized Estimate, Top 5 Ships Solar Graph (from J. Kornitsky, personal communication, November, 2013)...52 Figure 29. Definitized Estimate, Top 5 Ships, Expense Detail Solar Graph (from J. Kornitsky, personal communication, November, 2013)...56 Figure 30. Actual Cost, Top 5 Ships, Type Expense Solar Graph (from J. Kornitsky, Personal Communication, November, 2013)...59 Figure 31. Definitized Estimate versus Actual, Top 5 Ships, Type Expense, Solar Graph Close-up (from J. Kornitsky, personal communication, November, 2013)...62 Figure 32. Definitized Estimate versus Actual, Top 5 Ships, Type Expense Solar Graph (from J. Kornitsky, personal communication, November, 2013)...64 Figure 33. Definitized Estimate versus Actual, Top 5 Ships, Work Solar Graph (from J. Kornitsky, personal communication, November, 2013)...67 ix

Figure 34. Actual versus 3DP, Top 5 Ships, Type Expense Solar Graph (from J. Kornitsky, personal communication, November, 2013)...70 Figure 35. Actual versus 3DP, Top 5 Ships, Type Expense, Work Solar Graph (from J. Kornitsky, personal communication, November, 2013)...72 Figure 36. Actual versus AM Radical, Top 5 Ships, Type Expense Solar Graph (from J. Kornitsky, personal communication, November, 2013)...75 Figure 37. Actual versus AM Radical, Top 5 Ships, Type Expense, Work Solar Graph (from J. Kornitsky, personal communication, November, 2013)...78 Figure 38. Definitized Estimate versus Actual, Type Expense, Work Solar Graph (from J. Kornitsky, personal communication, November, 2013)...81 Figure 39. Definitized Estimate versus Actual, Work, Ship Solar Graph (from J. Kornitsky, personal communication, November, 2013)...84 Figure 40. Actual versus AM Radical, Work, Ship Solar Graph (from J. Kornitsky, personal communication, November, 2013)...86 Figure 41. LOD versus Expense (Actual Cost) Bubble Chart (from J. Kornitsky, personal communication, November, 2013)...89 Figure 42. LOD versus Expense (Actual Cost) - Highlighted Bubble Chart (from J. Kornitsky, personal communication, November, 2013)...91 Figure 43. Availability Density versus Expense (Actual Cost) Bubble Chart (from J. Kornitsky, personal communication, November, 2013)...94 Figure 44. Barry Drill Down, 3 Levels of Detail Drill Down Spreadsheet (from J. Kornitsky, personal communication, 2013)...96 Figure 45. Barry Drill Down, 4 Levels of Detail Drill Down Spreadsheet (from J. Kornitsky, personal communication, 2013)...97 x

LIST OF TABLES Table 1. Big Data Vendors (from Kelly et al., 2013)...6 Table 2. Big Data Analyzing Techniques (from Manyika et al., 2011)...11 Table 3. Big Data Analysis Technologies (from Manyika et al, 2011)...14 Table 4. High Level Summary of Case Studies (from TechAmerica Foundation, 2012)...29 Table 5. Cost Comparison by Ship (after J. Kornitsky, personal communication, November, 2013)...45 Table 6. Cost Comparison by Work (after J. Kornitsky, personal communication, November, 2013)...45 xi

THIS PAGE INTENTIONALLY LEFT BLANK xii

LIST OF ACRONYMS AND ABBREVIATIONS 3D 3DP ADAMS AM CANES CINDER CMS CPLM DARPA DDG DECKPLATE DHS DLH DM DOD DON EA ERA G HSE IRS IT KVA L&MR LOD LST MGI NARA NASA three dimensional three-dimensional printing Anomaly Detection at Multiple Scales additive manufacturing Consolidated Afloat Networks and Enterprise Services cyber insider Centers for Medicare and Medicaid Services Collaborative Product Lifecycle Management Defense Advanced Research Projects Agency guided missile destroyer Decision Knowledge Programming for Logistics Analysis and Technical Evaluation Department of Homeland Security direct labor hours deferred maintenance Department of Defense Department of Navy enterprise architecture Electronic Records Archive growth Homeland Security Enterprise Internal Revenue Service information technology knowledge value added Logistics & Material Readiness lost operating days laser scanning technology McKinsey Global Institute National Archive and Records Administration National Aeronautics and Space Administration xiii

NAVAIR NAVSEA NG NOAA NW NoSQL NSSA NPS OASD OW PCD PEO PROCEED RFI RMC ROI SME ST1MS TB USS UT VIRAT Naval Air Systems Command Naval Sea Systems Command new growth National Oceanic and Atmospheric Administration new work not only structured query language Norfolk Ship Support Activity Naval Postgraduate School Office of the Assistant Secretary of Defense original work project completion date Program Executive Office Programming Computation on Encrypted Data request for information regional maintenance centers return on investment subject matter expert Surface Team One Metrics System terabyte United States Ship ultrasonic testing Video and Image Retrieval Analysis Tool xiv

EXECUTIVE SUMMARY The extraordinary demand placed on U.S. armed forces requires that the highest levels of readiness be maintained. The pressure to reduce costs, while maintaining the highest levels of readiness, compels each of our military services to periodically review internal processes to ensure responsible use of our nation s resources. One such process currently in review involves Department of Defense maintenance programs. In FY2011, the U.S. Navy spent $682 million maintaining its destroyers, representing only 22% of the 286 ships currently in the fleet. According to a 2012 Government Accountability Office report on ship readiness, by 2019, the U.S. Navy expects to have grown its fleet by another 14 ships to a total of 300. The size of the U.S. Navy s ship maintenance budget makes it a prime candidate for review. Reviewing ship maintenance programs is a complex task. There are between 150 and 200 parameters for measuring the performance of ship maintenance processes in the U.S. Navy. Despite this level of detail, budgets and timelines for performing maintenance on the Navy s fleet appear to be problematic. Making sense of what these parameters mean is clearly a big data problem. Fortunately, the value of big data analysis has become evident and many analysis solutions exist. Big data visualization was selected for closer examination and a sample of U.S. Navy ship maintenance availabilities were used to explore the technique. Big data visualization software was examined to determine if visualization tools could improve the understanding of U.S. Navy ship maintenance by its leaders. This thesis concludes that the visualization of big data supports decision making by enabling leaders to quickly identify trends, develop a better understanding of the problem space, establish defensible baselines for monitoring activities, perform forecasting, and evaluate metrics for use. For U.S. Navy ship maintenance decision makers desiring ways to improve the speed and accuracy of their decisions, they should consider the use of visualization software in their industry. To optimize the use of big data visualization, this xv

thesis recommends the continued and expanded collection of data, identification of performance accounting software for tracking, and the use of forecasting once accurate ship maintenance performance baselines are established. xvi

ACKNOWLEDGMENTS Much appreciation and gratitude goes to my thesis advisor, Dr. Thomas J. Housel, for his invaluable academic guidance and constant reassurances that I would be able to complete this thesis and graduate on time. Without Dr. Housel s support and countless hours refining this mountain of words into a cohesive document, this thesis would not exist. Much thanks and appreciation goes to John Kornitsky for the copious amounts of personal time he sacrificed to ensure I understood the outputs of the visualization software enough to write about it. Also, his review and suggestions for improvement were beneficial parts of the thesis revising process. Thanks to David J. Furey, an employee of Norfolk Ship Support Activity, for the many hours he spent on the phone with me, helping me to understand the importance and impact of new maintenance, deferred maintenance, and modernizations. Thanks to Sandra Hom for her professional support, without which this thesis would not have been completed. Your assistance is greatly appreciated. Thank you to Trent Silkey for your help in the early days of this thesis. Together, we made sense of endless spreadsheets of ship maintenance data and performed many correlational analyses in an attempt to identify trends. I appreciate your help. But, most of all, thank you to my wife, Reeannon. Thank you for your patience, not only during the thesis process, but also throughout our lives together. I hope to reciprocate all the love and support you ve given me. xvii

THIS PAGE INTENTIONALLY LEFT BLANK xviii

I. INTRODUCTION A. OVERVIEW There are between 150 and 200 parameters for measuring the performance of ship maintenance processes in the U.S. Navy. Despite this level of detail, budgets and timelines for performing maintenance on the Navy s fleet appear to be problematic. Making sense of what these parameters mean in terms of the overall performance of ship maintenance processes is clearly a big data problem. A team from the Naval Postgraduate School (NPS) was requested by Program Executive Office (PEO) SHIPS to work with naval ship maintenance metrics groups to provide additional options regarding how large data sets could be optimized. The current process for presenting data on the more than 150 parameters measuring ship maintenance performance costs and processes, containing billions of data points, is still done by static, cumbersome spreadsheets. The central goal of this thesis is to provide a means to aggregate voluminous maintenance data in such a way that the causal factors contributing to cost and schedule overruns can be better understood by ship maintenance leadership. By providing this kind of information in an intuitively visual form, leadership could be assisted in budget and scheduling decision making. The results of the project are in this report. In the first section, we review the big data world by looking at the $11 billion dollar industry in 2012. We examine the issues, components, technologies and tools surrounding big data. The next section focuses on big data and the federal government, which spent approximately $5 billion in 2012 on national security and military applications. Included in this section are public sector big data projects, case studies and lessons learned. Vignettes are presented in section 3 to provide a framework for understanding ship maintenance activities in the U.S. Navy. Section 4 illustrates the power of big data visualization software, with data provided by naval ship maintenance metrics groups. It provides examples of how large data sets could be optimized with alternative presentation methods showing a ship s maintenance status, including all operational costs and schedule deviations from planned maintenance. It 1

shows how visualization tools can dig deeper into numbers to improve how key information is summarized and ultimately used in making critical maintenance allocation decisions. Data were collected on 19 U.S. Navy guided missile destroyers (DDG) on 21 maintenance availabilities for those DDGs. Information that was collected included definitized estimates prepared by subject matter experts (SME) in the planning process, along with the actual cost and availability data on three maintenance categories. Two simulations were run testing the potential impact of incorporating select technologies on ship maintenance processes. Conclusions and recommendations are found in the final section. 2

II. LITERATURE REVIEW A. BIG DATA The world is exploding in digital data. IDC Corporation predicts that from 2005 to 2020, the digital universe will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 zettabytes. Moreover, the digital universe will about double every two years from now to 2020; a 50-fold growth in ten years as seen in Figure 1 (Gantz & Reinsel, 2012). More than 5 billion people are calling, texting, tweeting and browsing in mobile phones worldwide and 350 million tweets are sent per day (Kelly, Floyer, Vellante, & Miniman, 2013). Companies around the world are capturing trillions of bytes of information on customers, suppliers, and operations. The McKinsey Global Institute (MGI) estimates that global enterprises stored more than 7 exabytes of new data on disk drives in 2010, while consumers stored more than 6 exabytes of new data on devices such as PCs and notebooks (Manyika et al, 2011). The U.S. government produced 848 petabytes of data in 2009. Data collected by the U.S. Library of Congress as of April 2011 totals 235 TB. Figure 1. The Digital Universe (from Gantz & Reinsel, 2012) 3

For the purposes of our research, we will use MGI s definition of big data as datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze (Manyika et al., 2012). There are many challenges with big data, including the ability to capture, store, curate, search, transfer, share, analyze and visualize the data. This section focuses on the big data eco structure. It begins with a discussion of the market size, then discusses some of the tools and technologies used in big data analysis, and looks at federal government initiatives involving big data. The total big data market reached $11.59 billion in 2012 and is estimated to grow at an annual growth rate of 61% to $18.1 billion in 2013, according to Wikibon (Kelly et al., 2013). Figure 2 shows revenue by type while Figure 3 gives a breakdown by component. Big data requires the use of software, hardware, and services. Figure 2. Big Data Revenue by Type (from Kelly et al., 2013) 4

Figure 3. Big Data Revenue by Component (from Kelly et al., 2013) In addition, Wikibon predicts the big data market to exceed $47 billion by 2017, growing at a 31% compound annual growth rate over the five-year period from 2012 to 2017 as seen in Figure 4 (Kelly et al., 2013). Figure 4. Big Data Market Projection by Segment (from Kelly et al., 2013) 5

B. THE BIG DATA ECOSYSTEM Fueling the growth in big data sales are several factors: increased awareness of the benefits of big data as applied to industries beyond the web, most notably financial services, pharmaceuticals, and retail; implementation of big data analysis requires software such as Hadoop, NoSQL (not only structured query language), data stores, in-memory analytic engines, and massively parallel processing analytic databases; increasingly sophisticated professional services practices that assist enterprises in practically applying the big data requirements of hardware and software to business use cases; increased investment in big data infrastructure by massive Web properties most notable Google, Facebook, and Amazon and government agencies for intelligence and counter-terrorism purposes. (Kelly et al., 2013, Growth Drivers and Adoption Barriers, para. 3). Wikibon has been tracking the market size, following more than 60 vendors that include both big data pure-plays and others for whom big data is part of multiple revenue sources (Kelly et al., 2013). Table 1 is a current list of the vendors. Vendor Table 1. Big Data Vendors (from Kelly et al., 2013) 2012 Worldwide Big Data Revenue by Vendor ($US millions) Big Data Revenue Total Revenue Big Data Revenue as % of Total Revenue 6 % Big Data Hardware Revenue % Big Data Software Revenue IBM $1,306 $103,930 1% 19% 31% 50% HP $664 $119,895 1% 34% 29% 38% Teradata $435 $2,665 16% 31% 28% 41% Dell $425 $59,878 1% 83% 0% 17% Oracle $415 $39,463 1% 25% 34% 41% SAP $368 $21,707 2% 0% 67% 33% EMC $336 $23,570 1% 24% 36% 39% % Big Data Services Revenue Cisco Systems $214 $47,983 0% 58% 0% 42%

Vendor 2012 Worldwide Big Data Revenue by Vendor ($US millions) Big Data Revenue Total Revenue Big Data Revenue as % of Total Revenue 7 % Big Data Hardware Revenue % Big Data Software Revenue % Big Data Services Revenue PwC $199 $31,500 1% 0% 0% 100% Microsoft $196 $$71,474 0% 0% 67% 33% Accenture $194 $29,770 1% 0% 0% 100% Palantir $191 $191 100% 0% 36% 64% Fusion-io $190 $439 43% 71% 0% 29% SAS Institute $187 $2,954 6% 0% 59% 41% Splunk $186 $186 100% 0% 71% 29% Deloitte $183 $31,300 1% 0% 0% 100% NetApp $138 $6,454 2% 77% 0% 23% Hitachi $130 $112,318 0% 0% 0% 100% Opera Solutions $118 $118 100% 0% 0% 100% CSC $114 $15,825 1% 0% 0% 100% Mu Sigma $114 $114 100% 0% 0% 100% Booz Allen Hamilton $88 $5,802 1% 0% 0% 100% Amazon $85 $56,825 0% 0% 0% 100% TCS $82 $10,170 1% 0% 0% 100% Intel $76 $53,341 0% 83% 0% 17% Capgemini $72 $14,020 0% 0% 0% 100% MarkLogic $69 $78 88% 0% 63% 38% Cloudera $56 $56 100% 0% 47% 53% Actian $46 $46 100% 0% 50% 50% SGI $43 $769 6% 83% 0% 17% GoodData $38 $38 100% 0% 0% 100% 1010data $37 $37 100% 0% 0% 100% 10gen $36 $36 100% 0% 42% 58%

Vendor 2012 Worldwide Big Data Revenue by Vendor ($US millions) Big Data Revenue Total Revenue Big Data Revenue as % of Total Revenue 8 % Big Data Hardware Revenue % Big Data Software Revenue % Big Data Services Revenue Google $36 $50,175 0% 0% 0% 100% Alteryx $36 $36 100% 0% 55% 45% Guavus $35 $35 100% 0% 57% 43% VMware $32 $3,676 1% 0% 71% 29% ParAccel $24 $24 100% 0% 44% 56% TIBCO Software $24 $1,024 2% 0% 53% 47% Informatica $24 $812 2% 0% 63% 37% MapR $23 $23 100% 0% 51% 49% Pervasive Software $22 $51 37% 0% 41% 59% Attivio $21 $26 80% 0% 62% 38% Fractal Analytics $20 $20 100% 0% 0% 100% Hortonworks $18 $18 100% 0% 50% 50% Rackspace $18 $1,300 1% 0% 0% 100% QlikTech $16 $321 5% 0% 74% 26% DataStax $15 $15 100% 0% 59% 41% Basho $14 $14 100% 0% 63% 38% Microstrategy $13 $595 2% 0% 59% 41% Tableau Software $13 $130 10% 0% 59% 41% Kognitio $13 $12 100% 0% 47% 53% Couchbase $12 $12 $100% 0% 64% 36% Datameer $10 $10 100% 0% 80% 20% LucidWorks $9 $9 100% 0% 60% 40% Digital Reasoning $10 $10 100% 0% 51% 49%

Vendor 2012 Worldwide Big Data Revenue by Vendor ($US millions) Big Data Revenue Total Revenue Big Data Revenue as % of Total Revenue % Big Data Hardware Revenue % Big Data Software Revenue Aerospike $9 $9 100% 0% 80% 20% Neo Technology Think Big Analytics $9 $9 100% 0% 62% 38% % Big Data Services Revenue $8 $8 100% 0% 0% 100% Calpont $8 $8 100% 0% 60% 40% RainStor $8 $8 100% 0% 67% 33% SiSense $7 $7 100% 0% 40% 60% Revolution Analytics $7 $13 56% 0% 55% 45% Talend $6 $51 12% 0% 80% 20% Jaspersoft $6 $31 20% 0% 62% 38% Juniper Networks $6 $4,365 0% 70% 0% 30% Pentaho $6 $31 19% 0% 62% 38% DDN $4 $278 2% 63% 0% 38% Actuate $5 $137 3% 0% 63% 37% Original Device Manufacturers $2,375 $100,000 2% 100% 0% 0% Other $1,613 $197,170 1% 17% 13% 70% Total $11,565 $1,244,602 1% 37% 19% 44% Big data is generated by a variety of sources. The sources from which big data originate include industry specific transactions, machine/sensor indications, web applications, and text (Ferguson, 2013). Industry-specific transactions can include call records and geographic location data. Machines generate extremely large volumes of information every day and can range in complexity from simple temperature readings to 9

the performance parameters of a gas-turbine engine. Big data on the web also ranges in format from machine language to customer comments on social networks and also is produced in considerably sizeable portions. Text sources can include archived documents, external reports, or customer account information (Ferguson, 2013). Because big data comes from a variety of sources, it also possesses characteristics which distinguish it from data in the traditional context. Common terms used to define the qualities of big data include volume, variety, velocity, and value (Dijcks, 2013). From the listing of sources above, one can understand that the volume of data generated on a daily basis is enormous. For example, Dijcks (2013) stated that just a single jet engine produces 10 terabytes of data in 30 minutes. Extrapolate that example to include all the aircraft currently airborne, and then include all the factory infrastructure around the globe collecting data on production, service life, and maintenance requirements, and the enormity of big data volumes begins to emerge. Another characteristic of big data, variety, can be directly translated from the various sources into the variety of data formats. Various data formats require additional consideration to ensure the ability of all systems to share data. Velocity, which is related to volume, is the frequency with which big data is created. To illustrate velocity, consider the relative size of a single Twitter feed (140 characters) to the large number of feeds generated in a given time period (Dijcks, 2013). Finally, value is the feature of big data, which is important to any enterprise. Refer to Appendix A for a paper regarding the implications of big data on EA. C. BIG DATA TECHNOLOGIES AND TOOLS Many techniques can be used to analyze data sets. These techniques often draw upon statistics, computer science, and data science can be applied to big data to generate insights into large and diverse datasets, as well as smaller, diverse datasets. Table 2 summarizes some techniques. 10

Table 2. Big Data Analyzing Techniques (from Manyika et al., 2011) A/B testing Association rule learning Classification Cluster analysis Crowdsourcing Data fusion and data integration Data mining Technique in which a control group is compared with a variety of test groups in order to determine what treatments (i.e., changes) will improve a given objective variable. Big data enables huge numbers of tests to be executed and analyzed, ensuring that groups are of sufficient size to detect meaningful (i.e., statistically significant) differences between the control and treatment groups. Set of techniques for discovering interesting relationships, i.e., association rules, among variables in large databases. These techniques consist of a variety of algorithms to generate and test possible rules. An application is market basket analysis, in which a retailer can determine which products are frequently bought together and use this information for marketing (a commonly cited example is the discovery that many supermarket shoppers who buy diapers also tend to buy beer). Used for data mining. Set of techniques to identify the categories in which new data points belong, based on a training set containing data points that have already been categorized. One application is the prediction of segment-specific customer behavior (e.g., buying decisions, churn rate, consumption rate) where there is a clear hypothesis or objective outcome. These techniques are often described as supervised learning because of the existence of a training set; they stand in contrast to cluster analysis, a type of unsupervised learning. Used for data mining. Statistical method for classifying objects that splits a diverse group into smaller groups of similar objects, whose characteristics of similarity are not known in advance. An example of cluster analysis is segmenting consumers into self-similar groups for targeted marketing. This is a type of unsupervised learning because training data are not used. Used for data mining. Technique for collecting data submitted by a large group of people or community (i.e., the crowd ) through an open call, usually through networked media such as the Web. This is a type of mass collaboration and an instance of using Web 2.0. Set of techniques that integrate and analyze data from multiple sources in order to develop insights in ways that are more efficient and potentially more accurate than if they were developed by analyzing a single source of data. Signal processing techniques can be used to implement some types of data fusion. One example of an application is sensor data from the Internet of Things being combined to develop an integrated perspective on the performance of a complex distributed system such as an oil refinery. Data from social media, analyzed by natural language processing, can be combined with real-time sales data, in order to determine what effect a marketing campaign is having on customer sentiment and purchasing behavior. Set of techniques to extract patterns from large datasets by combining methods from statistics and machine learning with database management. These techniques include association rule learning, cluster analysis, classification, and regression. Applications include mining customer data to determine segments most likely to respond to an offer, mining human resources data to identify characteristics of most successful employees, or market basket analysis to model the purchase behavior of customers. 11

Ensemble learning Genetic algorithms Machine learning Natural language processing (NLP) Neural networks Network analysis Optimization Pattern recognition Predictive modeling Using multiple predictive models (each developed using statistics and/or machine learning) to obtain better predictive performance than could be obtained from any of the constituent models. This is a type of supervised learning. Technique used for optimization that is inspired by the process of natural evolution or survival of the fittest. Potential solutions are encoded as chromosomes that can combine and mutate. These individual chromosomes are selected for survival within a modeled environment that determines the fitness or performance of each individual in the population. Often described as a type of evolutionary algorithm, these algorithms are well-suited for solving nonlinear problems. Examples of applications include improving job scheduling in manufacturing and optimizing the performance of an investment portfolio. Subspecialty of computer science (within a field historically called artificial intelligence ) concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data. Natural language processing is an example of machine learning. Set of techniques from a subspecialty of computer science (within a field historically called artificial intelligence ) and linguistics that uses computer algorithms to analyze human (natural) language. Many NLP techniques are types of machine learning. One application of NLP is using sentiment analysis on social media to determine how prospective customers are reacting to a branding campaign. Computational models, inspired by the structure and workings of biological neural networks (i.e., the cells and connections within a brain), that find patterns in data. Neural networks are well-suited for finding nonlinear patterns. Can be used for pattern recognition and optimization. Some neural network applications involve supervised learning and others involve unsupervised learning. Examples of applications include identifying high-value customers that are at risk of leaving a particular company and identifying fraudulent insurance claims. Set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a community or organization are analyzed, e.g., how information travels, or who has the most influence over whom. Examples of applications include identifying key opinion leaders to target for marketing, and identifying bottlenecks in enterprise information flows. Portfolio of numerical techniques used to redesign complex systems and processes to improve their performance according to one or more objective measures (e.g., cost, speed, or reliability). Examples of applications include improving operational processes such as scheduling, routing, and floor layout, and making strategic decisions such as product range strategy, linked investment analysis, and R&D portfolio strategy. Genetic algorithms are an example of an optimization technique. Set of machine learning techniques that assigns some sort of output value (or label) to a given input value (or instance) according to a specific algorithm. Classification techniques are an example. A set of techniques in which a mathematical model is created or chosen to best predict the probability of an outcome. Example of an application in customer relationship management is the use of predictive models to estimate the likelihood that a customer will churn (i.e., change providers) or the likelihood that a customer can be cross-sold another product. Regression is one example of the many predictive modeling techniques. 12

Regression Sentiment analysis Signal processing Spatial analysis Statistics Supervised learning Simulation Time series analysis Set of statistical techniques to determine how the value of the dependent variable changes when one or more independent variables is modified. Often used for forecasting or prediction. Examples of applications include forecasting sales volumes based on various market and economic variables or determining what measurable manufacturing parameters most influence customer satisfaction. Used for data mining. Application of natural language processing and other analytic techniques to identify and extract subjective information from source text material. Key aspects of these analyses include identifying the feature, aspect, or product about which a sentiment is being expressed, and determining the type, polarity (i.e., positive, negative, or neutral) and the degree and strength of the sentiment. Examples of applications include companies applying sentiment analysis to analyze social media (e.g., blogs, microblogs, and social networks) to determine how different customer segments and stakeholders are reacting to their products and actions. Set of techniques from electrical engineering and applied mathematics originally developed to analyze discrete and continuous signals, i.e., representations of analog physical quantities (even if represented digitally) such as radio signals, sounds, and images. This category includes techniques from signal detection theory, which quantifies the ability to discern between signal and noise. Sample applications include modeling for time series analysis or implementing data fusion to determine a more precise reading by combining data from a set of less precise data sources (i.e., extracting the signal from the noise). Set of techniques, some applied from statistics, which analyze the topological, geometric, or geographic properties encoded in a data set. Often the data for spatial analysis come from geographic information systems (GIS) that capture data including location information, e.g., addresses or latitude/longitude coordinates. Examples of applications include the incorporation of spatial data into spatial regressions (e.g., how is consumer willingness to purchase a product correlated with location?) or simulations (e.g., how would a manufacturing supply chain network perform with sites in different locations?). Science of the collection, organization, and interpretation of data, including the design of surveys and experiments. Statistical techniques are often used to make judgments about what relationships between variables could have occurred by chance (the null hypothesis ), and what relationships between variables likely result from some kind of underlying causal relationship (i.e., that are statistically significant ). Statistical techniques are also used to reduce the likelihood of Type I errors ( false positives ) and Type II errors ( false negatives ). Example of an application is A/B testing to determine what types of marketing material will most increase revenue. Set of machine learning techniques that infer a function or relationship from a set of training data. Examples include classification and support vector machines. Modeling the behavior of complex systems, often used for forecasting, predicting and scenario planning. Monte Carlo simulations, for example, are a class of algorithms that rely on repeated random sampling, i.e., running thousands of simulations, each based on different assumptions. Result is a histogram that gives a probability distribution of outcomes. One application is assessing the likelihood of meeting financial targets given uncertainties about the success of various initiatives. Set of techniques from both statistics and signal processing for analyzing sequences of data points, representing values at successive times, to extract meaningful characteristics from the data. Examples of time series analysis include the hourly value of a stock market index or the number of patients diagnosed with a given condition every day. Time series forecasting is the use of a model to predict future values of a time series 13

based on known past values of the same or other series. Some of these techniques, e.g., structural modeling, decompose a series into trend, seasonal, and residual components, which can be useful for identifying cyclical patterns in the data. Examples of applications include forecasting sales figures, or predicting the number of people who will be diagnosed with an infectious disease. Unsupervised learning Visualization Set of machine learning techniques that find hidden structure in unlabeled data. Cluster analysis is an example of unsupervised learning (in contrast to supervised learning). Techniques used for creating images, diagrams, or animations to communicate, understand, and improve the results of big data analyses. There are a growing number of technologies used to aggregate, manipulate, manage, and analyze big data. Some of the more widely used technologies used to aggregate, manage and analyze big data are found in Table 3. TECHNOLOGY Big Table Business Intelligence Cassandra Cloud Computing Data mart Data warehouse Distributed system Table 3. Big Data Analysis Technologies (from Manyika et al, 2011) COMMENTS Proprietary distributed database system built on the Google File System. Inspiration for HBase. A type of application software designed to report, analyze, and present data. Often used to read data previously stored in a data warehouse or data mart. Also used to create standard reports that are generated on a periodic basis, or to display information on real-time management dashboards, i.e., integrated displays of metrics that measure the performance of a system. An open source database management system designed to handle huge amounts of data on a distributed system. System was originally developed at Facebook and is now managed as a project of the Apache Software foundation. A computing paradigm in which highly scalable computing resources, often configured as a distributed system, provided as a service through a network. Subset of a data warehouse, used to provide data to users usually through business intelligence tools. Specialized database optimized for reporting, often used for storing large amounts of structured data. Data uploaded using ETL (extract, transform, and load) tools from operational data stores, and reports are often generated using business intelligence tools. Multiple computers, communicating through a network, used to solve a common computational problem. Problem is divided into multiple tasks, each of which is solved by one or more computers working in parallel. Benefits of distributed systems include higher performance at a lower cost (i.e., because a cluster of lower-end computers can be less expensive than a single higher-end computer), higher reliability (i.e., because of a lack of a single point of failure), and more scalability (i.e., because increasing the power of a distributed system can be accomplished by simply adding more nodes rather than completely replacing a central computer). 14

TECHNOLOGY Dynamo Extract, transform, and load tools Google File System Hadoop HBase MapReduce Mashup Metadata Non-relational database R Relational database Semi-structured data SQL Stream processing Structured data Unstructured COMMENTS Proprietary distributed data storage system developed by Amazon. Software tools used to transfer data from one location and integrate into the data set. Proprietary distributed file system developed by Google; part of the inspiration for Hadoop.31 Open source software framework for processing huge datasets on certain kinds of problems on a distributed system. Its development was inspired by Google s MapReduce and Google File System. It was originally developed at Yahoo! and is now managed as a project of the Apache Software Foundation. Open source, distributed, non-relational database modeled on Google s Big Table. Originally developed by Powerset and is now managed as a project of the Apache Software foundation as part of the Hadoop. Software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system. Also implemented in Hadoop. Application that uses and combines data presentation or functionality from two or more sources to create new services. Applications are often made available on the Web, and frequently use data accessed through open application programming interfaces or from open data sources. Data that describes the content and context of data files, e.g., means of creation, purpose, time and date of creation, and author. A database that does not store data in tables (rows and columns). Open source programming language and software environment for statistical computing and graphics. R language has become a de facto standard among statisticians for developing statistical software and is widely used for statistical software development and data analysis. Database made up of a collection of tables (relations), i.e., data are stored in rows and columns. Relational database management systems (RDBMS) store a type of structured data. SQL is the most widely used language for managing relational databases. Data that do not conform to fixed fields but contain tags and other markers to separate data elements. Examples include XML or HTML-tagged text. Originally an acronym for structured query language, SQL is a computer language designed for managing data in relational databases. Technique includes the ability to insert, query, update, and delete data, as well as manage data schema (database structures) and control access to data in the database. Technologies designed to process large real-time streams of event data. Enables applications such as algorithmic trading in financial services, RFID event processing applications, fraud detection, process monitoring, and location-based services in telecommunications. Data that reside in fixed fields. Examples include relational databases or data in spreadsheets. Data that do not reside in fixed fields. 15

TECHNOLOGY data Visualization COMMENTS Examples include free-form text (e.g., books, articles, body of e-mail messages), untagged audio, image and video data. Technologies used for creating images, diagrams, or animations to communicate a message that are often used to synthesize the results of big data analyses. In working with massive amounts of data, the challenge of displaying data and visualization methods is critical in finding connections and relevance among millions of parameters and variables to convey linkages, hypotheses, metrics and project future outcomes. Taken one level further, Interactive Visualization moves visualization from static spreadsheets and graphics to images capable of drilling down for more detail to immediately change how data are presented and processed. Examples of visualization methods include: Bar charts are commonly used for comparing the quantities of different categories or groups. Figure 5. Bar Chart (from Choy, Chawla, & Whitman, 2012) Box plots represent a distribution of data values. Displaying five statistics of minimum, lower quartile, median, upper quartile and the maximum values that summarize the distribution of a set of data. Extreme values are represented by whiskers extending from the edges of the box. 16

Figure 6. Box Plot (from Choy et al., 2012) Bubble plots are variations of a scatter plot in which the data markers are replaced with bubbles, with each bubble representing an observation (or group of observations). Useful for data sets with many values or when values differ by orders of magnitude. Figure 7. Bubble Plot (from Choy et al., 2012) Correlation matrices, combine big data with fast response times to identify quickly which variables among millions/billions are related. They also show the relationship strength between variables. 17

Figure 8. Correlation Matrix (from Choy et al., 2012) Cross-tabulation charts show frequency distributions or other aggregate statistics for the intersections of two or more category data items. Crosstabs enable examination of data for intersections of hierarchy nodes or category values. Figure 9. Cross-Tabulation Chart (from Choy et al., 2012) Clustergrams display how individual members of a dataset are assigned to clusters as the number of members increases. 18

Figure 10. Clustergram (from Manyika et al, 2011) Geo maps display data as a bubble plot overlaid on a geographic map. Each bubble is located either at the center of a geographic region or at location coordinates. Figure 11. Geo Map (from Choy et al., 2012) Heat maps display distribution of values for two data items using a table with colored cells. Colors are used to communicate relationships between data values. Figure 12. Heat Map (from Choy et al., 2012) 19

Histograms are variations of bar charts using rectangles to show the frequency of data items in successive numerical intervals of equal size. They are often used to quickly show distribution of values in large data sets. Figure 13. Histogram (from Choy et al., 2012) History flow charts show the evolution of a document edited by multiple contributing authors. Time appears on the horizontal axis, while contributions to the text are on the vertical axis; each author has a different color code and the vertical length of a bar indicates the amount of text written by each author. Figure 14. History Flow (from Manyika et al., 2011) Line charts show the relationship of one variable to another by using a line that connects the data values. They are most often used to track changes or trends over time. 20

Figure 15. Line Chart (from Choy et al., 2012) Pareto charts are a specialized type of vertical bar chart where values of the dependent variables are plotted in decreasing order of frequency from left to right. They are used to quickly identify when certain issues need attention. Figure 16. Pareto Chart (from Choy et al., 2012) Scatter plots are two-dimensional plots showing joint variation of two (or three) variables from a group of table rows. They are useful for examining the relationships, or correlations, between numeric data items. 21

Figure 17. Scatter Plot (from Choy et al., 2012) Tag clouds are a weighted visual list in which words appearing most frequently are larger and words appearing less frequently, smaller. Figure 18. Tag Cloud (from Manyika et al., 2011) Tree maps are a variation of heat maps using rectangles (tiles) to represent data components. The largest rectangle represents the dominant division of the data and smaller rectangles represent subdivisions. Figure 19. Tree Map (from Choy et al., 2012) 22