From Big Data to Actionable Insight Bob Palmer Senior Director, SAP National Security Services (SAP NS2 ) Dan Dorchinsky Client Director, SAP National Security Services (SAP NS2 ) August 2012 www.sapns2.com About SAP National Security Services (SAP NS2 ) SAP National Security Services (SAP NS2) offers a full suite of enterprise applications, analytics, database, cloud, and mobility software solutions from SAP and Sybase with specialized levels of security and support to meet the unique mission requirements of US national security and critical infrastructure customers. SAP National Security Services and SAP NS2 are trademarks owned by SAP Government Support and Services (SAP GSS). For more information, visit www.sapns2.com. Big Data has become a hot topic as information technology (IT) leaders in business and government struggle with how best to leverage the rising flood of data coming from myriad sources. The explosion of information is a double-edged sword. On one hand, the data can reveal new insights that would have previously remained hidden. On the other hand, the quantity of data brings challenges in capturing, storing, sharing, searching, and analyzing that data. The term Big Data has come to describe data sets that are so large and complex that they are too cumbersome to manage using traditional tools or processes. In this paper, we propose that national security IT customers tackle Big Data with a hybrid solution of open-source and commercial technologies. The hybrid solution reduces the operating costs of managing and storing Big Data while providing agility and real-time speed for high-performance analytics all of which helps mission leaders and war fighters be more effective in carrying out their missions.
What is Big Data? In today s world, we have access to unimaginably large volumes of information from a growing number of data sources. While the exponentially expanding stream of information has made it possible to accomplish more and to address problems differently for example, spotting new business trends or drawing conclusions on theories that could never before be tested the sheer volume quickly becomes overwhelming. (See Figure 1.) Data sources are increasing daily: There are 5.9 billion mobile phone subscriptions worldwide, equivalent to 87 percent of the world s population. 1 Wal-Mart handles more than 1 million customer transactions every hour. 2 More than 2 billion people are accessing the Internet on a regular basis, creating data with every click. 3 The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, and 65 exabytes in 2007. It is predicted that the amount of traffic flowing over the Internet will reach 667 exabytes annually by 2013, and the world s stores of digital information will increase tenfold every five years. 4 1 International Telecommunication Union, ICT Facts and Figures. ITU Telecomm World (2011), 2, http://www.itu.int/itu-d/ict/facts/2011/material/ictfactsfigures2011.pdf. 2, Data, data everywhere. The Economist, 25 February 2010, http://www.economist.com/node/15557443. 3 International Telecommunication Union, ICT Facts and Figures. ITU Telecomm World (2011), 1, http://www.itu.int/itu-d/ict/facts/2011/material/ictfactsfigures2011.pdf. 4 Martin Hilbert and Priscila Lopez, The World s Technological Capacity to Store, Communicate, and Compute Information, Science 332, no. 6025, (1 April 2011): 60-65. 2 2012 SAP National Security Services SAP NS2
The explosion of information is a double-edged sword. On one hand, the data can reveal new insights that would have previously remained hidden. On the other hand, the quantity of data brings challenges in capturing, storing, sharing, searching, and analyzing the data. Thus, the term Big Data has become an industry catchphrase to describe data sets that are so large and complex that they are too cumbersome to manage using traditional tools or processes. Before discussing specific technologies for harnessing Big Data, it is important to understand there is no single solution or universal approach to Big Data. The platforms, specific functionalities and tools chosen need to be based on the requirements of the mission. The platform will likely include a combination of open-source and commercial-grade enterprise software to create an end-to-end solution with data on one end and analytics on the other. This paper provides a notional architecture of this end-to-end platform and highlight how the Department of Defense (DoD) can use a combination of solutions to maximize the value of its data. DoD Problem Statement Addressing large volumes of data is not new, per se. However, there are key differences in addressing today s Big Data conundrum than in the past: (1) The recent confluence of cloud, social media, and mobile computing trends has reshaped the approach to Big Data. The ability to collect and analyze the staggering volume of data these media generate has driven organizations in many different industries to seek cost-effective solutions. 2012 SAP National Security Services SAP NS2 3
(2) Data consumers (e.g., constituents, service members, and stakeholders) expect that data is served up instantaneously. In other words, they assume there is little to no data latency. (3) Data consumers expect that applications running against the data have the ability to proactively identify trends. And they expect the data to allow for self-service discovery in a user-friendly, visually appealing manner. Like the business and science communities, the Pentagon faces a similarly huge Big Data challenge, especially when it comes to unstructured data that does not fit well into a relational table and may include text, sensor data, images, or other elements. According to recent news reports, the DoD has invested billions of dollars in new electronic systems that gather and store vast quantities of imagery and other data from the battlefield. However, the digital deluge is so vast that sifting through it manually to generate actionable information is not practical, sustainable or cost-effective. According to Zach Lemnios, the Assistant Secretary of Defense for Research and Engineering, the DoD has progressed the quality of imagers and the quality of sensors to the point where our limitation is no longer the front end collection tools, it's the back-end decision tools. How do I take a large data set and integrate it in a time-critical way?" 5 Clearly, deriving new insights, recognizing relationships, and making increasingly accurate predictions are critically important to all of the defense and intelligence services. In addition to ingesting and digesting the sheer volume of data, the DoD faces an additional challenge: sharing information across different mission areas and partners. There is no standardized way of dealing with Big Data across such boundaries. 5 Jared Serbu, DoD R&D prioritizes 'Big Data.' Federal News Radio, 11 April 2012, http://www.federalnewsradio.com/?nid=885&sid=2824044. 4 2012 SAP National Security Services SAP NS2
Building the Solution Platform Step 1: Capturing and Storing Huge Volumes of Unstructured Data As stated previously, many of the functions behind Big Data are not new data warehousing, mining, analytics but have been out of reach of many organizations due to prohibitive storage and processing costs. Obviously, a lower-cost approach will always prove most favorable for adoption. Therefore, it is not a surprise that open-source software frameworks, such as Hadoop, have gained popularity for managing a major aspect of the Big Data platform storing unstructured data in large part because of the cost. 6 For example, the data stored on a Hadoop cluster is less than one-tenth the cost of an equivalent relational database. Hadoop clusters were also designed to better process unstructured data, which represents 95% of the data currently being created. Video, picture images, and voice files are not easily stored in relational databases. Lastly, Hadoop offers significant advantages with linear scalability. Although this paper is not intended to focus on Hadoop, it is important to define it for uninitiated executives. Keep in mind that SAP NS2 s position is not for or against any specific technology. Rather, we embrace and extend other technologies. Combining the best features of an open-source solution (such as Hadoop) with innovative, enterprise-ready technology will deliver the most value at the best cost for many organizations with Big Data requirements. Further, a Big Data solution should let an organization store data in its native object format and then enable users to pick and choose what items to bring forward for analysis. Quite simply, Hadoop is a distributed file system, not a database. The Hadoop Distributed File System (HDFS) manages the splitting up and storage of large files of data across many inexpensive commodity servers, which are known as worker nodes and cost hundreds, not thousands, of dollars per terabyte. When Hadoop splits up the files, it puts redundant copies of the chunks of the file on more than one disc drive, providing 6 Hadoop was chosen to represent the open-source file system in our notional platform due to its increasing popularity and market position as a distributed file store. 2012 SAP National Security Services SAP NS2 5
self-healing redundancy in case a low-cost commodity server fails. Hadoop also manages the distribution of scripts that perform business logic on the data files that are split up on those many server nodes. This splitting up of the business logic to each of the CPUs and RAM on many inexpensive worker nodes is what makes Hadoop work well on very large Big Data files. Analysis logic is performed in parallel on all of the server nodes at once, on each of the 64MB or 128MB chunks of the file. Hadoop software is written in Java and is licensed for free; it was developed as an open-source initiative of the Apache Foundation. Assuming that the technological challenge of capturing and storing massive unstructured data has been addressed with a solution like Hadoop, two vitally important questions remain: 1) How can the organization make the data relevant to act upon in real time, and 2) How can the organization accomplish this cost-effectively? Step 2: Adding an Analytical Data Warehouse for Real-Time Logic Our approach combines the Hadoop distributed file system for storing large amounts of unstructured data with (1) an analytical data warehouse for processing real-time analysis logic, and (2) a self-service, web-based user interface for visualizing the data. This approach may irritate purists who advocate solely for the Free and Open-Source Software (FOSS) movement or for the more traditional model of commercial-off-the-shelf (COTS) software. But in looking at this problem, we have concluded that a combination of solutions brings the right tools to the job. The Hadoop system is a scalable approach to inexpensively take in and store very large data files of unstructured and semi-structured data. Then the content of those files can be sorted and processed in parallel as instructed by code written by data scientists using the MapReduce methodology. 7 However, Hadoop and MapReduce have limitations and cannot effectively address Big Data solely on their own. A scalability problem arises in a pure Hadoop environment, and it is not merely a problem of scaling the numbers of cheap server nodes or disc space in the data center. Hadoop with MapReduce is essentially batch-oriented; a developer builds a MapReduce script to operate on the whole file, which may be as large as multiple terabyte-large and may run for 20 minutes, an hour, or even longer. There is no indexing or schema for the file system, nor is there of the capability to create, update, or delete. Given appropriate time and enough skilled data scientists, any type of analysis (predictive, comparative, pattern recognition, text analysis, or time series) can be run against the data in a batch process using MapReduce. While this process is effective, the outcome creates a conundrum due to the iterative 7 Pioneered by Google and further enhanced by internet indexing entities such as Yahoo! and companies that track massive amounts of click-stream data like Amazon.com 6 2012 SAP National Security Services SAP NS2
nature of data analytics. For example, the answer to the initial query may be accurate, but often another requirement will emerge to interrogate the data again, this time slightly differently, because analysis is inherently an iterative process. This creates another problem of scalability. In other words, how many hours will it take for highly skilled data scientists to write and rewrite analyses as driven by the shifting needs of mission specialists and war-fighters? Step 3: A Hybrid Solution Combining the Distributed File System with a Columnar-Structured Analytical Repository The ultimate goal of a Big Data solution is to drive actionable insight from as much information as possible. A state-of-the-art approach empowers end-users to query the data, with near-instant response time, in a much more self-service manner. The approach can be delivered by combining Hadoop with a high-performance data store in columnar format. The columnar data format is appropriate for this use-case for three reasons: Because the data originated from unstructured sources, the columnar format avoids the rigorous data modeling needed to construct a traditional Ralph Kimball-type star schema, or row-based relational table structure. 2012 SAP National Security Services SAP NS2 7
The columnar data store obviates the need to build a pre-conceived indexing strategy for the data, because in a sense, in a columnar data store, the data is the index. And finally, the columnar data store allows for extreme data compression through bit-mapping, because all of the attributes of the data are organized together in columns, instead of being distributed in rows. Certainly, data compression is appropriate in what will be a large volume of data being operated upon in this Big Data scenario. SAP has been a thought leader in columnar data base development, and since the acquisition of Sybase in 2010, its expertise has deepened. For years, Sybase IQ was the pioneering columnar database in the market; and now, with more than 13 years of continuous development, it is widely deployed in missioncritical applications in government and the financial industry. Sybase IQ leverages the columnar approach and advanced compression techniques to rapidly produce insight across both structured and unstructured data sets. On the cutting edge of database evolution, SAP now offers a columnar database and computation engine that is totally in-memory. SAP HANA is a flexible, multipurpose, in-memory data warehouse that combines SAP software components optimized on hardware provided and delivered by leading hardware vendors. 8 SAP HANA is not merely a cache of a sub-set of relational SAP proposes that customers tackle Big Data with a hybrid solution of Hadoop plus an Analytical Data Warehouse. database tables. Rather, the data warehouse is completely resident in RAM memory in columnar format, and the computational engine is resident in the same memory. Therefore, there is no need for disc storage with the exception of backups and disaster recovery. The net result is speed. Simply put, SAP HANA is blazingly fast, up to 100,000 times faster than disc-based systems in many cases. To enable powerful real-time analysis, the in-memory data warehouse performs the slicing and dicing of data and performs rigorous predictive analysis without any I/O from data storage, in the same memory space with the data. It can process the in-memory data using numerous algorithms in the HANA Predictive Analysis Library (PAL), including K-Means, KNN, C4.5 Decision Tree, Linear Regression, Apriori, Moving Averages, et cetera, and it also makes available more than 3,500 open-source algorithms available in the R programming language (another nexus between open source and our COTS solutions.) 8 By working collaboratively with Intel Corporation, SAP has enabled HANA to leverage the full capabilities of the Intel Nehalem chip set, on hardware available from leading hardware vendors that include the top manufacturers in the industry, such as Dell, IBM, Hewlett Packard, Cisco, Fujitsu, etc. 8 2012 SAP National Security Services SAP NS2
SAP proposes that customers tackle Big Data with a hybrid solution including Hadoop and an Analytical Data Warehouse. Rather than asking data scientists to write detailed analyses using MapReduce to perform analytical processes on entire data files, an innovative approach would be to use the MapReduce code to deliver an ETL (Extract, Transform and Load) of a large data domain to the inmemory, columnar data warehouse. The MapReduce process would sort and process the unstructured data, and then the data would be loaded into the in-memory data warehouse, using SAP Data Services, a click-and-drag graphical user-interface tool for mapping data. At this point, the large data domain needed for fast, real-time analysis will be on-hand in a structured data warehouse, in columnar format, with meta-data context. Users can create powerful analytical views of the data in a self-service web client environment (discussed in the next section). This will reduce the number of iterative cycles imposed on the data scientists to tune and re-tune the analyses as required by end users. Ultimately, it will serve to reduce the time lag between the warfighters analytical needs and their fulfillment. The in-memory analytical data warehouse is uniquely qualified to be the target of the large results sets of the MapReduce process. It is optimized so that there is no bottleneck in reporting performance for even extremely large data sets coming from a MapReduce job. The in-memory solution means that the seek time response is much shorter, even when analyzing the very large data sets that may be the result of the Hadoop MapReduce script. In fact, performance benchmarks routinely show sub-second response times, even for complex queries of millions of records. In sum, the combined solution reduces the operating costs of managing and storing Big Data while providing agility and speed for high-performance analytics. 2012 SAP National Security Services SAP NS2 9
Step 4: Add Business Analytics Big Data is meaningless without the ability to cull information at the right time and in the right format that materially impacts the mission. SAP BusinessObjects business intelligence (BI) solutions can operate directly against the data stored in the columnar or in-memory repository. BI solutions provide business users with both an analytics and reporting framework for the data. They also allow end users to interface with existing applications and operational software such as Microsoft Office. Powerful analytical capabilities can be exposed to end-users using the SAP BusinessObjects web-based user interface. This interface is designed to empower subject matter experts who are not IT personnel and who have no coding or scripting skills. Importantly, these users do not need to understand the structure of the underlying data store to create effective ad hoc queries and visualizations. They will not have to burden the data scientists with running additional MapReduce queries, as the data will already reside in the in-memory data warehouse. In traditional analytics, high data volumes require that assumptions be made during data modeling in order to reduce the data set to a manageable size. In classic Ralph Kimball-type data warehousing strategies, aggregations are often used to reduce the data set. But in national security missions, aggregation could mean that the transactional or atomic data element that is crucial for situational awareness is not present in the analytical repository. Such simplified models don t accurately reflect the multifaceted nature of operational data, often resulting in suboptimal forecasting, planning, trend analysis, or root cause analysis all critical to a rapid response and situational awareness in the mission. Combining the in-memory data warehouse with market-leading business intelligence tools provides a platform for extremely fast response times for highly complex ad hoc queries and analyses on very large data sets that were originally stored in a distributed file system like Hadoop. 10 2012 SAP National Security Services SAP NS2
Conclusion The hybrid solution combining both open source and commercial technologies can best solve the Big Data challenges faced by the defense and intelligence communities. End users, who are increasingly influenced by consumer applications, expect data to be provided to their fingertips with zero latency, and with the ability to proactively identify trends and conduct self-service discovery in a visually appealing environment. National security organizations will benefit from a hybrid solution that provides the ability to collect and analyze a staggering volume of data, and analyze it with agility in real time, with lower costs and increased speed to insight. But this is just the beginning. The in-memory computing platform for analyzing unstructured Big Data coming from Hadoop could also be the nexus for many other data sources in the DoD enterprise. Logistics, Order of Battle, Readiness, Force Generation, Human Resources, and Financial data all represent organizational functions that could be synthesized and powerfully analyzed using a similar inmemory approach, with real-time visibility. Authors: Bob Palmer Senior Director SAP National Security Services (SAP NS2 ) bob.palmer@sapns2.com 301.641.7785 Dan Dorchinsky Client Director SAP National Security Services (SAP NS2 ) dan.dorchinsky@sapns2.com 301.693.9000 2012. Copyright by SAP Government Support and Services. All rights reserved. May not be copied or redistributed without permission. SAP Government Support and Services, Inc. (SAP GSS), a Delaware corporation, is a wholly owned, independent US subsidiary of SAP and does business as SAP National Security Services (SAP NS2 ). SAP National Security Services and SAP NS2 are trademarks owned by SAP GSS. SAP NS2 offers the combined power of enterprise applications, analytics, database, cloud and mobile software solutions from SAP and Sybase with specialized levels of security and support to meet the unique mission requirements of US national security and critical infrastructure customers. In addition to US national security customers, SAP NS2 also supports private companies such as defense contractors, telecom carriers, and major financial institutions that have specialized information assurance needs. SAP, R/3, xapps, xapp, SAP NetWeaver, Duet, SAP Business ByDesign, ByDesign, PartnerEdge and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects S.A. in the United States and in several other 2012 SAP National Security Services SAP NS2 11
countries. Business Objects is an SAP Company. All other product and service names mentioned and associated logos displayed are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary. The information in this document is proprietary to SAP. This document is a preliminary version and not subject to your license agreement or any other agreement with SAP. This document contains only intended strategies, developments, and functionalities of the SAP product and is not intended to be binding upon SAP to any particular course of business, product strategy, and/or development. Please note that this document is subject to change and may be changed by SAP at any time without notice. SAP assumes no responsibility for errors or omissions in this document. SAP does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. SAP shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence. 12 2012 SAP National Security Services SAP NS2