Leveraging People, Processes, and Technology Generating the Business Value of Big Data: Analyzing Data to Make Better Decisions Authors: Rajesh Ramasubramanian, MBA, PMP, Program Manager, Catapult Technology Roberto Berezdivin, Ph.D. Systems Architect, Catapult Technology 11 Canal Center Plaza, Floor 2 Alexandria, VA 22314 240-482-2100 www.catapulttechnology.com
Introduction Big Data refers to large data sets whose size and disparity makes it difficult, if not impossible, for relational database software tools to capture, store, manage, and analyze the data. Relational databases, typical of structured data, cannot handle the scale and agility challenges that face modern applications, nor were they built to take advantage of the relatively inexpensive, cloud-based storage and processing power that is now available. The platforms, tools, and software available to store, process, and analyze the large datasets of unstructured data prevalent today are collectively known as Big Data technologies. As more and more companies incorporate efficient and scalable technology, data management and data storage is no longer the issue. Organizations generate constant data, through the use of the Internet, mobile applications, social media, internal documents, content and automated processes employed by the organization. The solutions available to the big Internet players Opportunity: 80-85% of global to process and analyze this voluminous data are data is unstructured. publicly available by open-source software communities. Meanwhile, the advent of cloud-based solutions has dramatically lowered the cost of storage and processing. Virtual file systems, either open source or vendor-specific, has helped transition from a managed infrastructure to a service-based approach. In addition, innovative designs for database management and cost-effective ways to support massively parallel processing have led to new products like nosql databases and the Apache Hadoop MapReduce platform. NoSQL was developed specifically to respond to the massive data of today, and improve upon the shortcomings of relational databases. Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on commodity hardware. According to a recent study, 80-85 percent of global data that exists is unstructured, meaning that it has no pre-defined data model or is not organized in a pre-defined manner. It can come from such disparate sources as social media platforms (e.g., Facebook, Twitter); email; online purchases; online profiles; content management system footprint; and photos. Page 2
The large Internet players are already discovering great value in their data by identifying new customers, improving their products and service offerings, expanding their markets, and increasing profitability. The real questions for business now are: How do you put all this captured and stored data to valuable use? How do you analyze it to make better business decisions? The 3 V s The 3 V s that define the Big Data are: 1. Volume Currently there is exponential growth in data storage, as data is not just textual but comes in the form of videos, music, images, clickstream and blog content, often through social media channels. It has been recently projected that every individual is predicted to generate over 20 petabytes of data over the course of his/her lifetime. A recent projection by Paypal cites that every individual is predicted to generate over 20 petabytes of data over the course of his/her lifetime. (For context, a terabyte is 10 12 bytes of digital information; a petabyte is 10 15 bytes of digital information.) According to International Data Corporation (IDC), the digital universe will grow to 35 zettabytes (i.e. 35 trillion terabytes) globally by 2020. The point is, data is exploding. The response to this data boom, as well as the ubiquity of the cloud, will be a significant decrease in a your IT capital expenditure, as many organizations invest in data virtualization. At the same time, there will be an increase in operating expenditure as organizations move towards the use and exploitation of that data using cloud-based storage and processing solutions. 2. Velocity The explosion of data is happening almost in real time, as people turn to social media for updates about what is occurring in the world around them. No one waits for news anymore; the speed with which we are informed has literally become fractions of seconds. An interesting example was the earthquake of 2011 southwest of Washington D.C.; the first news of it arrived via Twitter minutes before the tremor was felt. Page 3
As more and more data is produced, it must be collected in shorter timeframes. Therefore, organizations require tools and platforms for real-time processing of data in order to achieve, and maintain, a competitive advantage in the marketplace. 3. Variety In the real world, data comes in different formats, from structured, data typically data contained in relational databases and spreadsheets with specific classification to unstructured data, which can t be as neatly classified (e.g., videos, images, SMS, social media content, PDFs, etc.) Veracity (Value) The accuracy, truthfulness, and quality of data are the most important aspect that fuels new insights into your organization and provides high value. The data that organizations collect is all about supporting the decisions that can have a major impact on the organization as a whole. Businesses are going to want as much quality information as possible to support the business case. Establishing trust in Big Data solutions probably presents the biggest challenge; but once overcome, it will introduce a solid foundation for successful decision-making within your organization. There is more data than ever from which business decisions can be made. According to a study done by Avanade, Inc., 46 percent of companies report they have made an inaccurate business decision as a result of bad or outdated data. In many cases, useful and necessary data to make business decisions are not collected and well-meaning managers end up guessing. It is therefore critical for organizations to address this issue and position itself to react quickly to fast-changing business conditions. For example, a user posts something like, I am interested in buying a new smart phone for my wife on her birthday on social media. A smart phone manufacturer s data engineers who analyze this unstructured data can infer information about the shopper s interest, such as: 1. He is married; 2. He is looking for a smart phone; and 3. The phone will be used by his wife. Page 4
In addition, if he is a previous or current customer, the phone manufacturer can pull his profile and better target the individual with various options compared to competitors. Harnessing this kind of unstructured data will help increase the the phone manufacturer s sales and revenue and target the customer with better products. Imagine that this kind of information is posted by users in various social media. The volume of information that is available for organizations to analyze and better target their customers help companies increase 46% of organizations report their market reach. While organizations can acquire a negative online reputation, data can be leveraged as a corrective. For example, a passenger is traveling from one city to another by bus. The bus breaks down on the way to its destination. The passenger takes pictures of the incapacitated bus and tweets those images with complaints about the bus breaking down. Smart data mining from the bus line s data analysis team could provide this information to their customer service department in the form of alerts. Customer service can then return a tweet that apologizes for the inconvenience, ensures fast repair, and promises better services By offering a free ticket back from the trip s original destination, or some other accommodation, the bus line can also rebuild goodwill and fortify its customer retention strategy. The bus line s Big Data solution has mined unstructured data to return an actionable solution. Currently, the challenge that businesses face is to transform raw data into meaningful information and provide actionable insights for better business decision-making. Basically, organizations that mine their data warehouses, transactional systems, and the social media footprints of their customers can benefit by discovering the preferences of their customers. They can establish a meaningful relationship between customer segments and product segments with a higher degree of correlation. they have made an inaccurate business decision as a result of bad or outdated data. Page 5
The diagram below encompasses Big Data Management, the technology used, and the benefits for an enterprise: Technology Implementation: A Case Study A Catapult customer implemented a new web portal and wanted to answer basic questions, such as How many people visited the portal? and On average, how much time did people spend on the portal? Catapult leveraged Apache Hadoop, the open-source platform that is applied to Big Data. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage (Apache Hadoop ). In order to answer the customer s business questions, Catapult leveraged web server access logs and an http requests log. Catapult used the Hadoop Distributed File System (HDFS), which breaks down data into smaller pieces for easier processing, and wrote a MapReduce program to identify the unique values based on the IP address. (MapReduce basically breaks down individual data elements, thus reducing the size of a data set. The reduce job takes the output from a map as input and combines those data segments into a still smaller data set.) The session ID plays a critical role in the mining of the web logs. This session information provided the vital information of how long visitors spent on the portal. The MapReduce was used to compute this in fully distributed modes of the cluster. Page 6
Finally, the parsed log files were stored as text file in HDFS. This parsed log file was loaded to a Hive data warehouse. (Built on top of Hadoop, Hive provides data summarization, query, and analysis). By writing nosql queries, Catapult got answers to the above business questions. With this information, the customer targeted the user base with appropriate application design and better user experience, which led to more quality information. Security and Privacy both a technical and sociologi- The privacy of data is another huge concern, and cal issue; a solution should be one that increases in the context of Big Data. Organizations should understand that managing data tives. addressed from both perspec- privacy is both a technical and sociological issue and a solution should be addressed from both perspectives. As enterprises become more and more dependent on data to drive business decisions (whether the data is available publicly or through internal collection processes), they face the risk of inaccurate, incomplete, and fraudulently manipulated data. In order to avoid these risks, organizations need to verify and validate all the data sources from which they analyze and use tools and processes to check for vulnerabilities. Enterprises should have a proper Big Data governance process in order to avoid misleading data and additional unexpected costs associated with it. Implementing adequate controls through the governance process ensures that the information that businesses depend on is accurate, consistent, and good quality. In addition, data governance must be measured at three distinct levels: 1. At the program level, at which the organization identifies and highlights the qualitative level and the impact the data governance process delivers; 2. At the operational level, at which the organization monitors on how data is behaving against the companies set policy and baseline; 3. And at the quantitative level, at which the organization measures the effectiveness and efficiency of data management results, assessing quantitative business values like revenue growth, cost savings, risk reduction, internal processes, and customer retention. For example, as part of a data analysis contract with Department of Transportation (DOT) Pipeline and Hazardous Materials Safety Administration (PHMSA), Catapult provided data management activities Managing data privacy is Page 7
aligned to the agency s data management policy. The policy identifies roles and responsibilities for data owners, stewards, and managers, as well as rulemaking impacts on collected data set and data management procedures. As part of the agency s data governance effort, Catapult contractors developed comprehensive data policies, standards, and procedures and monitored and enforced conformance with those data policies, standards, and architecture. In addition, Catapult contractors manage and resolve data related issues and communicate and promote the value of data assets within the agency. Conclusion Through Big Data analytics, the potential has never been greater to optimize business processes, to drive product and service innovation, and to enable enterprise controls. By leveraging Big Data analytics, Catapult Technology can help your organization: Measure the incremental cost of managing and analyzing unstructured data sets against the incremental benefits gained over and above what can be achieved using structured data sets. Develop a data culture in which the management, employees, and strategic partners are active participants in managing a meaningful data lifecycle. Harness new sources of information and take responsibility over accurate data creation, dissemination, data governance, quality and maintenance Enable businesses to turn data from information into actionable insights. Catapult s Big Data consultants are adept at: Collecting, cleaning, and integrating unstructured data from multiple sources, while creating a road map that helps organizations realize their business value by deriving greater insights from their data. Developing a migration strategy, creating prototypes, and engaging in full-fledged deployment of Big Data solutions. Page 8
Accommodating privacy, security, and data governance aspects of Big Data. Translating Big Data analytical findings into appropriate risk management and marketing strategies that drive business value. Hadoop/HDFS, MapReduce, HBase, Pig, NoSQL data stores (Cassandra, MongoDB). New businesses are emerging based on harvesting Big Data and by combining data and analytics services. Disruptive change is being implemented across industries both horizontally and vertically. Contact Catapult Technology so we can help your organization take advantage of Big Data technologies and build a culture that infuses analytics everywhere! Call 240-482-2100 or email info@catapulttechnology.com References: McKinsey Global Institute: Big data: The next frontier for innovation, competition, and productivity The White House Big Data and Privacy Review Report: http://www.whitehouse.gov/sites/ default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf Daniel Austin, Principal Architect at PayPal: http://www.kdnuggets.com/2014/04/big-datainnovation-summit-2014-highlights-day1.html Apache Hadoop : http://hadoop.apache.org/ Avanade Inc., Global Survey: The Business Impact of Big Data, 2010: http://www.avanade. com/en-us/approach/research/pages/big-data.aspx# www.idc.com (International Data Corporation) Page 9
11 Canal Center Plaza, Floor 2 Alexandria, VA 22314 240-482-2100 www.catapulttechnology.com info@catapulttechnology.com 09/02/14 QP1560-106