How Big Data Can Lie

Size: px
Start display at page:

Download "How Big Data Can Lie"


1 How Big Data Can Lie Ensuring Accuracy in Predictive Analytics A White Paper WebFOCUS iway Software Omni

2 Table of Contents 1 Introduction 2 Big Data, Big Problems 2 Variety 3 Velocity 3 Volume 5 Big Data Analytics: The Wrong Approach 7 Taking the Risk Out of Big Data Analytics 9 Conclusion

3 Introduction To exploit information assets, organizations need to do more than just analyze historical data. Decision-makers must be able to accurately anticipate future events, behaviors, and conditions. That s why the worldwide market for predictive analytics is expected to experience record growth in the near future. Valued at over $2.08 billion in 2012, Research and Markets anticipates extremely strong expansion in the coming years 17.8 percent annually through Traditionally, these predictive analytic solutions have been applied to small sample data sets. Success has been quickly achieved, since the information is easy to clean and control. The trend toward big data, however, is changing all that. Big data is about more than just storing massive volumes of information. It also creates tremendous opportunities through the use of predictive analytics. According to Forrester s Mike Gualtieri, With the rise of big data, the predictive analytics market has woken up; firms now understand the opportunity to use big data to increase their knowledge of their business, their competitors, and their customers. Firms can use predictive analytics models to reduce risks, make better decisions, and deliver more personal customer experiences. 2 Big data can also pose many problems when it comes to predictive analytics. It s no secret that the accuracy of the results delivered by predictive analytic solutions is heavily reliant on the integrity of the underlying data. When predictive analytics are applied to big data that is dirty or corrupt, that big data won t tell the truth. The key challenge is the very nature of big data, which makes it highly prone to quality problems. To promote the highest levels of accuracy in big data environments, predictive analytics strategies must include dynamic data collection, integration, and quality management. This type of decision management where the end-to-end process of gathering, consolidating, cleansing, and analyzing data is fully automated is vital for enabling effective decision-making in big data environments. In this white paper, we will highlight how combining advanced analytical solutions, such as predictive analytics, with data integration and integrity technologies can facilitate successful analytics and decision management in big data scenarios. We ll demonstrate how together, these technologies can promote the integrity of analytic outputs, minimize the risk of invalid or inaccurate predictions and business decisions, and ensure that big data doesn t lie. 1 Predictive Analytics Market, Transparency Market Research, November Gualtieri, Mike. Big Data Predictive Analytics Solutions, Forrester Research, Q Information Builders

4 Big Data, Big Problems Big data is often described using the three Vs variety, velocity, and volume. However, these very properties can increase the probability of poor information integrity and hinder successful analytics and decision management. Alone, each V may have a small impact on data quality. But when combined, they create a multiplier effect that dramatically jeopardizes information validity and accuracy. This, in turn, negatively impacts the decisions that are based on the analytic results. Variety Big data is derived from a multitude of sources, all of which must be collected and consolidated into a single analytic data set for analysis. Different levels of aggregation, or the different time periods covered by big data can pose a problem. Matching and merging this misaligned data can lead to analysis that delivers grossly inaccurate results. Unlike conventional data sources, big data often contains much unstructured information, such as consumer feedback, s, spreadsheets, documents, and data collected from social media sites. How this data is integrated with the structured datasets for analysis and decision-making is critical because of the ambiguity and input errors frequently encountered. Full integration, combined with master data management (MDM) and data governance, is needed to overcome the problems created by variety. These technologies ensure that information is analyzed and understood from all aspects and perspectives, while preventing false correlations. Such correlation errors can lead to disastrous decisions. For example, false correlations in medical data can lead to improper treatment recommendations and efficacy of treatment estimation. In a recent Wired article based on his book Antifragile, author Nassim N. Taleb calls these false correlations the tragedy of big data. Falsity occurs because big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). This is the tragedy of big data: The more variables, the more correlations that can show significance. Falsity also grows faster than information; it is nonlinear (convex) with respect to data. Noise is antifragile. Source: N.N. Taleb. 2 How Big Data Can Lie

5 Velocity Big data is often generated at a rapid pace and can change just as quickly. This can be largely attributed to the growing mobile trend, where smartphones and tablets are now collecting and transmitting data on a near-continuous basis. This data from mobile applications, as well as from social networks and other sources flows in real time, creating as many opportunities as it does risks. Speed, when combined with the massive amounts of data being transmitted and processed, can lead to a high number of inaccuracies and errors. Even worse, this data can have a viral effect, as it spreads very quickly. Bad information can have a particularly detrimental impact if the inputs are used in calculations for derivative measures. Imagine the potential impact of common mistakes such as transcription errors, transposition errors or typos, where numbers are entered in the wrong order. If the database a telecommunications company uses to calculate its service rates contains even a small error, hundreds of thousands of inaccurate statements and bills can be distributed to customers. Or, if an incorrect number of manufacturing parts or components are ordered, a company could experience significant production delays. But nowhere are such problems more prevalent, or more potentially harmful, than in the healthcare industry. In one example, simple data entry mistakes in the NHS database resulted in records indicating that there are 20,000 pregnant men in Britain. 3 Only fully automated data quality management (DQM) can combat the problems caused by data velocity. When data is generated, modified, and shared at such incredible speed, information integrity must be managed proactively. Fast moving, dirty data must be stopped before it enters an environment and pollutes other systems during data feed, update, and information delivery processes. Volume The larger the data set, the more likely analysts would overlook mistakes. Volume makes it difficult to screen for errors in a traditional way. There is simply too much information; manually browsing records and eyeballing data points to locate and correct inaccuracies will be nearly impossible. As data volumes grow, automated tools are needed to uncover the dirty needles in the massive data haystack. The errors caused by volume will be even more harmful when analysts and statisticians apply sampling, because small data sets pulled from massive data stores make it easier to miss outliers the unusual or extraordinary events or values that may indicate a problem. These volume-related data quality issues can also cause massive disruptions to business, with the financial services industry being hit particularly hard in recent times. Last year, a glitch in Knight Capital Group s algorithmic trading software led to large price swings on the New York Stock 3 Bates, Claire. Why There Are 20,000 Pregnant Men in Britain, Associated Newspapers Ltd, April Information Builders

6 Exchange. Total losses to the company were estimated at nearly $440 million. 4 An algorithm error was also to blame for a similar stock market crash close to 1,000 points in May of Also consider loans in a personal banking scenario. In an Institute of Mathematical Statistics paper, the authors explain, banks accept those loan applicants whom they expect to repay the loans. For such people, the bank eventually discovers the true outcome (repay, do not repay), but for those rejected for a loan, the true outcome is unknown: it is a missing value. This poses difficulties when the bank wants to construct new predictive models (Hand and Henley, 1993; Hand, 2001). If a loan application asks for household income, replacing a missing value by a mean or even by a modelbased imputation may lead to a highly optimistic assessment of risk. 5 Because of the potential for problems created by the three Vs, ensuring the quality of big data with integration, DQM, MDM, and data governance technologies is critical. Larry English, developer of the Total Information Quality Management (TIQM) methodology, claims that poor data quality can cost companies 15 percent to 25 percent (or more) of their operating budget. 6 4 Olds, Dan. How One Bad Algorithm Cost Traders $44m, The Register, April De Veaux, Richard D.; Hand, David J. How to Lie With Bad Data, Statistical Science, August English, Larry. Data Quality Trends, With Expert Larry English, 4 How Big Data Can Lie

7 Big Data Analytics: The Wrong Approach According to Forrester s Mike Gualtieri, big data is not defined by how you can measure data in terms of volume, velocity, and variety. Instead, he claims, in order to be actionable and to enable users to uncover the non-obvious nuggets of insight that can be used to improve business outcomes, big data must be all about how massive volumes of information are captured and stored, processed (cleansed, enriched, and analyzed), and accessed (retrieved, searched, integrated, and visualized). 7 Many organizations mismanage their big data in a variety of ways, including: Underutilization In many big data scenarios, much of the data is hardly even used. It is either archived in Hadoop or a similar big data repository to reduce storage costs, or it is cast aside as dark data because organizations simply do not know what to do with it. Some experts have compared big data stores to closets in a house. Items often get shoved in there in case they are needed in the future, but are rarely ever taken out again. Even dark data has great potential to provide valuable insight, but companies shy away from analyzing it because they don t know where to start. An analytic strategy to tap into it and glean intelligence is important. Rushing Analytic Strategies Companies with big data want to quickly jump on the analytics bandwagon. Rushing to mine big data, however, can yield fool s gold. False trends pose the most significant risk. Take Simpson s Paradox, for example. This anomaly in probability and statistics, according to Wikipedia, takes place when a trend that appears in different groups of data disappears when the groups are combined, and the reverse trend appears for the aggregate data. Simpson s Paradox is responsible for a vast quantity of misinformation, says Xiao-Li Meng, chairman of Harvard University s statistics department, in a current Wall Street Journal article. You can easily be fooled. Meng also states that examples of Simpson s Paradox are highly evident in medical studies and are only uncovered when someone digs deeper into the research. 8 He cites a 1986 British study of kidney-stone treatments as an example. The authors indicated that traditional surgery successfully removed or substantially shrank the stone in 78 percent of cases, compared to less-invasive surgical procedures, which were successful 83 percent of the time. However, because the treatments were applied to different types of kidney stones, with the lessinvasive approach used more on smaller stones which are easier to treat, the results were misleading. When using standalone data discovery tools, the risk of Simpson s Paradox, as well as many other problems, can increase exponentially. This is because the underlying data may be corrupt or invalid, with no means of identifying and correcting it. 7 Gualtieri, Mike. The Pragmatic Definition of Big Data, Forrester Research, December Tuna, Cari. When Combined Data Reveal the Flaw of Averages, The Wall Street Journal, December Information Builders

8 Think back to Red Lobster s disastrous Endless Crab promotion. Thanks to false trends, local area managers vastly underestimated the number of helpings one customer could eat in a sitting. The company failed to take into account the weight of crab, which is far lighter than other foods the chain serves. Rising wholesale prices due to seasonal changes exacerbated the problem, making it nearly impossible to profit based on the $22.99 all-you-can-eat price. In a New York Post, article, company chairman Joe R. Lee said, It wasn t the second helping, it was the third that hurt. The company lost millions due to less-than-precise calculations about how many times diners would come back to refill their plates. This dramatically reduced the stock value of Red Lobster s parent company Darden Restaurants, and lead to the replacement of company President Edna Morris. 9 The flaw of averages is another potential issue. In his book of the same name, author Sam Savage claims that plans based on assumptions about average conditions usually go wrong. In a Harvard Business Review article, he cites an example about a pharmaceutical company selling a perishable antibiotic. Although the demand for the drug varies, approximately 5,000 units are moved in the average month. If this number is used to set monthly inventory levels, without considering fluctuations, the company will incur excessive spoilage costs when monthly demand is less than the average, or extra air-freight expenses when demand is greater than the amount stocked. 10 Furthermore, sampling analyzing just a small subset of the big data may produce results that lack sufficient records/observations for important segments. Imagine a company with thousands of product categories. If they take a random sample of 1,500 product records, the observations will be lacking in those categories from which fewer or no records were pulled. Poor Data Collection, Consolidation, and Quality All of the issues highlighted here are made even worse when data collection and data quality are not taken into account, as data quality problems exist in any type of information architecture. As highlighted earlier in the paper, however, these issues can be far greater and far more damaging in big data scenarios. Lack of automated integration in big data environments can make the information available for analysis incomplete, while lack of automated data quality management can hinder information accuracy, resulting in incorrect analytical outcomes. 9 Tharp, Paul. Endless Crab Pigout is End for Red Lobster Boss, The New York Post, September Savage, Sam. The Flaw of Averages, Harvard Business Review, November How Big Data Can Lie

9 Taking the Risk Out of Big Data Analytics Organizations looking to successfully leverage big data analytics to enable effective decision management need to apply a combination of data integration, data integrity, and business analytics technologies. This will ensure the highest levels of quality and accessibility for BI and big data analytics applications and tools, and enable stakeholders at all levels to locate and exploit valuable insights in large sets of data, regardless of its source. In an age where every pixel can be tracked and measured, the challenge isn t having the data or accessing it, but making sense of it all. For companies interested in learning what s working and what s not, sorting through mountains of data in order to see those insights is a promise that s hard to ignore, says Uri Bar-Joseph from Search Engine Watch. 11 To ensure big data is fit for use, organizations must determine which quality attributes (validity, variety, etc.) will be measured and what it considers an acceptable level of information accuracy, timeliness, and completeness. Furthermore, it must determine which approaches to quality management will be used e.g. if issues will be fixed at the source, or located and cleansed downstream and if those methods will be relevant in their big data scenario, such as when data flows from underlying processes. Data governance tools, which enhance enterprise data management to improve its availability, usability, integrity, and security, can serve as the backbone of a sound data governance program. This program must also include a governing body or council, a defined set of procedures, and a plan to execute those practices and policies. An owner, often called a data custodian, must be identified to oversee the entire effort. Data quality management and master data management solutions are also important to enable data synchronization, standardization, and cleansing. This will eliminate existing database mistakes, such as missing, incomplete, or invalid entries, while enriching information to achieve optimum completeness, quality, and consistency. Many companies who rely on big data analytics have avoided this approach because of the time and cost involved. However, big data analytics doesn t have to be expensive and time-consuming. A robust analytics platform, complete with well-integrated tools for data integration, DQM, MDM, and data governance can empower everyone to work together to overcome big data analytics challenges. Information Builders offers a full suite of complete, fully integrated solutions for big data collection, cleansing and correlation, and usage: Big Data Collection The iway Integration Suite can access and consolidate every kind of information, whether it s needed in real time or for historical purposes. Part of the iway Information Asset Management Platform (IAMP), the Integration Suite s broad data reach includes unstructured data, such as customer feedback and social media streams; cloud-based data from web services or API queries; structured data from ERP, CRM, legacy, and other systems; or sensor data, such as RFID or UPC scans and utility gauge readings. 11 Bar-Joseph, Uri. Big Data = Big Trouble: How to Avoid 5 Data Analysis Pitfalls, Search Engine Watch, August Information Builders

10 Big Data Cleansing and Integration Before big data can be used effectively, it must be clean. However, it s inefficient to clean data as it s being used or when it has already been moved into giant data stores. The iway Data Quality Suite, also part of the iway Information Asset Management Platform, will create a data quality firewall that preserves information accuracy and validity by proactively preventing dirty data from entering enterprise systems. iway also provides master data management capabilities through the iway Master Data Suite to help combine information from multiple system types and enable full data governance, so big data can be closely managed, even from mobile devices. Big Data Usage Once big data has been gathered, consolidated, and cleansed, the WebFOCUS business intelligence (BI) and analytics platform can enable information delivery to users in a way they can understand. From predictive analytics (WebFOCUS RStat) and sophisticated visualization (WebFOCUS Visual Discovery) for analysts and power users, to dashboards and InfoApps, simple apps that answer specific business questions or solve certain problems, Information Builders has a solution to meet the big data analysis needs of any type of user. Information Builders further supports decision management with WebFOCUS Magnify s enterprise search capabilities. So users can easily locate any type of big data whether its structured, unstructured, or existing reports and analyses. Minimize Total Cost of Ownership Best of all, Information Builders offers enterprise-level agreements, which dramatically lower total cost of ownership. InfoApps and other forms of self-service analytics can be made available to an unlimited number of users, inside or outside the organization, without any additional licensing costs. This savings, combined with the money saved by eliminating the expenses that result from bad decisions due to erroneous big data, makes Information Builders the most cost-effective BI and analytics solution available. 8 How Big Data Can Lie

11 Conclusion If not managed properly, big data will often lie. Variety, velocity, and volume the three main characteristics used most often to describe big data put large volumes of enterprise information at serious risk for quality problems. Those issues, if not proactively addressed with the right tools and technologies, will lead to incorrect analyses that negatively impact decision-making. Organizations who wish to leverage their big data with advanced analytics need to rectify quality problems to ensure peak accuracy in their analytic outputs and enable effective decision management. Information Builders solutions for data integration and data integrity promote the highest levels of completeness and correctness in big data environments. With Information Builders, big data can be dynamically collected, aggregated, cleansed, and analyzed to eliminate the risk of invalid predictions, forecasts, and other types of analyses. 9 Information Builders

12 Worldwide Offices Corporate Headquarters Two Penn Plaza New York, NY (212) (800) United States Atlanta, GA* (770) Boston, MA* (781) Channels (770) Chicago, IL* (630) Cincinnati, OH* (513) Dallas, TX* (972) Denver, CO* (303) Detroit, MI* (248) Federal Systems, D.C.* (703) Florham Park, NJ (973) Houston, TX* (713) Los Angeles, CA* (310) Minneapolis, MN* (651) New York, NY* (212) Philadelphia, PA* (610) Pittsburgh, PA (412) San Jose, CA* (408) Seattle, WA (206) St. Louis, MO* (636) , ext. 321 Tampa, FL (813) Washington, D.C.* (703) International Australia* Melbourne Sydney Austria Raffeisen Informatik Consulting GmbH Wien Brazil São Paulo Canada Calgary (403) Montreal* (514) Ottawa (416) Toronto* (416) Vancouver (604) China Beijing Estonia InfoBuild Estonia ÖÜ Tallinn Finland InfoBuild Oy Espoo France* Puteaux +33 (0) Germany Eschborn* Greece Applied Science Ltd. Athens Guatemala IDS de Centroamerica Guatemala City (502) India* InfoBuild India Chennai Israel SRL Software Products Ltd. Petah-Tikva Italy Agrate Brianza Japan KK Ashisuto Tokyo Latvia InfoBuild Lithuania, UAB Vilnius Lithuania InfoBuild Lithuania, UAB Vilnius Mexico Mexico City Middle East Innovative Corner Est. Riyadh n Iraq n Lebanon n Oman n Saudi Arabia n United Arab Emirates (UAE) Netherlands* Amstelveen 31 (0) n Belgium n Luxembourg Nigeria InfoBuild Nigeria Garki-Abuja Norway InfoBuild Norge AS c/o Okonor Tynset Portugal Lisboa Singapore Automatic Identification Technology Ltd. Singapore /92 South Africa InfoBuild (Pty) Ltd. Johannesburg South Korea UVANSYS, Inc. Seoul Southeast Asia Singapore n Bangladesh n Brunei n Burma n Cambodia n Indonesia n Malaysia n Papua New Guinea n Thailand n The Philippines n Vietnam Spain Barcelona Bilbao Madrid* Sweden InfoBuild AB Stockholm Switzerland Dietlikon Taiwan Galaxy Software Services, Inc. Taipei (866) , ext. 114 United Kingdom* Uxbridge Middlesex Venezuela InfoServices Consulting Caracas * Training facilities are located at these offices. Corporate Headquarters Two Penn Plaza, New York, NY (212) Fax (212) DN Connect With Us Copyright 2014 by Information Builders. All rights reserved. [117] All products and product names mentioned in this publication are trademarks or registered trademarks of their respective companies.