Big Data 101: Harvest Real Value & Avoid Hollow Hype
2 Executive Summary Odds are you are hearing the growing hype around the potential for big data to revolutionize our ability to assimilate and act on information. It is equally probable that you are struggling with the challenges of crafting and perhaps even executing a strategy to capitalize on big data opportunities. As recently as 2000, only 25 percent of the world s information was digital; today, 98 percent of the world s information is digital 1. With this ever increasing diversity and abundance of data (1,200 exabytes worth 2 ) bursting from the digital age, your ability to harvest real value from big data and avoid the pitfalls of hollow hype will determine your organization s success. The big data market is poised to reach $16.9 billion by 2015 and the broader market of business analytics solutions is forecast to reach $50.7 billion in 2016. Yet interestingly enough, only four percent of the 400 global companies surveyed by Bain & Company in 2013 believed that they are converting their investments in big data tools into meaningful business insights that improve decision making and financial performance. 3 From Atigeo s customer implementation experience, we believe success depends on your approach. Big data requires adoption of revolutionary technology that evolves faster than most companies can keep pace. However, many companies still attempt to use traditional IT planning, where migration to a new paradigm is slow and technology components are adopted in piecemeal fashion. This approach takes several years to complete and does not have any guarantee of ROI until the new solution is in production. By that time, it is very difficult to iterate to improve results or even change course. This whitepaper provides suggestions on how to select big data analytic solutions for your enterprise, introduces Atigeo s xpatterns platform, and provides xpatterns deployment examples. The U.S. will face a deficit of over 1.5 million data analysts 4 to help bridge the gaps. This shortage is already triggering a cascade of failed attempts at big data analytics using traditional approaches. Meanwhile, data growth already outstrips the ability for people and 20th Century technology to make sense of it all. Success in big data is no longer about data collection or data hoarding, which through commodity storage is easy for any enterprise to implement. The real return on any big data investment depends on analytical performance. This will determine how enterprises deliver differentiating and actionable insights and useful applications for end users (internal or external end to the enterprise). Data itself is not useful unless it is applied correctly to solve real business problems. Designing a great product has not changed even though data availability has; it is still about knowing and understanding users needs. Most important is correctly identifying when big data solutions are needed vs. conventional approaches. 1 The Rise of Big Data, Foreign Affairs, June, 2013 2 The Rise of Big Data, Foreign Affairs, June 2013 3 Big Data, Big Choices, Bain & Co., November 2013 4 Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, May 2011
3 Here are three key questions enterprises should ask about their end- users: 1. Does the end user understand the difference between BI and advanced analytics? 2. Does the end user need to control the output (expert knowledge) or let the data identify insights without expert knowledge? 3. Does the end user care about exactness of the output as a trade- off of detail level insights? The figures below show how enterprises should determine the type of analytics that should be applied to their big data to satisfy end users needs. There are situations where optimizing an existing business intelligence solution has value, but a transition to a big data approach adds new layers of insight. For example, a growing concern in healthcare is population health management (PHM) which is concerned with the health of individuals in a group and how health outcomes are distributed in a group. A hospital can use conventional BI techniques on insurance claims data to generate historical reports of health outcomes. However, a big data approach based on the same data can produce forward- looking predictive analytics as well as more detailed inferences. This type of result would be valuable in PHM and many other enterprises, but represents a big change to the conventional, historically focused approach. Thus, big data analytics greatly enhance BI and enterprises should carefully examine the impact and potential of metric and reporting changes in order to adopt them. At Atigeo, we recommend enterprises build big data solutions through an iterative process to improve analytics models over time and provide compelling evidence of improvement in order to gain end user trust. This is a key example of why Atigeo s xpatterns is an ideal platform for enterprises to build models that automatically learn over time. Introducing Atigeo xpatterns What is xpatterns? As the only "end to end" big data analytics platform available today, xpatterns allows you to utilize your existing resources with a secure, enterprise- ready system that requires no datacenter build- out. With xpatterns you can seamlessly create a scalable, private virtual cloud and through xpatterns patented collection of intelligent algorithms you can access all of your data in real- time, leading to measurably better and faster answers.
4 xpatterns has a novel architecture that integrates state- of- the- art components across three logical layers: Infrastructure, Analytics, and Applications. xpatterns can act as a virtual abstraction layer across any IT system, extracting value from both legacy and new technologies immediately extending the life, value, and intelligence of an ecosystem. The Infrastructure layer offers remarkably fast integration without requiring a costly data warehouse implementation. It can quickly adapt to new technologies, allowing you to leverage and extend existing IT investments. xpatterns delivers managed cloud services and safeguards the privacy of your data, satisfying a broad range of regulatory, financial and legal requirements. The Analytics layer consists of a wealth of proprietary advance analytics algorithms that automatically build the best model for the questions being asked. This unique ability is made possible through the xpatterns Cooperative Distributed Inferencing (CDI) engine. In addition, xpatterns learns over time, self- optimizing through a hybrid approach of optimizing hard rules and soft rules, both supervised and unsupervised. For data scientists, who like to design their own model and run experiments, xpatterns provides easy to use analytics, automated experimentation and feature generation tools and many
5 other ready to use components to make modeling and experimenting in a distributed environment effortless. The Application layer includes visualization tools that allow enterprises to immediately visualize their big data in xpatterns platform and publish applications, all without integrating with any other software. The full workflow of building your own big data application can be done right from xpatterns Management Console in the cloud. Design Tenets xpatterns is the fastest, best- performing, and lowest- risk big data intelligence platform available today: All- inclusive: A complete platform for building applications and running advanced analytics on very large datasets. It provides integrated software across all three layers required for big data analytics: data ingestion, analytics and application development. Cutting- edge analytics: Includes a wide range of advanced intelligence components that run the gamut from market- tested to beta to just- out- of- research. Components include: machine learning, data mining, natural language understanding, dynamic ontologies, search, inference, and other analytics components. Intelligence technology R&D is Atigeo s prime directive, and we innovate continuously in this space. Cloud- based: Delivered via the cloud, meaning no hardware needs to be installed. Storage and compute capacity are managed by the platform, and can scale up and down easily. Fastest- to- market: From ramp- up time for new adopters of an xpatterns solution to delivery time via the cloud, all xpatterns design considerations are made to enable users to get their business results fastest to market. Enterprise- grade: Designed to build production- quality, line- of- business applications, the platform meets the following quality attributes: performance, scalability, high availability, reliability, security, manageability, extensibility, modularity, interoperability, testability, documentation, instrumentation and monitoring, backup and restore, disaster recovery and diagnostic tools. Compliant: Security, privacy, compliance and audit are built into the platform. In addition to software compliance, Atigeo s procedures for managing the cloud and our teams in charge of carrying them out also adhere to a corresponding set of compliance requirements. We enable cloud applications for the highest compliance standards, including HIPAA. Integrator: Includes a toolbox of choices for the infrastructure, analytics and application layers. Since different problems require different solutions, each customer leverages a subset of the tools that best fits their needs. The toolbox includes open source, commercial and Atigeo- designed components. Developer- ready: Currently, the APIs number over 100, covering the gamut from data ingestion to analytical processing, data updates, real- time queries and configuration. The APIs are authenticated over a secure channel, using standard Internet authentication protocols. The APIs are scalable, instrumented and monitored. API access is role- based, and roles can be configured for both developers and applications.
6 Fully- managed: Operates the cloud environment for you. Customers can rely on Atigeo s expertise to launch production applications quickly, at a known cost, without having to ramp up their IT, committing to long- term consulting engagements, or taking risks on the readiness of new technology. Who uses xpatterns? Layer by Layer Each of xpatterns 3 layers was built around business objectives, and align with the different roles and functions in your organization. Over time and via many customer projects, we have found these three roles are required to build end- to- end, intelligent big data applications: Data Analyst/ETL Analyst/Data Integration Engineer: Builds the quality and integration pipelines connecting many corporate systems to an xpatterns cloud. xpatterns tools support editing a data ingestion workflow; testing and scheduling data integration, and monitoring operations. Data Scientist An expert in statistics, machine learning and/or data mining, who uses the data products from the ETL Analyst to model, query and experiment on data. The tools include an integrated development environment (IDE) for creating rankers, classifiers, topics, queries and models. Application Engineer Builds user applications with data and models from the Data Scientist with application- specific tools. For dashboard applications, xpatterns has a turnkey dashboard studio tool. Today s evolving big data infrastructure has many other roles and tools, but we believe many of these will fade away as big data best practices mature. xpatterns abstracts away complexity for our clients by managing the cloud environment for them, and by orchestrating the software and tools according to these two principles: Façade Each person should see an optimized but minimal set of tools, data and software required for their job. Anything more distracts and reduces productivity; under the hood, advanced tools are there for any users who want them. Choice of tools While xpatterns comes with pre- packaged tools, every role should be able to pick their own. For example, if a data scientist prefers SAS or R, they should be able to easily and securely install big data connectors for them within an xpatterns cloud. xpatterns Deployment Examples Infrastructure technology should not drive or constrain applications A Fortune 500 company faced a predictive analytics challenge: make informed business decisions based on 10s of terabytes of data from multiple sources and systems. The company had data assets in the range of 5-10 billion customer behavior records. Their existing technology infrastructure produced conventional BI results: historical charts, tables, and dashboards showing what customers were doing in the past. Worse still, their predictive analytics were able to work on only a sample of the data, using only about 5 million records, or about 0.1% of the total set. The company applied models to their data sample which had become standard for their industry, based on academic research on even smaller datasets of 1,000 to about 100,000 records.
7 Adding xpatterns to their existing technology took only a few days of engineering work, rather than the typical months of infrastructure planning, in- house expertise and custom integration required by traditional approaches. Most significantly, with xpatterns the company was able to develop new big data models that leverage their entire data set of 5-10 billion customer behavior records. This produced an improvement over the best available academic model of 75%, creating an invaluable resource out of what had been a burdensome dataset that could only be sampled. Advanced analytics and modeling quality should be bound by computing power, not manual labor by data scientists Another major US- based data company was doing statistical modeling with software packages running on single machines with small data samples. The company s time to market was delayed as data analysts made hard choices partitioning the data. Their products data models changed and caused many further delays based on different data samples and lack of visibility across the many sample datasets also compromised the model s validity. With few data scientists, the company s incorrect data- based assumptions came at a high cost and ROI could not be realized. With the xpatterns analytics optimization engine, this company s data analysts could focus on designing models based on all the data, and most importantly - - run multiple experiments in parallel. The company redesigned their data model and increased computing capacity for a one- week experiment, where they applied xpatterns optimization engine to the entire dataset, running hundreds of concurrent experiments, and producing an optimal production model, that would have otherwise taken months of manual labor to come across. In addition, xpatterns easily allowed them to decommission existing computing clusters. Additionally, with conventional approaches you are forced to clean the data as part of the ETL. However, xpatterns easily handles noisy and dirty data and learns from data that is ingested as is. In this use case, the precision/recall curve below illustrates that with xpatterns the more data is used for training, the more accurate the results. 1 0.95 0.9 0.85 0.8 100% data 40% data 20% data 5% data 0.75 0.7 0 0.2 0.4 0.6 0.8 1
8 These are two ways that the xpatterns platform increases success of big data initiatives: xpatterns analytics layer has tools to make data scientists as efficient as they can be, and the xpatterns capabilities automate and improve the production model without data scientist intervention. Text analytics should learn: Semantics and learning make a difference A major healthcare company was using slow legacy systems to process large amounts of unstructured text: files with no indication of meaning, subjects, or categorization of the file contents. Nearly all enterprises have unstructured data, which in healthcare includes physicians encounter notes doctors notes taken during examinations. The company needed to improve their long and costly unstructured data processing to augment their bottom line. In their line of business, this means decreasing insurance claims processing time by swiftly and accurately adding standardized medical codes for procedures and diagnoses. Using xpatterns text analytics, the healthcare company was able to add correct medical codes in spite of differences in individual physicians use of language and jargon, and different semantic contexts. Among many other semantic capabilities, xpatterns is able to discern negations ( the patient never broke her leg in childhood ) and able to distinguish the context of a phrase, such as family history and physical exam. xpatterns capabilities also include the correct detection of conditional and hypothetical statements ( if lab results are positive, the diagnosis is kidney failure ). xpatterns also continually learned from what it found in the company s data, basing new inferences on that feedback. This means that xpatterns is able to continue operating successfully as new sources of unstructured data are encountered. Conclusion Companies in all sectors are increasingly realizing that they are effectively big data companies by virtue of their massive enterprise and customer data repositories. While acutely aware of the critical need for analytical insight into both stored and streaming data, a number of factors impede progress toward surfacing intelligent solutions: the state of the data, lack of resources and expertise, lack of infrastructure, state of tools and solutions, and the quality and evaluation of results. If you have a tough data problem, not easily solved by current methodologies, xpatterns can positively impact your organization in widely influential ways, as the fastest, highest performing, and lowest risk way of building intelligent big data applications. By uncovering more relevant connections in data at game- changing speed workflow procedures are streamlined, development cycles are reduced, and customer and patient needs are anticipated more accurately. From healthcare, energy, security and beyond if it requires information to do its job, xpatterns makes it intelligent. As the only "end to end" big data analytics platform available today, xpatterns allows you to utilize your existing resources, and seamlessly create a scalable, private virtual cloud. Through its patented collection of intelligent algorithms, xpatterns advanced analytics gives you access to all of your data in real time, and leads to better, faster answers.