1 White Paper Discovering Big Data s Value with Graph Analytics By Evan Quinn, Senior Principal Analyst April 2013 This ESG White Paper was commissioned by YarcData and is distributed under license from ESG.
2 2 Contents Executive Summary... 3 Big Data s Value Proposition... 4 The Sixth V: Value... 4 Big Data Is Intuition s Partner and Helps to Discover the Unknown... 4 The Power of Discovery... 5 Addressing the Generalized Question... 5 Discovery Roadblocks and Resolutions... 5 Big Data Stills Wants To Be Secure, Credible, and Quick... 6 Example Graph Discovery Use Cases... 7 Urika Offers Faster Time to Big Data Value... 8 An Appliance to Make Graph Analytics Easy... 8 A Complete Big Data Solution... 8 The Bigger Truth All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at
3 3 Executive Summary With the seeming boundless industry buzz around big data, and constant talk about the 3 Vs variety, volume, and velocity sometimes the ultimate reason organizations should invest in big data becomes lost. Organizations should look for business value from big data; big data means big value. ESG believes that big data value will largely be realized through discovery analytics. Though discovery analytics may seem more abstract initially, the results often produce insights that carry more strategic impact than BI or basic business analytics. To reach those insights, the discovery analytics process typically involves more diverse and potentially voluminous data, and requires iterations of queries and searches for patterns in the data relationships that eventually yield strategic insight. Discovery implies a less concrete initial grasp of the outcome and the data that will drive the outcome. With BI and basic business analytics, the answer design and the underlying data are usually already well understood. The main roadblock preventing organizations from realizing the value of discovery analytics is the fact that they seldom possess a BI/analytics solution that works well for discovery purposes. Most BI and business analytics solutions do an insufficient job with discovery analytics. The relational database models so often used in data warehouses and for BI are inept at discovery analytics. Moreover, advanced solution designs that support business analytics and even predictive analytics may do okay in a pinch for some discovery analytics purposes, but they too run out of gas under the strain of the multiple ad hoc queries, fast iteration, and morphing data sets demanded by discovery processes. Graph analytics solutions have emerged as the best approach for discovery because they are designed from the outset to deal with the quantities of relationships, data, and queries needed. Graph solutions, which act like partners to the intuition of data analysts, enable organizations to rapidly sift through queries, redefine data sets, and chase possible fresh patterns of data and thinking without worrying about redefining schema. While graph solutions have existed in niches like cybersecurity and intelligence for many years, the increasingly networked nature of the world, including social media, is shifting the graph approach from niche toward mainstream. The good news is that as more organizations discover the power of big data discovery analytics, they will also discover fresh graph solutions in the market. While graph analytics solutions are unique compared to other BI/analytics solutions in that (a) they should optimally use large amounts of memory to perform their relationship- centric queries and (b) often require wide pipes for ingesting the potentially massive volumes of data they work with, they are no more difficult to use than other analytic solutions. Of the newer graph solutions in the market, ESG is quite impressed with YarcData and its Urika appliance for discovery analytics. First, ESG believes the preconfigured appliance approach makes huge sense for graph solutions: The physical security combined with an open source, standards- based RDF graph database, plus the appropriate mix of memory and I/O optimization eliminates privacy, repository, and configuration complexity from the customer s list of concerns. And Urika offers stunning memory and I/O optimization that will take on even the most daunting discovery challenges. Second, ESG believes YarcData s choice of open standards- based SPARQL as its query language helps organizations reuse SQL expertise; SPARQL is syntactically and functionally quite similar to SQL. SPARQL, however, was designed to operate effectively with many ad hoc queries that do not require constant manipulation of schema and database designs. Some customers might ask, either or? do I use a graph solution like Urika instead of other BI/analytics solutions for big data? ESG recommends that both types of solutions be implemented, and that they co- exist, share data, and use each others strengths to the organization s advantage. Urika already integrates nicely with several leading BI/analytics solutions. The bigger truth is simple: If you really want to do big data, you must search for big insights discovery analytics offers that path to big insights, and YarcData s graph solution offers the path to discovery.
4 4 Big Data s Value Proposition The Sixth V: Value It seems the mass media currently applies the big data label onto any form of business intelligence ( BI ) or analytics news that companies or IT vendors tout. While the affinity for such news raises general consciousness about business intelligence and analytics, the frenzy masks what is truly unique and special about big data. Similarly, much has been written about the three Vs of big data volume, variety and velocity and certainly the fact that data continues growing in these dimensions contributes to the big data phenomena. Organizations, however, need to consider three additional Vs in the pursuit of big data: veracity, visualization, and value. Veracity mitigates issues associated with garbage in, garbage out, and many organizations have stepped up data governance and cleansing practices in support of big data pursuits. And organizations increasingly use visualization tools to create rich BI and analytics graphical representations that most effectively communicate insights. The last V, value, is clearly the most vital yet the most elusive of the six Vs. The value benefit of big data investments needs to be rethought because big data s value proposition is fundamentally different than the value realized through classic business intelligence. Let us compare. Big Data Is Intuition s Partner and Helps to Discover the Unknown ESG breaks down the BI/analytics into two primary value pursuits, with the second more distinctly big data: 1. You Know Precisely What Questions to Ask and How to Find the Answers: The entire notion of schema and structured query language, and the previous generation of business intelligence and analytics, is based upon the notion that you know what you are looking for and that you possess a familiarity with the data. What is different in the big data era is you are now dealing with the three Vs, so a different processing approach may be required to handle the more voluminous and complex data sets. Still, the analyst usually knows what the answer will look like after positing the question. Older BI/analytics solutions typically do a poor job digesting and processing data complexity; they were never designed to handle the first three Vs. 2. You Generally Know What You Are Looking for, but Don t Know Specifically How to Find It: You are looking for fresh perspectives buried in obscure data relationships your intuition tells you that they are there. The problem you are trying to solve, however, is roughly defined. You are operating in a discovery mode. Because you may not know initially what the answer will look like, you are not in a position to winnow down the data before performing the analysis, or to tune schemas, or to use MapReduce technologies requiring long cycles to generate a single result. You are looking for patterns snaking through massive and/or multiple data sets and you need to iterate through theories, hypotheses, and temporary results rapidly in order to ultimately uncover the important big data insight. Newer BI/analytics solutions, from Hadoop to commercial analytics solutions designed to perform massively parallel processing (MPP) perform very well at delivering the first value situation where the problem domain is clear. Graph solutions excel, however, at helping analysts progress through the second case of analyses: discovery. Even though graph analytics solutions have been used in telecommunications and cybersecurity for many years, as organizations pursue big data, many will find a need for graph analytics. ESG believes organizations should increasingly invest in two analytics platforms, one to address more classic BI and analytics challenges, and one for the purposes of discovery in big data. ESG also believes these are not mutually exclusive but rather mutually supportive solutions they feed upon each others strengths and data. Let s look deeper into the power of discovery and what makes the graph analytics approach different.
5 5 The Power of Discovery Addressing the Generalized Question Consider four questions: 1. A business intelligence question: What percentage of our salespeople did not meet quota during fiscal Q3, 2012 in Eastern Europe? 2. A related discovery question to number one: Are any of my European salespeople committing fraud? 3. A big data question: What is the predicted risk- adjusted value of this mortgage- backed security? 4. A related discovery question to number three: Is today s economic news changing trading patterns? Most business people, and any business data analyst, can easily conceive of how a BI solution could answer question number one; the data elements required are obvious, and while the sources for the data may vary, the underlying data structures are typically straightforward. Fewer business people, perhaps only bankers and/or data analysts with banking domain knowledge, might be able to tackle question three; mortgage- backed securities contain many data elements that could be risk associated, and data from third- party sources may be essential to construct predictive risk models. But still, the data sources, correlations, and pathways to the answer are quite clear. How do you possibly, however, begin to look for answers to the far more generalized, yet arguably more strategic, questions posited in numbers two and four? The path to the answers and the data required isn t clear. You will need to experiment, run multiple queries across wide ranges of varying data, store the pattern threads that look useful, compare them to fresh data, and so on. The creative and iterative process required to address questions two and four is what ESG refers to as a discovery analytics process. Note that since more queries, and unpredictable queries, running across wide ranges and depths of data are required for discovery analytics processes, most classic BI solutions or basic business analytics will simply not work for discovery processes. Using MPP or Hadoop- based analytics solutions might get you there eventually, but they will struggle. Graph analytics solutions, however, are inherently designed to effectively deal with such abstract analytical endeavors, or more specifically: Graph solutions enable the rapid process of discovery analytics to uncover answers to more generalized albeit strategic questions involving subtle and less formally defined data relationships. Discovery Roadblocks and Resolutions In computer science, graph data structures focus on the relationships between data objects. In relational databases, the equation is flipped: Relational databases are optimized to process the data objects (the rows or records), and sub- optimally deal with the relationships between the data objects. Think of a relational database and you think of many rows with few relationship fields or indices. While to the data analyst or data scientist a graph analytics solution works like any other analytics solution on the surface, and a graph database is as easy to use as a relational database, the limitations of relational databases and data warehouses in discovery analytics are considerable. Why do graph solutions work so well for discovery analytics? Graph databases are precisely designed for discovering new or previously undetected patterns of relationships in the data. Graph analytics offer the ability to detect, follow, test, and model relationship patterns that exist beyond the predefined schema at far faster rates than other database/analytics combinations. Table 1 delves into some details regarding how graph solutions are designed to optimize discovery analytics.
6 6 Table 1. Graph Solutions Provide Resolutions to Discovery Roadblocks Discovery Roadblocks Schema Nightmare and Too Many Joins: Relationship analytics using relational databases or classic data warehouses are problematic because discovery processes using the relational database model forces the analyst to continuously extend schema. In order to overcome the lack of relationship- centric information and weak support for semi- structured data, the relational model ends up requiring joins and, even worse, nested joins to process discovery style queries and those process slowly. Lack of Native Query and Result Set Reuse: SQL does not natively support relationship pattern searches or queries; they can be emulated but only by paying a huge performance penalty and testing the limits of analysts patience. And SQL is weak at saving and cycling through results sets that may require their own fresh schema and query, and is potentially slow for supporting related visualization. No Scalability: The lack of a native data model that maps to the relationship pattern discovery process, plus using workarounds to simulate the lack of native data model, results in an inability to scale, which can seriously delay the iterative discovery process. Discovery Solutions Graph Model Designed for Diverse Data: Since discovery suggests that you don t know which data sources might be useful, the slow old method of ETL and schema definition for the building of a repository or data warehouse inhibits the discovery processes. A graph solution optimizes the ingest of arbitrary sources of data, affording the analyst the freedom to quickly load data of all kinds, test hypotheses, and reconfigure the data set as applicable. This stepwise refinement of interrelated data and query rapidly allows the analyst to explore theories without being limited by schema. Result Sets that SPARQL: Graph solutions allow the analyst to save multiple result sets from a long list of ad hoc queries, yielding more relevant combined data sets. Open standards- based SPARQL is precisely designed for this iterative process. Graph solutions can be used to populate other big data tools without SQL delay; graph analytics may act as preprocessors for other BI/analytics solutions like SAS, statistical development tools like open source R, and next- generation visualization tools like Tableau Software. Memory = Fast Iteration + Highly Interactive: Graph solutions take full advantage of memory, meaning that complex discovery queries process in seconds, allowing the analyst to quickly evolve their thinking. While some relational and many columnar databases benefit from in- memory processing, the unique nature of the graph approach makes best use of large amounts of memory. And the highly iterative and interactive graph approach for discovery is far faster than the batch cycles of some analytics solutions like Hadoop. Big Data Stills Wants To Be Secure, Credible, and Quick Source: Enterprise Strategy Group, Despite the promise of discovery analytics, organizations still want big data to meet enterprise- grade requirements. Figure 1 lists the top- ten significant challenges associated with big data in general, as reported by respondents to a recent ESG research survey. Major challenges organizations face with their largest data sets include security, quality, integration, cleansing, speed of analysis ( how long it takes a task to complete ), storage costs, and data set sizes. 1 Not meeting those challenges can often make or break the success of a big data undertaking. Discovery analytics delivered through purpose- built appliance graph solutions, however directly address all of these challenges. ESG estimates that around 80% of an analysts time is used in creating the proper data set and 1 Source: ESG Research Report, The Convergence of Big Data Processing and Integrated Infrastructure, July 2012.
7 7 managing the infrastructure in order to actually begin doing the important part, and only 20% on analyses. This 80/20 ratio is not effective in the case of discovery analytics because of its iterative nature. Graph solutions favorably alter this ratio, and in particular give analysts a predictability of performance that enables them to optimize their own workflows. Figure Primary Challenges of Big Data What are the most significant data processing and analyqcs challenges your organizaqon faces with its largest data set? (Percent of respondents, N=399, mulqple responses accepted) Data security 34% Data integralon Data quality Data cleansing tasks Business expectalons on how long it takes for tasks to complete Data synchronizalon Storage costs Master data management Lack of skills Data set sizes 30% 29% 28% 27% 25% 24% 22% 21% 20% Example Graph Discovery Use Cases 0% 5% 10% 15% 20% 25% 30% 35% 40% Source: Enterprise Strategy Group, Does graph analytics sound good in theory? ESG has seen it in practice too, and is aware of a long list of use cases where graph databases and analytics have uncovered insights that otherwise would have been too slow or impossible to discover using other techniques. Here are a few examples that offer a sense of the range and depth of analytical challenges that graph solutions meet far more effectively than other options: Cyber Threats: Identify new threats before an intrusion occurs by exploring a wide variety of web- related inputs (DNS, Firewall, Netflow, etc.) to spot deviations in traffic patterns, spot correlations to known threats, focusing around unforeseen relationships to common data objects such as URLs. Healthcare Treatment Optimization: Obtain optimal treatment recommendations for each patient regardless of doctor by using evidence- based treatment efficacy, taking into account patient data such as history, genetics, test results, and demographically wider but related data (e.g., family, city, etc.). Disease Research: Develop richer and collaborative research models to, for example, better understand cancer cells by loading a wide swath of data from multiple researchers sourced from a wide set of biomedical and life- sciences publications. Financial Services: Analyze and identify patterns associated with fraudulent transactions or money laundering. Another example is counterparty risk analysis which, when combined with identity resolution, identifies the relationships between entities to assess credit quality and risk.
8 8 Medicare Fraud Detection: Spotting relationships between health care beneficiary data, claims data, health care provider data, data from tests, and other relevant data allow analysts to detect complex patterns of relationships which could indicate collusion to commit fraud. Note that there are over six million healthcare providers, over a hundred million patients, and billions of claim records, and that it has been estimated that over 8% of Medicare s budget, or approximately $50 billion per year, is exposed to fraud. ESG sees graph analytics already reaching a wide range of other markets, particularly energy, retail, consumer packaged goods, transportation, and e- commerce. Urika Offers Faster Time to Big Data Value An Appliance to Make Graph Analytics Easy ESG tracks a long list of NoSQL or Not Only SQL databases, including graph databases. ESG also tracks a number of BI/analytics platforms, including several available via appliance. YarcData s Urika appliance offers organizations an extremely fast, effective means of implementing discovery analytics through a graph solution that includes graph database. Though YarcData is barely a year old as a singular organization, and Urika is similarly aged as an offering, Urika and the intellectual property of YarcData actually reflect many years of R&D focused on graph databases and analytics under the original auspices of Cray Inc., the supercomputer manufacturer. YarcData, however, is specifically focused on delivering the premier graph analytics solution to the market. Table 1 highlights some of the key benefits of the Urika solution: Table 2. Urika s Benefits Feature In- memory Processing up to 512 terabytes of real memory Threadstorm processor architecture with 128 hardware threads High performance I/O subsystem up to 350TB per hour Appliance Appliance Tuning and Management Native SPARQL support Subscription Pricing Benefit The massive amount of memory enables complex ad hoc queries to be processed across massive data sets in seconds. Supports truly massively parallel queries, required for digging through complex relationship scenarios. Essential for rapid ingest and result set output of massive quantities of data. Significantly reduces security risks associated with hosting or public cloud; minimal infrastructure configuration required; operational in hours to provide quick time to value. Urika s Graph Analytic Manager looks and operates like an enterprise- class systems management and DBA console, with similar features and control. Specialized query language developed to work explicitly with the Resource Description Format (RDF) graph data model, but syntactically and conceptually very similar to SQL, requiring nominal formal training for those with SQL skills. Subscription pricing allows for OPEX recognition to spread cash flow of big data investment over time, with flexibility to allow for proof- of- concept trials; full support included. A Complete Big Data Solution Source: Enterprise Strategy Group, ESG does not recommend YarcData s Urika for the purposes of basic business intelligence it would be like using a jet airplane to fly a few miles. Similarly, ESG does not recommend Urika for basic business analytics that vendors such as SAS or QlikTech readily solve. However, ESG does recommend that organizations looking for true big data discovery analytics use an enterprise- ready graph solution like Urika. For an illustration of how BI, business analytics, and discovery analytics differ, see Table 3. ESG also suggests that specific solutions for general BI/analytics and
9 9 discovery analytics are a best practice, but that ultimately organizations should view them as interconnected and codependent. For example, data already collected for BI may be repurposed for predictive and discovery analytics. Outputs from discovery analytics may certainly be part of the data mix for predictive analytics. And all three may work in unison with visualization tools that promote information sharing and further insight, which may feed back into the discovery analytics process. Table 3. Six Dimensions of BI, Business Analytics, and Discovery Analytics Schema Data Mix Analyst Iteration Algorithms Time Query Business Intelligence Business Analytics Discovery Analytics Known Mainly known Mainly structured Often structured Nominal Arithmetic Look backward Simple Some Little known Most diverse Always Some regression All of the above, plus graph Some backwards, some forward Varies, but must be super fast Complex Abstract Source: Enterprise Strategy Group, 2013.
10 10 The Bigger Truth If you are a data scientist or data analyst, terms like machine learning, data mining, predictive analytics, MapReduce, decision models, regression trees, and vector machines are already part of your vocabulary. Some new terms associated with graph analytics should now be added as part of the big data vocabulary, such as nodes, (not the nodes of a cluster), edges, vertices, triple stores, and analytic techniques such as clustering, shortest path, and common pattern identification. For the CIO, CFO, CMO, and business analyst practitioners, however, grasping the benefit and practice of discovery analytics is enough to help organizations realize the value of the graph approach. When you want to spot critical trends and patterns buried in highly diverse and voluminous data sources, you will likely need discovery analytics, delivered through a graph solution, to do the job. Some of those trends and patterns will unlock otherwise undetectable security risks; cures to systemic diseases or improved treatment; a better understanding of complex financial market movements; or even sentiment analysis from various social media sources. If your organization is serious about big data, ESG suggests you take a serious look at graph analytics. Fortunately, a handful of specialty vendors have tried to make graph analytics easy to acquire and use. YarcData has taken extra steps to include: A preconfigured memory and optimized I/O infrastructure designed for graph analytics via the Urika appliance. A highly performing World Wide Web Consortium (W3C) standards- based Resource Description Framework (RDF), and SPARQL graph database. Full- featured administration tools to make it easy for IT and analysts to implement and maintain. A pricing model that matches the preference of many CIOs and CFOs. Government organizations and enterprises who believe in the power and value of big data owe it to themselves to discover graph solutions, and give Urika a test drive.
11 20 Asylum Street Milford, MA Tel: Fax: global.com