Traditional Analytics: Not Designed to Excel at Graph Analytics



Similar documents
Cray: Enabling Real-Time Discovery in Big Data

I D C A N A L Y S T C O N N E C T I O N. T h e C r i t i cal Role of I/O in Public Cloud S e r vi c e P r o vi d e r E n vi r o n m e n t s

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

I D C S P O T L I G H T. Ac c e l e r a t i n g Cloud Ad o p t i o n w i t h Standard S e c u r i t y M e a s u r e s

I D C T E C H N O L O G Y S P O T L I G H T

I D C T E C H N O L O G Y S P O T L I G H T. B i g D a t a a n d E C M : Making Smarter Decisions

urika! Unlocking the Power of Big Data at PSC

The Next Phase of Datacenter Network Resource Management and Automation March 2011

I D C V E N D O R S P O T L I G H T. S t o r a g e Ar c h i t e c t u r e t o Better Manage B i g D a t a C hallenges

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO

I D C T E C H N O L O G Y S P O T L I G H T. I m p r o ve I T E f ficiency, S t o p S e r ve r S p r aw l

I D C V E N D O R S P O T L I G H T

Six Days in the Network Security Trenches at SC14. A Cray Graph Analytics Case Study

Got Files? Get Cloud!

I D C M A R K E T S P O T L I G H T. T a m i n g D a t a M a n a g e m e nt Costs in a " C l o u d y" I T W o rld

I D C T E C H N O L O G Y S P O T L I G H T

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Workload Automation Challenges and Opportunities

I D C V E N D O R S P O T L I G H T

Optimizing Information Management in the Cloud

I D C T E C H N O L O G Y S P O T L I G H T

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

I D C M A R K E T S P O T L I G H T

On-Demand vs. On-Premise Customer Relationship Management: A New Hybrid Emerges

I D C T E C H N O L O G Y S P O T L I G H T

Billing for services or medical equipment not received or medically unnecessary

I D C M A R K E T S P O T L I G H T

Self-Service Big Data Analytics for Line of Business

Building a Web Security Ecosystem to Combat Emerging Internet Threats

I D C T E C H N O L O G Y S P O T L I G H T

T r a n s f o r m i ng Manufacturing w ith the I n t e r n e t o f Things

I D C T E C H N O L O G Y S P O T L I G H T. S e r ve r S e c u rity: N o t W h a t It U s e d t o Be!

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

I D C E V E N T P R O C E E D I N G S

Helping Enterprises Succeed: Responsible Corporate Strategy and Intelligent Business Insights

I N D U S T R Y S P O T L I G H T. T h e Grow i n g Appeal of Ad va n c e d a n d P r e d i c ti ve Analytics f o r the Utility I n d u s t r y

Global Headquarters: 5 Speen Street Framingham, MA USA P F

I D C T E C H N O L O G Y S P O T L I G H T. L e ve r a g i n g N e tw o r k Virtualization for B u s i n e s s D i fferentiation

Technical White Paper. October Real-Time Discovery in Big Data Using the Urika-GD. Appliance G OVERN M ENT.

Thin Provisioning: Using Intelligent Storage Virtualization Technology for More Efficient Use of Storage Assets

Investing in an Internet of Things (IoT) Solution: Asking the Right Questions to Minimize TCO

Securing Converged Networks

I D C T E C H N O L O G Y S P O T L I G H T

I D C V E N D O R S P O T L I G H T

I D C T E C H N O L O G Y S P O T L I G H T

Migrating to Windows 7 - A challenge for IT Professionals

Risk and responsibility in a hyperconnected world: Implications for enterprises

C l o u d - B a s e d S u p p l y C h a i n s : T r a n s f o rming M a n u f a c t u r ing Performance

T a c k l i ng Big Data w i th High-Performance

CONTINUOUS DIAGNOSTICS BEGINS WITH REDSEAL

CYBER SECURITY THREAT REPORT Q1

I D C T E C H N O L O G Y S P O T L I G H T. P o r t a b i lity: C h a r t i n g t h e Path T ow ard the Open Hyb r i d C l o u d

I D C A N A L Y S T C O N N E C T I O N

Big Data Tips the Power Balance Between IT and Business Users

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Big Data and Healthcare Payers WHITE PAPER

I D C T E C H N O L O G Y S P O T L I G H T. C a n S e c u rity M a k e IT More Productive?

I D C A N A L Y S T C O N N E C T I O N

Managing Data, Voice, and Converged IP Networks

I D C V E N D O R S P O T L I G H T. F l a s h, C l o u d, a nd Softw ar e - D e f i n e d Storage:

I D C V E N D O R S P O T L I G H T. H yb r i d C l o u d Solutions for ERP

Combatting the Biggest Cyber Threats to the Financial Services Industry. A White Paper Presented by: Lockheed Martin Corporation

I D C A N A L Y S T C O N N E C T I O N

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

I D C T E C H N O L O G Y S P O T L I G H T

HPC Market Update, HPC Trends In the Oil/Gas Sector and IDC's Top 10 Predictions for Earl Joseph, HPC Program Vice President

W H I T E P A P E R E d u c a t i o n a t t h e C r o s s r o a d s o f B i g D a t a a n d C l o u d

Privilege Gone Wild: The State of Privileged Account Management in 2015

Worldwide Security and Vulnerability Management Forecast and 2008 Vendor Shares

How To Buy Ibm Cloud In Canada

How To Create An Insight Analysis For Cyber Security

I D C E X E C U T I V E B R I E F

Security and Availability: A Holistic Solution to a Critical Problem

I D C T E C H N O L O G Y S P O T L I G H T

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

I D C V E N D O R S P O T L I G H T

I D C A N A L Y S T C O N N E C T I O N

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

YarcData urika Technical White Paper

Middle Class Economics: Cybersecurity Updated August 7, 2015

WHITE PAPER Get Your Business Intelligence in a "Box": Start Making Better Decisions Faster with the New HP Business Decision Appliance

The Rise of Intelligent Systems: Connecting Enterprises and Smart Devices in Seamless Networks

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Making the Business Case for HR Investments During Economic Crisis

Maintaining Business Continuity with Disk-Based Backup and Recovery Solutions

Taming IT Management Chaos

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Data Analytics. SPAN White Paper. Turning information into insights

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

CyberArk Privileged Threat Analytics. Solution Brief

I D C A N A L Y S T C O N N E C T I O N. C o g n i t i ve C o m m e r c e i n B2B M a rketing a n d S a l e s

I D C E X E C U T I V E B R I E F

How Big Data Transforms Data Protection and Storage

TYPES, PREVALENCE, AND PREVENTION OF CYBERCRIME. Haya Fetais & Mohammed Shabana. Saint Leo University COM- 510

I D C T E C H N O L O G Y S P O T L I G H T

I D C V E N D O R S P O T L I G H T

I D C M A R K E T S P O T L I G H T

Cyber Threats in Physical Security Understanding and Mitigating the Risk

Predictive Cyber Defense A Strategic Thought Paper

Transcription:

I D C T E C H N O L O G Y S P O T L I G H T F i n d i n g H i g h - V a l u e R elationships in Big Data May 2013 Adapted from Worldwide Data Intensive Focused HPC Server Systems 2011 2015 Forecast by Steve Conway, Earl Joseph, and Chirag Dekate, IDC #232572 Sponsored by YarcData This paper examines the dynamics of the formative market for high-performance data analysis (HPDA). The emphasis is on challenges, such as fraud detection, cybersecurity, and insider threats, that are increasingly crucial for both government and commercial organizations to address. Tackling these problems requires moving beyond today's needle-in-a-haystack searches for items already known to exist in a database. The essential challenge presented by these problems is to discover hidden patterns and relationships things you didn't know were there in real time or near real time. This is typically a far more difficult undertaking, one that traditional technologies such as relational database management systems (RDBMSs) and clustered computers ("clusters") do not handle well. It entails using graph analysis, sometimes in combination with other methodologies such as semantic or statistical analysis. The paper also looks at the role of YarcData, a vendor whose hardware-software solution is designed to excel at graph analytics and related methods. By offering a low-risk, subscription-based pricing model, YarcData is enabling organizations to make the transition from today's limited, static searches to the emerging era of dynamic discoveries of high-value patterns and relationships. Cyberattacks, Insider Threats, and Fraud: The Need to Move from Search to Discovery Historically, the challenge of civilizations has been to safeguard items of high value from external plunder and insider theft. Ancient cities were fortified sites built to protect stores of surplus food and other raw materials on which the lives of urban populations depended. The stored wealth of cities has always been subject to brute force attacks by hostile armies, Trojan horse style intrusions, and betrayals to enemies by treacherous city residents. These same threats have been extended to a newer type of stored value: the vast volumes of data that describe and help maintain critical infrastructures national security systems, power and communications systems along with private data on large government programs (e.g., social security, Medicare, and Medicaid) and data containing industrial trade secrets. In this Big Data context, the main threats fall into the following categories: Cyberattack: An attempt by an outside party to gain unauthorized access to stored data or disrupt services, usually via network intrusion (hacking). Cyberattackers have breached the U.S. electrical grid, planting potentially harmful software in the process. In March 2013, attackers hijacked the servers of The Spamhaus Project, a nonprofit organization based in London and Geneva, to carry out the largest cyberattack in history an attack so powerful and widespread that "it almost broke the Internet," according to the firm hired to subdue the attack. The head of the Pentagon's Cyber Command, General Keith Alexander, estimates that cyberattacks and intellectual property theft have cost U.S. companies $250 billion annually. IDC 1489

Insider threat: The unauthorized appropriation or exploitation of data by a person with access to the organization owning the data. Espionage is one insider threat category in government, as the following example from an FBI brochure confirms: "Greg Chung spied for China from 1979 2006. Chung stole trade secrets about the space shuttle, the Delta IV rocket and the C-17 military cargo jet for the benefit of the Chinese government... In February 2010 he was sentenced to over 15 years in prison." Insider theft is also a serious problem in the private sector. In one of many examples, in 2011, hedge fund founder Raj Rajaratnam was sentenced to 11 years in prison for insider trading that netted him $60 million. Fraud: The deceptive exploitation or annotation of data for wrongful or illegal personal gain. For example, the FBI estimates that 10% of transactions in federal healthcare programs Medicare, Medicaid, Veterans Affairs, and so forth are fraudulent, costing more than $100 billion a year. Today, fraud is detected after the fact, and the government recovers only about $5 billion a year. The government is conducting tests to evaluate multiple proposed solutions to this Big Data problem. Estimates for annual losses to fraud by U.S. businesses run as high as $1 trillion. The banking and financial services industry accounts for 16% of all fraud cases in the private sector, more than any other industry, according to a recent report from the Association of Certified Fraud Examiners. Detecting Cybercrime: Finding Relevant Patterns and Relationships in Time Incidents of cybercrime are escalating quickly today. IDC believes most organizations in government, academia, and industry will be expected to do much more than they do today to protect themselves against this dangerous deluge. One crucial aspect of protection is the ability to detect cybercrime before it happens or soon afterward. In the urban warfare example cited previously, it was easy enough to notice if the Visigoths were storming the city gates or if half the wheat in the city's storehouse was missing. In contrast, today's cybercriminals can copy data and steal away, leaving the original data in place. The FBI estimates, for example, that for every organization that's aware of being hacked, there are 100 others that don't know they have been attacked. For this reason, typical needle-in-a-haystack searches are inadequate for detecting today's cybercrime, either before it occurs or quickly enough afterward to apprehend the responsible parties. The problem is, there is often no needle to be found, no single item that has gone missing or that would point definitively to the perpetrator. Instead, detecting cybersecurity breaches, insider threats, and fraud requires timely and relevant discovery of hidden patterns and relationships in the data. These patterns are dynamic they can change at any time. That is why this challenge is sometime described as the ability to detect patterns in shifting sands. Unknown Unknowns Adding considerably to the challenge of discovering hidden relationships in data is the fact that a substantial subset of these troubling patterns consists of brand new forms of subterfuge that have never been encountered before. Rather than "known unknowns" for which some form of defense has usually been developed, these especially dangerous relationship patterns are "unknown unknowns." They can be far more damaging, in financial and other terms, than their "known unknown" counterparts and far harder to detect at all, much less in time to react effectively. The chief methodology for discovering the presence, nature, and strength of hidden and unexpected relationships and patterns is graph analytics (see the Definitions section). Graph analytics are applicable not only to cybercrime but also to many Big Data problems where finding hidden 2 2013 IDC

relationships is essential, such as genomics, cancer research, epidemiology, materials science, financial services, and others. Business-to-consumer (B2C) organizations are also investing in graph analytics solutions to enable more advanced affinity marketing by identifying not only the purchasing patterns of customers but also the relationships of customers to their networks of family, friends, and acquaintances. Discovering these extended, hidden patterns that influence purchasing decisions is high value to online and brick-and-mortar retailers alike. Traditional Analytics: Not Designed to Excel at Graph Analytics Most Big Data problems today are needle-in-a-haystack, static searches that perform well on RDBMSs running on standard compute servers/clusters. But graph analytics problems are an important exception. Why? Because RDBMSs impose a specific, rigid schema on the data in advance, making it impractical to discover differently structured hidden relationships that may be crucially important. What is needed to extract the meaning and value from these relationships is a graph database that, without prejudice, can identify and lay out relationships among the data and then quickly move through (traverse) all those relationships ("edges") to assess the nature and importance of each one. To that end, enriching a graph with many data sets enables the discovery of "unknown unknowns" that can capture emerging patterns and trends. Since the graph is stored in memory, results can often be generated in seconds to enable real-time threat detection. Furthermore, data can come from new sources and can change in real time. By contrast, schema extensions to RDBMSs or other tablebased approaches may require a multiweek IT project. A graph database cannot act alone. Its capabilities must be strongly supported by the hardware system on which it runs. Standard clusters aren't designed for effectively tackling graph analytics and other types of challenging relationship discovery problems (e.g., semantic analysis, climate knowledge discovery). For one thing, systemwide memory on standard clusters is limited in size and is not logically shared. There is no single memory space big enough to hold all the data for a sizable Big Data problem. Instead, the data has to be split into chunks small enough to fit into the system's logically distributed memory locations. Clusters work well enough for needle-in-a-haystack problems that can easily be subdivided into smaller problems, each of which can be run independently on one node of the cluster with its local memory. But relationship discovery problems that benefit from graph analytics don't work this way. They cannot easily be split up (partitioned) into smaller, independent problems. The limited communications capabilities of standard clusters also severely restrict data set size and the performance achievable on graph analytics. Graphs are very highly interconnected, so attempts to partition them onto a cluster result in skyrocketing communications requirements as the graph grows in size. Past a certain graph size, adding additional cluster nodes can actually slow the time to solution because of increasing communications requirements. This is a major issue when results are needed in real time or near real time, as is often the case with detecting cybersecurity breaches, insider threats, fraud, terrorist activities, and epidemic patterns. For these reasons, organizations needing to run the most demanding Big Data problems already are seeking computer systems with capabilities superior to those available on standard clusters especially systems that are expressly designed to excel at graph analytics and related methods. Definitions Cluster. IDC defines clusters used in technical markets as a set of independent computers combined into a unified system through systems software and networking technologies. Thus, clusters are not based on new architectural concepts so much as new systems integration strategies. Clusters today are heavily based on standard technologies, such as x86-based processors, the Linux or Windows operating system, and the MPI message-passing protocol. 2013 IDC 3

Graph analytics. IDC defines graph analytics as a computational method for identifying and visualizing relationships among items in a database and assessing the relative strengths and natures of the relationships based on their connecting lines (edges). Graphs generally exhibit irregular data patterns and cannot be easily subdivided (partitioned) for parceling out to the distributed memory locations of standard clusters. This methodology benefits from computer systems with large shared memories and strong I/O capabilities. High-performance data analysis (HPDA). IDC coined the term high-performance data analysis to describe the convergence of established data-intensive HPC markets and the high-end government and commercial analytics markets that require similarly powerful computing resources. Benefits Graph analytics can discover high-value, hidden patterns and relationships, typically in real time or near real time. Standard servers/clusters cannot match this performance. IDC believes that IT organizations will increasingly be expected to augment today's static searches with such an ability, which is becoming crucial not only for detecting cybercrime before or shortly after it happens but also for a wide range of other Big Data problems in life sciences, materials science, finance, and other fields. Organizations that are facing these problems may falter in their missions and competitively if they do not acquire this ability. Considering YarcData's Big Data Appliance YarcData (Pleasanton, California), a Cray company, offers a Big Data appliance that is purpose built to excel at difficult graph analytics problems. The aptly named Urika solution includes a hardware platform that supports massive multithreading, graph analytics database software carefully designed to exploit the hardware platform effectively, and additional software including a user interface aimed at making the Urika appliance easy to administer and use. YarcData reports that its solution typically can be up and running challenging problems in a week or two. The following are highlights of the YarcData graph analytics platform: A shared (global) memory space that can hold up to 512TB of data, avoiding the need to partition graphs A purpose-built graph acceleration processor, featuring massive multithreading technology, with hardware support for 128 concurrent hardware threads per processor (This is designed to enable the appliance to maintain peak performance even in the event of memory and network delays.) A scalable I/O subsystem, enabling Urika to dynamically add new data sources or streaming data to the in-memory graph as needed to explore a new hypothesis A highly tuned graph database, providing W3C standard interfaces to facilitate migrating to the platform and avoiding vendor lock-in Industry-standard interfaces and a comprehensive toolset for appliance and database management, permitting the reuse of existing IT skills and resources Integration with existing analytic environments, including data warehouses and Hadoop YarcData makes Urika available on a subscription basis, avoiding the need for capital expenditures. The company has domain experts who can engage on a peer-to-peer basis to understand customers' 4 2013 IDC

problems and work to map those problems efficiently onto the Urika platform. Examples of existing customers and uses include the following: QinetiQ North America (QNA) will use YarcData's Urika graph analytics appliance to deliver actionable defense-related intelligence to QNA's international clientele of government and commercial customers. The U.S. Department of Energy's Oak Ridge National Laboratory (ORNL) will use the Urika appliance to support research in healthcare fraud and analytics for a leading healthcare payer. In addition, ORNL will apply the capabilities of the Urika graph analytics appliance to other areas of research where data discovery is vital. These potential use cases include healthcare treatment efficacy and outcome analysis, analyzing drugs and side effects, and analysis of proteins and gene pathways. At the Institute for Systems Biology (ISB), Urika is supporting research into new drugs for cancer treatment. ISB is taking a systems biology approach, modeling the development of cancer at a cellular level. Data from sources such as the MEDLINE database of biomedical articles, The Cancer Genome Atlas, and several proteomic databases and the ISB's own wet lab results were combined into a graph to enable hypothesis validation on Urika. Major discoveries have already been made, validating this approach. The Pittsburgh Supercomputing Center, a National Science Foundation funded facility, is using a Urika appliance for a wide range of scientific research projects in the life sciences and other domains. The Swiss National Supercomputing Centre (CSCS) is using a Urika appliance for large-scale parallel applications that benefit from pattern matching, scenario development, behavioral prediction, anomaly identification, and graph analytics. Targeted domains include material sciences, medicine genomics, high-energy physics, climate research, and astrophysics. Challenges The YarcData Urika graph analytics appliance does face market challenges, however. Standard servers/clusters represent the status quo in the marketplace. Even though these platforms are not designed to perform well on difficult graph analytics problems, many organizations do not yet recognize the high value attached to graph analytics or that a powerful hardware platform exists that is designed to extract this value. Success stories from existing Urika users can help educate prospects. Also, IDC believes that problems needing graph analytics will gain visibility and importance in more organizations, making them more receptive to YarcData's solution. It helps that YarcData has been willing to engage with prospective customers in proof-of-concept projects to validate its approach. Conclusion Graph analytics is indispensable for a growing class of high-value Big Data problems that require the discovery of hidden, unexpected relationships. These economically and strategically important problems exist in government, commercial, and other environments. They include detecting cybersecurity breaches, insider threats, and fraud. The approaches needed to solve these problems effectively are relevant for many use cases, including national security, healthcare management, drug discovery, financial services, weather and climate research, manufacturing, and many more. RDBMSs and standard computing systems, such as clusters, are not well suited for solving these difficult problems. 2013 IDC 5

YarcData's Urika graph analytics appliance is a solution designed expressly to excel on graph analytics problems. YarcData has added features designed to make this appliance easy to deploy, administer, and use. The Urika appliance is tackling a growing array of challenging graph analytics problems in customer environments across the globe. IDC believes that graph analytics problems will grow quickly in economic importance and that the market for graph analytics solutions has substantial growth potential. As a differentiated player in this developing market, YarcData faces the challenge of educating customers about the potential of graph analytics but is well positioned to benefit from the market's anticipated growth. A B O U T T H I S P U B L I C A T I ON This publication was produced by IDC Go-to-Market Services. The opinion, analysis, and research results presented herein are drawn from more detailed research and analysis independently conducted and published by IDC, unless specific vendor sponsorship is noted. IDC Go-to-Market Services makes IDC content available in a wide range of formats for distribution by various companies. A license to distribute IDC content does not imply endorsement of or opinion about the licensee. C O P Y R I G H T A N D R E S T R I C T I O N S Any IDC information or reference to IDC that is to be used in advertising, press releases, or promotional materials requires prior written approval from IDC. For permission requests, contact the GMS information line at 508-988-7610 or gms@idc.com. Translation and/or localization of this document requires an additional license from IDC. For more information on IDC, visit www.idc.com. For more information on IDC GMS, visit www.idc.com/gms. Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com 6 2013 IDC