Technical White Paper. October Real-Time Discovery in Big Data Using the Urika-GD. Appliance G OVERN M ENT.

Transcription

1 LIFE SCIENCES Technical White Paper Real-Time Discovery in Big Data Using the Urika-GD Appliance SPORTS ANALYTICS FRAU D SCIENTIFIC RESEARCH CYBERSECURITY G OVERN M ENT TELECOMMUNICATIONS CUSTOM ER INSIG HTS FINANCIAL SERVICES

2 Table of Contents Executive Summary Discovery Through Human-Machine Collaboration... 3 Using Graphs for Discovery Analytics... 5 Introducing the Cray Urika-GD Graph Analytics Appliance Overview of the Multiprocessor, Shared Memory Architecture Addressing the Memory Wall Through Massive Multithreading... 8 Delivering High Bandwidth with a High Performance Interconnect... 8 Enabling Fine-Grained Parallelism with Word-Level Synchronization Hardware Delivering Scalable I/O to Handle Dynamic Graphs Comparison: The Urika-GD Appliance s Hardware and Commodity Hardware The Urika-GD System Software Stack The Graph Analytics Database Enabling Ad-hoc Queries and Pattern-based Search with RDF and SPARQL Augmenting Relationships in the Graph Through Inferencing Benefits of the Urika-GD System s Software Architecture The Benefits of an Appliance Integrating the Urika-GD Appliance into an Existing Analytics Environment Building the Graph Visualization Integration with Other Analytics Packages Conclusion Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 1

3 A new approach is needed. An approach that: Separates data and its representation, allowing for new data sources and new relationships to be included without complex data model changes. Supports a wide range of ad-hoc analysis as needed to spark insight and validate new theories. Typically, these will take the form of searching for patterns of relationships, but other types of analytics and visualization will also be required. Operates in real time, supporting collaborative, iterative discovery in very large datasets. Executive Summary Discovery, the often accidental revelations that have changed the world since Archimedes, is a vital component of the advancement of knowledge. The recognition of previously unknown linkages between occurrences, objects and facts underpins advances in such diverse areas as life sciences (cancer drug discovery, personalized medicine or understanding the spread of disease), financial services (counter-party credit risk analysis, fraud detection, identity resolution or anti-money laundering) and government operations (cybersecurity threat analysis, person-of-interest identification or counterterrorism threat detection). New discoveries often deliver very high value: Consider the harm avoided through the proactive detection of fraud or counterterrorism operations, or the billions of dollars in revenue generated from a new cancer drug. The traditionally slow pace of discovery is being greatly accelerated by the advent of big data. Discovery takes place when a researcher has a Eureka! moment, where a flash of insight leads to the formulation of a new theory, followed by a painstaking validation of that theory against observations in the real world. Big data can assist in both of these phases. Applying analytics and visualization to the huge volume of captured data stimulates insight, and the ability to test new theories electronically can speed validation a thousandfold, fulfilling the true promise of big data as long as an organization s systems are up to the challenge. Traditional data warehouses and business intelligence (BI) tools built on relational models are not well suited to discovery, however. BI tools are highly optimized to generate defined reports from operational systems or data warehouses. They require the development of a data model that is designed to answer specific business questions but the model then limits the types of questions that can be asked. Discovery, however, is an iterative process, where a line of inquiry may result in previously unanticipated questions, which may also require new sources of data to be loaded. Accommodating these will likely require time-consuming, error-prone and complex extensions to the data model, for which saturated IT professionals do not have time. Graph analytics are ideally suited to meet these challenges. Graphs represent entities and the relationships between them explicitly, greatly simplifying the addition of new relationships and new data sources, and efficiently support ad-hoc querying and analysis. Real-time response to complex queries against multi-terabyte graphs is also achievable, with the appropriate platform. Cray s Urika-GD appliance is built to meet the challenging requirements of discovery. With one of the world s most scalable shared memory architectures, the Urika-GD appliance employs graph analytics to surface unknown linkages and non-obvious patterns in big data, do it with speed and simplicity, and facilitate the kinds of breakthroughs that can give any organization in government activities ranging from national security to fraud detection, medical and pharmaceutical research, financial services and even retail a measurable advantage. The Urika-GD appliance complements existing data warehouses and Hadoop clusters by offloading challenging data discovery applications while still interoperating with the existing analytics workflow. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 2

4 Discovery through Human-Machine Collaboration All truths are easy to understand once they are discovered; the point is to discover them. Galileo Galilei Discovery is the desired outcome of an investigative analytical process. Discovery in big data requires the collaboration of man and machine, where the guiding intellect the ability to posit and infer is human. In time, artificial intelligence may be able to make suppositions and draw conclusions but, for now, humans still have the advantage. The process of discovery is iterative, as shown in Figure 1. The analyst must be able to test a hypothesis against all available data by posing a question that the technology answers in depth and then renders visually, shortening the time between results. This requires the ability to ask questions that were not anticipated by those who built the knowledge base, referred to as ad-hoc queries in the database world. In discovery, you don t know the next question until you get the first answer, and each iteration may require additional datasets for analysis. The addition of those datasets demands fast, flexible and powerful I/O. This cycle continues until the Eureka! moment, where the analyst makes a high-value breakthrough discovery. Example 1, on cancer drug discovery, illustrates this process. Discovery through fast hypothesis validation Figure 1. The cycle of discovery. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 3

5 With traditional analytics technologies, discovery is challenging because of several interrelated difficulties: 1. Predicting what data is needed. Discovery depends upon the ability to import and combine new datasets, ranging from structured (databases) to semistructured (XML, log files) to unstructured sources (text, audio, video) as needed to support new lines of inquiry. Traditional analytics solutions use fixed data schema, and the addition of new types of data and relationships between data items involves complex, time consuming schema extension, often requiring person-weeks or months of effort. Analysts using traditional technologies report spending up to 80 percent of their time on data import and schema manipulation. 2. Predicting what questions will be asked. Discovery depends on the ability to follow up on new lines of questioning, including questions about the relationships implied within the data. Traditional solutions depend upon optimizing data schemata for specific queries in order to deliver acceptable performance. Failure to do so results in nested table joins, which are very damaging to performance. IT groups have described these as forbidden queries, for their tendency to bring the analytics infrastructure to a grinding halt until the queries are killed. 3. Delivering predictable, real-time performance as data sizes and query complexity grow. Discovery depends upon real-time results being delivered in response to queries. Traditional systems have difficulty achieving deterministic response times to ad-hoc queries, let alone real-time response as dataset sizes and query complexity grows. The result is that analysts cherry pick their lines of reasoning, driven by systems capability, rather than investigating all the avenues desired, introducing bias from their own preconceptions. The result of these challenges is an organizational unwillingness to extemporaneously experiment with data, unless the value has been proven beyond the shadow of a doubt. This is a major constraint on innovation. Example 1: Cancer Drug Discovery Using Graph Analytics The Institute for Systems Biology (ISB) is approaching the challenge of cancer drug discovery using a systems biology approach, involving the modeling of the formation and growth of tumors at the molecular level. The objective is to understand the gene mutations and the biological processes that lead to cancer, to discover highly targeted treatments. This is very challenging because the volume of published, relevant scholarly articles and genomic and protein databases is beyond human ability to digest. ISB tackled this problem by extracting the relationships contained in Medline articles, containing journal citations and abstracts for biomedical literature from around the world, using natural language processing. They combined these relationships with genomic and proteomic data of healthy and cancerous cells from the Cancer Genome Atlas and other databases, as well as their own experimental wet lab results, into a very large graph comprising billions of relationships. New sources of data were continually added as their relevancy was determined. Researchers wrote complex, ad-hoc queries, effectively validating hypotheses in-silico, in an iterative process where each new set of results suggested new lines of inquiry. Graph analytics served ISB very well for discovery. They were able to quickly add new sources of data and new types of relationships as they were uncovered, and write sophisticated, partially specified queries looking for patterns of relationships in the data. Visualization of the results enabled quick comprehension, while the ability to export large sets of results for statistical processing helped guide the discovery process and provided statistical rigor. In the amount of time it took to validate one hypothesis, we can now validate 1,000 hypotheses increasing our success rate significantly, remarked Dr. Ilya Shmulevich of the ISB. This approach led to the discovery that many breast cancers have an increase in the expression of the ABCG2 gene, and that the HIV drug nelfinavir inhibits ABCG2. This drug is a strong candidate for repurposing to treat breast cancer, a discovery with considerable potential revenue opportunity. Repurposing is a very cost-effective way of bringing new drug therapies to market, and graph analytics are now a proven way to identify these opportunities. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 4

6 Using Graphs for Discovery Analytics Using graphs in data analytics provides many advantages. What differentiates Urika from the many graph databases available today is its ability to enable data discovery at scale and on an interactive basis. (Chartis Research, Looking for Risk: Applying Graph Analytics to Risk Management, Peyman Mestchian) A graph consists of nodes, representing data items, and edges, representing the relationships between nodes, as shown in Figure 2. Graphs represent the relationships between data explicitly, facilitating analysis of patterns of relationships, a key aspect of discovery. Contrast this with traditional tabular representations, where the focus is on processing data (the rows in the tables), and where relationships are second-class entities, represented indirectly by table column headings and indices. Graphs address the challenges presented by traditional analytics. 1. Predicting what data is needed. Graphs provide a flexible data model, where new types of relationships are readily added, greatly simplifying the addition of new data sources. Relationships extracted from structured, semistructured or unstructured data can be readily represented in the same graph. 2. Predicting what questions will be asked. Graphs have no fixed schema, constraining the universe of queries that can be posed. Relationships are not hidden: it s possible to write relationships querying the types of relationships that exist. Graphs also enable advanced analytics techniques such as community detection, path analysis, clustering and others. 3. Delivering predictable, real-time performance as data sizes and query complexity grow. Graph analytics can deliver predictable, real-time performance, as long as the hardware and software are appropriate to the task. Cray developed the Urika-GD appliance specifically for this task, as described in the following sections. These attributes enable graph analytics to deliver value incrementally. As understanding grows, new data sources and new relationships can be added, building an ever more potent and accurate model. Figure 2. Graphs consist of nodes (data items) and edges (relationships). Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 5

7 Introducing the Cray Urika-GD Graph Analytics Appliance The Urika-GD appliance was introduced in recognition of the important role that graph analytics can play in discovery. A large governmental organization approached Cray about performing discovery analytics in a large and constantly growing graph. They had investigated several technologies, but none satisfied their needs. Analysis of this organization s needs led to a canonical list of hardware requirements for all graph analytics (the software requirements will be discussed in a later section): Discovery analytics requires real-time response: A multiprocessor solution is required for scale and performance. Many graph analytics solutions are single-computer implementations, very useful for small problems, but unusable at scale. Uncovering previously unknown patterns and relationships across increasingly large repositories of multistructured data represents one of the biggest opportunities to derive new sources of innovation, growth and productivity from analytics. (Gartner, Cool Vendors in Content and Social Analytics, by Rita Salaam) Graphs are hard to partition: A large, shared memory is required to avoid the need to partition graphs. Analyzing graph relationships requires following the edges in the graph. Regardless of the scheme used, partitioning the graph across a cluster will result in edges spanning cluster nodes. In most cases, the number of edges crossing cluster nodes is so large that it requires a time-consuming network transfer each time those edges are crossed. Compared to local memory, even a fast commodity network such as 10 GB Ethernet is at least 100 times slower at transferring data. Given the highly interconnected nature of graphs, users gain a significant processing advantage if the entire graph is held in sufficiently large shared memory. Graphs are not predictable, and therefore cache-busting: A custom graph processor is needed to deal with the mismatch between processor and memory speeds. Analyzing relationships in large graphs requires the examination of multiple, competing alternatives. These memory accesses are very data dependent and eliminate the ability to apply traditional performance improvement techniques such as pre-fetching and caching. Given that even RAM memory is 100 times slower than processors, and that graph analytics consists of exploring alternatives, the processor sits idle most of the time waiting for delivery of data. Cray developed hardware multithreading technology to help alleviate this problem. Threads can explore different alternatives, and each thread can have its own memory access. As long as the processor supports a sufficient number of hardware threads, it can be kept busy. Given the highly nondeterministic nature of graphs, a massively multithreaded architecture enables a tremendous performance advantage. Graphs are highly dynamic: A scalable, high performance I/O system is required for fast loading. Graph analytics for discovery involves examining the relationships and correlations between multiple datasets and, consequently, requires loading many large, constantly changing datasets into memory. The sluggish speed of I/O systems 1,000 times slower compared to the CPU translates into graph load and modification times that can stretch into hours or days far longer than the time required for running analytics. In a dynamic enterprise with constantly changing data, a scalable processing infrastructure enables a tremendous performance advantage for discovery. These requirements drove the design of the Urika-GD system s hardware, and resulted in a hardware platform proven to deliver real-time performance for complex data discovery applications. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 6

8 Overview of the Multiprocessor, Shared Memory Architecture Cray s Urika-GD appliance is a heterogeneous system consisting of Urika-GD appliance services nodes and graph accelerator nodes linked by a high performance interconnect fabric for data exchange (see Figure 3). Graph accelerator nodes ( accelerator nodes ) utilize a purpose-built Threadstorm processor capable of delivering several orders of magnitude better performance on graph analytics applications than a conventional microprocessor. Accelerator nodes share memory and run a single instance of a UNIX-based, compute-optimized OS named multithreaded kernel (MTK). Urika-GD appliance services nodes ( service nodes ), based on x86 processors, provide I/O, appliance and database management. As many I/O nodes may be added as desired, enabling the scaling of connectivity and management functions for larger Urika-GD appliances. Each service node runs a distinct instance of a fully featured Linux operating system. The interconnect fabric is designed for high-speed access to memory anywhere in the system from any processor, as well as scaling to large processor counts and memory capacity. The Urika-GD system architecture supports flexible scaling to 8,192 graph accelerator processors and 512 TB of shared memory. Urika-GD systems can be incrementally expanded to this maximum size as data analytics needs and dataset sizes grow. Graph Accelerator Nodes Urika-GD Appliance Services Nodes MTK Linux Network RAID Controllers Threadstorm processors running MTK (BSD) x86 processors running SUSE Linux Figure 3. Urika-GD system architecture. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 7

9 Addressing the Memory Wall Through Massive Multithreading A DBMS designed for known relationships and anticipated requests runs badly if the relationships actually discovered are different, and if requests are continually adapted to what is learned. (Gartner, Urika shows Big Data is More than Hadoop and Data Warehouses by Carl Claunch) Memory wall refers to the growing imbalance between CPU speeds and memory speeds. Starting in the early 1980s, CPU speed improved at an annual rate of 55 percent, while memory speed only improved at a rate of 10 percent. This imbalance has been traditionally addressed by either managing latency or amortizing latency. However, neither approach is suitable for graph analytics. Managing latency is achieved by creating a memory hierarchy (levels of hardware caches) and by software optimization to pre-fetch data. However, this approach is not suitable for graph analytics, where the workload is heavily dependent on pointer chasing (the following of edges between nodes in the graph) because the random access to memory results in frequent cache misses and processors stalling while waiting for data to arrive. Amortizing latency is achieved by fetching large blocks of data from memory. Vector processors and GPUs employ this technique to great advantage when all the data in the block retrieved is used in computation. This approach is also not suitable for graph analytics, where relatively little data is associated with each graph node, other than pointers to other nodes. In response to the ineffectiveness of managing or amortizing latency on graph problems, a new approach was developed the use of massive multithreading to tolerate latency. The Threadstorm processor is massively multithreaded with 128 independent hardware streams. Each stream has its own register set and executes one instruction thread. The fully pipelined Threadstorm processor switches context to a different hardware stream on each clock cycle. Up to eight memory references can be in flight for each thread, and each hardware stream is eligible to execute every 21 clock cycles if its memory dependencies are met. No caches are necessary or present anywhere in the system since the fundamental premise is that at least some of the 128 threads will have the data required to execute at any given time. Effectively, Threadstorm enables global access of multiple, random, dynamic memory refers simultaneously without pre-fetching or caching, turning the memory latency problem into a requirement for high bandwidth. Delivering High Bandwidth with a High Performance Interconnect The Urika-GD system uses a purpose-built high-speed network. This interconnect links nodes in the 3-D torus topology (Figure 3) to deliver the system s high communication bandwidth. This topology provides excellent cross-sectional bandwidth and scaling, without layers of switches. Key performance data for the interconnection network: Sustained bidirectional injection bandwidth of more than 4 GB/s per processor and an aggregate bandwidth of almost 40 GB/s through each vertex in the 3-D torus Efficient support for Threadstorm remote memory access (RMA), as well as direct memory access (DMA) for rapid and efficient data transfer between the service nodes and accelerator nodes Combination of high-bandwidth, low-latency network and massively multithreaded processors make the Urika-GD appliance ideally suited to handle the most challenging graph workloads Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 8

10 Comparison: The Urika-GD Appliance s Hardware and Commodity Hardware The Urika-GD appliance hardware provides a number of key advantages for discovery analytics over commodity cluster systems: a large, global shared memory, extreme processing power, a purpose-built, massively multithreaded graph acceleration processor, extreme memory bandwidth and extreme tolerance for memory latency. The table below sums up these differentiators and the benefit each provides for graph analytics. Enabling Fine-Grained Parallelism with Word-Level Synchronization Hardware The benefits of massive parallelism are quickly lost if synchronization between threads involves serial processing, in accordance with Amdahl s Law. The Threadstorm processors and memory implement fine-grained synchronization to support asynchronous, data-driven parallelism and spread the synchronization load physically within the memory and interconnection network to avoid hot spots. Full/empty bits are provided on every 64-bit memory word in the entire global address space for fine-grained synchronization. This mechanism can be used across unrelated processes on a spin-wait basis. Within a process (which can have up to tens of thousands of active threads), multiple threads can also do efficient blocking synchronization using mechanisms based on the full/empty bits. The Urika-GD system s software uses this mechanism directly without OS intervention in the hot path. Delivering Scalable I/O to Handle Dynamic Graphs Any number of appliance services nodes can be plugged into the interconnect, allowing the appliance s I/O capabilities to be scaled independently from the graph processing engine. A Lustre parallel file system is used to provide scalable, high performance storage. Lustre is an open-source file system designed to scale to multiple exabytes of storage, and to provide near linear scaling in I/O performance with the addition of Lustre nodes. DIFFERENTIATOR URIKA-GD SYSTEM CAPABILITY SIGNIFICANCE TO DISCOVERY ANALYTICS LARGE GLOBAL SHARED MEMORY Scales up to 512 TB Enables uniform, low-latency access to all the data, regardless of data partitioning, layout or access pattern. A large shared memory holds the entire graph, avoiding the need to partition and enabling unknown linkages and non-obvious patterns in the data to be easily surfaced with no advance knowledge of the relationships in the dataset. EXTREME PROCESSING POWER Scales up to 8,192 processors Achieving real-time performance requires employing as many processors as needed, all sharing the same memory. This scalability ensures interactive response on the most demanding workloads. MASSIVE MULTITHREADING EXTREME MEMORY PERFORMANCE 128 hardware threads per processor Memory bandwidth scales with size of appliance Graph analytics involves random memory access patterns. Random memory access results in individual threads stalling. Processors can tolerate latency if they have multiple concurrently executing hardware threads so there are always threads ready to execute upon a memory stall. The Urika-GD appliance is effectively investigating multiple changing hypotheses in real time simultaneously, enabling it to deliver two to four orders of magnitude improvement in performance. 3 Traditional processors amortize memory latency (they make an inherent assumption that data will have locality, so they retrieve blocks of data into a complex hierarchy of caches). Graphs, and discovery applications generally, do not have locality, so this approach doesn t work. The Urika-GD platform s Threadstorm processors tolerate latency through massive multithreading. However, each thread can issue up to eight concurrent memory references so massive memory bandwidth is required to keep the processors running at peak performance. Massive multithreading and extreme memory bandwidth go hand in hand to deliver the Urika-GD system s performance advantage. Word-level memory synchronization hardware enables very linear scaling to high thread and processor counts. Together, these optimizations deliver an appliance finely tuned to the requirements of discovery in big data. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 9

11 The Urika-GD System Software Stack The Urika-GD system s software stack was crafted with several goals in mind: Create a standards-based appliance for real-time data discovery using graph analytics Facilitate migration of existing graph workloads onto the Urika-GD appliance Allow users to easily fuse diverse datasets from structured, semistructured and unstructured sources without upfront modeling, schema design or partitioning considerations Enable ad-hoc queries and pattern-based searches across the entire dynamic graph database. The Urika-GD appliance software (Figure 4) is partitioned across the two types of processors in the service nodes and accelerator nodes, with each processing the workload for which it is best suited. The service nodes run discrete copies of Linux and are responsible for all interactions with the external world database and appliance management, and database services, including a SPARQL endpoint for external query submission. The service nodes also perform network and file system I/O. A Lustre parallel file system enables near-linear scalability across multiple service nodes, allowing even the largest datasets to be loaded into memory in minutes. The accelerator nodes perform the functions of maintaining the in-memory graph database, including loading and updating the graph, performing inferencing and responding to queries. RDF SPARQL Java Visualization Tools Urika-GD Appliance Services Nodes Urika-GD Graph Appliance Accelerator Nodes Graph Analytics Application Services Database Manager, Database Services, Visualization Services Graph Analytics Database SUSE Linux 11 Optimized Multithreaded Kernel Figure 4. Urika-GD system software architecture. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 10

12 The Graph Analytics Database The graph analytics database provides an extensive set of capabilities for defining and querying graphs using the industry standard RDF and SPARQL. These standards are widely used for storing graphs and performing analytics against them. Cray built the database and query engine from the ground up to take advantage of the massive multithreading on the Threadstorm processors. The standards-based approach and comprehensive feature set ensure that existing graph data and workloads can be migrated onto the Urika-GD platform with minimal or no changes to existing queries and application software. With the Urika-GD appliance, query results can be sent back to the user or be written to the parallel file system. The latter capability can be very useful when the set of results is very large. Concept Fertilize Transport Factory Goal: Proactively identify patterns of activity and threat candidates by aggregating intelligence and analysis Data sets: Reference data, people, places, things, organizations, communications... Technical challenges: Volume and velocity of data; inaccurate, incomplete and falsified data Users: Intelligence analysts Usage model: Search for patterns of activity and graphically explore relationships between candidate behavior and activities Augmenting: Existing Hadoop cluster and multiple data appliances Figure 5. An example of identifying threat patterns. Discovery in Big Data Using the Urika-GD Graph Analytics Appliance 11