Introduc)on to urika. Mul)threading. SPARQL Database. urika Appliance. XMT- 2 Programming. Use Cases



Similar documents
Introduction to urika. Multithreading. urika Appliance. SPARQL Database. Use Cases

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

excellent graph matching capabilities with global graph analytic operations, via an interface that researchers can use to plug in their own

Cray: Enabling Real-Time Discovery in Big Data

YarcData urika Technical White Paper

Run$me Query Op$miza$on

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Introduc8on to Apache Spark

DNS Big Data

Data Stream Algorithms in Storm and R. Radek Maciaszek

Six Days in the Network Security Trenches at SC14. A Cray Graph Analytics Case Study

Using RDBMS, NoSQL or Hadoop?

Chapter 3. Database Architectures and the Web Transparencies

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

AVOIDING SILOED DATA AND SILOED DATA MANAGEMENT

Inge Os Sales Consulting Manager Oracle Norway

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering

Best Prac*ces for Deploying Oracle So6ware on Virtual Compute Appliance

Graph Database Performance: An Oracle Perspective

Big Data Are You Ready? Thomas Kyte

Technical White Paper. October Real-Time Discovery in Big Data Using the Urika-GD. Appliance G OVERN M ENT.

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

An Open Dynamic Big Data Driven Applica3on System Toolkit

An to Big Data, Apache Hadoop, and Cloudera

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

Manifest for Big Data Pig, Hive & Jaql

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Bringing Big Data Modelling into the Hands of Domain Experts

Oracle Big Data SQL Technical Update

The Data Reservoir. 10 th September Mandy Chessell FREng CEng FBCS Dis4nguished Engineer, Master Inventor Chief Architect, Informa4on Solu4ons

OS/Run'me and Execu'on Time Produc'vity

Understanding the Value of In-Memory in the IT Landscape

SQream Technologies Ltd - Confiden7al

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Infrastructure Matters: POWER8 vs. Xeon x86

Privacy- Preserving P2P Data Sharing with OneSwarm. Presented by. Adnan Malik

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

The IBM Cognos Platform for Enterprise Business Intelligence

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BIG DATA AND INVESTIGATIVE ANALYTICS

bigdata Managing Scale in Ontological Systems

Scaling Out With Apache Spark. DTL Meeting Slides based on

Real-time Big Data Analytics with Storm

Big Data in medical image processing

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

NextGen Infrastructure for Big DATA Analytics.

Oracle Spatial and Graph. Jayant Sharma Director, Product Management

Integrating Open Sources and Relational Data with SPARQL

How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

Introducing Oracle Exalytics In-Memory Machine

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Data Center Evolu.on and the Cloud. Paul A. Strassmann George Mason University November 5, 2008, 7:20 to 10:00 PM

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Network Graph Databases, RDF, SPARQL, and SNA

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Advanced In-Database Analytics

CS 5150 So(ware Engineering System Architecture: Introduc<on

Protec'ng Communica'on Networks, Devices, and their Users: Technology and Psychology

How To Use A Webmail On A Pc Or Macodeo.Com

LDIF - Linked Data Integration Framework

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Computer Networks. Examples of network applica3ons. Applica3on Layer

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Return on Experience on Cloud Compu2ng Issues a stairway to clouds. Experts Workshop Nov. 21st, 2013

Beyond Watson: The Business Implications of Big Data

Transcription:

1

Introduc)on to urika Mul)threading SPARQL Database urika Appliance XMT- 2 Programming Use Cases 2

MTA- 1 1998 Gallium arsenide: Proof of concept First produc,on implementa,on of latency- tolerant mul,threading MTA- 2 2002 Transi5on the microprocessor to CMOS Inven,on of scaling memory in a mul,threading system XMT- 1 2008 Integrated MTA and main Cray product line XT First hybrid system to support X86 and mul,threading CPUs XMT- 2 2011 Made system ready for enterprise produc5on Largest shared memory machine in the world First installa,on CSCS, May 2011 YarcData urika 2012 Op5mized for Scalable Graph Analy5cs 3

Uses the same cabinets, boards, scalable interconnect, I/O and storage infrastructure, user environment, and administra)ve tools just changes the processor Cabinet 24 blades per cabinet Ver5cal airflow with op5onal liquid assist Compute blades 4 Threadstorm processors 16-64 GB per processor Cray XT service and I/O subsystem PCIe connec5ons to storage and networks Scalable Lustre global file system Cray XT high- speed 3D torus network Cray XT power and RAS systems Linux based user environment 4

Social Networking Intelligence/Security Telecom/Mobile Life Sciences/Biology Healthcare/Medicine Finance Internet/WWW Supply Chain 5

Big Data Structured Semi-structured Unstructured Scale- out Analy)cs Unstructured Data Highly Par))onable Datasets Keyword Search Example: Hadoop, column stores Data Warehouses BI Tools Structured Data OLAP Cubes Regular Queries, Known Variables Example: Oracle Exadata In- memory Analy)cs Structured and Unstructured Data Non- par))onable datasets Complex Queries ( PaYern Matching ) Example: YarcData urika 6

7 Graphs are hard to Partition Graphs are not Predictable Graphs are highly Dynamic? High cost to follow relationships that span Cluster Nodes High cost to follow multiple competing paths which cannot be prefetched/cached High cost to load multiple, constantly changing datasets into in-memory graph models Network is 100 times SLOWER than Memory* Memory is 100 times SLOWER than Processor* Storage I/O is 1000 times SLOWER than Memory I/O* *Source: Hennessy, J. and Patterson, D., Computer Architecture: A Quantitative Approach, 2012 edition 7

8 Graphs are hard to Partition Large Shared Memory Up to 512 TB Graphs are not Predictable Massively Mul5- threaded 128 threads/processor Graphs are highly Dynamic Highly Scalable I/O Up to 350 TB/hr Real-time, Interactive Analytics on Big Data Graph Problems 8

Cray XMT- 2 Hardware Engine Originally designed for deep analysis of large datasets Very large scalable shared memory Architecture can support 512TB shared memory Typical systems are 2 TB to 32 TB Mul5threading Unique highly mul5threaded architecture 128 hardware threads per processor Extreme parallelism, hides memory latency Mul)threaded Graph Database Highly parallel in- memory RDF / SPARQL quad store High performance inference engine High performance parallel I/O Industry Standard Front End Based on Jena open source seman5c DB All standard Linux infrastructure and languages Lustre parallel file system 9 9

1 0

Rela)ve latency to memory con)nues to increase Vector processors amor,ze memory latency Cache- based microprocessors reduce memory latency Mul5threaded processors tolerate memory latency Mul)threading is most effec)ve when: Parallelism is abundant Data locality is scarce Large graph problems perform well on this architecture Graph analysis Seman5c databases Social network analysis 1 1

Many threads per processor core; small thread state Thread- level context switch at every instruc)on cycle registers ALU stream conventional processor program counter multithreaded processor Slide 12 1 2

Conven)onal processor Mul)threaded processor memory memory When one or a few threads stall, memory/ network bandwidth become idle Although some threads stall, others keep issuing local/remote memory requests, keeping most precious resources busy network network Slide 1 3

To the programmer, a mul)ple processor XMT- 2 looks like a single processor, except that the number of threads is increased. Bokhari/Sauer 2003 1 4

1 5

RDF triples databases are inherently graphical Some researchers call seman)c databases seman)c graph databases Vancouver in-country Canada is-type country object Edmonton has-population 730,000 North America 1 6

Resource Descrip)on Framework (RDF) Designed to enable seman5c web searching and integra5on of disparate data sources W3C standard formats Every datum represented as subject/predicate/object Ideally with each of those expressed with a URI Standard ontologies in some domains e.g., Open Biological and Biomedical Ontologies (OBO) Examples: subject predicate object <ncbitax:ncbitaxon_840261> rdf:type owl:class <ncbitax:ncbitaxon_195644> rdfs:subclassof <ncbitax:ncbitaxon_185881> <ncbitax:ncbitaxon_816681> rdfs:label "Characiformes sp. BOLD:AAG5151"@en 1 7

SPARQL Protocol and RDF Query Language (SPARQL) Enables matching of graph paderns in the seman5c DB Reminiscent of SQL # Lehigh University BenchMark (LUBM) Query 9 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT?X,?Y,?Z WHERE {?X rdf:type ub:student.?y rdf:type ub:faculty.?z rdf:type ub:course.?x ub:advisor?y.?y ub:teacherof?z.?x ub:takescourse?z} PREFIX == shorthand for a URI variables to be returned from the query find sets of (X, Y, Z) with a subject X of type Student, a subject Y of type Faculty, and a subject Z of type Course, where X is an advisee of Y, Y teaches course Z, and X takes course Z 1 8

1 9

20 RDF SPARQL Java Linux Apache Tomcat/Jena Graph Analy)c Tools Management, Security, Data Pipeline Graph Analy)c Database RDF datastore and SPARQL engine Gadgets / Mashups Graph Analy)c Plagorm Shared- memory, Mul5- threaded, Scalable I/O graph- op5mized hardware Data Center sees another Linux server Applica)ons see industry standard interface RDF, SPARQL, Java, Gadgets Reuse Exis)ng Skillsets Java, OSGI, App Server, SOA, ESB, Web toolkit Standard languages All applica5ons and ar5facts built on urika can be run on other plalorms 2 0

SELECT?p?x WHERE { (?x type person) (?p sells DVD ) (?x shops-at?p) } API SPARQL Engine database BGP SPARQL Ingest Applica)on Jena/Fuseki HTTP Interface Send query Parse, interpret query Set up query engine ORDER DISTINCT OPTIONAL UNION interface to Web Receive results Translate, send results FILTER generate low-level query urika service nodes urika compute nodes 21 2 1

Data Warehouse Hadoop Other Big Data Appliances Existing Analytic Environment(s) Export Analy)c/ Rela)onship Results Import Graph Datasets User/App Visualiza)on 2 2

2 3

Automa)c Parallelizing C / C++ Compiler Compiler runs on Linux login node The executable runs on the compute nodes Provides automa5c parallelism if proper op5ons are included Paralleliza5on direc5ves are used to provide hints to the compiler Login Nodes run standard SuSe (SLES 11) Linux All usual Linux languages and applica5ons may be used A hybrid programming model is possible, and encouraged Lightweight user communica)on (LUC) interface Use LUC to build a client/server interface between the front- end and back- end Symmetric; RPC- style interface in either direc5on Lustre global file system High- performance parallel file system used across Cray line 2 4 Slide

Primary responsibility for machine virtualiza)on. Maps simple programming model (flat memory, infinite processors) to actual hardware. Performs general management of resources such as processors and hardware streams. Responsible for thread management Thread crea5on/destruc5on/blocking/unblocking. Work scheduling - automa5c intra- process load balancing. Handles hardware traps and OS- induced swaps. Provides a highly- op)mized parallel memory allocator. Provides support for auxiliary func)ons. Tracing and profiling. Client- side debugging. Slide 25 2 5

Appren)ce2 GUI applica5on for debugging performance problems Consists of one or more reports based on how your program was compiled and executed Canal Report Feedback from the compiler Insight on how latent parallelism was exploited Informa5on on expected resource u5liza5on and scheduling Tview Report Hardware counter plots Actual performance of your applica5on Run5me trap informa5on for detec5ng hotspots Bprof Report Profile tables in terms of instruc5ons issued and memory references 26 2 6

2 7

Discover new drugs and treatments The Challenge Mul5ple massive datasets describing biological network graphs in cancer cells from published literature and experimental data, constantly updated Un- par55onable, densely and irregularly connected graphs Medical research churns out massive quan55es of literature, but most researchers don t have 5me to read even their own specialty publica5ons Over 10,000 genomics publica5ons per month currently urika Solu)on urika holds un- par55oned genomic network graph in memory Contrast experimental models and theories with published results to discover previously unknown rela5onships Interac5ve, real 5me access by mul5ple researchers Confirmation of elevated VEGF levels by tissue microarray: 2 8

The Cancer Genome Atlas (TCGA) data combined with literature rela)onships from Medline Protein domain interac5ons are extracted from TCGA using Random Forest classifica5on A Normalized Medline Distance (NMD) between proteins is calculated from Medline 5tles and abstracts Protein data from Uniprot and Pfam are also included 1.2 billion entries that generate 6B triples New data sets constantly being added- muta5on data, pa5ent data, methyla5on data, sta5s5cal analysis results and more Current database technology does not have the performance for useful full- scale queries across this data Muta5on Methyla5on Genes (TCGA) Protein Data (UniProt) GRAPH Query Literature (Medline) Protein Families (PFAM) Correla5ons (Sta5s5cal) Pa5ents (Health Records) 2 9

Quickly adapt surveillance to new rules and paderns The Challenge Con5nuous stream of possible compliance events New compliance rules require constant updates of signals and paderns Current solu5ons require offline prepara5on and are based on rigid rules urika Solu)on En5re rela5onship graph in memory New Paderns/templates can be iden5fied and added in real- 5me Graphical interac5ve explora5on of rela5onships between people, places, things, organiza5ons, communica5ons, etc. Business Value A responsive, flexible event detec5on plalorm that adapts to new knowledge Figure Source: Graph-based technologies for intelligence analysis,t. Coffman, S. Greenblatt, S. Marcus, Commun. ACM, 47(3):45-47, 2004 3 0

Deploy beder trading strategies with enhanced Market Sensing The Challenge Trading strategies that can react quickly and con5nuously to market drivers Market drivers are news and other events which may impact current trading outlook, posi5ons Enhanced market sensing requires seman5c analysis of news and a scalable graph- based model to implement urika Solu)on Ability to hold en5re market model in- memory as a dependency graph Ability to filter news relevant to posi5ons based on seman5c analysis Interac5ve, real- 5me access from trader worksta5ons Business Value Beder trading strategies that can be based on augmented market data including news and other secondary events of relevance 3 1

Discover similar Pa5ents to op5mize treatment The Challenge Longitudinal, historical data spanning all events, symptoms, diagnoses, diseases, treatments, prescrip5ons, etc of 10M pa5ents including gene5cs and family history Ad- hoc, constantly changing defini5on of similarity based on thousands of parameters Interac5ve, real- 5me response during consulta5on urika Solu)on urika holds en5re rela5onship graph in memory Iden5fy similar pa5ents based on ad- hoc physician specified paderns Interac5ve, real- 5me access by en5re physician community Business Value Consistent selec5on of the most effec5ve treatment for each pa5ent by each doctor every 5me Blood pressure Prior myocardial infarction Hypertension Body mass index HDL Anti-hypertension meds Patient: Jean Generic 3 2

3 3

SPARQL by Example tutorial, by Lee Feigenbaum, hdp:// www.cambridgeseman5cs.com/2008/09/sparql- by- example/ Search RDF data with SPARQL, by Philip McCarthy http://www.ibm.com/developerworks/xml/library/j-sparql/ Seman5c Web for the Working Ontologist, by Dean Allemang and James Hendler, ISBN 978-0123859655 Learning SPARQL, by Bob DuCharme, O Reilly, ISBN 978-1- 449-30659- 5 RDF: www.w3.org/rdf/ SPARQL: www.w3.org/tr/rdf- sparql- query/ D2R: www4.wiwiss.fu- berlin.de/bizer/d2r- server 3 34 4