Bigdata Model And Components Of Smalldata Structure

Similar documents
Bigdata : Enabling the Semantic Web at Web Scale

bigdata Managing Scale in Ontological Systems

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Bigdata High Availability (HA) Architecture

Graph Database Performance: An Oracle Perspective

Introduction. Bigdata Database Architecture

LDIF - Linked Data Integration Framework

Hadoop & its Usage at Facebook

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Chapter 7. Using Hadoop Cluster and MapReduce

How To Scale Out Of A Nosql Database

CitusDB Architecture for Real-Time Big Data

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Introduction to Hadoop

Cloud Computing at Google. Architecture

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

AllegroGraph. a graph database. Gary King gwking@franz.com

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Cloud Computing and Advanced Relationship Analytics

Hadoop & its Usage at Facebook

Big Data Challenges in Bioinformatics

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Accelerating and Simplifying Apache

Benchmarking Cassandra on Violin

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL 2012 IBM Corporation

Introduction to Cloud Computing

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Apache Hadoop FileSystem and its Usage in Facebook

The Sierra Clustered Database Engine, the technology at the heart of

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

Introduction to Hadoop

Deliverable Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Network Graph Databases, RDF, SPARQL, and SNA

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Hypertable Architecture Overview

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Jeffrey D. Ullman slides. MapReduce for data intensive computing

NoSQL Data Base Basics

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Big Systems, Big Data

Application Development. A Paradigm Shift

Inge Os Sales Consulting Manager Oracle Norway

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Data processing goes big

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Advanced In-Database Analytics

bigdata SYSTAP, LLC All Rights Reserved bigdata Presented at Graph Data Management 2012

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

In-Memory Data Management for Enterprise Applications

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

InfiniteGraph: The Distributed Graph Database

DBMS / Business Intelligence, SQL Server

Apache Hadoop FileSystem Internals

MapReduce and Hadoop Distributed File System

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Optimizing Performance. Training Division New Delhi

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data With Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

Graph Database Proof of Concept Report

Log Mining Based on Hadoop s Map and Reduce Technique

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Semantic Web Standard in Cloud Computing

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Cloud Based Distributed Databases: The Future Ahead

NoSQL and Hadoop Technologies On Oracle Cloud

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Microsoft Analytics Platform System. Solution Brief

CiteSeer x in the Cloud

Trafodion Operational SQL-on-Hadoop

Lecture Data Warehouse Systems

Oracle Big Data Building A Big Data Management System

Transcription:

bigdata Flexible Reliable Affordable Web-scale computing. bigdata 1

Background Requirement Fast analytic access to massive, heterogeneous data Traditional approaches Relational Super computer Business intelligence tools Semantic web or graph based New approach Evolution lead by internet companies Linear scale out on commodity hardware Higher concurrency, up time; lower costs. bigdata 2

The Information Factories http://www.wired.com/wired/archive/14.10/cloudware.html Google Massively concurrent reads and writes with atomic row updates and petabytes of data. Basic data model is { application key, column name, timestamp } : { value } Yahoo/Amazon Cloud computing and functional programming to distribute processing over clusters (Apache Hadoop, Amazon elastic computing cloud). e-bay Partitioned data horizontally and into transactional and non-transactional regimes. bigdata 3

Use Cases Online services Search (Google), Advertising (Google, AOL), Recommendation Systems (Amazon), Auctions (ebay), Logistics (Wal-Mart, FedEx) Federal Government OSINT Predict and interdict Information fusion Bond trade analysis Bioinformatics bigdata 4

OSINT Harvest OSINT to support IC mission Many source languages Directed and broad harvest Long standing and ad hoc analyses Entities of interest automatically tagged Knowledge captured and republished using a semantic web database Aligned to multiple ontologies. Challenges Scale deployment from department to enterprise. Continued data growth and increasing growth rate. Federated policy-based information sharing at scale. bigdata 5

Connect the actors Multi-agency database of names, events, relationships. Data quality issues (alternative spellings, timeline, provenance) Policy-based data sharing. Key computation Preempt actors by linking entities and events to predict and interdict enemy operations. bigdata 6

Semantic Web Databases Facilitate these dynamic demands, but not at necessary scale. Best solutions offer 15k triples per second Billions of triples 20 hours at 15k per second. Scale-up architecture bigdata 7

Desired solution characteristics Cost effective, scalable access to and analysis of massive data. Petabyte scale Linear scale out Support rapid innovation Low management costs Very high: parallelism concurrency resource utilization aggregate IO bandwidth bigdata 8

bigdata architecture bigdata 9

Architecture layer cake Search& Indexing Events Data Analysis Pub/Sub SLA Reports services GOM (OODBMS) RDFS Sparse Row Store data models Transaction Manager Map / Reduce Metadata Service Data Service bigdata storage and computing fabric bigdata 10

Distributed Service Architecture Centralized services Distributed services Transaction Manager Metadata Service Data Services Map / Reduce Services Grid-enabled cluster. Service discovery using JINI. Data discovery using metadata service. Automatic failover for centralized services bigdata 11

Service Discovery Clients 3. Discover & locate Metadata Service 2. Advertise & monitor Registrar 1. advertise 4. Clients talk directly to data services. Data Services 1. Data services discover registrars and advertise themselves. 2. Metadata services discover registrars, advertise themselves, and monitor data service leave/join. 3. Clients discover registrars, lookup the metadata service, and use it to locate data services. 4. Clients talk directly to data services. bigdata 12

Dynamic Data Layer writes reads journal journal journal index segments overflow Clients Data Services Append only journals and readoptimized index segments are basic building blocks. bigdata 13

Data Service API (low level) Batch B+Tree operations Insert, contains, remove, lookup Range query Fast key range scans Optional filters Submit job Run procedure in local process Procedures automatically mapped across the data bigdata 14

Data Service Architecture writes reads Data Service journal journal journal overflow Journal is ~200M per host index Pipeline writes for data redundancy segments On overflow New journal for new writes Migrate writes onto index segments Periodically Compacting merge of index segments Release old journals and segments Split / join index segments to maintain 100-200M per segment. bigdata 15

Metadata Service Index management Add, drop, provision. Index partition management Locate Split, Join, Compact Resource reclamation bigdata 16

Metadata Service Architecture One metadata index per scale out index. Clients query locations Metadata read / writeservice The metadata index maps application keys onto data service locators for index partitions. Clients go direct to data services for read and write operations Data Services bigdata 17

Managed Index Partitions journal journal journal overflow overflow overflow p0 p0 p1 pn Begins with just one partition. Index partitions are split as they grow. New partitions are distributed across the grid CPU, RAM and IO bandwidth grow with index size. bigdata 18

Metadata Addressing L0 alone can address 16 Terabytes. L1 can address 8 Exabytes per index. L0 metadata L0 128M L0 metadata partition with 256 byte records L1 metadata L1 L1 L1 128M L1 metadata partition with 1024 byte records. Data partitions p0 p1 pn 128M per application index partition bigdata 19

bigdata Map / Reduce Processing on Distributed File Systems bigdata 20

Map / Reduce Distributed computing paradigm Big win: Easy to write robust jobs that scale Lots of uses: Crawlers, indexers, reverse link graphs, etc. Implementations: Google, Hadoop, bigdata Distribute code to data (map) Local IO Aggregation phase (reduce) Collects intermediate outputs. Flexible basis for scalable enterprise computing services http://labs.google.com/papers/mapreduce.html http://lucene.apache.org/hadoop/ bigdata 21

Distributed File System Scale-out storage Block structured files 64M block size Partial blocks are stored compactly Applications manage logical records but enforce block boundaries. Atomic operations on file blocks Read, write, append, delete. bigdata 22

Use Case: Telescopic to Microscopic Analysis Begin with broad queries Initial questions must be broad to bring in all relevant data Ingest resources related to those queries Each query might select 10,000 documents. Sort, sift, and filter looking for key entities and new relationships Irrelevant data will be filtered or discarded to focus on the core problem bigdata 23

Scale-out Entity Extraction Query Control flow Map / Reduce Services Data flow Search Harvest Extract 1. Search web or intranet services, noting URLs of interest; 2. Harvest downloads documents; 3. Extract entities of interest. Distributed File System URLs docs RDF/ XML bigdata 24

bigdata Federation And Semantic Alignment bigdata 25

Semantic Web Technology Has been applied to federate IC data. Well suited to dealing with problems of federation and semantic alignment. Dynamic declarative mapping of classes, properties and instances to one another. owl:equivalentclass owl:equivalentproperty owl:sameas bigdata 26

RDF Data Model Resources URIs, http://www.systap.com. Literals Plain literals, bigdata. Language code literals, guten tag : de Datatype literals: 12.0 : xsd:double XML Literals Blank nodes An anonymous resource that can be used to construct containers. bigdata 27

RDF Data Model Statements General form is a statement or assertion { Subject, Predicate, Object } x:mike rdf:type x:terrorist. x:mike x:name Mike There are constraints on the types of terms that may appear in each position of the statement. Model theory licenses entailments (aka inferences). bigdata 28

Semantic Alignment with RDFS rdfs:class rdfs:class rdf:type rdf:type rdf:type rdfs:subclassof x:person x:terrorist y:individual rdfs:domain (A) x:name (B) Two schemas for the same problem. rdfs:subclassof rdfs:domain y:fullname rdf:type y:terroristactor rdf:type x:person rdf:type y:individual x:mike rdf:type x:terrorist y:michael rdf:type y:terroristactor (A) x:name Mike (B) y:fullname Michael explicit property inferred property Sample instance data for each schema. bigdata 29

Mapping ontologies together rdfs:class rdfs:class rdf:type rdf:type rdf:type rdfs:subclassof x:person x:terrorist y:individual rdfs:domain (A) x:name (B) Two schemas for the same problem. rdfs:subclassof rdfs:domain y:fullname rdf:type y:terroristactor x:terrorist owl:equivalentclass y:terroristactor x:name owl:equivalentproperty y:fullname x:mike owl:sameas y:michael Assertions that map (A) and (B) together. bigdata 30

Semantically Aligned View y:individual rdf:type rdf:type x:person y:terroristactor rdf:type x:mike rdf:type x:terrorist y:fullname x:name Mike Michael explicit property y:individual rdf:type rdf:type x:person inferred property y:terroristactor rdf:type y:michael rdf:type x:terrorist y:fullname x:name Mike Michael The data from both sources are snapped together once we assert that x:mike and y:michael are the same individual. bigdata 31

Semantic Web Database Build on the bigdata architecture Feature complete triple store RDFS inference Very competitive performance Open source bigdata 32

RDFS+ Semantics RDF Model Theory Eager closure is implemented Truth maintenance Declarative rules (easily extended). Backward chaining of select rules to reduce data storage requirements. Simple OWL extensions Support semantic alignment and federation. Magic sets solution is planned More scalable Makes truth maintenance easier bigdata 33

Lexicon uses two indices terms Assigns unique term identifiers. Full Unicode and data type support. Multi-component key [typecode] [languagecode datatypeuri]? [lexicalform data type value] Value is the 64-bit unique term identifier ids Key is termid, value is the URI, BNode, or Literal. Primary use is to emit RDF. bigdata 34

Perfect Statement Index Strategy All access paths (SPO, POS, OSP). Facts are duplicated in each index. No bias in the access path. Key is concatenation of the term ids; value is unused. Fast range counts for choosing join ordering. Provided by the bigdata B+Tree package. bigdata 35

Data Load Strategy Sort data for each index RDF parser (rio) Buffers ~100k statements per second. Discard duplicate terms and statements. Batch ~100k statements per write. Net load rate is ~20,000 tps. buffer terms ids spo Batch index writes journal pos bigdata 36 osp concurrent

bigdata bulk load rates ontology #triples #terms tps seconds wines.daml 1,155 523 14,807 0.1 sw_community 1,209 855 19,190 0.1 russiaa 1,613 1,018 26,016 0.1 Could_have_been 2,669 1,624 21,352 0.1 iptc-srs 8,502 2,849 36,333 0.2 hu 8,251 5,063 22,002 0.4 core 1,363 861 15,989 0.1 enzymes 33,124 24,855 22,082 1.5 alibaba_v41 45,655 18,275 24,975 1.8 wordnet nouns 273,681 223,169 20,014 13.7 cyc 247,060 156,944 21,254 (15,000) 11.6 (17) ncioncology 464,841 289,844 25,168 18.5 Thesaurus 1,047,495 586,923 20,557 51.0 taxonomy 1,375,759 651,773 14,638 94.0 totals 3,512,377 1,964,576 21,741 193.1 bigdata 37

Comparison of Bulk Load Rates Oracle 300-500 tps 1. OpenLink 3k-6k tps 2. Franz AllegroGraph 15k-20k tps 3. Commercial best of breed bigdata ~20k tps 4. Plenty of room for improvement through simple tuning. (1) Source is Oracle -- http://forums.oracle.com/forums/message.jspa?messageid=1242528. (2) Source is http://www.openlinksw.com/weblog/oerling/?id=1010 (3) Source is http://www.franz.com/products/allegrograph/allegrograph.datasheet.pdf. (4) Dell Latidude D620 (laptop) using 1G of RAM and JRockit VR26. bigdata 38

Comparison to Franz AllegroGraph Franz: AMD64 quad-processor machine with 16GB of RAM AllegroGraph Data Size Load Rate LUBM, Lehigh U. Benchmark - 1,000 Files (50 Universities?) 6.88M triples 13.9K tps OpenCyc 255K triples 15.0K tps Bigdata: Xeon (32-bit) quad-processor machine with 4GB of RAM bigdata Data Size Load Rate LUBM, Lehigh U. Benchmark - 1,000 Files (100 Universities) 13.33M triples 22.0K tps OpenCyc 255K triples 21.3K tps bigdata 39

Scale-out data load Distributed architecture Range partitioned indices. Jini for service discovery. Client and database are distributed. Test platform Xeon (32-bit) 3Ghz quad-processor machines with 4GB of RAM and 280G disk each. Data set LUBM U1000 (133M triples) bigdata 40

Scale-out performance Performance penalty for distributed architecture All performance regained by the 2 nd machine. 10 machines would be 100k triples per second. Configuration Single-host architecture Scale-out architecture, one server Load Rate 18.5K tps 11.5K tps Scale-out architecture, two servers 25.0K tps bigdata 41

Next Steps Sesame 2.x integration Quad store SPARQL Inference and high-level query Backward chaining for owl:sameas Distributed JOINs Replace justifications with SLD/Magic Sets Scale-out Dynamic range partitioning and load balancing Service and data failover bigdata 42

bigdata Timeline today Development begins journal Bulk index builds Transactions Distributed indices Map / Reduce Dynamic Partitioning B+-Tree Partitioned indices Group Commit & High Concurrency Oct Nov Dec Jan Feb Mar Apr Sep Oct Nov Dec Jan Feb 07 08 RDF application Forward closure Consistent Data Load Fast query & truth maintenance. Massive parallel ingest Quad Store bigdata 43

Bryan Thompson Chief Scientist SYSTAP, LLC bryan@systap.com 202-462-9888 bigdata TM Flexible Reliable Affordable Web-scale computing. bigdata 44