bigdata Flexible Reliable Affordable Web-scale computing. bigdata 1
Background Requirement Fast analytic access to massive, heterogeneous data Traditional approaches Relational Super computer Business intelligence tools Semantic web or graph based New approach Evolution lead by internet companies Linear scale out on commodity hardware Higher concurrency, up time; lower costs. bigdata 2
The Information Factories http://www.wired.com/wired/archive/14.10/cloudware.html Google Massively concurrent reads and writes with atomic row updates and petabytes of data. Basic data model is { application key, column name, timestamp } : { value } Yahoo/Amazon Cloud computing and functional programming to distribute processing over clusters (Apache Hadoop, Amazon elastic computing cloud). e-bay Partitioned data horizontally and into transactional and non-transactional regimes. bigdata 3
Use Cases Online services Search (Google), Advertising (Google, AOL), Recommendation Systems (Amazon), Auctions (ebay), Logistics (Wal-Mart, FedEx) Federal Government OSINT Predict and interdict Information fusion Bond trade analysis Bioinformatics bigdata 4
OSINT Harvest OSINT to support IC mission Many source languages Directed and broad harvest Long standing and ad hoc analyses Entities of interest automatically tagged Knowledge captured and republished using a semantic web database Aligned to multiple ontologies. Challenges Scale deployment from department to enterprise. Continued data growth and increasing growth rate. Federated policy-based information sharing at scale. bigdata 5
Connect the actors Multi-agency database of names, events, relationships. Data quality issues (alternative spellings, timeline, provenance) Policy-based data sharing. Key computation Preempt actors by linking entities and events to predict and interdict enemy operations. bigdata 6
Semantic Web Databases Facilitate these dynamic demands, but not at necessary scale. Best solutions offer 15k triples per second Billions of triples 20 hours at 15k per second. Scale-up architecture bigdata 7
Desired solution characteristics Cost effective, scalable access to and analysis of massive data. Petabyte scale Linear scale out Support rapid innovation Low management costs Very high: parallelism concurrency resource utilization aggregate IO bandwidth bigdata 8
bigdata architecture bigdata 9
Architecture layer cake Search& Indexing Events Data Analysis Pub/Sub SLA Reports services GOM (OODBMS) RDFS Sparse Row Store data models Transaction Manager Map / Reduce Metadata Service Data Service bigdata storage and computing fabric bigdata 10
Distributed Service Architecture Centralized services Distributed services Transaction Manager Metadata Service Data Services Map / Reduce Services Grid-enabled cluster. Service discovery using JINI. Data discovery using metadata service. Automatic failover for centralized services bigdata 11
Service Discovery Clients 3. Discover & locate Metadata Service 2. Advertise & monitor Registrar 1. advertise 4. Clients talk directly to data services. Data Services 1. Data services discover registrars and advertise themselves. 2. Metadata services discover registrars, advertise themselves, and monitor data service leave/join. 3. Clients discover registrars, lookup the metadata service, and use it to locate data services. 4. Clients talk directly to data services. bigdata 12
Dynamic Data Layer writes reads journal journal journal index segments overflow Clients Data Services Append only journals and readoptimized index segments are basic building blocks. bigdata 13
Data Service API (low level) Batch B+Tree operations Insert, contains, remove, lookup Range query Fast key range scans Optional filters Submit job Run procedure in local process Procedures automatically mapped across the data bigdata 14
Data Service Architecture writes reads Data Service journal journal journal overflow Journal is ~200M per host index Pipeline writes for data redundancy segments On overflow New journal for new writes Migrate writes onto index segments Periodically Compacting merge of index segments Release old journals and segments Split / join index segments to maintain 100-200M per segment. bigdata 15
Metadata Service Index management Add, drop, provision. Index partition management Locate Split, Join, Compact Resource reclamation bigdata 16
Metadata Service Architecture One metadata index per scale out index. Clients query locations Metadata read / writeservice The metadata index maps application keys onto data service locators for index partitions. Clients go direct to data services for read and write operations Data Services bigdata 17
Managed Index Partitions journal journal journal overflow overflow overflow p0 p0 p1 pn Begins with just one partition. Index partitions are split as they grow. New partitions are distributed across the grid CPU, RAM and IO bandwidth grow with index size. bigdata 18
Metadata Addressing L0 alone can address 16 Terabytes. L1 can address 8 Exabytes per index. L0 metadata L0 128M L0 metadata partition with 256 byte records L1 metadata L1 L1 L1 128M L1 metadata partition with 1024 byte records. Data partitions p0 p1 pn 128M per application index partition bigdata 19
bigdata Map / Reduce Processing on Distributed File Systems bigdata 20
Map / Reduce Distributed computing paradigm Big win: Easy to write robust jobs that scale Lots of uses: Crawlers, indexers, reverse link graphs, etc. Implementations: Google, Hadoop, bigdata Distribute code to data (map) Local IO Aggregation phase (reduce) Collects intermediate outputs. Flexible basis for scalable enterprise computing services http://labs.google.com/papers/mapreduce.html http://lucene.apache.org/hadoop/ bigdata 21
Distributed File System Scale-out storage Block structured files 64M block size Partial blocks are stored compactly Applications manage logical records but enforce block boundaries. Atomic operations on file blocks Read, write, append, delete. bigdata 22
Use Case: Telescopic to Microscopic Analysis Begin with broad queries Initial questions must be broad to bring in all relevant data Ingest resources related to those queries Each query might select 10,000 documents. Sort, sift, and filter looking for key entities and new relationships Irrelevant data will be filtered or discarded to focus on the core problem bigdata 23
Scale-out Entity Extraction Query Control flow Map / Reduce Services Data flow Search Harvest Extract 1. Search web or intranet services, noting URLs of interest; 2. Harvest downloads documents; 3. Extract entities of interest. Distributed File System URLs docs RDF/ XML bigdata 24
bigdata Federation And Semantic Alignment bigdata 25
Semantic Web Technology Has been applied to federate IC data. Well suited to dealing with problems of federation and semantic alignment. Dynamic declarative mapping of classes, properties and instances to one another. owl:equivalentclass owl:equivalentproperty owl:sameas bigdata 26
RDF Data Model Resources URIs, http://www.systap.com. Literals Plain literals, bigdata. Language code literals, guten tag : de Datatype literals: 12.0 : xsd:double XML Literals Blank nodes An anonymous resource that can be used to construct containers. bigdata 27
RDF Data Model Statements General form is a statement or assertion { Subject, Predicate, Object } x:mike rdf:type x:terrorist. x:mike x:name Mike There are constraints on the types of terms that may appear in each position of the statement. Model theory licenses entailments (aka inferences). bigdata 28
Semantic Alignment with RDFS rdfs:class rdfs:class rdf:type rdf:type rdf:type rdfs:subclassof x:person x:terrorist y:individual rdfs:domain (A) x:name (B) Two schemas for the same problem. rdfs:subclassof rdfs:domain y:fullname rdf:type y:terroristactor rdf:type x:person rdf:type y:individual x:mike rdf:type x:terrorist y:michael rdf:type y:terroristactor (A) x:name Mike (B) y:fullname Michael explicit property inferred property Sample instance data for each schema. bigdata 29
Mapping ontologies together rdfs:class rdfs:class rdf:type rdf:type rdf:type rdfs:subclassof x:person x:terrorist y:individual rdfs:domain (A) x:name (B) Two schemas for the same problem. rdfs:subclassof rdfs:domain y:fullname rdf:type y:terroristactor x:terrorist owl:equivalentclass y:terroristactor x:name owl:equivalentproperty y:fullname x:mike owl:sameas y:michael Assertions that map (A) and (B) together. bigdata 30
Semantically Aligned View y:individual rdf:type rdf:type x:person y:terroristactor rdf:type x:mike rdf:type x:terrorist y:fullname x:name Mike Michael explicit property y:individual rdf:type rdf:type x:person inferred property y:terroristactor rdf:type y:michael rdf:type x:terrorist y:fullname x:name Mike Michael The data from both sources are snapped together once we assert that x:mike and y:michael are the same individual. bigdata 31
Semantic Web Database Build on the bigdata architecture Feature complete triple store RDFS inference Very competitive performance Open source bigdata 32
RDFS+ Semantics RDF Model Theory Eager closure is implemented Truth maintenance Declarative rules (easily extended). Backward chaining of select rules to reduce data storage requirements. Simple OWL extensions Support semantic alignment and federation. Magic sets solution is planned More scalable Makes truth maintenance easier bigdata 33
Lexicon uses two indices terms Assigns unique term identifiers. Full Unicode and data type support. Multi-component key [typecode] [languagecode datatypeuri]? [lexicalform data type value] Value is the 64-bit unique term identifier ids Key is termid, value is the URI, BNode, or Literal. Primary use is to emit RDF. bigdata 34
Perfect Statement Index Strategy All access paths (SPO, POS, OSP). Facts are duplicated in each index. No bias in the access path. Key is concatenation of the term ids; value is unused. Fast range counts for choosing join ordering. Provided by the bigdata B+Tree package. bigdata 35
Data Load Strategy Sort data for each index RDF parser (rio) Buffers ~100k statements per second. Discard duplicate terms and statements. Batch ~100k statements per write. Net load rate is ~20,000 tps. buffer terms ids spo Batch index writes journal pos bigdata 36 osp concurrent
bigdata bulk load rates ontology #triples #terms tps seconds wines.daml 1,155 523 14,807 0.1 sw_community 1,209 855 19,190 0.1 russiaa 1,613 1,018 26,016 0.1 Could_have_been 2,669 1,624 21,352 0.1 iptc-srs 8,502 2,849 36,333 0.2 hu 8,251 5,063 22,002 0.4 core 1,363 861 15,989 0.1 enzymes 33,124 24,855 22,082 1.5 alibaba_v41 45,655 18,275 24,975 1.8 wordnet nouns 273,681 223,169 20,014 13.7 cyc 247,060 156,944 21,254 (15,000) 11.6 (17) ncioncology 464,841 289,844 25,168 18.5 Thesaurus 1,047,495 586,923 20,557 51.0 taxonomy 1,375,759 651,773 14,638 94.0 totals 3,512,377 1,964,576 21,741 193.1 bigdata 37
Comparison of Bulk Load Rates Oracle 300-500 tps 1. OpenLink 3k-6k tps 2. Franz AllegroGraph 15k-20k tps 3. Commercial best of breed bigdata ~20k tps 4. Plenty of room for improvement through simple tuning. (1) Source is Oracle -- http://forums.oracle.com/forums/message.jspa?messageid=1242528. (2) Source is http://www.openlinksw.com/weblog/oerling/?id=1010 (3) Source is http://www.franz.com/products/allegrograph/allegrograph.datasheet.pdf. (4) Dell Latidude D620 (laptop) using 1G of RAM and JRockit VR26. bigdata 38
Comparison to Franz AllegroGraph Franz: AMD64 quad-processor machine with 16GB of RAM AllegroGraph Data Size Load Rate LUBM, Lehigh U. Benchmark - 1,000 Files (50 Universities?) 6.88M triples 13.9K tps OpenCyc 255K triples 15.0K tps Bigdata: Xeon (32-bit) quad-processor machine with 4GB of RAM bigdata Data Size Load Rate LUBM, Lehigh U. Benchmark - 1,000 Files (100 Universities) 13.33M triples 22.0K tps OpenCyc 255K triples 21.3K tps bigdata 39
Scale-out data load Distributed architecture Range partitioned indices. Jini for service discovery. Client and database are distributed. Test platform Xeon (32-bit) 3Ghz quad-processor machines with 4GB of RAM and 280G disk each. Data set LUBM U1000 (133M triples) bigdata 40
Scale-out performance Performance penalty for distributed architecture All performance regained by the 2 nd machine. 10 machines would be 100k triples per second. Configuration Single-host architecture Scale-out architecture, one server Load Rate 18.5K tps 11.5K tps Scale-out architecture, two servers 25.0K tps bigdata 41
Next Steps Sesame 2.x integration Quad store SPARQL Inference and high-level query Backward chaining for owl:sameas Distributed JOINs Replace justifications with SLD/Magic Sets Scale-out Dynamic range partitioning and load balancing Service and data failover bigdata 42
bigdata Timeline today Development begins journal Bulk index builds Transactions Distributed indices Map / Reduce Dynamic Partitioning B+-Tree Partitioned indices Group Commit & High Concurrency Oct Nov Dec Jan Feb Mar Apr Sep Oct Nov Dec Jan Feb 07 08 RDF application Forward closure Consistent Data Load Fast query & truth maintenance. Massive parallel ingest Quad Store bigdata 43
Bryan Thompson Chief Scientist SYSTAP, LLC bryan@systap.com 202-462-9888 bigdata TM Flexible Reliable Affordable Web-scale computing. bigdata 44