Bigdata : Enabling the Semantic Web at Web Scale
Presentation outline What is big data? Bigdata Architecture Bigdata RDF Database Performance Roadmap
What is big data? Big data is a new way of thinking about and processing massive data. Petabytescale Commodity hardware Distributed processing
The origins of big data Google published several inspiring papers that have captured a huge mindshare. GDFS, Map/Reduce, bigtable. Competition has emerged among cloud as service providers: E3, S3, GAE, BlueCloud, Cloudera, etc. An increasing number of open source efforts provide cloud computing frameworks: Hadoop, Bigdata, CouchDB, Hypertable, mg4j, Cassandra.
Who has big data? USG Finance Biomedical & Pharmaceutical Large corporations Major web players High energy physics http://dataspora.com/blog/tipping points and big data/ http://www.wired.com/wired/archive/14.10/cloudware.html http://radar.oreilly.com/2009/04/linkedin chief scientist on analytics and bigdata.html http://www.nature.com/nature/journal/v455/n7209/full/455001a.html http://queue.acm.org/detail.cfm?id=1563874
Technologies that go big Distributed file systems GFS, S3, HDFS Map / reduce Lowers the bar for distributed computing Good for data locality in inputs E.g., documents in, hash partitioned full text index out. Sparse row stores High read / write concurrency using atomic row operations Basic data model is { primary key, column name, timestamp } : { value }
The killer big data application Clouds + Open Data = Big Data Integration Critical advantages Fast integration cycle Open standards Integrates heterogeneous data, linked data, structured data. Opportunistic exploitation of data, including data which can not be integrated quickly enough today to derive its business value.
Bigdata Architecture
Petabyte scale Dynamic sharding Commodity hardware Open source, Java High performance High concurrency (MVCC) HA Architecture Temporal database Semantic web database
Key Differentiators Dynamic sharding Incrementally scale from 10s, to 100s, to 1000s of nodes. Temporal database Fast access to historical database states. HA Architecture Built in design for high availability.
Bigdata Services Centralized services Distributed services Transaction Manager Metadata Service Load Balancer Data Services - Index data - Join processing Client Services - Distributed job execution Jini Service discovery. Zookeeper Configuration management, global locks, and master elections.
Service Discovery Clients Metadata Service Data Services 1. Services discover service registrars and advertise themselves. 3. Discover & locate 2. Advertise & monitor 1. advertise 2. Clients discover registrars, lookup the metadata service, and use it to obtain locators spanning key ranges of interest for a scale-out index. 3. Clients resolve locators to data service identifiers, then lookup the data services in the service registrar. 4. Clients talk directly to data services. Jini Registrar 5. Client libraries encapsulate this for applications.
The Data Service Scattered writes journal journal journal overflow Gathered reads index segments Clients Data Services Append only journals and readoptimized index segments are basic building blocks.
Bigdata Indices Dynamically key range partitioned B+Trees for indices Index entries (tuples) map unsigned byte[ ] keys to byte[ ] values. Tuples also have delete flag and timestamp Index partitions distributed across data services on a cluster Located by centralized metadata service root n0 n1 nn t0 t1 t2 t3 t4 t5 t6 t7
Dynamic Key Range Partitioning p0 split p1 p2 Splits break down the indices dynamically as the data scale increases. p1 p2 join p3 p3 move p4 Moves redistribute the data onto existing or new nodes in the cluster.
Dynamic Key Range Partitioning Metadata Service ([], ) p0 Initial conditions place a single index partition on an arbitrary host representing the entire B+Tree. Data Service 1
Dynamic Key Range Partitioning ([], ) Writes cause the partition to grow. Eventually its size on disk will exceed a preconfigured threshold. Metadata Service p0 Data Service 1
Dynamic Key Range Partitioning p1 Instead of a simple twoway split, the initial partition is scatter split so that all data services can start managing data. Metadata Service ([], ) p4 p2 p8 p7 p5 p3 p9 p6 Nine data services in this example. Data Service 1
Dynamic Key Range Partitioning (1) p1 Data Service 1 The newly created partitions are then moved to the various data services. Metadata Service (2) ( ) p2 Data Service 2 Subsequent splits are two way and moves occur based on relative server load (decided by load balancer service). (9) p9 Data Service 9
Bigdata Scale Out Math L0 metadata L0 200M L0 metadata partition with 256 byte records L1 metadata L1 L1 L1 200M L1 metadata partition with 1024 byte records. Index partitions p0 p1 pn 200M per application index partition. L0 alone can address 16 Terabytes. L1 can address 30 Exabytes per index. Even larger address spaces if L0 > 200M.
Bigdata RDF Database
Bigdata RDF Database Covering indices (ala YARS, etc). Three database modes: triples, provenance, or quads. Very high data rates High level query (SPARQL)
Covering Indices
RDF Database Modes Triples 2 lexicon indices, 3 statement indices. RDFS+ inference. All you need for lots of applications. Provenance Datum level provenance. Query for statement metadata using SPARQL. No complex reification. No new indices. RDFS+ inference. Quads Named graph support. Useful for lots of things, including some provenance schemes. 6 statement indices, so nearly twice the footprint on the disk.
Statement Level Provenance Important to know where data came from in a mashup <mike, memberof, SYSTAP> <http://www.systap.com, sourceof,...> But you CAN NOT say that in RDF.
RDF Reification Creates a model of the statement. <_s1, subject, mike> <_s1, predicate, memberof> <_s1, object, SYSTAP> <_s1, type, Statement> Then you can say, <http://www.systap.com, sourceof, _s1>
Statement Identifiers (SIDs) Statement identifiers let you do exactly what you want: <mike, memberof, SYSTAP, _s1> <http://www.systap.com, sourceof, _s1> SIDslook just like blank nodes And you can use them in SPARQL construct {?s <memberof>?o.?s1?p1?sid. } where {?s1?p1?sid. GRAPH?sid {?s <memberof>?o } }
Bulk Data Load Very high data load rates 1B triples in under an hour (10 data nodes, 4 clients) Executed as a distributed job Read data from a file system, the web, HDFS, etc. Database remains available for query during load Read from historical commit points. Lot s of work was required to get high throughput!
Identifying and Resolving Performance Bottlenecks 300k Apr Jan Fed Mar May Jun 300,000 triples per second (less than one hour for LUBM 8000). Triples Per Second 200k 100k Asynchronous writes for TERM2ID, reduced RAM demands; increased parser threads. 13B triples loaded. Eliminated machine and shard hot spots; asynchronous write API (130k). Faster & smarter moves for shards Increased write service concurrency (70k) Baseline on cluster (30k)
Bigdata U8000 Data Load Told Triples Loaded Billions 1.2 1.0 310k tps 0.8 Told Triples 0.6 0.4 ID2TERM Splits Scatter Splits 0.2 0.0 1 11 21 31 41 51 time (minutes)
Remaining Bottlenecks Index partition splits Tend to occur together. Fix is to schedule splits proactively. Indices Faster index segment builds. Various hotspots (shared concurrent LRU). Clients Buffer index writes for the target host, not the target shard. Can we double performance again?
Asynchronous index write API Shared, asynchronous write buffers Decouples client from latency of write requests Transparently handles index partition splits, moves, etc. Filters duplicates before RMI Chunkier writes on indices Much higher throughput
Distributed Data Load Job Task queue Scattered writes Job Master Clients Data Services
Writes are buffered inside the client P1 Pn Task Queue P1 Pn Scattered writes P1 Pn Job Master Clients Data Services
Client scatters writes against indices SPO#1 SPO P1 P2 P3 SPO#2 SPO#3 Client Data Services
Query evaluation Nested subquery Clients demand data from the shards, process joins locally. Can generate a huge number of RMI requests. Pipeline joins Map binding sets over the shards, executing joins close to the data. 50x faster for distributed query (based on earlier data distribution patterns). New join algorithms E.g., push statement patterns Latency and resource requirements Etc.
Preparing a query Original query: SELECT?x WHERE {?x a ub:graduatestudent ; ub:takescourse <http://www.department0.university0.edu/graduatecourse0>. } Translated query: query :- (x 8 256) ^ (x 400 3048) Query execution plan (access paths selected, joins reordered): query :- pos(x 400 3048) ^ spo(x 8 256)
Pipeline join execution SPO#1 POS#2 spo#1(x,8,256) Join Master Task SPO#2 POS#3 pos#2(x,400,3048) spo#2(x,8,256) SPO#3 POS#1 spo#3(x,8,256) Client Data Services pos(x 400 3048) spo(x 8 256)
Query Performance 10/22/2009 #trials=10 #parallel=1 Query Time Result# delta-t % change query1 254 4 24 10% query2 8,212,149 2,528 (10,227,868) -55% query3 194 6 (67) -26% query4 876 34 (422) -33% query5 1932 719 (75) -4% query6 713,445 69,222,196 (2,477,634) -78% query7 838 61 (29) -3% query8 3239 6463 (2,539) -44% query9 2,851,182 1,379,952 (2,699,119) -49% query10 121 0 (11) -8% query11 261 0 (88) -25% query12 1709 0 (227) -12% query13 47 0 9 24% query14 646,517 63,400,587 (2,426,916) -79% Total 12,432,764 134,012,550 (17,834,962) -59% Cluster of 10 nodes. 60% improvement in one week.
Bigdata Roadmap Parallel materialization of RDFS closure [1,2] Distributed query optimization High Availability architecture [1] Jesse Weaver, James A. Hendler. Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples. [2] Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen. Department of Computer Science, Vrije Universiteit Amsterdam, the Netherlands, Scalable Distributed Reasoning using MapReduce.
Bryan Thompson Chief Scientist SYSTAP, LLC bryan@systap.com bigdata Flexible Reliable Affordable Web scale computing.