bigdata SYSTAP, LLC All Rights Reserved bigdata Presented at Graph Data Management 2012

Size: px
Start display at page:

Download "bigdata SYSTAP, LLC 2006-2012 All Rights Reserved bigdata http://www.bigdata.com/blog Presented at Graph Data Management 2012"

Transcription

1 1 1

2 Presentation outline Bigdata and the Semantic Web Bigdata Architecture Indices and Dynamic Sharding Services, Discovery, and Dynamics Bulk Data Loader Bigdata RDF Database Provenance mode Vectored Query Engine Analytic Query mode Performance Roadmap Sample applications 2

3 SYSTAP Company Overview LLC, Small Business, Founded % Employee Owned Customers & Use Cases Intelligence Community Federation and semantic alignment at scale to facilitate rapid threat detection and analysis Telecommunications Horizontal data integration across enterprise services Health Care Data integration and analytics Network Storage Embedded device monitoring and root cause analysis Collaboration and Knowledge Portals Bioinformatics, manufacturing, NGOs, etc. OEM Resellers Overview Corporate Services & Product Offering Semantic Web Consulting Services System vision, design, and architecture Information architecture development Ontology development and inference planning Relational data mapping and migration Rapid prototyping Bigdata, an open source, horizontallyscaled high performance RDF database Dual licensing (GPL, commercial) Infrastructure planning Technology identification and assessment Benchmarking and performance tuning Feature development Training & Support 3 We focus on the backend semantic web database architecture and offer support and other services around that. Reasons for working with SYSTAP, include: -You have a prototype and you want to get it to the market. -You want to embed a fast high performance semantic web database into your application. -You want to integrate, query, and analyze large amounts information across the enterprise. 3

4 High level take aways Fast, scalable, open source, standards compliant database Single machine to 50B triples or quads Scales horizontally on a cluster Compatible with 3 rd party inference and middle tier solutions Reasons to be interested: You need to embed a high performance semantic web database. You need to integrate, query, and analyze large amounts information. You need fast turn around on support, new features, etc. 4

5 What is big data? Big data is a way of thinking about and processing massive data. Petabyte scale Distributed processing Commodity hardware Open source 5 And with the emergence of big data analytics, many core and massively parallel processing. 5

6 Different kinds of big systems Parallel File Systems and Blob Stores GFS, S3, HDFS, etc. Map / reduce Good for data locality in inputs E.g., documents in, hash partitioned full text index out. Row stores High read / write concurrency using atomic row operations Basic data model is { primary key, column name, timestamp } : { value } Useful for primary key access to large data sets Parallel (clustered) databases The Bigdata platform fits into this category. 6 There are many different approaches to massive data. I have listed a few here. Row stores Row stores provide single key access to schema flexible data. The original inspiration was Google s bigtable system. Map reduce Map reduce decompose a problem, performs a local computation, and then aggregates the outputs of the local computation. Map reduce is great for problems with good input locality. The general map/aggregation pattern is an old one. The modern map/reduce systems are inspired by Google s map/reduce system. Main memory graph processing These systems provide in-memory graph processing, but lack many of the features of a database architecture (isolation, durability, etc). The intelligence community has a long history in this area. More recently RAM clouds are emerging in the private sector. There are also custom hardware solutions in the space, such as the Cray XMT. Parallel databases Parallel database platforms offer ACID (Atomic, Consistent, Isolated, Durable) data processing. This world can be divided up again into traditional relational databases, column stores, and increasingly, semantic web database platforms.

7 The Semantic Web The Semantic Web is a stack of standards developed by the W3C for the interchange and query of metadata and graph structured data. Open data Linked data Multiple sources of authority Self describing data and rules Federation or aggregation And, increasingly, provenance 7 As a historical note, provenance was left out of the original semantic web architecture because it requires statements about statements which implies second order predicate calculus. At the Amsterdam 2000 W3C Conference, TBL stated that he was deliberately staying away from second order predicate calculus. Provenance mechanisms are slowly emerging for the semantic web due to their absolute necessity in many domains. The SGML / XML Topic Maps standardization was occurring more or less at the same time. It had a focus which provided for provenance and semantic alignment with very little support for reasoning. Hopefully these things will soon be reconciled as different profiles reflecting different application requirements.

8 The Standards or the Data? TBL, S. Bratt, Images copyright WC3. See (TBL, 2000) and (S. Bratt, 2006) Provenance is huge concern for many communities. However, support for provenance, other than digitally signed proofs, was explicitly off the table when the semantic web was introduced in TBL s reason for not tackling provenance was that it requires second order predicate calculus, while the semantic web is based on first order predicate calculus. However, it is possible to tackle provenance without using a highly expressive logic. The ISO and XML Topic Maps communities have lead in this area and, recently, there has been increasing discussion about provenance within the W3C. 8

9 The data it s about the data 9 English: The [ ] diagram [above] visualizes the data sets in the LOD cloud as well as their interlinkage relationships. Each node in this cloud diagram represents a distinct data set published as Linked Data. The arcs indicate that RDF links exist between items in the two connected data sets. Heavier arcs roughly correspond to a greater number of links between two data sets, while bidirectional arcs indicate the outward links to the other exist in each data set. Image by Anja Jentzsch. The image file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

10 Entrez Gene Knowledge Base of Biology (KaBOB) 17 databases UniProt DIP biomedical data & information biomedical knowledge application data GOA GAD Open Biomedical Ontologies HGNC InterPro 12 ontologies Gene Sequence Ontology Ontology Cell Type Ontology ChEBI NCBI Taxonomy Protein Ontology 10 This is one system which is using bigdata. There are more slides on KaBOB at the end of the deck, There are currently 8.7B statements drawn from 17 databases and stored in a single bigdata Journal. KaBOB differs from many efforts in its intent to create a high quality curated database over which high level reasoners may be run. In order to run high level reasoners, the data needs to be very carefully reviewed with respect to its semantics. For example, making sure that the distinction between a gene, a gene expression, and an experiment which indicates some correlation between a gene and a gene expression are all carefully distinguished. Without that, a high level reasoner will produce garbage rather than interesting entailments.

11 Enabling trends for Open source movement Cloud infrastructure Price pressure on commodity hardware Many core computing (CPUs, GPUs) Solid state disk Semantic technologies and linked open data Business pressure toward increasing data scale years ago everyone knew that the database was a frozen market. Today, nothing could be further from the truth This is a time of profound upheaval in the database world. There are a number of trends which are driving this. One is the continued pressure on commodity hardware prices and performance. Another is the growth in open source software. Today, all big data systems leverage open source software and many federal contracts require it. At the same time the speed of processors has hit the wall so applications are forced to parallelize and use distributed architectures. SSD has also changed the playing field, virtually eliminating disk latency in critical systems. The central focus for the platform central focus is a parallel semantic web database architecture. There are other distributed platforms, including those based on custom hardware solutions, for main memory graph processing. These tend to lack core features of a database architecture such as durability and isolation.

12 The killer big data app Clouds + Open Data = Big Data Integration Critical advantages Fast integration cycle Open standards Integrate heterogeneous data, linked data, structured data, and data at rest. Opportunistic exploitation of data, including data which can not be integrated quickly enough today to derive its business value. Maintain fine grained provenance of federated data. 12 This is our take on where this is all heading. We tend to focus on high data scale and low expressivity with rapid data exploitation cycles. 12

13 Petabyte scale Dynamic sharding Commodity hardware Open source, Java High performance High concurrency (MVCC) High availability Temporal database Semantic web database 13 13

14 Key Differentiators Scalable Single machine scaling to 50B triples/quads Cluster scales out incrementally using dynamic sharding Temporal database Fast access to historical database states. Light weight versioning. Point in time recovery. Highly available Built in design for high availability. Open source Active development team including OEMs and Fortune 500 companies. 14

15 Petabyte scale Metadata service (MDS) used to locate shards. Maps a key range for an index onto a shard and the logical data service hosting that shard. MDS is heavily cached and replicated for failover. Petabyte scale (tera triples) is easily achieved. Exabyte scale is much harder; breaks the machine barrier for MDS. MDS scale Data scale Triples 71MB terabyte 18,000,000,000 72GB petabyte 18,000,000,000,000 74TB exabyte 18,000,000,000,000,000 15

16 Cloud Architecture Hybrid shared nothing / shared disk architecture Compute cluster Spin compute nodes up or down as required plus Managed cloud storage layer S3, openstack, parallel file system, etc 16 Large scale systems need to think about deployment architectures and managing the raw resources that they will consume. Bigdata has evolved over time into a hybrid architecture in which there is a separation of concerns between the compute and persistence layers. The architecture still leverages local computation (including local disk access) whenever possible. We do this now through caching shards on the instances nodes while leaving read-only files on long term durable storage. This architecture makes it possible to deploy to platforms like Amazon EC2 (compute) + Amazon S3 (high 9s durable storage).

17 Bigdata Services and dynamics 17

18 is a federation of services Application Client Application Client Application Client Application Client Application Client Unified API RDF Data and SPARQL Query SPARQL XML RDF/XML Turtle SPARQL JSON N-Triples TriG N-Quads RDF/JSON Client Service Client Service Client Service Client Service Registrar Zookeeper Shard Locator Transaction Mgr Load Balancer Management Functions Data Service Data Service Data Service Data Service Data Service Data Service Data Service Distributed Index Management and Query 18 Data services (DS) manage shards Shard locator service (SLS) maps shards to their data services Transaction service (TS) coordinates transactions Load balancer service (LBS) centralizes load balancing decisions Client services (CS) provide execution of distributed jobs Jini lookup services (LUS) provide service discovery Zookeeper quorum servers (ZK) provide for master election, configuration management, and global synchronous locks 18

19 Typical Software Stack Java Application Layer Unified API Unified API Implementation API Frameworks (Spring, etc.) Sesame Framework SAIL RDF SPARQL Bigdata RDF Database Bigdata Component Services OS (Linux) Cluster and Storage Management HTTP Jini Zookeeper 19

20 Service Discovery (1/2) Service registration and discovery (a) Registrar (b) Shard Locator Client Service Data Service Data Service (a) Services discover registrars. Registrar discovery is configurable using either unicast or multicast protocols. Services advertise themselves and lookup other services (a). Clients use the shard locator to locate key range shards for scale out indices (b). Data Service RMI / NIO Bus 20

21 Service Discovery (2/2) Service lookup (b) Clients resolve shard locators to data service identifiers (a), then lookup the data services in service registrar (b). Registrar (a) Shard Locator Data moves directly between clients and services (c). Client Service Control Messages and Data Flow (c) (c) Data Service Data Service Service protocols not limited to RMI. Custom NIO protocols for data high throughput. RMI / NIO Bus (c) Data Service Client libraries encapsulate this for applications, including caching of service lookup and shard resolution. 21

22 Persistence Store Mechanisms journal read/write store index segments Write Once, Read Many (WORM) store Append only, log structured store (aka journal ) Used by the data services used to absorb writes Plays an important role in the temporal database architecture Read/Write (RW) store Efficiently recycles allocation slots on the backing file Used by services that need persistence without sharding Also used in the scale up single machine database (~50B triples) Index segment (SEG) store Read optimized B+Tree files Generated by bulk index build operations on a data service Plays an important role in dynamic sharding and analytic query 22 There are three distinct persistence store mechanisms within bigdata: Write Once, Read Many (WORM) store, which is based an append-only, logstructured store and is by the data services used to absorb writes. This is often referred to as the journal ; Read/Write (RW) store, which is capable of reusing allocation slots on the backing file and is used for services where we want to avoid decomposing the data into managed shards; and Index segment (SEG) store, which is a read-optimized B+Tree file generated by a bulk index build operation on a data service. Both the WORM and RW persistence store designs support write replication, record level checksums, low-level correction of bad reads, and resynchronization after service joins a failover chain. The index segments are only present on the data service nodes and are part of the mechanisms for dynamic sharding. Once an index segment file has been generated, it is replicated to the other physical data service nodes for the same logical data service instance. Bigdata provides a strong guarantee of consistency for the nodes in a failover chain such that any physical service instance in the chain may service any request for any historical commit point which has not be released by the transaction manager. This allows us to use shard affinity among the failover set for a given logical data service to load balance read operations. Since the members of the failover group have consistent state at each commit point, it possible for any node in the failover set to perform index segment builds. In general, it makes sense to perform the build on the node with the strongest affinity for that shard as the shard s data is more likely to be cached in RAM. 22

23 RWStore Read write database architecture Scales to ~ 50B triples. Fast Manages a pool of allocation regions 1k, 2k, 4k, 8k, etc. Reasonable locality on disk through management of allocation pools, but Requires SCSI (or SSD) for fast write performance (SATA is not good enough). Tracks deleted records (B+Tree nodes and leaves) Releases them once the commit points from which they were visible have been expired. Commit points expire when: No transactions open against any commit point LTE ts. (now ts) GTE retention time 23

24 The Data Service (1/2) Scattered writes journal overflow Gathered reads index segments Clients Data Services Append only journals and readoptimized index segments are basic building blocks. 24 Once clients have discovered the metadata service, they can begin operations against data services. This diagram shows how writes originating with different clients can get applied to a data service, how the data services use a write through pipeline to provide data replication, and how the writes appear on the append only journal and then migrate over time onto a read-optimized index segment. The concepts of the append only, fully-buffered journal and overflow onto readoptimized index segments are basic building blocks of the architecture. 24

25 The Data Service (2/2) Data Service journal index segments Behind the scenes on a data service. overflow Journal Block append for ultra fast writes Target size on disk ~ 200MB. Overflow Fast synchronous overflow Asynchronous index builds Key to Dynamic Sharding Index Segments 98%+ of all data on disk Target shard size on disk ~200M Bloom filters (fast rejection) At most one IO per leaf access Multi block IO for leaf scans 25 If a bigdata federation is configured to retain large amounts of (or infinite) history, then the persistent state of the services must be stored on a parallel file system, SAN, or NAS. This is a shared disk architecture. While quorums are still used in the shared disk architecture configuration, the persistent state is centralized on inherently robust managed storage. Unlike the shared nothing architecture, synchronization of a node is very fast as only the live journal must be replicated to a new node (see below). In order to rival the write performance of the shared nothing architecture, the shared disk architecture uses write replication for the live journal on the data services and stores the live journal on local disk on those nodes. Each time the live journal overflows, the data service opens a new live journal, and closes out the old journal against further writes. Asynchronous overflow processing then replicates the old journal to shared disk and batch builds read-only B+Tree files (index segments) from the last commit point on the 25

26 Bigdata Indices Scale out B+Tree and Dynamic Sharding 26

27 Bigdata Indices Dynamically key range partitioned B+Trees for indices Index entries (tuples) map unsigned byte[ ] keys to byte[ ] values. deleted flagand timestamp used for MVCC and dynamic sharding. Index partitions distributed across data services on a cluster Located by centralized metadata service nodes C sep. keys 7 child refs. A B A B 4 9 a b c d leaves a b c d keys values v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 27

28 Dynamic Key Range Partitioning p0 split p1 p2 Splits break down the shards as the data scale increases. p1 p2 join p3 Joins merge shards when data scale decreases. p3 move p4 Moves redistribute shards onto existing or new nodes in the cluster

29 Dynamic Key Range Partitioning Initial conditions place the first shard on an arbitrary data service representing the entire index key range. Shard Locator Service ([], ) p0 DataService

30 Dynamic Key Range Partitioning Writes cause the shard to grow. Eventually its size on disk exceeds a preconfigured threshold. ([], ) Shard Locator Service p0 DataService All Rights Reserved 30 30

31 Dynamic Key Range Partitioning Instead of a simple two way split, the initial shard is scatter split so that all data services can start managing data. p1 Shard Locator Service ([], ) p4 p2 p8 p7 p5 p3 p9 p6 DataService

32 Dynamic Key Range Partitioning (1) (2) p1 p2 DataService1 The newly created shards are then dispersed around the cluster. Subsequent splits are two way and moves occur based on relative server load (decided by Load Balancer Service). Shard Locator Service ( ) DataService2 (9) p9 DataService

33 Shard Evolution journal 0 build p0 Builds generate index segments from just the old journal. journal 0 p0 build p1 journal 0 p1 p0 build p2 journal 0 p2 p1 p0 merge p3 Merge compacts the shard view into a single index segment

34 Shard Evolution t 0 WRITE Initial journal on DS. time t n t n+1 READ WRITE READ WRITE READ journal 0 journal n journal n+1 seg n 1 seg n seg 0 build merge Incremental build of new segments for a shard with each journal overflow. Shard periodically undergoes compacting merge. Shard will split at 200MB. 34

35 Bulk Data Load High Throughput with Dynamic Sharding 35

36 Bulk Data Load Very high data load rates 1B triples in under an hour (better than 300,000 triples per second on a 16 node cluster). Executed as a distributed job Read data from a file system, the web, HDFS, etc. Database remains available for query during load Read from historical commit points with snapshot isolation

37 Batch Online or Offline The current batch load is online. Snapshot isolation is available for consistent read behind Readers advance to new global commit point after load Batch offline could also be achieved 1 st Map/Reduce pass Assign identifiers and compute key range partitions 2 nd Map/Reduce pass builds Covering indices (3 or 6) DESCRIBE cache (a key value cache for RDF subjects) Global cutover to new federation instance Offline bulk load would provide all the same query features and much higher load rates. 37

38 Distributed Bulk Data Loader Application Client Application client identifies RDF data files to be loaded Client Service One client is elected as the job master and coordinates the job across the other client. Client Service Client Service Client Service Clients read directly from shared storage. Writes are scattered across the data service nodes Data Service Data Service Data Service Data Service Data Service 38 This is an incremental bulk load into an existing system with snapshot isolation for consistent read behind against the same cluster.

39 Input (2.5B triples) 16 node cluster loading at 180k triples per second Billions toldtriples minutes 39 The bottleneck was the NSF mount from which the data were read.

40 Throughput (2.5B triples) 16 node cluster, 2.5B triples. 200, , , ,000 triples per second 120, ,000 80,000 60,000 40,000 20, minutes 40 The bottleneck was the NFS mount from which the data were read. 40

41 Disk Utilization (~56 bytes/triple) Bytes Under Management Billions bytes on disk minutes 41 Disk utilization under sustained write activity converges to a steady state after ~ 2B triples. These numbers are without block compression in the index segments. The initial peak and the peak and valley pattern are related to full GCs in the JVM. 41

42 Shards over time (2.5B triples) Total Shard Count on Cluster over time for 2.5B triples Shards minutes 42

43 Dynamic Sharding in Action Shard builds, merges, and splits for 2.5B triples #of operations Split Build Merge minutes 43 This graph shows the relative frequency of shard build, merge, and split events. Shard builds simply transfer the last commit point of an index onto a new index segment file. This action puts the data in key order on a read-only file. By building the index segment files, we can also release the journal on which the original writes were buffered (once the retention period for commit points on that journal has expired). Shard merges compact the total view of a shard into a single index segment file. A shard merges are executed used a priority queue based on the complexity of the shard view. Shard splits break an index file into one or more key ranges. A shard split is executed once the size on disk of a compact shard exceeds a threshold (200MB).

44 Scatter Splits in Action (Zoom on first 3 minutes) 44 This is an interesting view into the early life cycle of shards on a cluster. At the start of the timeline, each index (SPO, OSP, POS, T2ID, ID2T) is on one node. These initial splits are accelerated by a low threshold for the size of a full shard on the disk. As a result, those initial index partitions are split very quickly. The initial shard split is a scatter split. The initial shard of each index is split into N*2 shards, where N is the #of nodes in the cluster. Those shards are then scattered across the cluster. The top half of the graph shows the shard split events. The bottom half of the graph shows the shard move events, which is how the data are redistributed. Throughput accelerates rapidly as the system goes through these initial scatter splits.

45 Bigdata RDF Database 45

46 Bigdata 1.2 Scales to 50B triples/quads on a single machine. Scales horizontally on a cluster. Easy to embed in 3 rd party applications. SPARQL 1.1 support (except property paths) Query, Update, Federated Query, Service Description. Native RDFS+ inference. Analytic query. Three database modes: triples, provenance, or quads. 46 Bigdata 1.1 largely focused on query performance. Put all of our effort into something that really scales up. 46

47 Recent Features Analytics package Native memory manager scales up to 4TB of RAM and eliminates GC overhead. Still 100% Java. New hash index data structure for named solution sets and hash joins. New query model Faster query plans. Explain a query. Embedded SERVICEs. Query extensions, custom indexing, monitor updates, etc. Query hints using virtual triples Fine grained control over evaluation order, choice of operators, query plans, etc. Storage layer improvements: New index for BLOBs. Inline common vocabulary items in 2 3 bytes. 47

48 RDF Database Modes Triples 2 lexicon indices, 3 statement indices. RDFS+ inference. All you need for lots of applications. Provenance Datum level provenance. Query for statement metadata using SPARQL. No complex reification. No new indices. RDFS+ inference. Quads Named graph support. Useful for lots of things, including some provenance schemes. 6 statement indices, so nearly twice the footprint on the disk. 48

49 RDF Database Schema (1/2) Dynamically key range sharded indices: t2id id2t blobs Lexicon Relation RDF store spo pos osp Statement Relation Lexicon relation maps RDF Values (terms) into compact internal identifiers. Statement relation provides fast indices for SPARQL query. Scale out indices gives us scale out RDF 49 49

50 RDF Database Schema (2/2) This is slightly dated. We now inline numeric data types into the statement indices for faster aggregation and use a BLOBS index for large literals. 50

51 Covering Indices (Quads) 51

52 Statement Level Provenance Important to know where data came from in a mashup <mike, memberof, SYSTAP> < sourceof,...> But you CAN NOT say that in RDF. 52

53 RDF Reification Creates a model of the statement. <_s1, subject, :mike> <_s1, predicate, :memberof> <_s1, object, :SYSTAP> <_s1, type, Statement> Then you can say, << :sourceof, _s1>> 53 53

54 Statement Identifiers (SIDs) Statement identifiers let you do exactly what you want: <:mike, :memberof, :SYSTAP, _s1> << :sourceof, _s1>> SIDslook just like blank nodes And you can use them in SPARQL construct {?s :memberof?o.?s1?p1?sid. } where {?s1?p1?sid. GRAPH?sid {?s :memberof?o } } 54

55 Quads with Provenance Recasting the same idea to support quads with provenance. This would be a new feature. We do not currently support statement level provenance in the quads mode. Bigdata represents SIDs by inlining statements into statements. <<s,p,o>,p1,o1> SPARQL extension functions support SIDs in any mode: boolean:issid(varorconst) true iff the argument evaluates to a SID. sid(sid,s,p,o,[c]) where args are variables or constants. Unifies sid with <s,p,o,[c]>. Used to (de )compose statements about statements. 55 See

56 Statement Identifiers (SIDs) Using sid() in SPARQL. construct {?s :memberof?o.?s1?p1?sid. } where { sid(?sid,?s,memberof,?o,?g).?s1?p1?sid. GRAPH?g {?s :memberof?o } }»SPARQL functions might need to look a little different since SPARQL does not support unification (it can not bind function arguments). 56

57 Bigdata Vectored Query Engine 57

58 Bigdata Query Engine Vectored query engine High concurrency and vectored operator evaluation. Physical query plan (BOPs) Supports pipelined, blocked, and at once operators. Chunks queue up for each operator. Chunks transparently mapped across a cluster. Query engine instance runs on: Each data service; plus Each node providing a SPARQL endpoint. 58

59 Query Plan Improvements Analytic Query Mode Large intermediate results with ZERO GC. Sub Select, Sub Groups, and OPTIONAL groups Based on a hash join pattern, can be 100s of times faster. Automatically eliminates intermediate join variables which are not projected. Merge join pattern Used when a series of group graph patterns share a common join variable. Pipelined aggregation Does not materialize data unless necessary Named solution sets An Anzo extension. Faster default graph queries Query hints Control join order and many other things. 59

60 Analytic Package Bigdata Memory Manager 100% Native Java Application of the RWStore technology Manages up to 4TB of the native process heap No GC. Relieves heap pressure, freeing JVM performance. Extensible Hash Index (HTree) Fast, scalable hash index Handles key skew and bucket overflow gracefully Use with disk or the memory manager (zero GC cost up to 4TB!) Used for analytic join operators, named solution sets, caches, etc. Runtime Query Optimizer coming soon. 60 See Sven Helmer, Thomas Neumann, Guido Moerkotte: A Robust Scheme for Multilevel Extendible Hashing. ISCIS 2003: for the HTree algorithm.

61 JVM Heap Pressure JVMs provide fast evaluation (rivaling hand-coded C++) through sophisticated online compilation and auto-tuning. Application Throughput JVM Resources GC Workload Application Workload However, a non-linear interaction between the application workload (object creation and retention rate), and GC running time and cycle time can steal cycles and cause application throughput to plummet. 61 Go and get sesame, it will fall over because of this. Several commercial and open source Java grid cache products exist, including Oracle s Coherence, infinispan, and Hazelcast. However, all of these products share a common problem they are unable to exploit large heaps due to an interaction between the Java Garbage Collector (GC) and the object creation and object retention rate of the cache. There is a non-linear interaction between the object creation rate, the object retention rate, and the GC running time and cycle time. For many applications, garbage collection is very efficient and Java can run as fast as C++. However, as the Java heap begins to fill, the garbage collector must run more and more frequently, and the application is locked out while it runs. This leads to long application level latencies that bring the cache to its knees and throttles throughput. Sun and Oracle have developed a variety of garbage collector strategies, but none of them are able to manage large heaps efficiently. A new technology is necessary in order to successfully deploy object caches that can take advantage of servers with large main memories. 61

62 Choose standard or analytic operators Easy to specify which URL query parameter or SPARQL query hint Java operators Use the managed Java heap. Can sometimes be faster or offer better concurrency E.g., distinct solutions is based on a concurrent hash map BUT The Java heap can not handle very large materialized data sets. GC overhead can steal your computer Analytic operators Scale up gracefully Zero GC overhead. 62

63 Query Hints Virtual triples Scope to query, graph pattern group, or join. Allow fine grained control over Analytic query mode Evaluation order Vector size Etc. SELECT?x?y?z WHERE { hint:query hint:vectorsize ?x a ub:graduatestudent.?y a ub:university.?z a ub:department.?x ub:memberof?z.?z ub:suborganizationof?y.?x ub:undergraduatedegreefrom?y. } 63

64 New Bigdata Join Operators Pipeline joins (fast, incremental). Map binding sets over the shards, executing joins close to the data. Faster for single machine and much faster for distributed query. First results appear with very low latency Hash joins against an access path Scan the access path, probing the hash table for joins. Hash joins on a cluster can saturate the disk transfer rate! Solution set hash joins Combine sets of solutions on shared join variables. Used for Sub Select and Group Graph Patterns (aka join groups) Merge joins Used when a series of group graph patterns share a common join variable Think OPTIONALs. Java : Must sort the solution sets first, then a single pass to do the join. HTree : Linear in the data (it does not need to sort the solutions). 64

65 Query Evaluation Logical Query Model Physical Query Model SPARQL Parser AST AST Optimizers AST Query Plan Generator BOPs Vectored Query Engine Solutions Resolve Terms database Fast Range Counts Access Paths 65 As part of our 1.1 release, we had to take over more control of query evaluation from the Sesame platform. We found that that Sesame operator tree was discarding too much information about the structure of the original SPARQL query. Also, bigdata query evaluation has always been vectored and operators on IVs rather than RDF Values. The Sesame visitor pattern was basically incompatible with bigdata query evaluation. Bigdata is now integrated with the Sesame platform directly at the SPARQL parser. All query evaluation from that point forward is handled by bigdata. Most of the work is now done using the Abstract Syntax Tree (AST). The AST is much closer to the SPARQL grammar. AST optimizers annotate and rewrite the AST. AST then translated to Bigdata Operators (BOPs). BOPs run on the query engine. The AST ServiceNode provides for 3 rd party operators and services. The bigdata query engine is vectored and highly concurrent. Most bigdata operators support concurrent evaluation and multiple instances of the same operator can be running in parallel. This allows us to fully load the disk queue. With the new memory manager and analytic query operators, having large solutions sets materialized in memory no longer puts a burden on the JVM.

66 Data Flow Evaluation Left to right evaluation Physical query plan is operator sequence (not tree) op1 op2 op3 Parallelism Within operator Across operators Across queries Operator annotations used extensively. 66

67 Pipeline Joins Fast, vectored, indexed join. Very low latency to first solutions. Mapped across shards on a cluster. Joins execute close to the data. Faster for single machine Much faster for distributed query. 67

68 Preparing a query Original query: SELECT?x WHERE {?x a ub:graduatestudent ; ub:takescourse < } Translated query: query :- (x 8 256) ^ (x ) Query execution plan (access paths selected, joins reordered): query :- pos(x ) ^ spo(x 8 256) 68

69 Pipeline Join Execution 69

70 TEE Solutions flow to both the default and alt sinks. UNION(A,B) default sink TEE A B alt. sink Generalizes to n aryunion 70

71 Conditional Flow Data flows to either the default sink or the alt sink, depending on the condition. default sink cond? yes no alt. sink Branching pattern depends on the query logic. 71

72 Sub Group, Sub Select, etc. Build a hash index from the incoming solutions (at once). Run the sub select, sub group, etc. Solutions from the hash index are vectored into the sub plan. Sub Select must hide variables which are not projected. Join against the hash index (at once). build index group hash index Normal Optional Exists Not Exists 72 The JVM and HTree hash join utility classes understand the following kinds of hash join: /** * A normal join. The output is the combination of the left and right hand * solutions. Only solutions which join are output. */ Normal, /** * An optional join. The output is the combination of the left and right * hand solutions. Solutions which join are output, plus any left solutions * which did not join. Constraints on the join ARE NOT applied to the * "optional" path. */ Optional, /** * A join where the left solution is output iff there exists at least one * right solution which joins with that left solution. For each left * solution, that solution is output either zero or one times. In order to * enforce this cardinality constraint, the hash join logic winds up * populating the "join set" and then outputs the join set once all * solutions which join have been identified. */ Exists, /** * A join where the left solution is output iff there is no right solution * which joins with that left solution. This basically an optional join * where the solutions which join are not output. * <p> * Note: This is also used for "MINUS" since the only difference between * "NotExists" and "MINUS" deals with the scope of the variables. */ NotExists, /** * A distinct filter (not a join). Only the distinct left solutions are * output. Various annotations pertaining to JOIN processing are ignored * when the hash index is used as a DISTINCT filter. */

73 Named Solution Sets Support solution set reuse Evaluation Run before main WHERE clause. Exogenous solutions fed into named subquery. Results written on hash index. Hash index INCLUDEd into query. Example is BSBM BI Q5 Select?country?product?nrOfReviews?avgPrice WITH { Select?country?product (count(?review) As?nrOfReviews) {?product a < bsbm:reviewfor?product.?review rev:reviewer?reviewer.?reviewer bsbm:country?country. } Group By?country?product } AS %namedset1 WHERE { { Select?country (max(?nrofreviews) As?maxReviews) { INCLUDE %namedset1 } Group By?country } { Select?product (avg(xsd:float(str(?price))) As?avgPrice) {?product a < bsbm:product?product.?offer bsbm:price?price. } Group By?product } INCLUDE %namedset1. FILTER(?nrOfReviews=?maxReviews) } Order By desc(?nrofreviews)?country?product 73 See

74 Bottom Up Evaluation SPARQL semantics specify as if bottom up evaluation For many queries, left to right evaluation produces the same results. Badly designed left joins must be lifted out and run first. SELECT * { :x1 :p?v. OPTIONAL { :x3 :q?w. OPTIONAL { :x2 :p?v } } } rewrite SELECT * { WITH { SELECT?w?v { :x3 :q?w. OPTIONAL { :x2 :p?v } } } AS %namedset1 :x1 :p?v. OPTIONAL { INCLUDE %namedset1 } } 74

75 Merge Join Pattern High performance n way join Linear in the data (on HTree). Series of group graph patterns must share a common join variable. Common pattern in SPARQL Select something and then fill in optional information about it through optional join groups. SELECT (SAMPLE(?var9) AS?var1)?var2?var3 { { SELECT DISTINCT?var3 WHERE {?var3 rdf:type pol:politician.?var3 pol:hasrole?var6.?var6 pol:party "Democrat". } } OPTIONAL {?var3 p1:name?var9 } OPTIONAL {?var10 p2:votedby?var3.?var10 rdfs:label?var2. }... } GROUP BY?var2?var3 75 Really common pattern in SPARQL select something and then fill in optional information about it through optional join groups. The critical pattern for a merge join is that we have a hash index against which we may join several other hash indices. To join, of course, the hash indices must all have the same join variables. The pattern recognized here is based on an initial INCLUDE (defining the first hash index) followed by several either required or OPTIONAL INCLUDEs, which are the additional hash indices. The merge join itself is the N-way solution set hash join and replaces the sequence of 2-way solution set hash joins which we would otherwise do for those solution sets. It is more efficient because the output of each 2-way join is fed into the next 2-way join, which blows up the input cardinality for the next 2-way hash join. While the merge join will consider the same combinations of solutions, it does so with a very efficient linear pass over the hash indices. (For the JVM Merge Join operator we have to do a SORT first, but the HTree imposes a consistent ordering by the hash bits so we can go directly to the linear pass. (It is actually log linear due to the tree structure of the HTree, but it is against main memory and it is a sequential scan of the index so it should be effectively linear.) This is based on query21 from our govtrack CI dataset. 75

76 Merge Joins (cont.) SELECT (SAMPLE(?var9) AS?var1)?var2?var3 { { SELECT DISTINCT?var3 WHERE {?var3 rdf:type pol:politician.?var3 pol:hasrole?var6.?var6 pol:party "Democrat". } } OPTIONAL {?var3 p1:name?var9 rewrite } OPTIONAL {?var10 p2:votedby?var3.?var10 rdfs:label?var2. }... } GROUP BY?var2?var3 SELECT WITH { } as %set1 WITH { } as %set2 WITH { } as %set3 WHERE { INCLUDE %set1. OPTIONAL { INCLUDE %set2. } OPTIONAL { INCLUDE %set3. } } Handled as one n way merge join. 76 Two implementations: Java : Sorts the solution sets first, then does a linear scan for the join. HTree : Linear in the data (htree imposes a consistent ordering).

77 Parallel Graph Query Plans Partly ordered scans Parallel AP reads against multiple shards Hash partitioned Distinct Order by Aggregation (when not pipelined) Default Graph Queries Global index view provides RDF Merge semantics. DISTINCT TRIPLES across the set of contexts in the default graph 77 We do only some of this today, but this is where we are heading with the clustered database. Other than the default graph access path, this is all pretty standard stuff. We are looking at alternative ways to handle default graph queries on a cluster.

78 Parallel Hash Joins for Unselective APs AP must read a lot of data with reasonable locality Shard view: journal + index segment(s). One IO per leaf on segments. 90%+ data in key order on disk Narrow stride (e.g., OSP or POS) Crossover once sequential IO dominates vectored nested index join Key range scans turn into sequential IO in the cluster Goal is to saturate the disk read bandwidth Strategy A Map intermediate solutions against target shards Build hash index on target nodes Parallel local AP scans, probing hash index for joins. Strategy B Hash partition intermediate solutions on join variables. Parallel local AP scans against spanned shard(s) Vectored hash partition of solutions read from APs. Vectored probe in distributed hash index partitions for joins. 78 The question is how to best leverage the fact that 99%+ of the data are in key order on the disk in the clustered database architecture. We are looking at two basic designs here. One would map the intermediate solutions to the shard against which they need to be joined. The other would scan the shards, hash partitioning both the intermediate solutions and the solutions from the shard APs. Either way, the goal is to saturate the local disk when reading on unselective access paths in support of fast analytic query. This bears some relationship to column stores, not in the database architecture but in the goal of maximizing the IO transfer rate through narrow strides (triples or quads are quite narrow) and doing sequential IO when scanning a column.

79 Integration Points 79

80 Sesame Sail & Repository APIs Java APIs for managing and querying RDF data Extension methods for bigdata: Non blocking readers RDFS+ truth maintenance Full text search Change log Etc. 80

81 NanoSparqlServer Easy deployment Embedded or Servlet Container High performance SPARQL end point Optimized for bigdata MVCC semantics (queries are non blocking) Built in resource management Scalable! Simple REST API SPARQL 1.1 Query and Update Simple and useful REST ful INSERT/UPDATE/DELETE methods ESTCARD exposes super fast range counts for triple patterns Explain a query. Monitor running queries. 81 Benefits over Sesame: -trivial to deploy -optimized for bigdata MVCC semantics -built in resource management: controls the # of threads 81

82 SPARQL Query High level query language for graph data. Based on pattern matching and property paths. Advantages Standardization Database can optimize joins Versus navigation only APIs such as blueprints 82 SPARQL - Blueprints -

83 SPARQL Update Graph Management Create, Add, Copy, Move, Clear, Drop Graph Data Operations LOAD uri INSERT DATA, DELETE DATA DELETE/INSERT ( WITH IRIref )? ( ( DeleteClause InsertClause? ) InsertClause ) ( USING ( NAMED )? IRIref )* WHERE GroupGraphPattern Can be used as a RULES language, update procedures, etc. 83

84 Bigdata SPARQL UPDATE Language extension for SPARQL UPDATE Adds cached and durable solution sets. INSERT INTO %solutionset1 SELECT?product?reviewer WHERE { } Easy to slice result sets: SELECT... { INCLUDE %solutionset1 } OFFSET 0 LIMIT 1000 SELECT... { INCLUDE %solutionset1 } OFFSET 1000 LIMIT 1000 SELECT... { INCLUDE %solutionset1 } OFFSET 2000 LIMIT 1000 Re group or re order results: SELECT... { INCLUDE %solutionset1 } GROUP BY?x SELECT... { INCLUDE %solutionset1 } ORDER BY ASC(?x) Expensive JOINs are NOT recomputed. 84 See

85 Basic Federated Query Language extension for remote services SERVICE uri { graph pattern } Integrated into the vectored query engine Solutions vectored into, and out of, remote end points. Control evaluation order query hints: runfirst, runlast, runonce, etc. SERVICE {. } hint:prior hint:runfirst true. ServiceRegistry Configure service end point behavior 85 See See

86 Custom Services Examples Bigdata search index Ready made for 3 rd party integrations (Lucene, spatial, embedded prolog, etc.) Custom Services Run in the same JVM Can monitor database updates Invoked with the SERVICE graph pattern Solutions vectored in and out. 86

87 Low Latency Full Text Index 87

88 Full Text Index Built on top of standard B+Tree index architecture Enabled via a database property Each literal added to database indexed by full text engine High level SPARQL integration Java API Supports low latency paged result sets Supports whole word match and prefix match Embedded wildcard search (other than prefix match) not supported Hits are scored by traditional cosine relevance 88

89 Full Text Index Token extracted from RDF Literal. Uses Lucene analyzers based on the language code. Can register custom tokenizers, e.g., for engineering data docid The IV of the RDF Literal having that token. termfreq How many times the token appears in that literal localtermweight The normalized term frequency data. 89

90 Free Text Search: Basic Bigdata defines a number of magic predicates that can be used inside SPARQL queries: bd:search is used to bind a variable to terms in the full text index, which can then be used in joins: select?s?p o where { # search for literals that use mike?o bd:search mike. # join to find statements that use those literals?s?p?o. } 90

91 Free Text Search: Advanced Other magic predicates provide more advanced features: select?s?p?o?score?rank where {?o bd:search mike personick.?o bd:matchallterms true.?o bd:minrelevance 0.25.?o bd:relevance?score.?o bd:maxrank 1000.?o bd:rank?rank.?s?p?o. } 91

92 Benchmarks 92

93 LUBM U50 Benchmark provides a mixture of select, unselective, simple, and complex queries. Can be run at many data scales. Queries are not parameterized, so all runs are hot. query Time (ms) Result# query query query query query query ,463 query query query query query14 2, ,730 query6 2, ,114 query query9 4,114 8,627 Total 9,721 93

94 BSBM 100M Graph shows series of trials for the BSBM reduced query mix (w/o Q5). Metric is Query Mixes per Hour (QMpH). Higher is better. 8 client curve shows JVM and disk warm up effects. Both are hot for 16 client curve. Occasional low points are GC. Apple mini (4 cores, 16G RAM and SSD). Machine is CPU bound at 16 clients. No IO Wait. QMpH 60,000 50,000 40,000 30,000 20,000 10,000 BSBM 100M 8 clients 16 clients trials 94

95 Roadmap 95

96 Bigdata Roadmap 2012 Q1 SPARQL 1.1 Update SPARQL 1.1 Service Description SPARQL 1.1 Basic Federated Query Sesame Ganglia Integration 2012 Q2 SPARQL 1.1 Update Extensions (solution sets, cache control) SPARQL 1.1 Property paths New search index (subject centric) Runtime query optimizer (RTO) for analytic queries RDF GOM 2012 Q3 High Availability for the RWStore 2012 Q4 Scale Out enhancements Cloud enabled storage management 96 See

97 Runtime Query Optimization 97

98 RDF Join Order Optimization Typical approach Assign estimated cardinality to each triple pattern. Bigdata uses the fast range counts Start with the most selective triple pattern For each remaining join Propagate variable bindings Re estimate cardinality of remaining joins Cardinality estimation error Grows exponentially with join path length 98

99 Runtime Query Optimizer (RTO) Inspired by ROX (Runtime Optimization for XQuery) Online, incremental and adaptive dynamic programming. Breadth first estimation of actual cost of join paths. Cost function is cumulative intermediate cardinality. Extensions for SPARQL semantics. Query plans optimized in the data for each query. Never produces a bad query plan. Does not rely on statistical summaries. Recognizes correlations in the data for the specific query. Can improve the running time for some queries by 10x 1000x. Maximum payoff for high volume analytic queries. 99 See ROX: Run-Time Optimization of XQueries, by Riham Abdel Kader (PhD Thesis). See ROX: Runtime Optimization of XQueries by Riham Abdel Kader and Peter Boncz [ See ROX: The Robustness of a Run-time XQuery Optimizer Against Correlated Data [

Bigdata : Enabling the Semantic Web at Web Scale

Bigdata : Enabling the Semantic Web at Web Scale Bigdata : Enabling the Semantic Web at Web Scale Presentation outline What is big data? Bigdata Architecture Bigdata RDF Database Performance Roadmap What is big data? Big data is a new way of thinking

More information

bigdata Managing Scale in Ontological Systems

bigdata Managing Scale in Ontological Systems Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural

More information

Run$me Query Op$miza$on

Run$me Query Op$miza$on Run$me Query Op$miza$on Robust Op$miza$on for Graphs 2006-2014 All Rights Reserved 1 RDF Join Order Op$miza$on Typical approach Assign es$mated cardinality to each triple pabern. Bigdata uses the fast

More information

Bigdata High Availability (HA) Architecture

Bigdata High Availability (HA) Architecture Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources

More information

Flexible Reliable Affordable Web-scale computing.

Flexible Reliable Affordable Web-scale computing. bigdata Flexible Reliable Affordable Web-scale computing. bigdata 1 OSCON 2008 Background How bigdata relates to other efforts. Architecture Some examples RDF DB Some examples Web 3.0 Processing Using

More information

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 http://www.bigdata.com/blog. bigdata Presented to CSHALS 2/27/2014

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 http://www.bigdata.com/blog. bigdata Presented to CSHALS 2/27/2014 SYSTAP / Open Source High Performance Highly Available 1 SYSTAP, LLC Small Business, Founded 2006 100% Employee Owned Customers OEMs and VARs Government TelecommunicaHons Health Care Network Storage Finance

More information

Introduction. Bigdata Database Architecture

Introduction. Bigdata Database Architecture Introduction Bigdata is a standards-based, high-performance, scalable, open-source graph database. Written entirely in Java, the platform supports the SPARQL 1.1 family of specifications, including Query,

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server

More information

Graph Database Performance: An Oracle Perspective

Graph Database Performance: An Oracle Perspective Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

Bigdata Model And Components Of Smalldata Structure

Bigdata Model And Components Of Smalldata Structure bigdata Flexible Reliable Affordable Web-scale computing. bigdata 1 Background Requirement Fast analytic access to massive, heterogeneous data Traditional approaches Relational Super computer Business

More information

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data Collaborative Project LOD2 Creating Knowledge out of Interlinked Data Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months Deliverable 2.1.4 150 Billion Triple dataset hosted on

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Availability Digest. www.availabilitydigest.com. Raima s High-Availability Embedded Database December 2011

Availability Digest. www.availabilitydigest.com. Raima s High-Availability Embedded Database December 2011 the Availability Digest Raima s High-Availability Embedded Database December 2011 Embedded processing systems are everywhere. You probably cannot go a day without interacting with dozens of these powerful

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

<Insert Picture Here> Oracle In-Memory Database Cache Overview

<Insert Picture Here> Oracle In-Memory Database Cache Overview Oracle In-Memory Database Cache Overview Simon Law Product Manager The following is intended to outline our general product direction. It is intended for information purposes only,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Cloud Based Application Architectures using Smart Computing

Cloud Based Application Architectures using Smart Computing Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...

More information

Big Table A Distributed Storage System For Data

Big Table A Distributed Storage System For Data Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

MyISAM Default Storage Engine before MySQL 5.5 Table level locking Small footprint on disk Read Only during backups GIS and FTS indexing Copyright 2014, Oracle and/or its affiliates. All rights reserved.

More information

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

In-Memory Databases MemSQL

In-Memory Databases MemSQL IT4BI - Université Libre de Bruxelles In-Memory Databases MemSQL Gabby Nikolova Thao Ha Contents I. In-memory Databases...4 1. Concept:...4 2. Indexing:...4 a. b. c. d. AVL Tree:...4 B-Tree and B+ Tree:...5

More information

Software-defined Storage Architecture for Analytics Computing

Software-defined Storage Architecture for Analytics Computing Software-defined Storage Architecture for Analytics Computing Arati Joshi Performance Engineering Colin Eldridge File System Engineering Carlos Carrero Product Management June 2015 Reference Architecture

More information

Liferay Performance Tuning

Liferay Performance Tuning Liferay Performance Tuning Tips, tricks, and best practices Michael C. Han Liferay, INC A Survey Why? Considering using Liferay, curious about performance. Currently implementing and thinking ahead. Running

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb, Consulting MTS The following is intended to outline our general product direction. It is intended for information

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Jonathan Halstuch, COO, RackTop Systems JHalstuch@racktopsystems.com Big Data Invasion We hear so much on Big Data and

More information

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services Cognos8 Deployment Best Practices for Performance/Scalability Barnaby Cole Practice Lead, Technical Services Agenda > Cognos 8 Architecture Overview > Cognos 8 Components > Load Balancing > Deployment

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Running a Workflow on a PowerCenter Grid

Running a Workflow on a PowerCenter Grid Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier Simon Law TimesTen Product Manager, Oracle Meet The Experts: Andy Yao TimesTen Product Manager, Oracle Gagan Singh Senior

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

Cosmos. Big Data and Big Challenges. Pat Helland July 2011 Cosmos Big Data and Big Challenges Pat Helland July 2011 1 Outline Introduction Cosmos Overview The Structured s Project Some Other Exciting Projects Conclusion 2 What Is COSMOS? Petabyte Store and Computation

More information

Data Store Interface Design and Implementation

Data Store Interface Design and Implementation WDS'07 Proceedings of Contributed Papers, Part I, 110 115, 2007. ISBN 978-80-7378-023-4 MATFYZPRESS Web Storage Interface J. Tykal Charles University, Faculty of Mathematics and Physics, Prague, Czech

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Big data management with IBM General Parallel File System

Big data management with IBM General Parallel File System Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers

More information

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Mission-Critical Java. An Oracle White Paper Updated October 2008

Mission-Critical Java. An Oracle White Paper Updated October 2008 Mission-Critical Java An Oracle White Paper Updated October 2008 Mission-Critical Java The Oracle JRockit family of products is a comprehensive portfolio of Java runtime solutions that leverages the base

More information

Google File System. Web and scalability

Google File System. Web and scalability Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

SQL Anywhere 12 New Features Summary

SQL Anywhere 12 New Features Summary SQL Anywhere 12 WHITE PAPER www.sybase.com/sqlanywhere Contents: Introduction... 2 Out of Box Performance... 3 Automatic Tuning of Server Threads... 3 Column Statistics Management... 3 Improved Remote

More information

Development of nosql data storage for the ATLAS PanDA Monitoring System

Development of nosql data storage for the ATLAS PanDA Monitoring System Development of nosql data storage for the ATLAS PanDA Monitoring System M.Potekhin Brookhaven National Laboratory, Upton, NY11973, USA E-mail: potekhin@bnl.gov Abstract. For several years the PanDA Workload

More information

A Survey of Shared File Systems

A Survey of Shared File Systems Technical Paper A Survey of Shared File Systems Determining the Best Choice for your Distributed Applications A Survey of Shared File Systems A Survey of Shared File Systems Table of Contents Introduction...

More information

ORACLE DATABASE 10G ENTERPRISE EDITION

ORACLE DATABASE 10G ENTERPRISE EDITION ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Mark Bennett. Search and the Virtual Machine

Mark Bennett. Search and the Virtual Machine Mark Bennett Search and the Virtual Machine Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A Business

More information

Understanding Data Locality in VMware Virtual SAN

Understanding Data Locality in VMware Virtual SAN Understanding Data Locality in VMware Virtual SAN July 2014 Edition T E C H N I C A L M A R K E T I N G D O C U M E N T A T I O N Table of Contents Introduction... 2 Virtual SAN Design Goals... 3 Data

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce Elastic Application Platform for Market Data Real-Time Analytics Can you deliver real-time pricing, on high-speed market data, for real-time critical for E-Commerce decisions? Market Data Analytics applications

More information

In-Memory BigData. Summer 2012, Technology Overview

In-Memory BigData. Summer 2012, Technology Overview In-Memory BigData Summer 2012, Technology Overview Company Vision In-Memory Data Processing Leader: > 5 years in production > 100s of customers > Starts every 10 secs worldwide > Over 10,000,000 starts

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni Agenda Database trends for the past 10 years Era of Big Data and Cloud Challenges and Options Upcoming database trends Q&A Scope

More information

Exploiting peer group concept for adaptive and highly available services

Exploiting peer group concept for adaptive and highly available services Exploiting peer group concept for adaptive and highly available services Muhammad Asif Jan Centre for European Nuclear Research (CERN) Switzerland Fahd Ali Zahid, Mohammad Moazam Fraz Foundation University,

More information

Big Data and Big Data Modeling

Big Data and Big Data Modeling Big Data and Big Data Modeling The Age of Disruption Robin Bloor The Bloor Group March 19, 2015 TP02 Presenter Bio Robin Bloor, Ph.D. Robin Bloor is Chief Analyst at The Bloor Group. He has been an industry

More information

SQL Server Virtualization

SQL Server Virtualization The Essential Guide to SQL Server Virtualization S p o n s o r e d b y Virtualization in the Enterprise Today most organizations understand the importance of implementing virtualization. Virtualization

More information

BSC vision on Big Data and extreme scale computing

BSC vision on Big Data and extreme scale computing BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,

More information

MS SQL Performance (Tuning) Best Practices:

MS SQL Performance (Tuning) Best Practices: MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

Building a Scalable News Feed Web Service in Clojure

Building a Scalable News Feed Web Service in Clojure Building a Scalable News Feed Web Service in Clojure This is a good time to be in software. The Internet has made communications between computers and people extremely affordable, even at scale. Cloud

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks

More information

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015 E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing

More information

11.1 inspectit. 11.1. inspectit

11.1 inspectit. 11.1. inspectit 11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.

More information

Technology Insight Series

Technology Insight Series Evaluating Storage Technologies for Virtual Server Environments Russ Fellows June, 2010 Technology Insight Series Evaluator Group Copyright 2010 Evaluator Group, Inc. All rights reserved Executive Summary

More information

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation SQL Server 2014 New Features/In- Memory Store Juergen Thomas Microsoft Corporation AGENDA 1. SQL Server 2014 what and when 2. SQL Server 2014 In-Memory 3. SQL Server 2014 in IaaS scenarios 2 SQL Server

More information

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) ! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and

More information

Hypertable Goes Realtime at Baidu. Yang Dong yangdong01@baidu.com Sherlock Yang(http://weibo.com/u/2624357843)

Hypertable Goes Realtime at Baidu. Yang Dong yangdong01@baidu.com Sherlock Yang(http://weibo.com/u/2624357843) Hypertable Goes Realtime at Baidu Yang Dong yangdong01@baidu.com Sherlock Yang(http://weibo.com/u/2624357843) Agenda Motivation Related Work Model Design Evaluation Conclusion 2 Agenda Motivation Related

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK 3/2/2011 SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK Systems Group Dept. of Computer Science ETH Zürich, Switzerland SwissBox Humboldt University Dec. 2010 Systems Group = www.systems.ethz.ch

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information