PDQCollections: A Data-Parallel Programming Model and Library for Associative Containers
|
|
|
- Dominic Day
- 10 years ago
- Views:
Transcription
1 PDQCollections: A Data-Parallel Programming Model and Library for Associative Containers Maneesh Varshney Vishwa Goudar Technical Report # Computer Science Department University of California, Los Angeles April, 2013 ABSTRACT Associative containers are content-addressable data structures, such as maps, ordered maps, multimaps, sets etc., that are wide employed in a variety of computational problems. In this paper, we explore a parallel programming paradigm for data-centric computations involving associative data. We present PDQCollections - a novel set of data structures, coupled with a computation model (which we refer to as the Split-Replicate-Merge model) - that can efficiently exploit parallelism in multi-core as well distributed environments, for in-memory as well as large datasets. The distinguishing characteristics of our programming model are the design of data structure that inherently encapsulate parallelism and the computation model that transforms the problem of parallelization to that of defining addition operator for value data types. The PDQ design offers fundamental benefits over traditional data structures: with memorybound workloads, PDQ avoids locks and other forms of synchronization, and is able to significantly outperform lockbased data structures; with larger disk-bound workloads, PDQ does not use caches and avoids random disk access, and thus significantly outperform traditional disk-backed structures; and with distributed workloads, PDQ accesses remote data resources sequentially and only once, and is able to significantly outperform distributed data structures. We highlighted the distinguishing capabilities of PDQ library with several applications drawn from a variety of fields in Computer Science, including machine learning, data mining, graph processing, relational processing of structured data and incremental processing of log data. 1. INTRODUCTION Associative containers, also commonly referred to as hashes, hash tables, dictionaries or maps, are a cornerstone of programming template libraries. They provide standardized, robust and easy-to-use mechanisms to efficiently store and access key-value mappings. In this paper, we explore datacentric programming problems where either the input or the output or both are associative containers. Such computations are commonly employed in the fields of document processing, data mining, data analytics and machine learning, statistical analysis, log analysis, natural language processing, indexing and so on. In particular, we seek a data-parallel programming framework for associative data, where the parallelism can scale from multi-core to distributed environments, the data can scale from in-memory to diskbacked to distributed storage and the programming paradigm is as close as possible to the natural sequential programming patterns. The problems of data parallelism with associative containers are unlike from the index-based data structures such as the arrays and matrices. The most familiar parallel programming paradigm for the latter is the parallel for loop, as exemplified in OpenMP, Intel s Thread Building Blocks (TBB) and Microsoft s.net Task Parallel Library (TPL). In this model, the index range of the for loop is partitioned and each partition of the range is assigned to a separate thread. However, this paradigm requires that the input to the computation (that has to be partitioned for parallelization) must be index-addressable. Secondly, as the outputs generated by each thread must be protected against concurrent modifications, each thread must write data in non-overlapping memory regions or use critical sections or concurrent-safe containers. Maps are content-addressable, and consequently, can neither serve as input or output in a parallel for computation. Furthermore, in the data-parallel programming context, the synchronization locks are contested often leading to a significant locking overhead, which we confirmed in our benchmarks. Finally, these libraries, are typically meant for shared-memory and memory bound workloads and offer little support or perform poorly with distributed systems and persistent data stores. In our study of data-parallel programming paradigms, we did not discover any library or framework that can: (a) operate on associative containers, (b) execute in shared memory multi-threaded as well as distributed contexts, (c) support data size that scales from in-memory to disk-backed, and (d) have parallelization constructs that are as close as possible to the natural sequential and object-oriented style of programming. Towards this last point, we note that the widely acclaimed Map-Reduce model, owing to the functional nature of the programming framework, does not provide to the programmers the familiar Abstract Data Type of associative containers. In this paper, we present PDQCollections 1 a novel set of data structures, coupled with a computation model, for exploiting data parallelism in associative data sets. PDQ- Collections is a comprehensive library of collection classes that implement the native associative container interfaces: map, multimap, ordered map, ordered multimap, set, sorted set and others. We have also proposed a computation model, which we refer to as the Split-Replicate-Merge (SRM) model, that transparently and efficiently supports and exploits par- 1 PDQ could stand for Processes Data Quickly
2 allelism in multi-core as well as distributed environments, over the data scales that range from memory-bound to diskbacked. We have shown an equivalence between the SRM and the map-reduce (MR) model by mutual reducibility, that is, any problem that can be solved by one model can be solved by another. The distinguishing characteristic of our programming model is encapsulating the parallelism within the data structure implementations (the PDQ Collections classes) rather than the program code. Object oriented programming encourages encapsulating the distinguishing characteristics within the implementation of the objects, while providing familiar interfaces to the programmer. For example, Berkeley DB s StoredMap abstracts the knowledge that the data is backed on disk, while providing the familiar Map interface; GUI libraries hide the platform specific characteristics behind the implementation, while providing the same widget interface; and the Remote Method Invocations hide the fact that the objects are remotely located. However, traditional approaches for data parallel programming have instead relied on modifying the computation code, either by extending the grammar of a language (e.g. the parallel for loop), or enforcing alternate paradigms (e.g. functional programming in Map Reduce). By encapsulating parallelism within the data structures, the programming model is able to cleanly separate out the code for actual computation and the code for parallelization. Consider the analogy of Semaphores vs. Monitors: semaphores require introducing the logic of concurrency within the code, as opposed to monitors where all logic is captured separately within the monitor object. Similarly, with our programming model, the logic of parallelism is not interspersed within the code, rather it is expressed separately. As a concrete example, we shall discuss the Frequent Pattern (FP) growth algorithm [1] in Section 5.2, where the input data is processed to generate a FP tree data structure. The traditional methods of parallelism would require modifying this tree-generation code by embedding parallel constructs within the code. With our model, the programmer only needs to separately define how to add two trees. We believe, our design choice lead to a cleaner separation of functionality, resulting in a flexible and robust system architecture. The SRM computation model employs the divide-andconquer strategy for parallelization. We have proposed a novel shared-nothing strategy for merging associative containers, which obviates the need for any explicit or underlying synchronization, concurrent access or transactional processing. By avoiding locking overheads, PDQ containers significantly outperform the locking-based data structures, such as ConcurrentMaps. The same strategy, when applied to disk-bound workloads, is a cache-free design and ensures that data is always read and written to disk sequentially. By avoiding cache-miss penalties and random disk seek overheads, PDQ containers significantly outperform the traditional disk-backed data structures, such as BerkeleyDB. The PDQ collection classes are versatile in managing data over a wide range of scale. Initially, the data is stored inmemory, but as the size of the container grows, the data is transparently and automatically spilled over to the disk, where it is stored in a format most efficient for further processing. The disk-backed form of the container objects have the capability to store data on multiple disks, if present, to improve the disk I/O speeds. They have a distributed implementation as well, where the data can be flexibly stored at common location (e.g. SAN, NFS) or in a distributed storage. The rest of the paper is organized as follows: in Section 2 we provide a formal description of programming model and in Section 3 we discuss our implementation. We conduct a comparative design performance analysis study in Section 4 with concurrent data structures (for memory bound computation), BerkeleyDBâĂŹs StoredMap (for disk-backed computation) and Hazelcast (for distributed computation) to highlight the salient characteristics of our programming model that explains how PDQ significantly outperforms these traditional data structures. We illustrate several applications of our PDQ in different fields of Computer Science in Section 5 and conclude in Section DESIGN In this section, we provide a formal description of programming model, including the data model, which formalizes the problems that can be parallelized with our model, the Split-Replicate-Merge computation model, which describes the strategy for parallelism, and the programming APIs. We also show an equivalence with the Map Reduce programming model. 2.1 Data Model: Mergeable Maps This section formalizes the data-centric problems that can be parallelized using our model. We begin by defining an associative container as a mapping function from key space K to value space V: A : K V Different variants of associative containers can be interpreted by adapting the semantics of the mapping function. For example, the container is a multimap if the value space V is a set of collection of values, or a set if a Boolean space. Similarly, the container is an array if the key space K is a range of natural numbers, or a matrix if a set of tuples. Other complex containers, such as graphs, can similarly be interpreted. Next we define a computation as a function with input in the domain I (which may be an associative container) to produce as output an associative container A: C seq : I A We have explicitly formulated the computation to be a sequential function. For the sake of simplicity, we have also assumed only one input and one output for this computation. The extension to the general case is trivial, and will be discussed later. Our objective is to develop a data-parallel programming paradigm for this computation that simultaneously satisfies the following requirements: 1. the computation can be parallelized and scaled across multiple cores and processors, in shared-memory concurrent as well as distributed processing contexts. 2. the parallel computation can handle input and output data that scales from memory-bound to disk-backed to distributed storage.
3 3. both forms of scaling (processing and data) can be achieved by using the unmodified sequential form of the computation. We claim that the above requirements can be satisfied for the class of computation problems with inherent parallelism that can be formulated as follows: If I 1 I, and I 2 I, such that I 1 I 2 = φ, and C(I 1) A 1, C(I 2) A 2. Then the following must be equivalent: C(I 1 I 2) A 1 merge A 2 Where the merge operator merge is defined as: A 1(k) if k A 1, k / A 2 (A 1 merge A 2)(k) = A 2(k) if k / A 1, k A 2 A 1(k) A 2(k) otherwise Here, is some user-defined operator. It is required, however, that this add operator is both commutative and associative. Intuitively, the above formulation specifies that the inherent parallelism in the computation is such that if partial results were obtained by processing partial inputs, then the result of the combined inputs can be produced by merging the partial outputs. We shall later prove that this formulation of parallelism is equivalent to the map-reduce programming model, that is, any problem that can be solved by map-reduce can be formulated in this manner. Let us now review some examples to illustrate this model of parallelism in action. Word Count: Given a document, find the number of times each word appeared in the document. The input is a list of words, and the output is a simple map of string to integers. This problem can be parallelized since we can partition the input list into sublists, execute the word count program separately on each sublist, and merge the generated output maps by defining as integer addition operator. Inverted Index: Given a map of keys to a list of words (the input), generate a map of words to the list of keys that they belonged (the output). Although we have not yet discussed how the maps are partitioned, for now we imagine that the key space is partitioned and the sub-maps are created with key-value mappings for each partition, and the program to invert indices is executed separately on each sub-map. If we define as the list concatenation operator, the merged output will indeed be the inverted index of the original map. Finding Average: Find the average from the list of numbers. Since the average of averages is not the average, we need to define a custom data type with two fields, sum and count. We also define the operator for this data type that adds the two fields. The output of the computation is a singleton container, which contains only one element. The computation, then, takes as input a sublist of numbers and computes the result into this data type. These partial results are added together to produce the final result. 2.2 Computation Model: Split, Replicate, Merge In this section we describe our parallel computation model, called the Split-Replicate-Merge model, to exploit parallelism for the class of problems identified in the previous section. As previously, for the sake of simplicity we assume the computation with single input and single output. The model executes as follows: the input is first splitted into M partitions. The number of splits is atleast equal to the number of processing units available, although it can be greater if it is desired that each partition should be of some manageable size (for example, if each partition has to be processed entirely within memory). Next, the output data structure is replicated M times. Each replication is an initially empty data structure of same type as the output data container type. These splits and replicas are paired together, and M instances of computation are executed (one for each pair) concurrently. Finally, the replicas, which now contain the results of the computation from the respective splits, are merged to produce the final output. The following psuedo code outlines this process: PDQRun(C seq, input, output ) { SPLIT( input ) {sp i, 1 i M REPLICATE( output ) {rp i, 1 i M parallel for ( i : 1 to M ) do C seq (sp i, rp i ) end output M i=1 rpi // Merge Listing 1: The Split-Replicate-Merge Computation Model Note that there is a single synchronization point in the entire model, which is the implied barrier at the end of the parallel for loop. This model is easily extended to support multiple input and output parameters, as well as shared parameters. With multiple inputs, the implementation must ensure that each input generates equal number of splits, and the respective split from each input share the same keyspace. The multiple outputs are replicated, and later merged, independent of each other. The shared arguments are neither splitted nor replicated; the same copy of the object is provided to all the computation instances. In section 3, we shall discuss efficient implementations for splitting and merging. We show that associative containers can be splitted in O(1) time, and the replicas can be merged in parallel as well. We shall also see how this computation model can be easily adapted to handle volumes of data and to operate in distribute execution mode. 2.3 Programming Model Whereas the computation model described previously is meant to be transparent to the programmers, it is the programming model, described herein, that is visible to the programmers. This model describes how the computation must be coded for parallelization. We describe the programming model by means of the Word Count program, outlined in Listing 2. Although the example shown is coded in Java SE 6, the model can be applied to any programming language that supports Object Polymorphism and function references, either as closures, function pointers or functors, and, of course, a multithreaded environment. P a r a l l e l ( name= wordcount ) 2 public void wordcount ( java. i o. FileReader input, j a v a. u t i l. Map<S t r i n g, I n t e g e r > output ) { 5 BufferedReader r e a d e r = new BufferedReader ( i nput ) ; 6 S t r i n g l i n e ; 7 for ( ( l i n e = r e a d e r. r e a d L i n e ( ) )!= null ) {
4 8 for ( S t r i n g word : l i n e. s p l i t ( ) ) 9 i f (! output. containskey ( word ) ) 10 output. put ( word, 1 ) ; 11 e l s e 12 output. put ( word, output. g e t ( word ) + 1 ) ; public s t a t i c void main ( S t r i n g [ ] a r g s ) { 17 PDQ. s e t C o n f i g ( a r g s ) ; 18 F i l e R e a d e r input = new PDQFileReader ( f i l e n a m e ) ; 19 Map<String, Integer > output = 20 new PDQMap<String, Integer >( IntegerAdder ) ; PDQ. run ( this, wordcount, input, output ) ; 23 Listing 2: Word count using PDQCollections The first segment of the code (lines 1 through 14) is the actual computation. Notice that the method arguments are the native data types, and the method body is entirely sequential. Java annotations indicate that the arguments are split of the input and replication of output, respectively. annotation merely serves as a naming marker for the method, which we utilize to extract the method reference (java.lang.reflect.method) using the Java reflection APIs. This is only a user-friendly way to name a method, other techniques such as function objects would have sufficed as well. Next, we instantiate the objects for the computation - a FileReader as input, and a Map of String to Integer as output. Here, instead of using the Java native classes, we create the input/output objects from the counterpart classes provided by the PDQCollections library. The library includes the typical container data types: PDQMap, PDQMultimap, PDQSortedMap, PDQSortedMultimap, PDQSet, PDQSortedSet, PDQList etc that implement the respective Java interfaces; as well as File I/O classes, including PDQFileReader and PDQFileInputStream that subclass the respective Java classes. The input object is instantiated at line 18, while the output object in lines The PDQMap is also provided with the operator, in this case an Integer adder. Observe this salient feature of our programming model: to parallelize a sequential program, the programmer only needs to replace the programming language s native input/output container, or File I/O, classes with the counterpart classes from the PDQ library; while the actual computation code itself remains unmodified. The parallel execution is invoked in line 22 where the reference to the computation method and the input/output objects are passed to the PDQ library. This library call executes the Split-Replicate-Merge computation on the provided arguments. The behavior of the computation model execution can be controlled by the configuration parameters, shown in line Equivalence with the Map Reduce Model Map-Reduce is a widely acclaimed programming model for processing large data sets. This is a functional-style model that processes data in two stages of map and reduce phases. We show this model is equivalent to the SRM model by proving that the two models are reducible to each other. Consider an arbitrary problem that can be parallelized with the map-reduce model. There exists, then, a map function that transforms the key-value mappings of the input into an intermediate list of mappings in a different domain: MAP : (k 1, v 1) list(k 2, v 2) This intermediate output is processed to collect all values for a given key together, and the key space is sorted. This serves as the input to the reduce function, which generates the final result: REDUCE : (k 2, list(v 2)) list(k 3, v 3) This problem can be formulated by the Split-Replicate- Merge model in two execution steps, as shown in Listing 3. In the first step, the input map is processed to produce a SortedMultimap of intermediate values, which is further processed in the second step to generate the output P a r a l l e l ( name= s t e p 1 ) void s t e p 1 (Map<K 1, V 1> input, SortedMultimap<K 2, V 2> i n t e r m e d i a t e ) { for (k 1, v 1 in i n p u t ) L i s t <K 2, V 2> l i s t = MAP(k 1, v 1 ) ; for (k 2, v 2 in l i s t ) i n t e r m e d i a t e. add (k 2, v 2 ) P a r a l l e l ( name= s t e p 2 ) void s t e p 2 ( SortedMultimap<K 2, V 2> i n t e r m e d i a t e, SortedMultimap<K 3, V 3> output ) { for (k 2, l i s t v 2 in i n t e r m e d i a t e ) L i s t <K 3, V 3> l i s t = REDUCE(k 2, l i s t v 2 ) ; for (k 3, v 3 in l i s t ) output. add (k 3, v 3 ) ; void main (Map<K 1, V 1> o r i g i n a l I n p u t ) { Map<K 1, V 1> i n p u t = new PDQMap<K 1, V 1>( o r i g i n a l I n p u t ) ; SortedMultimap<K 2, V 2> i n t e r m e d i a t e = new PDQSortedMultimap<K 2, V 2 >(); SortedMultimap<K 3, V 3> output = new PDQSortedMultimap<K 3, V 3 >(); PDQ. run ( this, s t e p 1, input, i n t e r m e d i a t e ) ; PDQ. run ( this, s t e p 2, i n t e r m e d i a t e, output ) ; Listing 3: Reducing SRM model to Map-Reduce Conversely, consider an arbitrary problem that can formulated with our data and computation model, that is, the following method is defined: SRM(C seq fn, Map<K 1, V 1> input, Map<K 2, V 2> output ) This problem can be reduced to the map-reduce model as follows: c l a s s Mapper { Map<K 1, V 1> i n p u t ; void map(k 1 k, V 1 v ) { i n p u t. put ( k, v ) ; void f i n a l i z e ( ) { Map<K 2, V 2> p a r t i a l O u t p u t = REPLICATE( output ) ; fn ( input, partialoutput ) ; for (k 2, v 2 in p a r t i a l O u t p u t ) EMIT(k 2, v 2 ) c l a s s Reducer { void reduce (K 2 k 2, L i s t <V 2> l i s t v 2 ) { V 2 sum = f o l d (, l i s t v 2 ) ; EMIT(k 2, sum ) ; Listing 4: Reducing Map-Reduce model to SRM
5 In the mapper class, the key-values mappings are first collected into a Map. The output object is replicated, and the computation is invoked on these two objects. The mappings from the replica (which served as the output container for the chunk of input provided to this Mapper) are emitted to the reducer. At the reducer, the list of values for a key are aggregated together using the operator, which is then emitted for the given key. We have thus shown that that two programming models can be emulated with each other and, consequently, are equivalent. 2.5 Related Work Piccolo [2] is a data-centric programming model with a similar objective of parallel and distributed computing using associative containers. The system ensures execution locality by partitioning the input across machines with a userspecified partitioning function. While each process processing its partition locally to produce the corresponding output, it also has access to global state. In other words, it can attain a globally consistent view of the value corresponding to an output key. Piccolo achieves this by introducing a userspecified accumulation function akin to our add operator. When called by the user, this operator can accumulate the values for a specific key across all processes. This differentiates the PDQ programming model from Piccolo s. While Piccolo offers fine-grained access to global state, it requires user control over the merge procedure. On the other hand, PDQ executes the merge operation transparent to the user. PDQ also reduces the merge overhead partitioning the output across a SuperMap. Finally, in contrast to Piccolo that only supports, memory-constrained outputs, PDQ transparently allows in-memory, out-of-core and distributed processing and output data structures. 3. IMPLEMENTATION Our primary goal for implementation is efficient processing of data by maximizing the CPU utilization as well as by maximizing the disk I/O. Our implementation targets data sizes up to few terabytes, and optimized for multi-core systems and small to medium sized clusters. Concerns such as large scale cluster management, fault tolerance are not addressed. Our secondary goal is transparent parallelism, that is, the computation can be coded using programming language s native data type and sequential style of programming, and does not require any modifications to language grammar. In the following, we describe our data structures that can transparently switch from memory bound, to diskbacked and distributed operational behavior. We have implemented our programming model as a library in Java SE Data Structures The PDQCollections library provides a suite of data structure implementations that are meant to be drop-in replacement for their counterpart native implementations. These include the typical container data types that implement the Map, Multimap, SortedMap, SortedMultimap, Set, Sorted- Set, List and other interfaces, as well as File I/O classes, that inherit the FileReader and FileInputStream classes. Recall the computation model outlined in Listing 1, where we discussed that the input containers are partitioned into splits, and that the output containers are first replicated, and later these replicas are merged back. This behavior is enforced by requiring the input and output containers to implement the Splittable and Mergeable interfaces, respectively. The signatures of these two interfaces are shown below: interface S p l i t t a b l e { I t e r a t o r <Object> s p l i t ( ) ; interface Mergeable { Object r e p l i c a t e ( ) ; void merge ( Object [ ] p a r t i a l s ) ; The splits of an input container and the replicas of an output container are objects of the same abstract data type as the respective input and output containers. In addition, the replicated objects are in-memory data structures. The collections classes support both interfaces, while the file reader classes only implement the Splittable interface PDQ Collections Classes Recall that an associative container can be viewed as a mapping from key space to value space: A : K V. If the execution environment has N threads, then the key space is exhaustively divided into N non-overlapping subspaces and a delegate associative container is created for each of these key subspaces. In other words, the PDQ container is a wrapper around N associative containers, each managing a portion of the keyspace, as defined below: fassoc(k) 1 if k K 1 f A P DQ : f pdq (k) = assoc(k) 2 if k K 2.. fassoc(k) N if k K N where K 1, K 2... K N are the N divisions of the key space, and fassoc, 1 fassoc, 2... fassoc N are the N delegates. As an example, in an execution environment with four threads, a PDQMap wraps four java.util.hashmap delegates. For an operation for a key k (for example, get(), put(), contains(), etc), the key subspace K i is identified that contains this key, and the operation is applied to the associated delegate fassoc. i One simple strategy for partitioning the key space is by partitioning the hash code range of the keys, that is, a key k will belong to subspace hash(k) modulo N. Although, it must be ensured that this hashing function is different from functions that may be used internally by the associative container (for example, HashMap). There are several benefits to this wrapped delegates design. First, the Splittable interface is trivially implemented: each delegates is a split! Notice that our approach is not to split the container after it is populated, rather to create the splits beforehand and populate them incrementally. This is an easier as well as an essentially zero-cost operation. Secondly, this design allows for merging of the partial results in parallel. Each PDQ Collection implements the Mergeable interface, where the replicas themselves are wrapped collections of delegates. That is, for an execution environment with N threads, the PDQ collection wraps N delegates, as shown below: A P DQ := {D 1, D 2,..., D N where D i is the i th delegate of the PDQ collection. This
6 collection object will generate N replications, which themselves wraps N containers, as shown below: R 1 := {D R 1 1, D R 1 2,... D R 1 R 2 := {D R 2 1, D R 2 2,... D R 2 N... R N := {D R N 1, D R N 2,... D R N N N where R i is the i th replication, and D R i j is its j th delegate. Now the merge operation can be applied in parallel by shuffling the replication delegates and grouping them according to the key subspace they manage. That is: MERGE (A P DQ, [ R 1, R 2... R N ] ) : p a r a l l e l f o r ( i : 1 to N) : D i D R 1 i D R 2 i... D R N i Similar parallelization benefits are achieved when the container data is stored on disk, as we shall see in Section 3.3. In this case, the various activities such as sorting, serialization and writing to disk can be applied in parallel PDQ File I/O Classes The PDQ library provides replacements for the java.io file I/O classes. In particular, the PDQFileReader and PDQ- FileInputStream classes support reading characters and bytes, respectively, from files. Both implement the Splittable interface, and, therefore, can act as inputs to the computation. Similarly, the PDQFileWriter and PDQFileOutputStream classes support writing characters and bytes, respectively, and implement the Mergeable interface. The primary consideration in reading files is blocksize the size of the chunk of input that is processed by a single thread. The input file is partitioned into chunks of size blocksize each, and these chunks serve as input splits to the computation. If the file contains variable length records, the splits coincide with record boundaries. Recall our implementation goal which states that the computation for each input split must happen within memory. It is, therefore, necessary to ensure that the blocksize must not exceed the limit where the output generated (for that split) cannot be entirely contained within memory. Since the exact limit is dependent on the processing environment, this parameter is user-defined. The PDQ file output classes implement the Mergeable interface by generating replicas that are standard java.io output classes for files where the filename is appended with a part identifier (e.g., filename.part0000). 3.2 Thread Pool Execution Mode This section describes the PDQ library s shared-memory parallel execution mode of the computation. As the name implies, the concurrency is achieved by thread pools, and we make use of the java.util.concurrent.threadpoolexecutor service for the thread pool creation and management. This execution mode is applicable when the output of the computation can be contained within memory. For a pool of N threads, the input is first split into N partitions. If the input is a file, then the blocksize is automatically computed to create N chunks. The output containers are replicated N times as well, and the N instances of computations are assigned to the thread pool. At this point, the main thread invokes the thread pool and waits until all threads within the pool complete, which can be viewed as an implied barrier for the main thread. Subsequently, the main thread shuffles the delegates of the replicas for merging, in accordance with the f, input, output = [Da, Db] loop [There are more splits] opt Main Thread #1 Thread #2 ref Get at most 2 splits Thread Pool Execution (f, input, output) [Externalize output?] Da Implied Barrier Db <<Sort>> <<Serialize & Write to File>> <<External Merge>> Figure 1: Sequence diagram for the disk-backed execution model for a pool of 2 threads process described previously, and assign these merge tasks to the thread pool. There is a second implied barrier, at the end of which the partial results are merged back into the output containers. As the cost of splitting, replicating and shuffling/distributing delegates is essentially negligible, this execution mode effectively utilizes all the processing cores throughout the program execution. 3.3 Disk-backed Computation The previous execution mode is extended to handle cases when the output of the computation is too large to fit in memory, thus necessitating the need to store the output data on to the disk. In the following description, we refer to the process of storing the output disk on disk as externalization. The first difference from the memory-bound execution mode is that the number of the input splits, M, can be greater than the number of processing cores, N. In this case the computation unfolds in multiple iterations. In the first iteration, the first N splits are processed, in accordance with the thread-pool execution mode described previously; in the second iteration, the next N splits are processed similarly and the results from this iteration are merged back into the previously accumulated results. This process continues until all M splits are processed, at the end of which the accumulated result (obtained by merging partial results from each iteration) will be the final output of the computation. At the some point, the accumulated results will be large enough to overflow the memory. Before reaching this state, we enforce externalization of memory contents to the disk, thus freeing up memory for further processing. At the end of each iteration, the library determines whether the memory contents must be externalized to the disk. In our implementation, we check the currently available memory in JVM (via the freememory() method of the java.lang.runtime class) and decide to externalize if it falls below a threshold. When externalizing the output container, each of the delegates of the output container are externalized in parallel.
7 The process involves sorting the key space of the delegate container, if it was not already sorted, and serializing the key-value pairs to a flat file. We use the Java s native serialization framework for externalizing data. At this end of this externalization step there will be N files on disk, which contains the memory contents of the N delegates respectively, with each file written in parallel. The next externalization step will write data into a new set of N files. When all the M splits are processed, these files are merged into a Mapfile. A mapfile is a file that stores a sorted list of key-value mappings along with an index (typically kept in memory) that maps a subset of the keys to the offset within the file where they are located. To retrieve the value for a key, the index is consulted to locate the offset of the closest key, and then the file is read sequentially from that offset until the said key is located. Since the externalized files are already sorted on disk, the O(N) disk merge algorithm is used to merge them into the mapfiles. These mapfiles, subsequently, serve as the delegates for the output container. Figure 1 illustrates this execution mode as a sequence diagram. The outer loop in the diagram represents the iterations, and the opt block represents the optional step of externalizing the delegates to disk. Observe that the actual computation, at each iteration, is entirely memory bound. While the merged results may be externalized to the disk, the output data is always within memory while the computation is being executed. This eliminates any need for disk-based data structures, which are substantially more costly than the in-memory structures. This design also not employ any form of caching disk data within memory, and thus avoids caching relating scalability overheads. Finally, the data is always read and written sequentially from the files, which also eliminates random disk seek overheads. If multiple hard drives are available, the disk I/O can be further improved by storing the externalized files onto these multiple disks. In the thread pool environment with N threads, each externalization step generates N files, and at the end of the computation, there are N delegate mapfiles for the output container. These files can be spread across the hard drives, and given that the file I/O operations are parallelized in our implementation, it follows that the data can be read or written simultaneously from all the hard drives. 3.4 Distributed Mode In this final section, we describe our implementation strategy for distributed execution. As noted previously, our design goal is catered towards small to medium sized clusters. The implementation follows a master-slave architecture, where the master executes the main application and requests the workers to participate in the computation. The main application jar file is added to the classpath of the workers to ensure the visibility of the classes that define the computation. When the main program, executing at the master node, invokes the computation via the PDQ.run(...) method (see line 22 in Listing 2), the PDQ library contacts the worker daemon processes and prepares for the distributed execution. The master first notifies to each worker their unique rank, which is a number between 1 and total number of workers. The workers use the rank to determine the portions of the input that they are responsible for processing. Subsequently, the master serializes and transmits over the wire the computation method and its input/output arguments. At this point, the workers have sufficient information to begin processing their share of the input. Each worker executes independently of others, and after processing its input, serializes and transmits the results back to the master. The master may choose to merge the results gathered from the workers either locally (if the data is small enough), or employ the slaves again for distributed merge. A brief note on sharing the input and output data: in simple cases, the master and workers may exchange the entire data set over the wire. This simple approach does not require any cluster configuration, however, it is not the most efficient method of sharing data, for example, it adds the serialization and deserialization overhead. Alternatively, the data may be stored on shared file system, for example, Network File System (NFS), or distributed file system, for example, Hadoop Distributed File System (HDFS). In this case, only the reference to the file needs to be shared. 4. DESIGN ANALYSIS In this section, we compare the PDQ programming model against representative data structures for the memory-bound, disk-bound and distributed computations. Our objective is to gain an insight into the fundamental nature of the PDQ computation model and illustrate the impact of our design choices towards overall performance of data-centric computations. We are not interested in raw performance comparisons between two systems, as those statistics vary widely based on parameters of implementation, programming style, efficiency of the libraries used (e.g., threading, serialization, networking). Instead, we wish to expose those salient characteristics of our model that makes the PDQ library consistently outperform traditional designs, irrespective of actual implementations under consideration. The summary of our findings is as follows: with memorybound workloads, the PDQ library avoids locks and any other form of synchronization, and thus significantly scales better with number of processing cores than the lock-based data structures, such as ConcurrentMap. For large diskbound workloads, the PDQ library does not employ any form of caching and always accesses the disk with sequential read and write requests. In contrast, the disk-backed data structures, such as BerkeleyDB and KyotoCabinet, are twotiered with in-memory cache for the on-disk B-tree, where the efficacy of the cache diminishes with smaller cache ratio. By avoiding the cache miss penalties and random disk seek overheads, the PDQ library significantly scales better with data size. Finally, with distributed workloads, the PDQ library accesses remote data resources sequentially and only once, and is thus able to significantly outperform distributed data structures. Single-host experiments were conducted on a machine hosting a dual hyper-threaded 6-core Intel Xeon 2.30GHz processor with 32GB of memory and a 500GB of disk. The machine runs CentOS 6.3 with kernel version For distributed experiments, we used a cluster of eight machines, each of which is comprised of two quad core Intel Xeon 2.33 GHz processors with no hyper-threading, 8GB memory and a 200GB hard drive, with CentOS release 5.8 and kernel. The cluster nodes are connected with infiniband switching as well as 100 MBps ethernet. All machines have Java Hotspot 1.6 runtime available.
8 Speedup Factor Number of threads Figure 2: Scalability Performance with respect to sequential processing 4.1 Multi-Threaded Data Structures We begin our analysis with the thread pool execution mode (section 3.2). Term frequency is one of the common application of maps, which counts the number of occurrences of each unique term in a document. An implementation, using the PDQ library, was discussed earlier in Listing 2, which we compare against the ConcurrentHashMap class from the java.util.concurrent package, as well as a critical section (using Java s synchronized keyword) around a regular java.util.hashmap. Our dataset comprised of a collection of web documents containing a total of 300 million words, from a set of 5.2 million unique words [3]. We investigate how efficiently these three implementations exploit varying levels of concurrency. We measure the performance as the ratio of parallel to sequential execution time (factor speedup), which is shown in Figure 2 for the three implementations, for varying number of threads ranging from 1 through 12. Notice that, regardless of the type of data structure selected, each of the k threads processed only 1/k of the input data. This implies that difference between the theoretic speedup factor of k and the observed speedup can be attributed to the overhead of maintaining consistency of the output data (although, there are other factors in play as well, such as garbage collection, memory allocation, context switches etc, but we believe their contribution is relatively small). From the figure, we observe that the concurrent and synchronized maps are constrained by their lock-based approach to data consistency. This locking overhead is significant and in fact lead to a runtime degradation in comparison to sequential data processing for this benchmark problem. In contrast, the divide-and-conquer approach of the PDQ library offers an alternative mechanism to achieve data consistency, which we have shown to outperform locking-based consistency for data-centric computations. Furthermore, the PDQMaps don t exhibit diminishing speedup with increasing number of threads as was observed with concurrent and synchronized maps. 4.2 Disk-backed Data Structures To analyze the merits of the PDQ library in processing large datasets (Section 3.3) we compared it with embedded disk-based database implementations from Berkeley DB Java Edition (BDB JE) and Kyoto Cabinet (KC). To focus exclusively on the disk I/O performance, we have limited the benchmarks to single-threaded execution. In any case, we have shown previously that locking-based data consistency (which is employed by BDB and KC) does not scale for data-centric computations. Reverse index is another common application of multimaps, which creates a mapping from values to their keys in the original index. We used the dataset from DBpedia [4] which lists the URLs of English Wikipedia pages and the URLs of other Wikipedia web pages they point to. The reverse index is a mapping from a page URL to URLs that link to it. The input is a 21GB text file, where each line represents one outgoing link and has the format url of source page... url of linked page. The input was read using the PDQFileReader class, which automatically partitions the input file, and the output was stored in a PDQMultimap object in case of the PDQ library, StoredMap object in case of BDB and the DB object for KC. Figure 3 (next page) is a time series of the disk read and write activities (top plot), CPU utilization (middle plot) and rate at which the input data is processed, for both PDQ- Multimap (the left plots) and BDB StoredMap. In case of PDQMultimap, we can clearly observe the cycles of processing input (when the disk read rate and data processing rates peak) and externalizing the intermediate results to disk (when the disk write rate peaks while the CPU waits as no data processing occurs). Around the 300s mark, we observe this cycle ends and the external merge begins. During the external merge, data is read and written in rapid succession while CPU utilization observes a marked drop. In contrast, the BDB StoredMap observes good performance only when the database is entirely contained in the memory cache. During the first 200 seconds, StoredMap utilizes the CPU well as data is read from the disk and processed at a high rate. Once the database is too large to fit in memory, we observe a spike in the disk write rate as the contents of the cache are flushed to disk. Subsequently, the disk read and data processing rates drop significantly and none of these performance markers improve over time. Once the database is too large to fit in memory, only portions of it can be stored in the cache at a time, and cache page swaps contribute substantially to disk I/O. This is verified by high percentage of time that CPU spends in waiting state. To summarize, once the size of output data exceeds the threshold of RAM, the StoredMap spends a lot of time bringing relevant data items from disk to cache and dumping stale cache entries to disk. There is a second fundamental limitation of StoredMap which is not apparent from the previous plots. As the size of data on disk increases, the effectiveness of cache decreases (since a random entry is less likely to be found in cache) and consequently, the performance further degrades. This is verified in Figure 4 which shows that the execution time increases exponentially with the input data size. PDQMultimap, in contrast, offered linear scalability. 4.3 Distributed Data Structure We compared the distributed execution mode (Section 3.4) of the PDQ library against a distributed implementation of the Map container type from Hazelcast [5]. The hazelcast map partitions the contents across the multiple JVMs based on the hashcode of the keys. The likelihood that a given key is present locally decreases in inverse proportion to the size of the cluster. Since the version of Hazelcast we used only supported in-memory data, we used the same Term frequency benchmark problem that we consid-
9 Disk Access (MBps) PDQMap Read Write BDBMap Read Write 0 Time Spent in CPU States (%) Running Waiting Running Waiting 0 80 Data Processing Rate (MBps) Time (s) Time (s) Figure 3: Performance Comparison of PDQ and Berkeley DB (BDB) for Disk I/O Rate (top), CPU Utilization (middle) and Processing Rate (bottom) Execution time (min) PDQMap BDBMap PDQ:Hazelcast Execution Time Ratio Number of Machines Input Size (GB) Figure 4: Scalability Performance of PDQMultiMap vs. Berkeley DB Map ered for the thread pool execution mode. Figure 5 shows the ratio of the execution time with Hazelcast to PDQMap for varying size of cluster. In each case, the JVMs at each node executed the program on a single thread. The PDQMap outperforms the Hazelcast Map by a factor of 90, and the reason for this vast difference in performance turns out to be that Hazelcast is in fact limited by the effects of concurrent access. For a randomly distributed input, n 1 n (where n is the size of cluster) access to a map entry will travel over the network and the proportion of such access increases as more nodes participate in the computation. Figure 5: Performance Comparison of PDQMap vs. Hazelcast Notice that it is not just the amount of data transferred over the network that influences the performance, it is also the network access pattern. In case of Hazelcast map, remote data is accessed while the computation is being executed, which implies that the processor would remain idle while waiting for the remote data to be fetched. In contrast, the PDQMap transfers data only twice: first, the initial input workload is distributed by the master to the slaves, and secondly, by the slaves when they transmit their computed results back to the master. In between these two network access, the processors are fully occupied with computation and never wait for network activity. The figure 6 illustrates the breakdown of the execution times with PDQMap into four phases. As a summary, whereas distributed maps like
10 Runtimes (s) Split and Distribute Input Local Split and Merge Gather Results Final Merge at Master Factor Speedup Number of Machines Number of Machines Figure 6: Breakdown of execution times for PDQMap by operation Hazelcast are latency-bound, the PDQMaps are throughputbound. What makes this observation pressing is that it is easier to cater to network throughput bound operations. For example, the data can be compressed, faster network interconnects may be used in the cluster, network parameters may be tuned and so on. 5. APPLICATIONS In this section, we illustrate several applications of our PDQ library in the different fields of Computer Science. The k-nearest Neighbor Classification in Machine Learning and the FP Tree Growth Algorithm in Data Mining illustrate how the PDQ programming model transforms the problem of parallelization to that of adding data structures. The PageRank algorithm for Graph Processing illustrates the efficient operation of PDQ data structures for large datasets in distributed environments. The PDQ programming model has a broad scope of applicability, which we illustrate by examples in the relational processing of structured-data and incremental processing of log data. 5.1 Machine Learning: k-nearest Neighbor based Classification We consider the problem of handwriting recognition, and specifically, recognizing handwritten numerical digits. Our training dataset consists of 42,000 images, where each image is 28x28 pixels and represents one handwritten digit [6]. Each of these images is correctly labeled with the represented digit. The machine learning objective is stated as: given a new image, ascertain the digit it represents. We employ the k-nearest Neighbors based Classification approach, which employs the following steps: (a) represent each sample item as a vector, (b) define distance between two vectors (we used the Euclidean distance), (b) determine the k vectors from the training samples that are closest to test vector, and (d) take the majority vote of labels of these k selected samples. We defined a LimitedPriorityQueue that contains no more than k entries, that is, if a new entry is less than the great- Figure 7: Parallel Speedup of k-nearest Neighbor Computation est entry in queue, the former is inserted and the latter is removed. To execute the computation in parallel, we simply need to define the âăijaddâăi operator for the LimitedPriorityQueue objects, as shown below: def add ( f i r s t, second ) : r e s u l t = f i r s t. copy ( ) for d, l a b e l in second : r e s u l t. i n s e r t (d, l a b e l ) return r e s u l t We can now split the input training data, and apply the k-nearest computation for each split. Each computation would produce a LimitedPriorityQueue object, which we merge as shown above, which is the final result. The parallel speedup on a 12-core machine, for varying number of processing threads, is shown in the Figure Data Mining: Frequent Itemset Discovery Consider a retail store, where a customer purchases a set of items (a transaction). There are many such transactions, and we would like to determine which items are frequently bought together. The Frequent Pattern (FP) Tree growth algorithm is an efficient technique to discover the frequent itemsets. In the first phase of the algorithm, the input data is scanned and the FP tree is constructed, in the second phase the tree is mined for frequent item sets. We will focus on parallelization of first phase only. Our strategy for parallelizing FP tree construction would then be to figure out how to add two trees. We omit the details, and only give a brief sketch on a recursive strategy for addition. The recursive function takes as input two tree nodes, one from each tree, and is initially invoked with the root nodes of the two trees. For each child of the node from the first tree, we check if it is present in the second tree. If present, we fuse these two nodes and recursively apply the function for these two nodes; and if absent, we simply copy the subtree rooted at the child node. An example is shown in Figure 8 (header table and node links are omitted for clarity). The left tree was constructed for transactions {A, B, E, {A, C, D and {C, E, while the transactions for the right tree were {A, D, E, {E and {C, E. We applied our implementation on a 1.4GB dataset of 1.7 million transactions with 5.2 million unique items, and aver-
11 B:1 A:2 C:1 D:1 C:1 B:1 A:3 C:1 D:1 D:1 A:1 D:1 C:2 E:2 Figure 8: Adding two FP Trees age of 177 items per transaction [3]. With the parallelization strategy we were able to achieve a speedup of 5X on a 8-core machine. 5.3 Graph Processing: Pagerank Algorithm If we represent the web as a directed graph, with webpages represented as nodes and links as edges, then the PageRank algorithm [7] can be used to generate weights for each node, which measures the relative importance of that web page. We applied the algorithm on the Wikilinks dataset [4], which is a graph of all English Wikipedia pages along with internal page links (21 GB text file). We processed this raw data and created a Multimap, which consumed about 8GB, when serialized on disk. This dataset is too large to be completely processed in-memory. With this example, we highlight the capability of our associative container implementations that can automatically and transparently transition their behavior from in-memory to disk-backed as well distributed execution without requiring any code changes to the program. Cluster Size Time per Iteration (min) Speedup 1 20: : X 4 5: X 8 2: X Table 1: Speedup of PageRank algorithm on a Cluster We executed the algorithm on a distributed cluster of machines, the size of which ranged from 1 through 8. Since the computation is mostly I/O-bound, we spawned only one thread per machine. The following table shows the speedup with respect to the cluster size. 5.4 Relational Processing of Structured Data Let us consider a motivating example from the DBLP repository [8] of academic publications in Computer Science. We extracted two database tables: paper (paperid, authorid), which lists all published papers and their authors, C:1 and citations (paperid, citedpaperid), which lists citations for each paper. Equivalently, we can view them as multimaps of paperid to authorid, and paperid to citedpaperid, respectively. We chose multimaps, since each paper may have multiple authors, and may cite multiple papers. Sometimes authors cite their own papers, or of their colleagues. Colleagues are the researchers that an author has co-authored some paper with, or equivalently: A is-colleague-of B p, (p, A) paper (p, B) paper We would like to find out, for each author, the percentage of papers they cited that were their own or of their colleagues (which should give us some insight into how clustered the research community is!). The following is an intuitive (but unoptimized) SQL statement to count the number of citations, for each author, that were their own or from colleagues: SELECT author, count ( paper ) FROM (SELECT author, paper, coauthors, g r o u p c o n c a t ( c o a u t h o r s ) AS C FROM f o r each author, the papers they c i t e (SELECT paper. uid AS author, c i t a t i o n. c i t e d p a p e r I d AS c i t e d p a p e r FROM paper INNER JOIN c i t e s ON author = c i t e d p a p e r ) AS Y1 INNER JOIN f o r each paper, i t s coauthor s e t (SELECT P1. pid AS paper, X1. coauthor AS coauthors FROM paper AS P1 INNER JOIN f o r each author, l i s t of coauthors (SELECT P1. authorid AS author, P2. authorid AS coauthor FROM paper AS P1 INNER JOIN paper AS P2 WHERE P1. paperid = P2. paperid GROUP BY P1. uid, P2. uid ) AS X1 ON P1. a u t h o r I d = X1. author GROUP BY P1. pid, X1. coauthor ) AS Y2 ON Y1. c i t e d p a p e r s = Y2. paper GROUP BY author, paper ) AS Z1 WHERE (SELECT f i n d i n s e t ( author, C) ) > 0 GROUP BY author In context of associative containers, we can imagine a data-flow programming paradigm where the data can be transformed using some set of relational operators. If A : K V and B : K T are two associative containers, then we have defined for following operations (only a subset listed here, relevant for this problem): Operation reverse(a) : V K join(a, B) : K < V, T > mix(a, B) : V T unique(a) : K V Definition v k reverse(a) k v A k < v, t > join(a, B) k v A k t B v t mix(a, B) k K, k v A k t B remove duplicate values for each key We omit the proof and only claim that each of these operations can be parallelized by the PDQ library both in multicore and distributed environments, for memory-bound as well as disk-bound workloads. We can now describe a counterpart sequence of data flow statements to compute the same result as the SQL statement, as shown in the following: #T1 = ( Multimap from papers t a b l e ) #T2 = ( Multimap from c i t a t i o n s t a b l e )
12 T3 = r e v e r s e (T1) # papers w r i t t e n by each author T4 = mix (T1, T1) # coauthors f o r each author T5 = unique (T4) # unique coauthors f o r each author T6 = mix (T3, T5) # coauthor s e t f o r each paper T7 = unique (T6) T8 = mix (T2, T1) # f o r each paper, authors t h a t c i t e i t T9 = j o i n (T8, T7) # f o r each paper, authors t h a t c i t e i t # and t h e coauthor s e t o f t h i s paper PDQ. run ( compute, T9, r e s u l t ) # where, the compute function i s defined as : def compute (T9, r e s u l t ) : for key, ( c i t i n g a u t h o r s, c o a u t h o r s ) in T9 : for c i t i n g a u t h o r in c i t i n g a u t h o r s : i f c i t i n g a u t h o r in c o a u t h o r s : r e s u l t [ c i t i n g a u t h o r ] Incremental Log Processing In several contexts, logs are generated continuously (e.g. web traffic) and several statistics rely on cumulative processing of logs (e.g. top web pages visited). Processing the entire log dataset for these statistics, for every time a report is requested, may not be practical. Alternatively, statistics can be generated incrementally; however, in traditional programming approaches, custom code has to be developed for this purpose. PDQ library offers a native and efficient mechanism for incremental processing of data. Recall that the our SRM computation model splits the inputs, creates replicas of output, applies computation to each split/replica pair, and finally merges the replicas back to the output container. So far we had assumed that the output container is initially empty. However, if the output container has some data to begin with, the replicas will be merged into the existing data. Since the PDQ data structures are natively Serializable, incremental processing is easily accomplished, as illustrated below: def f i r s t r e p o r t ( input, output ) : PDQ. run ( g e n e r a t e r e p o r t, input, output ) s a v e F i l e ( output, timestamp 1 ) def i n c r e m e n t a l r e p o r t ( l a s t t i m e s t a m p, i n p u t s i n c e l a s t t i m e s t a m p ) : output = r e a d F i l e ( timestamp + l a s t t i m e s t a m p ) PDQ. run ( g e n e r a t e r e p o r t, i n p u t s i n c e l a s t t i m e s t a m p, output ) s a v e F i l e ( output, timestamp + (++last timestamp ) ) After producing the first report, the output is saved to disk; while for the subsequent reports, the previously saved report is loaded from disk and used as output. We used the WorldCup98 dataset [9] which is website request logs over 92 days, from 30 servers in four countries. Each log entry is 20B and the total data set size is 26GB. A report was generated each week, and the Figure 9 shows the execution time of incremental processing each week, compared with full processing of data. 6. CONCLUSION We have presented PDQCollections, a library of associative containers that inherently encapsulate parallelism and a computation model that transforms the problem of parallelization to that of defining addition operator for value data types. We provided the Data Model, which describes what Processing Time (s) Incremental Full Week Figure 9: Comparison of Incremental processing of data against full processing problems can be solved, the Split-Replicate-Merge Computation Model, our strategy for parallelism, the Programming APIs and shown the equivalence with the Map Reduce programming model. We have shown that the PDQ library can efficiently exploit parallelism in multi-core as well distributed environments, for in-memory as well as large datasets. We discussed our implementation that included the data structure design, the thread pool execution mode for shared-memory and memory-bound workloads, the disk backed execution mode to handle computations when output data is too large to fit in memory and finally, the distributed execution mode for small to medium sized clusters. Our comparative study with concurrent data structures (for memory bound computation), BerkeleyDB s StoredMap (for disk-backed computation) and Hazelcast (for distributed computation) highlighted the salient characteristics of our model that explained how it significantly outperforms these traditional data structures. We also illustrated the application of our library in several fields of Computer Science. 7. REFERENCES [1] J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, SIGMOD, [2] R. Power and J. Li, Piccolo: building fast, distributed programs with partitioned tables, OSDI, [3] C. Lucchese, S. Orlando, R. Perego, and F. Silvestri, Webdocs: a real-life huge transactional dataset, [4] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, R. Cyganiak, and Z. Ives, DBpedia: A nucleus for a web of open data, (ISWC/ASWC, [5] Hazelcast, [6] Y. LeCun and C. Cortes, MNIST handwritten digit database, [7] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation ranking: Bringing order to the web., tech. rep., Stanford InfoLab, [8] M. Ley, DBLP - some lessons learned, PVLDB, [9] M. Arlitt and T. Jin, 1998 world cup web site access logs.
PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers
PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers Maneesh Varshney Vishwa Goudar Christophe Beaumont Jan, 2013 * PDQ could stand for Pretty Darn Quick or Processes
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Glossary of Object Oriented Terms
Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Using Object Database db4o as Storage Provider in Voldemort
Using Object Database db4o as Storage Provider in Voldemort by German Viscuso db4objects (a division of Versant Corporation) September 2010 Abstract: In this article I will show you how
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
FAWN - a Fast Array of Wimpy Nodes
University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Data-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Cloud Based Application Architectures using Smart Computing
Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator
WHITE PAPER Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com SAS 9 Preferred Implementation Partner tests a single Fusion
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Raima Database Manager Version 14.0 In-memory Database Engine
+ Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
MapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Replication on Virtual Machines
Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
BlobSeer: Towards efficient data storage management on large-scale, distributed systems
: Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu
MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)
MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
PERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
Duke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski [email protected]
Kamil Bajda-Pawlikowski [email protected] Querying RDF data stored in DBMS: SPARQL to SQL Conversion Yale University technical report #1409 ABSTRACT This paper discusses the design and implementation
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
