# BigDansing: A System for Big Data Cleansing

Size: px
Start display at page:

## Transcription

3 (1) Logical layer. A major goal of BigDansing is to allow users to express a variety of data quality rules in a simple way. This means that users should only care about the logic of their quality rules, without worrying about how to make the code distributed. To this end, BigDansing provides five logical operators, namely Scope,,, Detect, and GenFix, to express a data quality rule: Scope defines the relevant data for the rule; defines the group of data units among which a violation may occur; enumerates the candidate violations; Detect determines whether a candidate violation is indeed a violation; and GenFix generates a set of possible fixes for each violation. Users define these logical operators, as well as the sequence in which BigDansing has to run them, in their jobs. Alternatively, users provide a declarative rule and BigDansing translates it into a job having these five logical operators. Notice that a job represents the logical plan of a given input quality rule. Figure 1: BigDansing architecture Among the many possible solutions to the cleansing problem, it is common to define a notion of minimality based on the cost of a repair. A popular cost function [5] for relational data is the following: t D,t D r,a A dis A(D(t[A]), D(t [A])), where t is the fix R for a specific tuple t and dis A(D(t[A]), D(t [A])) is a distance between their values for attribute A (an exact match returns ). This models the intuition that the higher is the sum of the distances between the original values and their fixes, the more expensive is the repair. Computing such minimum repairs is NP-hard, even with FDs only [5, 23]. Thus, data repair algorithms are mostly heuristics (see Section 5). 2.2 Architecture We architected BigDansing as illustrated in Figure 1. BigDansing receives a data quality rule together with a dirty dataset from users (1) and outputs a clean dataset (7). BigDansing consists of two main components: the RuleEngine and the RepairAlgorithm. The RuleEngine receives a quality rule either in a UDFbased form (a BigDansing job) or in a declarative form (a declarative rule). A job (i.e., a script) defines users operations as well as the sequence in which users want to run their operations (see Appendix A). A declarative rule is written using traditional integrity constraints such as FDs and DCs. In the latter case, the RuleEngine automatically translates the declarative rule into a job to be executed on a parallel data processing framework. This job outputs the set of violations and possible fixes for each violation. The RuleEngine has three layers: the logical, physical, and execution layers. This architecture allows BigDansing to (i) support a large variety of data quality rules by abstracting the rule specification process, (ii) achieve high efficiency when cleansing datasets by performing a number of physical optimizations, and (iii) scale to big datasets by fully leveraging the scalability of existing parallel data processing frameworks. Notice that, unlike a DBMS, the RuleEngine also has an execution abstraction, which allows BigDansing to run on top of general purpose data processing frameworks ranging from MapReduce-like systems to databases. (2) Physical layer. In this layer, BigDansing receives a logical plan and transforms it into an optimized physical plan of physical operators. Like in DBMSs, a physical plan specifies how a logical plan is implemented. For example, could be implemented by either hash-based or rangebased methods. A physical operator in BigDansing also contains extra information, such as the input dataset and the degree of parallelism. Overall, BigDansing processes a logical plan through two main optimization steps, namely the plan consolidation and the data access optimization, through the use of specialized join and data access operators. (3) Execution layer. In this layer, BigDansing determines how a physical plan will be actually executed on the underlying parallel data processing framework. It transforms a physical plan into an execution plan which consists of a set of system-dependent operations, e.g., a Spark or MapReduce job. BigDansing runs the generated execution plan on the underlying system. Then, it collects all the violations and possible fixes produced by this execution. As a result, users get the benefits of parallel data processing frameworks by just providing few lines of code for the logical operators. Once BigDansing has collected the set of violations and possible fixes, it proceeds to repair (cleanse) the input dirty dataset. At this point, the repair process is independent from the number of rules and their semantics, as the repair algorithm considers only the set of violations and their possible fixes. The way the final fixes are chosen among all the possible ones strongly depends on the repair algorithm itself. Instead of proposing a new repair algorithm, we present in Section 5 two different approaches to implement existing algorithms in a distributed setting. Correctness and termination properties of the original algorithms are preserved in our extensions. Alternatively, expert users can plug in their own repair algorithms. In the algorithms that we extend, each repair step greedily eliminates violations with possible fixes while minimizing the cost function in Section 2.1. An iterative process, i.e., detection and repair, terminates if there are no more violations or there are only violations with no corresponding possible fixes. The repair step may introduce new violations on previously fixed data units and a new step may decide to again update these units. To ensure termination, the algorithm put a special variable on such units after a fixed number of iterations (which is a user defined parameter), thus eliminating the possibility of future violations on the same data. 3

4 3. RULE SPECIFICATION BigDansing abstraction consists of five logical operators: Scope,,, Detect, and GenFix, which are powerful enough to express a large spectrum of cleansing tasks. While Detect and GenFix are the two general operators that model the data cleansing process, Scope,, and enable the efficient and scalable execution of that process. Generally speaking, Scope reduces the amount of data that has to be treated, reduces the search space for the candidate violation generation, and efficiently traverses the reduced search space to generate all candidate violations. It is worth noting that these five operators do not model the repair process itself (i.e., a repair algorithm). The system translates these operators along with a, generated or user-provided, BigDansing job into a logical plan. 3.1 Logical Operators As mentioned earlier, BigDansing defines the input data through data units Us on which the logical operators operate. While such a fine-granular model might seem to incur a high cost as BigDansing calls an operator for each U, it in fact allows to apply an operator in a highly parallel fashion. In the following, we define the five operators provided by BigDansing and illustrate them in Figure 2 using the dataset and rule FD φ F (zipcode city) of Example 1. Notice that the following listings are automatically generated by the system for declarative rules. Users can either modify this code or provide their own for the UDFs case. (1) Scope removes irrelevant data units from a dataset. For each data unit U, Scope outputs a set of filtered data units, which can be an empty set. Scope(U) list U Notice that Scope outputs a list of Us as it allows data units to be replicated. This operator is important as it allows BigDansing to focus only on data units that are relevant to a given rule. For instance, in Figure 2, Scope projects on attributes zipcode and city. Listing 4 (Appendix B) shows the lines of code for Scope in rule φ F. (2) groups data units sharing the same blocking key. For each data unit U, outputs a blocking key. (U) key The operator is crucial for BigDansing s scalability as it narrows down the number of data units on which a violation might occur. For example, in Figure 2, groups tuples on attribute zipcode, resulting in three blocks, with each block having a distinct zipcode. Violations might occur inside these blocks only and not across blocks. See Listing 5 (Appendix B) for the single line of code required by this operator fro rule φ F. (3) defines how to combine data units Us to generate candidate violations. This operator takes as input a list of lists of data units Us (because it might take the output of several previous operators) and outputs a single U, a pair of Us, or a list of Us. (list list U ) U U i, U j list U This operator allows to avoid the quadratic complexity for generating candidate violations. For instance, in Figure 2, passes each unique combination of two tuples inside each block, producing four pairs only (instead of 13 pairs): (1) Scope (zipcode, city) (2) (zipcode) (3) (t3, t5) zipcode city zipcode city (t2, t4) t1 11 NY t3 661 CH (t2, t6) (t4, t6) B1 t2 921 LA t5 661 CH (4) Detect t3 661 CH B2 t1 11 NY (t2, t4) (t4, t6) t4 921 SF t2 921 LA (5) GenFix t5 661 CH B3 t4 921 SF t2[city] = t4[city]; t6 921 LA t6 921 LA t6[city] = t4[city] Figure 2: Logical operators execution for rule FD φ F (t 3, t 5) from B 1, (t 2, t 4), (t 2, t 6), and (t 4, t 6) from B 3. Listing 6 (Appendix B) shows the code required by this operator for rule φ F. (4) Detect takes a single U, a pair-u, or a list o Us, as input and outputs a list of violations, possibly empty. Detect(U U i, U j list U ) {list violation } Considering three types of inputs for Detect allows us to achieve better parallelization by distinguishing between different granularities of the input. For example, having 1K Us as input, rather than a single list of Us, would allow us to run 1K parallel instances of Detect (instead of a single Detect instance). In Figure 2, Detect outputs two violations, v 1 = (t 2, t 4) and v 2 = (t 6, t 4), as they have different values for city; and it requires the lines of code in Listing 1. public ArrayList <Violation> detect(tuplepair in) { 1 ArrayList<Violation> lst = new ArrayList<Violation>(); 2 if (! in. getleft (). getcellvalue (1). equals( in.getright(). getcellvalue (1))) { 3 Violation v = new Violation( zipcode => City ); 4 v.addtuple(in. getleft ()); 5 v.addtuple(in.getright()); 6 lst.add(v); } 7 return lst ; } Listing 1: Code example for the Detect operator. (5) GenFix computes a set of possible fixes for a given violation. GenFix(violation) {list PossibleFixes } For instance, assuming that only right-hand side values can be modified, GenFix produces one possible repair for each detected violation (Figure 2): t 2[city] = t 4[city] and t 6[city] = t 4[city]. Listing 2 shows the code for this GenFix. public ArrayList<Fix> GenFix(Violation v) { 1 ArrayList<Fix> result = new ArrayList<Fix>(); 2 Tuple t1 = v.getleft () ; 3 Tuple t2 = v.getright() ; 4 Cell c1 = new Cell(t1.getID(), City,t1. getcellvalue (1)); 5 Cell c2 = new Cell(t2.getID(), City,t2. getcellvalue (1)); 6 result.add(new Fix(c1, =,c2)); 7 return result ; } Listing 2: Code example for the GenFix operator. Additionally, to better specify the data flow among the different operators, we introduce a label to stamp a data item and track how it is being processed. In contrast to DBMS operators, BigDansing s operators are UDFs, which allow users to plug in any logic. As a result, we are not restricted to a specific data model. However, for ease of explanations all of our examples assume relational 4

7 D1 D2 D1 D2 T1 T2 T1 T2 T1 Scope1 1 T1 T12 Detect Scope2 2 T2 T2 (a) Logical plan with non-consolidated operators PScope1 PScope2 T1 T2 Co T12 T12 P (b) Physical plan T12 T12 PDetect Figure 6: Example plans with Co GenFix PGenFix in rules such as DCs. Thus, BigDansing provides OCJoin, which is an efficient operator to deal with join ordering comparisons. It takes input dataset D and applies a number of join conditions, returning a list of joining U-pairs. We formally define this operator as follows: OCJoin(D) list Pair U Every time BigDansing recognizes joins conditions defined with ordering comparisons in PDetect, e.g., φ D, it translates into a OCJoin implementation. Then, it passes OCJoin output to a PDetect operator (or to the next logical operator if any). We discuss this operator in detail in Section 4.3. We report details for the transformation of the physical operators to two distributed platforms in Appendix G. 4.3 Fast Joins with Ordering Comparisons Existing systems handle joins over ordering comparisons using a cross product and a post-selection predicate, leading to poor performance. BigDansing provides an efficient adhoc join operator, referred to as OCJoin, to handle these cases. The main goal of OCJoin is to increase the ability to process joins over ordering comparisons in parallel and to reduce its complexity by reducing the algorithm s search space. In a nutshell, OCJoin first range partitions a set of data units and sorts each of the resulting partitions in order to validate the inequality join conditions in a distributed fashion. OCJoin works in four main phases: partitioning, sorting, pruning, and joining (see Algorithm 2). Partitioning. OCJoin first selects the attribute, P artatt, on which to partition the input D (line 1). We assume that all join conditions have the same output cardinality. This can be improved using cardinality estimation techniques [9,27], but it is beyond the scope of the paper. OCJoin chooses the first attribute involved in the first condition. For instance, consider again φ D (Example 1), OCJoin sets P artatt to rate attribute. Then, OCJoin partitions the input dataset D into nbp arts range partitions based on P artatt (line 2). As part of this partitioning, OCJoin distributes the resulting partitions across all available computing nodes. Notice that OCJoin runs the range partitioning in parallel. Next, OCJoin forks a parallel process for each range partition k i to run the three remaining phases (lines 3-14). Sorting. For each partition, OCJoin creates as many sorting lists (Sorts) as inequality conditions are in a rule (lines 4-5). For example, OCJoin creates two sorted lists for φ D: one sorted on rate and the other sorted on salary. Each list contains the attribute values on which the sort order is and the tuple identifiers. Note that OCJoin only performs a local sorting in this phase and hence it does not require any data transfer across nodes. Since multiple copies of a Algorithm 2: OCJoin input : Dataset D, Condition conds[ ], Integer nbparts output: List Tuples Tuple // Partitioning Phase 1 PartAtt getprimaryatt(conds[].getattribute()); 2 K RangePartition(D, PartAtt, nbparts); 3 for each k i K do 4 for each c j conds[ ] do // Sorting 5 Sorts[j] sort(k i, c j.getattribute()); 6 for each k l {k i+1...k K } do 7 if overlap(k i, k l, PartAtt) then // Pruning 8 tuples = ; 9 for each c j conds[ ] do // Joining 1 tuples join(k i, k l, Sorts[j], tuples); 11 if tuples == then 12 break; 13 if tuples!= then 14 Tuples.add(tuples); partition may exist in multiple computing nodes, we apply sorting before pruning and joining phases to ensure that each partition is sorted at most once. Pruning. Once all partitions k i K are internally sorted, OCJoin can start joining each of these partitions based on the inequality join conditions. However, this would require transferring large amounts of data from one node to another. To circumvent such an overhead, OCJoin inspects the min and max values of each partition to avoid joining partitions that do not overlap in their min and max range (the pruning phase, line 7). Non-overlapping partitions do not produce any join result. If the selectivity values for the different inequality conditions are known, OCJoin can order the different joins accordingly. Joining. OCJoin finally proceeds to join the overlapping partitions and outputs the join results (lines 9-14). For this, it applies a distributed sort merge join over the sorted lists, where some partitions are broadcast to other machines while keeping the rest locally. Through pruning, OCJoin tells the underlying distributed processing platform which partitions to join. It is up to that platform to select the best approach to minimize the number of broadcast partitions. 5. DISTRIBUTED REPAIR ALGORITHMS Most of the existing repair techniques [5 7, 11, 17, 23, 34] are centralized. We present two approaches to implement a repair algorithm in BigDansing. First, we show how our system can run a centralized data repair algorithm in parallel, without changing the algorithm. In other words, Big- Dansing treats that algorithm as a black box (Section 5.1). Second, we design a distributed version of the widely used equivalence class algorithm [5, 7, 11, 17] (Section 5.2). 5.1 Scaling Data Repair as a Black Box Overall, we divide a repair task into independent smaller repair tasks. For this, we represent the violation graph as a hypergraph containing the violations and their possible fixes. The nodes represent the elements and each hyperedge covers a set of elements that together violate a rule, along with possible repairs. We then divide the hypergraph into smaller independent subgraphs, i.e., connected components, 7

8 c1 v1 c2 v2 c4 c3 v3 c5 Hypergraph c1 c3 v1 v3 c2 v2 c4 c5 CC1 CC2 Connected components of the hypergraph Repair Repair Figure 7: Data repair as a black box. and we pass each connected component to an independent data repair instance. Connected components. Given an input set of violations, form at least one rule, BigDansing first creates their hypergraph representation of such possible fixes in a distributed manner. It then uses GraphX [37] to find all connected components in the hypergraph. GraphX, in turn, uses the Bulk Synchronous Parallel (BSP) graph processing model [25] to process the hypergraph in parallel. As a result, BigDansing gets a connected component ID for each hyperedge. It then groups hyperedges by the connected component ID. Figure 7 shows an example of connected components in a hypergraph containing three violations v1, v2, and v3. Note that violations v1 and v2 can be caused by different rules. Violations v1 and v2 are grouped in a single connected component CC1, because they share element c2. In contrast, v3 is assigned to a different connected component CC2, because it does not share any element with v1 or v2. Independent data repair instance. Once all connected components are computed, BigDansing assigns each of them to an independent data repair instance and runs such repair instances in a distributed manner (right-side of Figure 7). When all data repair instances generate the required fixes, BigDansing updates the input dataset and pass it again to the RuleEngine to detect potential violations introduced by the RepairAlgorithm. The number of iterations required to fix all violations depends on the input rules, the dataset, and the repair algorithm. Dealing with big connected components. If a connected component does not fit in memory, BigDansing uses a k-way multilevel hypergraph partitioning algorithm [22] to divide it into k equal parts and run them on distinct machines. Unfortunately, naively performing this process can lead to inconsistencies and contradictory choices in the repair. Moreover, it can fix the same violation independently in two machines, thus introducing unnecessary changes to the repair. We illustrate this problem with an example. Example 2: Consider a relational schema D with 3 attributes A, B, C and 2 FDs A B and C B. Given the instance [t 1](a1, b1, c1), [t 2](a1, b2, c1), all data values are in violation: t 1.A, t 1.B, t 2.A, t 2.B for the first FD and t 1.B, t 1.C, t 2.B, t 2.C for the second one. Assuming the compute nodes have enough memory only for five values, we need to solve the violations by executing two instances of the algorithm on two different nodes. Regardless of the selected tuple, suppose the first compute node repairs a value on attribute A, and the second one a value on attribute C. When we put the two repairs together and check their consistency, the updated instance is a valid solution, but the repair is not minimal because a single update on attribute B would have solved both violations. However, if the first node fixes t 1.B by assigning value b2 and the second one fixes t 2.B with Fixes Fixes b1, not only there are two changes, but the final instance is also inconsistent. We tackle the above problem by assigning the role of master to one machine and the role of slave to the rest. Every machine applies a repair in isolation, but we introduce an extra test in the union of the results. For the violations that are solved by the master, we mark its changes as immutable, which prevents us to change an element more than once. If a change proposed by a slave contradicts a possible repair that involve a master s change, the slave repair is undone and a new iteration is triggered. As a result, the algorithm always reaches a fix point to produce a clean dataset, because an updated value cannot change in the following iterations. BigDansing currently provides two repair algorithms using this approach: the equivalence class algorithm and a general hypergraph-based algorithm [6, 23]. Users can also implement their own repair algorithm if it is compatible with BigDansing s repair interface. 5.2 Scalable Equivalence Class Algorithm The idea of the equivalence class based algorithm [5] is to first group all elements that should be equivalent together, and to then decide how to assign values to each group. An equivalence class consists of pairs of the form (t, A), where t is a data unit and A is an element. In a dataset D, each unit t and each element A in t have an associated equivalence class, denoted by eq(t, A). In a repair, a unique target value is assigned to each equivalence class E, denoted by targ(e). That is, for all (t, A) E, t[a] has the same value targ(e). The algorithm selects the target value for each equivalence class to obtain a repair with the minimum overall cost. We extend the equivalence class algorithm to a distributed setting by modeling it as a distributed word counting algorithm based on map and reduce functions. However, in contrast to a standard word count algorithm, we use two map-reduce sequences. The first map function maps the violations possible fixes for each connected component into key-value pairs of the form ccid,value,count. The key ccid,value is a composite key that contains the connected component ID and the element value for each possible fix. The value count represents the frequency of the element value, which we initialize to 1. The first reduce function counts the occurrences of the key-value pairs that share the same connected component ID and element value. It outputs key-value pairs of the form ccid,value,count. Note that if an element exists in multiple fixes, we only count its value once. After this first map-reduce sequence, another map function takes the output of the reduce function to create new key-value pairs of the form ccid, value,count. The key is the connected component ID and the value is the frequency of each element value. The last reduce selects the element value with the highest frequency to be assigned to all the elements in the connected component ccid. 6. EXPERIMENTAL STUDY We evaluate BigDansing using both real and synthetic datasets with various rules (Section 6.1). We consider a variety of scenarios to evaluate the system and answer the following questions: (i) how well does it perform compared with baseline systems in a single node setting (Section 6.2)? (ii) how well does it scale to different dataset sizes compared to the state-of-the-art distributed systems (Section 6.3)? (iii) how well does it scale in terms of the number of nodes 8

15 //Input 1 JavaPairRDD<LongWritable,Text> inputdata = sc.hadoopfile(inputpath, org.apache.hadoop.mapred.textinputformat.class, LongWritable. class, Text. class, minpartitions ); 2 JavaRDD<Tuple> tupledata = tmplogdata.map(new StringToTuples()); // Rule Engine 3 JavaRDD<Tuple> scopedata = tupledata.map(new fdscope()); 4 JavaPairRDD<Key,Iterable<Tupke>> blockdata = scopeddata.groupbykey(new fd()); 5 JavaRDD<TuplePair> iteratedata = blockingdata.map(new fd()); 6 JavaRDD<Violation> detectdata = iteratedata.map(new fddetect()); 7 JavaRDD<Fix> genfixdata = detectdata.map(new fdgenfix()); // Repair Algorithm 8 JavaRDD<Edge<Fix>> edges = genfixdata.map(new extractedges()); 9 JavaRDD<EdgeTriplet<Object, Fix>> ccrdd = new RDDGraphBuilder(edges.rdd()).buildGraphRDD(); 1 JavaPairRDD<Integer, Fix> groupedfix = ccrdd.maptopair(new extractcc()); 11 JavaPairRDD<Integer, Tuple4<Integer, Integer, String, String>> stringfixuniquekeys = groupedfix.flatmaptopair(new extractstringfixuniquekey()); 12 JavaPairRDD<Integer, Tuple4<Integer, Integer, String, String>> countstringfixes = stringfixuniquekeys.reducebykey(new countfixes()); 13 JavaPairRDD<Integer, Tuple4<Integer, Integer, String, String>> newuniquekeysstringfix = countstringfixes.maptopair(new extractreducedcellvalueskey()); 14 JavaPairRDD<Integer, Tuple4<Integer, Integer, String, String>> reducedstringfixes = newuniquekeysstringfix.reducebykey(new reducestringfixes ()); 15 JavaPairRDD<Integer,Fix> uniquekeysfix = groupedfix.flatmaptopair(new extractfixuniquekey()); 16 JavaRDD<Fix> candidatefixes candidatefixes = uniquekeysfix. join ( reducedstringfixes ). values ().flatmap(new getfixvalues()); // Apply results to input 17 JavaPairRDD<Long, Iterable<Fix>> fixrdd = candidatefixes.keyby( new getfixtupleid()).groupbykey(); 18 JavaPairRDD<Long, Tuple> datardd = tupledata.keyby(new gettupleid()); 19 JavaRDD<Tuple> newtupledata = datardd.leftouterjoin(fixrdd).map( new ApplyFixes()); Listing 7: Spark code for rule φ F more data to process compared to the repair algorithm. 1 class RDDGraphBuilder(var edgerdd: RDD[Edge[Fix]]) { 2 def buildgraphrdd:javardd[edgetriplet[vertexid, Fix]] = { 3 var g: Graph[Integer, Fix ] = Graph.fromEdges(edgeRDD, ) 4 new JavaRDD[EdgeTriplet[VertexId,Fix ]]( g.connectedcomponents().triplets ) } } Listing 8: Scala code for GraphX to find connected components of fixes E. BUSHY PLAN EXAMPLE We show a bushy plan example in Figure 16 based on the following two tables and DC rules from [6]: Table Global (G): GID, FN, LN, Role, City, AC, ST, SAL Table Local (L): LID, FN, LN, RNK, DO, Y, City, MID, SAL (c1) : t 1, t 2 G, (t 1.City = t 2.City t 1.ST t 2.ST) (c2) : t 1, t 2 G, (t 1.Role = t 2.Role t 1.City = NY C t 2.City NY C t 2.SAL > t 1.SAL) (c3) : t 1, t 2 L, t 3 G, (t 1.LID t 2.LID t 1.LID = t 2.MID t 1.FN t 3.FN t 1.LN t 3.LN t 1.City = t 3.City t 3.Role M ) Detect GenFix C2 G Scope Detect GenFix C1 C3 L Scope Detect GenFix Figure 16: Example of bushy plan The plan starts with applying the Scope operator. Instead of calling Scope for each rule, we only invoke Scope for each relation. Next we apply the operator as follows: block on City for c1, on Role for c2, and on LID and MID for c3. Thereafter, for c1 and c2, we proceed to iterate candidate tuples with violations () and feed them to Detect and GenFix operators respectively. For c3, we iterate over all employees who are managers combine them with data units from the global table G and then finally feed them to the Detect and GenFix operators. The key thing to note in the above bushy data cleaning plan is that while each rule has its own Detect/GenFix operator, the plan shares many of the other operators in order to reduce: (1) the number of times data is read from the base relations, and (2) the number of duplicate data units generated and processed in the dataflow. F. DATA STORAGE MANAGER BigDansing applies three different data storage optimizations: (i) data partitioning to avoid shuffling large amounts of data; (ii) data replication to efficiently support a large variety of data quality rules; and (iii) data layouts to improve I/O operations. We describe them below. (1) Partitioning. Typically, distributed data storage systems split data files into smaller chunks based on size. In contrast, BigDansing partitions a dataset based on its content, i.e., based on attribute values. Such a logical partitioning allows to co-locate data based on a given blocking key. As a result, BigDansing can push down the operator to the storage manager. This allows avoiding to co-locate datasets while detecting violations and hence to significantly reduce the network costs. (2) Replication. A single data partitioning, however, might not be useful for mutliple data cleansing tasks. In practice, we may need to run several data cleansing jobs as data cleansing tasks do not share the same blocking key. To handle such a case, we replicate a dataset in a heterogenous manner. In other words, BigDansing logically partitions (i.e., based on values) each replica on a different attribute. As a result, we can again push down the operator for multiple data cleansing tasks. (3) Layout. BigDansing converts a dataset to binary format when storing it in the underlying data storage framework. This helps avoid expensive string parsing operations. 15

### Holistic Data Cleaning: Putting Violations Into Context

Holistic Data Cleaning: Putting Violations Into Context Xu Chu 1 Ihab F. Ilyas 2 Paolo Papotti 2 1 University of Waterloo, Canada 2 Qatar Computing Research Institute (QCRI), Qatar x4chu@uwaterloo.ca,

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

### Unique column combinations

Unique column combinations Arvid Heise Guest lecture in Data Profiling and Data Cleansing Prof. Dr. Felix Naumann Agenda 2 Introduction and problem statement Unique column combinations Exponential search

### A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

### ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski kbajda@cs.yale.edu

Kamil Bajda-Pawlikowski kbajda@cs.yale.edu Querying RDF data stored in DBMS: SPARQL to SQL Conversion Yale University technical report #1409 ABSTRACT This paper discusses the design and implementation

### LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

### Sampling from repairs of conditional functional dependency violations

The VLDB Journal (2014) 23:103 128 DOI 10.1007/s00778-013-0316-z REGULAR PAPER Sampling from repairs of conditional functional dependency violations George Beskales Ihab F. Ilyas Lukasz Golab Artur Galiullin

### Developing MapReduce Programs

Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

### MapReduce With Columnar Storage

SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

### Data Mining in the Swamp

WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

### Designing and Using Views To Improve Performance of Aggregate Queries

Designing and Using Views To Improve Performance of Aggregate Queries Foto Afrati 1, Rada Chirkova 2, Shalu Gupta 2, and Charles Loftis 2 1 Computer Science Division, National Technical University of Athens,

### Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

Objectives Distributed Databases and Client/Server Architecture IT354 @ Peter Lo 2005 1 Understand the advantages and disadvantages of distributed databases Know the design issues involved in distributed

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### Microsoft SQL Server on Stratus ftserver Systems

W H I T E P A P E R Microsoft SQL Server on Stratus ftserver Systems Security, scalability and reliability at its best Uptime that approaches six nines Significant cost savings for your business Only from

### Enterprise Discovery Best Practices

Enterprise Discovery Best Practices 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)

### Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

### Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

### The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

### Database Tuning Advisor for Microsoft SQL Server 2005

Database Tuning Advisor for Microsoft SQL Server 2005 Sanjay Agrawal, Surajit Chaudhuri, Lubor Kollar, Arun Marathe, Vivek Narasayya, Manoj Syamala Microsoft Corporation One Microsoft Way Redmond, WA 98052.

### Big Data looks Tiny from the Stratosphere

Volker Markl http://www.user.tu-berlin.de/marklv volker.markl@tu-berlin.de Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type

### Efficiently Identifying Inclusion Dependencies in RDBMS

Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany bauckmann@informatik.hu-berlin.de

### A Tool for Generating Partition Schedules of Multiprocessor Systems

A Tool for Generating Partition Schedules of Multiprocessor Systems Hans-Joachim Goltz and Norbert Pieth Fraunhofer FIRST, Berlin, Germany {hans-joachim.goltz,nobert.pieth}@first.fraunhofer.de Abstract.

### EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

### Improving Data Quality: Consistency and Accuracy

Improving Data Quality: Consistency and Accuracy Gao Cong 1 Wenfei Fan 2,3 Floris Geerts 2,4,5 Xibei Jia 2 Shuai Ma 2 1 Microsoft Research Asia 2 University of Edinburgh 4 Hasselt University 3 Bell Laboratories

### MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### Clean Answers over Dirty Databases: A Probabilistic Approach

Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University

### Using In-Memory Computing to Simplify Big Data Analytics

SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

### Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

### Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

### www.dotnetsparkles.wordpress.com

Database Design Considerations Designing a database requires an understanding of both the business functions you want to model and the database concepts and features used to represent those business functions.

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

### Energy Efficient MapReduce

Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

### Efficient Computation of Multiple Group By Queries Zhimin Chen Vivek Narasayya

Efficient Computation of Multiple Group By Queries Zhimin Chen Vivek Narasayya Microsoft Research {zmchen, viveknar}@microsoft.com ABSTRACT Data analysts need to understand the quality of data in the warehouse.

### SQL Server 2005 Features Comparison

Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions

### Optimizing Performance. Training Division New Delhi

Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,

### SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

### Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects

### InfiniteGraph: The Distributed Graph Database

A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

### Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Integrate Master Data with Big Data using Oracle Table Access for Hadoop Kuassi Mensah Oracle Corporation Redwood Shores, CA, USA Keywords: Hadoop, BigData, Hive SQL, Spark SQL, HCatalog, StorageHandler

### VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

### Using an In-Memory Data Grid for Near Real-Time Data Analysis

SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses

### Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

openbench Labs Executive Briefing: April 19, 2013 Condusiv s Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01 Executive

### Big Graph Processing: Some Background

Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

### Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

### Licenze Microsoft SQL Server 2005

Versione software Licenze Microsoft SQL Server 2005 Noleggio/mese senza assistenza sistemistica Noleggio/mese CON assistenza sistemistica SQL Server Express 0,00+Iva da preventivare SQL Server Workgroup

### Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

### CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

### MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

### A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases

International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 3, Issue 5 (August 2012), PP. 56-61 A Fast and Efficient Method to Find the Conditional

### Benchmarking Cassandra on Violin

Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

### HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

### Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

### Distributed Framework for Data Mining As a Service on Private Cloud

RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

### A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

### Index Selection Techniques in Data Warehouse Systems

Index Selection Techniques in Data Warehouse Systems Aliaksei Holubeu as a part of a Seminar Databases and Data Warehouses. Implementation and usage. Konstanz, June 3, 2005 2 Contents 1 DATA WAREHOUSES

### SOFTWARE as a service (SaaS) has attracted significant

ADAPTIVE DATABASE SCHEMA DESIGN FOR MULTI-TENANT DATA MANAGEMENT 1 ive Database Schema Design for Multi-Tenant Data Management Jiacai Ni, Guoliang Li, Lijun Wang, Jianhua Feng, Jun Zhang, Lei Li Abstract

### Efficient Algorithms for Masking and Finding Quasi-Identifiers

Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasi-identifier refers

### Graph Database Proof of Concept Report

Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

### Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

### Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark In-memory database systems (IMDSs) eliminate much of the performance latency associated with traditional on-disk

### Implementing Web-Based Computing Services To Improve Performance And Assist Telemedicine Database Management System

Implementing Web-Based Computing Services To Improve Performance And Assist Telemedicine Database Management System D. A. Vidhate 1, Ige Pranita 2, Kothari Pooja 3, Kshatriya Pooja 4 (Information Technology,

### Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

### Load Balancing for Entity Matching over Big Data using Sorted Neighborhood

San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Load Balancing for Entity Matching over Big Data using Sorted Neighborhood Yogesh Wattamwar

### Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract

### Understanding Data Locality in VMware Virtual SAN

Understanding Data Locality in VMware Virtual SAN July 2014 Edition T E C H N I C A L M A R K E T I N G D O C U M E N T A T I O N Table of Contents Introduction... 2 Virtual SAN Design Goals... 3 Data

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

### White Paper. Optimizing the Performance Of MySQL Cluster

White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....

### Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

### Key Attributes for Analytics in an IBM i environment

Key Attributes for Analytics in an IBM i environment Companies worldwide invest millions of dollars in operational applications to improve the way they conduct business. While these systems provide significant

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München,

### Enterprise Edition Analytic Data Warehouse Technology White Paper

Enterprise Edition Analytic Data Warehouse Technology White Paper August 2008 Infobright 47 Colborne Lane, Suite 403 Toronto, Ontario M5E 1P8 Canada www.infobright.com info@infobright.com Table of Contents

### CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS In today's scenario data warehouse plays a crucial role in order to perform important operations. Different indexing techniques has been used and analyzed using

### Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012

Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

### EMC Unified Storage for Microsoft SQL Server 2008

EMC Unified Storage for Microsoft SQL Server 2008 Enabled by EMC CLARiiON and EMC FAST Cache Reference Copyright 2010 EMC Corporation. All rights reserved. Published October, 2010 EMC believes the information

### Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

### Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More