PAXQuery: A Massively Parallel XQuery Processor
|
|
|
- Melissa Goodman
- 10 years ago
- Views:
Transcription
1 PAXQuery: A Massively Parallel XQuery Processor Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu To cite this version: Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu. PAXQuery: A Massively Parallel XQuery Processor. DanaC 14, Jun 2014, Snowbird, UT, United States. 2014, < / >. <hal > HAL Id: hal Submitted on 25 Nov 2014 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
2 PAXQuery: A Massively Parallel XQuery Processor Jesús Camacho-Rodríguez Université Paris-Sud & Inria, France [email protected] Ioana Manolescu Inria & Université Paris-Sud, France [email protected] Dario Colazzo Université Paris-Dauphine, France [email protected] ABSTRACT We present a novel approach for parallelizing the execution of queries over XML documents, implemented within our system PAXQuery. We compile a rich subset of XQuery into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data. Categories and Subject Descriptors H.2.4 [Database Management]: Systems General Terms Design, Performance, Experimentation 1. INTRODUCTION Increasingly large data volumes have lead to the apparition of massively parallel processing frameworks, such as MapReduce [7]. Its main advantage is to simplify parallel data processing and handle task allocation and fault tolerance transparently from the application. While the simplicity of MapReduce is an advantage, it is also a limitation, since large data processing tasks are represented by complex programs consisting of many Map and Reduce tasks. In particular, since these tasks are conceptually very simple, one often needs to write programs comprising many successive tasks, which limits parallelism. To overcome this problem, more powerful abstractions have appeared to express massively parallel complex data processing, such as the Parallelization Contracts programming model [2] (or PACT, in short). In a nutshell, PACT generalizes MapReduce by (i) manipulating records with any number of fields, instead of (key, value) pairs, (ii) enabling the definition of custom parallel operators by means of second-order functions, and (iii) allowing one parallel operator to receive as input the outputs This work was partially done when the author was with Université Paris-Sud and Inria. c ACM, This is the author s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in DanaC 14, June , Snowbird, UT, USA. XML documents XQuery XML algebra expression PACT plan Stratosphere system XQuery Data Model (XDM) Extended XQuery Data Model (EXDM) PACT Data Model XQuery results Figure 1: PAXQuery architecture overview. of several other such operators. The PACT model is part of the open-source Stratosphere platform [18]. In this work, we present PAXQuery, a massively parallel XQuery processor. Given a very large collection of XML documents, evaluating a query that navigates over these documents and also joins results from different documents raises performance challenges, which may be addressed by parallelism. Inspired by other high-level data analytics languages that are compiled into parallel frameworks [3, 10, 16], PAX- Query translates XML queries into PACT plans. The main advantage of this approach is implicit parallelism: neither theapplicationnortheuserneedtopartitionthexmlinput or the query across nodes. This contrasts with prior work [4, 12]. Further, we can rely on the Stratosphere platform for the optimization of the PACT plan and its automatic transformation into a data flow that is evaluated in parallel on top of the Hadoop Distributed File System (HDFS); these steps are explained in [2]. In the sequel, Section 2 describes PAXQuery architecture and main features in detail, and provides a beginning-to-end query translation example. Section 3 describes the experimental evaluation of our system. Section 4 discusses related work and Section 5 concludes. 2. PAXQUERY ARCHITECTURE Our approach for implicit parallel evaluation of XML queries is to translate them into PACT plans as depicted in Figure 1. The central vertical stack traces the query translation steps from the top to the bottom, while at the right of each step we show the data models manipulated by that step. We present each step of the translation below. 2.1 Query language PAXQuery supports a representative subset of XQuery[19], the W3C s standard query language for XML data, which has been recently enhanced with strongly requested features geared towards XML analytics. In particular, our
3 goal was to cover (i) the main navigational mechanisms of XQuery, and (ii) its key constructs to express analytical style queries e.g. aggregation, explicit grouping, and rich comparison predicates. Example. The following XQuery extracts the name of users, and the items of their auctions (if any): let $pc := collection( people ), $cc := collection( closed auctions ) for $p in $pc/site/people/person, $i in $p/@id let $n := $p/name let $r := for $c in $cc//closed_auction, $b in $c/buyer/@person, $s in $c/seller/@person let $a := $c/itemref where $i = $b or $i = $s return $a return <res>{$n,$r}</res> The above query shows some of the main features of our XQuery fragment. It includes FLWR expressions, which are powerful enough to express complex operations like iterating over sequences, joining multiple documents, and performing grouping. XPath paths start from the root of each document in a collection of XML documents, or from the bindings of a previouslyintroducedvariable; wesupportthexpath {/,//,[]} fragment [15]. In addition, our subset supports rich predicates expressed in disjunctive normal form (DNF), supporting value- and node identity-based comparisons. 2.2 XML query algebra Our approach is based on the representation of the XQuery query as an equivalent algebraic expression, on which multiple optimizations can be applied. XQuery translation into different algebra formalisms and the optimization of resulting expressions have been extensively studied [5, 8, 17]. A significant source of XQuery complexity comes from nesting: an XQuery expression can be nested in almost any position within another. In particular, nested queries challenge the optimizer, as straightforward translation into nested plans leads to very poor performance. Effective optimization techniques translate nested XQuery into unnested plans relying on joining and grouping [8, 13, 14]. Depending on the query shape, such decorrelating joins may be nested and/or outer joins. PAXQuery uses the algebra presented in [13], and translation from XQuery to this algebra is outlined in [1]. However, we can easily adapt to any XML query algebra used by existing engines that satisfies the following two assumptions: (i) the algebra is tuple-oriented (potentially using nested tuples), and (ii) the algebra is rich enough to support decorrelated (unnested) plans even for nested XQuery; in particular we consider that the query plan has been unnested [14] before we start translating it into PACT. 2.3 Algebraic representation of XQuery We introduce our algebraic representation of XQuery by the following example (see [13] for details): Example (continuation). The algebra plan for the XQuery introduced above is shown in Figure 2. The schemas of the tuples produced by each operator are denoted by S i. We discuss the operators starting from the leaves. The XML scan operators take as input the people (respectively closed auctions ) collection of XML documents and create a tuple out of each document in the collection. XQuery may perform navigation, which, in a nutshell, binds variables to the result of path traversals. Navigation is commonly represented through tree patterns, whose nodes e1 $pc:* site people $p: person n: $t $n: name S 5:=($pc,$p,$i,$t{$n}, S 2:=($pc,$p,$i,$t{$n}) navigation e1 S 1:=($pc) buyer $r{$cc, $c, $b, $s, $u{$a}}) scan( people ) e2 construct L $cc:* $c: closed auction seller nojoin l $i=$b $i=$s navigation e2 S 3:=($cc) scan( closed auct. ) n: $u $a: itemref S 4:=($cc,$c,$b,$s,$u{$a}) Figure 2: Logical plan for the example query. carry the labels appearing in the paths, and where some target nodes are also annotated with names of variables to be bound, e.g.$pc, $i etc. The algebra we consider allows for consolidating as many navigation operations from the same query as possible within a single navigation tree pattern, and in particular navigation performed outside of the for clauses. Large navigation patterns lead to more efficient query execution, since patterns can be matched very efficiently against XML documents [6]. Our algebra uses a navigation operator parameterized by an extended tree pattern (ETP) supporting multiple returning nodes, child and descendant axis, and nested and optional edges. ConsidertheETPe 1 infigure2. Theoperatornavigation e1 concatenates each input tuple successively with attributes (variable $i) resulting from the embeddings of e 1 in the value bound to $pc. The node labeled $n:name is (i) optional and (ii) nested with respect to its parent node $p:person, sincebyxquerysemantics: (i)ifagiven$placksa name, $p still contributes to the query result; (ii) all names for a given $p are bound into a single sequence. Observe that all name elements (variable $n) are nested into variable $t, which did not appear in the original query; in fact, $t is created by the XQuery to algebra translation to hold the nested collection with all values bound to $n. The operator navigation e2 is generated in a similar fashion. Therefore, in the previous query, ETPs e 1 and e 2 correspond to the following fragment: for $p in $pc/site/people/person, $i in $p/@id let $n := $p/name let $r := for $c in $cc//closed_auction, $b in $c/buyer/@person, $s in $c/seller/@person let $a := $c/itemref Above the navigation operators in Figure 2, we have a nested left outer join (nojoin l ρ ) on a disjunctive predicate ρ, which brings together people and the auctions they participated in, either as buyers or sellers. Observe that as the join is outer, all people are kept in the output, even if they did not participate in any auction. Finally, the XML construction (construct L) is responsible for transforming a collection of tuples to XML. For each tuple in its input, construct L builds one XML tree for each construction tree pattern in the list L attached to the operator. In our example, L contains a single pattern that generates for each tuple an XML tree consisting of elements of the form <res>{$n,$r}</res>.
4 Figure 3: (a) Map, (b) Reduce, (c) Cross, (d) Match, and (e) CoGroup parallelization contracts PACT model The PACT model [2] is a generalization of MapReduce, based on the concept of parallel data processing operators. PACT plans are DAGs of implicit parallel operators, that are optimized and translated into explicit parallel data flows by Stratosphere. Data model. PACT plans manipulate nested records of the form r = ((f 1,f 2,...,f n),(i 1,i 2,...,i k )) where 1 k n. The first component (f 1,f 2,...,f n) is an ordered sequence offieldsf i; inturn, afieldf i iseitheranatomicvalue(string) or a ordered sequence (r 1,...,r m) of records. The second component (i 1,i 2,...,i k ) is an ordered, possibly empty, sequence of record positions in [1...n] indicating the key fields for the record. Each of the key fields must be an atomic value. The key of a record r is the concatenation of all the key fields f i1,f i2,...,f ik. Processing model. Data sources and sinks are, respectively, the starting and terminal nodes of a PACT plan. The input data is stored in files; a function parameterizing the data source specifies how to structure the data into records. In turn, data is output into files, with the destination and format similarly controlled by an output function. The rest of data processing nodes in a PACT plan are operators. An operator manipulates bags of records. Its semantics is defined by (i) a parallelization contract, which determines how input records are organized into groups; and (ii) a user function (or UF) that is executed independently over each bag (group) of records created by the parallelization contract (these executions can take place in parallel). Although the PACT model allows creating custom parallelization contracts, a set of them for the most common cases is built-in: Map, Reduce, Cross, Match, and CoGroup (see Figure 3). The Map contract forms an individual group for every input record. The Reduce contract forms a group for every unique value of the key attribute in the input data set, and the group contains all records with that key value. The Cross, Match, and CoGroup contracts are used to define binary operators. The Cross contract forms a group from every pair of records in its two inputs (essentially, it produces the Cartesian product of the two input bags). The Match contract forms a group from every pair of records in its two inputs, only if the records have the same value for the key attribute. Finally, the CoGroup contract forms a group for every value of the key attribute (from the domains of both inputs), and places each record in the appropriate group depending on the key value of the record. 2.5 From algebra expressions to PACT plans We describe now how PAXQuery translates XML algebra expressions into PACT plans. First, out of nested tuples containing XML data instances (trees with identity), we create PACT nested records. Second, we translate XQuery 1 Figure reproduced from [11]. Table 1: Algebra to PACT overview. Algebra operators PACT operators (#) Scan Source (1) Construct Sink (1) Navigation Map (1) Group-by Reduce (1) Flatten Map (1) Selection Map (1) Projection Map (1) Aggregation (on nested field) Map (1) Aggregation (on top-level field) Reduce (1) Duplicate elimination Reduce (1) Cartesian product Cross (1) Inner Match (1) Conjunctive equi-join Outer CoGroup (1) Nested outer CoGroup (1) Inner Match (n) Disjunctive equi-join Outer CoGroup (n) & Reduce (1) (n conjunctions) Nested outer CoGroup (n) & Reduce (1) Inner Cross (1) Theta-join Outer Cross (1) & Reduce (1) Nested outer Cross (1) & Reduce (1) K 1 := (#5) K 2 := (#5) K 1 := (#5) K 2 := (#7) ρ := #5=#5 or #5=#7 K 3 := (#0,#2,#4) XML sink L Reduce nopostl K 3 K 3 CoGroup nopnjoinl (ρ,0) CoGroup nopnjoinl (ρ,1) K K 2 1 K 1 K 2 Map navigation(e 1 ) Map navigation(e 2 ) XML source( people ) XML source( closed auct. ) Figure 4: PACT plan corresponding to the logical expression in Figure 2. algebraic expressions into PACT plans. Details about the former are omitted in this presentation, for space reason, while we focus in the latter. Table 1 depicts the supported algebra operators and the contracts used by the PACT operators resulting from our translation. First, observe that the scan, respectively construct, functionality is integrated into a source, respectively sink, in the PACT plan. In turn, unary operators use Map and Reduce contracts depending on their semantics; the implementation of their UFs is in most of the cases straightforward. Finally, the translation of the binary operators is more complex, as they have to deal efficiently with the nested and/or outer nature of some joins, which may result in multiple operators at the PACT level. We illustrate this more elaborated translation with the following example. Example (continuation). Consider the algebra plan in Figure 2. PAXQuery generates the PACT program shown infigure4; eachk i containsthepositionsofthekeyfieldsin records. The XML source operators scan (in parallel) the respective collections and transform each document into a
5 700 1 node, 34 GB nodes, 68 GB 4 nodes, 136 GB nodes, 272 GB q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 Execution time (s) 700 BaseX SaxonPE Qizx/open PAXQuery q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 Figure 5: PAXQuery scalability evaluation (left) and comparison with centralized processors (right). Execution time (s) record. Next, the Map operators apply the navigation UF in parallel on the records thus obtained, following the query s XPath expressions. The nested outer join is translated into two CoGroup operators and a post-processing Reduce. The core difficulty to address by our translation is to correctly express (i) the disjunction in the where clause of the query, and (ii) the outerjoin semantics (recall that in this example a <res> element must be output even for people with no auctions). The main feature of the nopnjoin l (ρ,k) UF associated to each CoGroup is to guarantee that no erroneous duplicates are generated when the parallel evaluation of more than one conjunctive predicate is true for a certain record. The Reduce operator groups all the results of the previous CoGroup operators having the same left hand-side record, and then the nopost l UF associated to it is applied to produce the final result for the join. Finally, the XML sink builds and returns XML results. Clearly, complex joins such as the one contained in the example could be translated using a single Cross operator instead of multiple CoGroup. However, this would be less efficient and scale poorly (number of comparisons quadratic in the input size), as our experiments demonstrate. 2.6 Optimization and execution by Stratosphere In the last step, PAXQuery sends the PACT plan to the Stratosphere platform, which optimizes it, and turns it into a data flow that is evaluated in parallel, as explained in [2]. 3. IMPLEMENTATION AND EXPERIMENTS PAXQuery has been implemented in Java 1.6, and it relies on the Stratosphere platform [18] for the execution. We present here some experimental results we obtained with it. PAXQuery scalability was studied by fixing a set of 14 queries, generating documents (34GB) per node, and varying the number of nodes from 1 to 2, 4, 8 respectively; the total dataset size increases accordingly in a linear fashion, up to 272GB. Figure 5 (left) shows the response times for each query. Our results show that the translation to PACT allows PAXQuery to parallelize every query execution step with no effort required to partition, redistribute data etc., and thus to scale out with the number of machines in a cluster. The only case where scale-up was not so good is q 14 where we used a Cross to translate an inequality join; this highlights the interest of the more efficient operators (especially CoGroup) used in the other plans. Figure 5 compares PAXQuery with XQuery processors in a centralized setting (one single node, 34 GB of data). None of the competing processors was able to evaluate any of our queries with joins across documents (q 9-q 14), as they run out of memory or out of time, highlighting the need for efficient parallel platforms for evaluating such queries on such large document collections. 4. RELATED WORK MRQL [9] proposes a simple SQL-like XML query language implemented through a few operators directly compilable into MapReduce. MRQL queries may be nested, however, its dialect does not allow for expressing the rich join flavours that we use. Further, the XML navigation supported by MRQL is limited to XPath, in contrast to our richer navigation based on tree patterns with multiple returning nodes, and nested and optional edges. ChuQL [12] is an XQuery extension that exposes the MapReduce framework to the developer in order to distribute computations among XQuery engines; this leaves the parallelization work to the programmer, in contrast with our implicit query parallelization approach. 5. CONCLUSION AND FUTURE WORK We have presented the PAXQuery approach for the implicit parallelization of XQuery through translation to PACT. Our experiments demonstrate the efficiency and scalability ofpaxquery. Itisourplantoopen-sourcethesysteminthe near future. In the future, we contemplate the integration of indexing techniques into PAXQuery, and reusing intermediary results in the PACT framework to enable efficient multiple-query processing. Acknowledgements. This work has been partially funded by the KIC EIT ICT Labs activity REFERENCES [1] A. Arion, V. Benzaken, I. Manolescu, Y. Papakonstantinou, and R. Vijay. Algebra-Based identification of tree patterns in XQuery. In FQAS, [2] D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In SoCC, [3] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, [4] N. Bidoit, D. Colazzo, N. Malla, F. Ulliana, M. Nolè, and C. Sartiani. Processing XML queries and updates on map/reduce clusters. In EDBT, [5] P. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In SIGMOD, [6] Y. Chen, S. B. Davidson, and Y. Zheng. An efficient XPath query processor for XML streams. In ICDE, [7] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, [8] A. Deutsch, Y. Papakonstantinou, and Y. Xu. The NEXT Logical Framework for XQuery. In VLDB, [9] L. Fegaras, C. Li, U. Gupta, and J. Philip. XML Query Optimization in Map-Reduce. In WebDB, [10] A. Heise, A. Rheinländer, M. Leich, U. Leser, and F. Naumann. Meteor/Sopremo: An Extensible Query Language and Operator Model. In BIGDATA, [11] F. Hueske, M. Peters, M. J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, [12] S. Khatchadourian, M. P. Consens, and J. Siméon. Having a ChuQL at XML on the cloud. In AMW, [13] I. Manolescu, Y. Papakonstantinou, and V. Vassalos. XML Tuple Algebra. In Encyclopedia of Database Systems [14] N. May, S. Helmer, and G. Moerkotte. Strategies for query unnesting in XML databases. TODS, [15] G. Miklau and D. Suciu. Containment and equivalence for an XPath fragment. In PODS, [16] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, [17] C. Re, J. Siméon, and M. F. Fernández. A Complete and Efficient Algebraic Compiler for XQuery. In ICDE, [18] Stratosphere Platform. [19] XQuery 3.0: An XML Query Language, October 2013.
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Mobility management and vertical handover decision making in heterogeneous wireless networks
Mobility management and vertical handover decision making in heterogeneous wireless networks Mariem Zekri To cite this version: Mariem Zekri. Mobility management and vertical handover decision making in
ibalance-abf: a Smartphone-Based Audio-Biofeedback Balance System
ibalance-abf: a Smartphone-Based Audio-Biofeedback Balance System Céline Franco, Anthony Fleury, Pierre-Yves Guméry, Bruno Diot, Jacques Demongeot, Nicolas Vuillerme To cite this version: Céline Franco,
QASM: a Q&A Social Media System Based on Social Semantics
QASM: a Q&A Social Media System Based on Social Semantics Zide Meng, Fabien Gandon, Catherine Faron-Zucker To cite this version: Zide Meng, Fabien Gandon, Catherine Faron-Zucker. QASM: a Q&A Social Media
A graph based framework for the definition of tools dealing with sparse and irregular distributed data-structures
A graph based framework for the definition of tools dealing with sparse and irregular distributed data-structures Serge Chaumette, Jean-Michel Lepine, Franck Rubi To cite this version: Serge Chaumette,
Big Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv [email protected] Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
A usage coverage based approach for assessing product family design
A usage coverage based approach for assessing product family design Jiliang Wang To cite this version: Jiliang Wang. A usage coverage based approach for assessing product family design. Other. Ecole Centrale
VR4D: An Immersive and Collaborative Experience to Improve the Interior Design Process
VR4D: An Immersive and Collaborative Experience to Improve the Interior Design Process Amine Chellali, Frederic Jourdan, Cédric Dumas To cite this version: Amine Chellali, Frederic Jourdan, Cédric Dumas.
Additional mechanisms for rewriting on-the-fly SPARQL queries proxy
Additional mechanisms for rewriting on-the-fly SPARQL queries proxy Arthur Vaisse-Lesteven, Bruno Grilhères To cite this version: Arthur Vaisse-Lesteven, Bruno Grilhères. Additional mechanisms for rewriting
Online vehicle routing and scheduling with continuous vehicle tracking
Online vehicle routing and scheduling with continuous vehicle tracking Jean Respen, Nicolas Zufferey, Jean-Yves Potvin To cite this version: Jean Respen, Nicolas Zufferey, Jean-Yves Potvin. Online vehicle
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Study on Cloud Service Mode of Agricultural Information Institutions
Study on Cloud Service Mode of Agricultural Information Institutions Xiaorong Yang, Nengfu Xie, Dan Wang, Lihua Jiang To cite this version: Xiaorong Yang, Nengfu Xie, Dan Wang, Lihua Jiang. Study on Cloud
Minkowski Sum of Polytopes Defined by Their Vertices
Minkowski Sum of Polytopes Defined by Their Vertices Vincent Delos, Denis Teissandier To cite this version: Vincent Delos, Denis Teissandier. Minkowski Sum of Polytopes Defined by Their Vertices. Journal
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore [email protected] Abstract. Processing XML queries over
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
Faut-il des cyberarchivistes, et quel doit être leur profil professionnel?
Faut-il des cyberarchivistes, et quel doit être leur profil professionnel? Jean-Daniel Zeller To cite this version: Jean-Daniel Zeller. Faut-il des cyberarchivistes, et quel doit être leur profil professionnel?.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Global Identity Management of Virtual Machines Based on Remote Secure Elements
Global Identity Management of Virtual Machines Based on Remote Secure Elements Hassane Aissaoui, P. Urien, Guy Pujolle To cite this version: Hassane Aissaoui, P. Urien, Guy Pujolle. Global Identity Management
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
Flauncher and DVMS Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed Geographically
Flauncher and Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed Geographically Daniel Balouek, Adrien Lèbre, Flavien Quesnel To cite this version: Daniel Balouek,
Ontology construction on a cloud computing platform
Ontology construction on a cloud computing platform Exposé for a Bachelor's thesis in Computer science - Knowledge management in bioinformatics Tobias Heintz 1 Motivation 1.1 Introduction PhenomicDB is
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R Jose Luis Lopez Pino [email protected] Database Systems and Information Management Technische Universität Berlin Supervised by Volker Markl Advised
GDS Resource Record: Generalization of the Delegation Signer Model
GDS Resource Record: Generalization of the Delegation Signer Model Gilles Guette, Bernard Cousin, David Fort To cite this version: Gilles Guette, Bernard Cousin, David Fort. GDS Resource Record: Generalization
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
ANIMATED PHASE PORTRAITS OF NONLINEAR AND CHAOTIC DYNAMICAL SYSTEMS
ANIMATED PHASE PORTRAITS OF NONLINEAR AND CHAOTIC DYNAMICAL SYSTEMS Jean-Marc Ginoux To cite this version: Jean-Marc Ginoux. ANIMATED PHASE PORTRAITS OF NONLINEAR AND CHAOTIC DYNAMICAL SYSTEMS. A.H. Siddiqi,
Aligning subjective tests using a low cost common set
Aligning subjective tests using a low cost common set Yohann Pitrey, Ulrich Engelke, Marcus Barkowsky, Romuald Pépion, Patrick Le Callet To cite this version: Yohann Pitrey, Ulrich Engelke, Marcus Barkowsky,
A model driven approach for bridging ILOG Rule Language and RIF
A model driven approach for bridging ILOG Rule Language and RIF Valerio Cosentino, Marcos Didonet del Fabro, Adil El Ghali To cite this version: Valerio Cosentino, Marcos Didonet del Fabro, Adil El Ghali.
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Cobi: Communitysourcing Large-Scale Conference Scheduling
Cobi: Communitysourcing Large-Scale Conference Scheduling Haoqi Zhang, Paul André, Lydia Chilton, Juho Kim, Steven Dow, Robert Miller, Wendy E. Mackay, Michel Beaudouin-Lafon To cite this version: Haoqi
Managing Risks at Runtime in VoIP Networks and Services
Managing Risks at Runtime in VoIP Networks and Services Oussema Dabbebi, Remi Badonnel, Olivier Festor To cite this version: Oussema Dabbebi, Remi Badonnel, Olivier Festor. Managing Risks at Runtime in
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
Expanding Renewable Energy by Implementing Demand Response
Expanding Renewable Energy by Implementing Demand Response Stéphanie Bouckaert, Vincent Mazauric, Nadia Maïzi To cite this version: Stéphanie Bouckaert, Vincent Mazauric, Nadia Maïzi. Expanding Renewable
Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study
Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study Jianwu Wang, Daniel Crawl, Ilkay Altintas San Diego Supercomputer Center University of California,
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Overview of model-building strategies in population PK/PD analyses: 2002-2004 literature survey.
Overview of model-building strategies in population PK/PD analyses: 2002-2004 literature survey. Céline Dartois, Karl Brendel, Emmanuelle Comets, Céline Laffont, Christian Laveille, Brigitte Tranchand,
An update on acoustics designs for HVAC (Engineering)
An update on acoustics designs for HVAC (Engineering) Ken MARRIOTT To cite this version: Ken MARRIOTT. An update on acoustics designs for HVAC (Engineering). Société Française d Acoustique. Acoustics 2012,
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation
Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation Rajasekar Krishnamurthy University of Wisconsin [email protected] Raghav Kaushik Microsoft Corporation [email protected]
Big Data Begets Big Database Theory
Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Block-o-Matic: a Web Page Segmentation Tool and its Evaluation
Block-o-Matic: a Web Page Segmentation Tool and its Evaluation Andrés Sanoja, Stéphane Gançarski To cite this version: Andrés Sanoja, Stéphane Gançarski. Block-o-Matic: a Web Page Segmentation Tool and
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Deferred node-copying scheme for XQuery processors
Deferred node-copying scheme for XQuery processors Jan Kurš and Jan Vraný Software Engineering Group, FIT ČVUT, Kolejn 550/2, 160 00, Prague, Czech Republic [email protected], [email protected] Abstract.
Big Data Analytics. Chances and Challenges. Volker Markl
Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD
IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs
IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs Thomas Durieux, Martin Monperrus To cite this version: Thomas Durieux, Martin Monperrus. IntroClassJava: A Benchmark of 297 Small and Buggy
The PigMix Benchmark on Pig, MapReduce, and HPCC Systems
2015 IEEE International Congress on Big Data The PigMix Benchmark on Pig, MapReduce, and HPCC Systems Keren Ouaknine 1, Michael Carey 2, Scott Kirkpatrick 1 {ouaknine}@cs.huji.ac.il 1 School of Engineering
The truck scheduling problem at cross-docking terminals
The truck scheduling problem at cross-docking terminals Lotte Berghman,, Roel Leus, Pierre Lopez To cite this version: Lotte Berghman,, Roel Leus, Pierre Lopez. The truck scheduling problem at cross-docking
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Use of tabletop exercise in industrial training disaster.
Use of tabletop exercise in industrial training disaster. Alexis Descatha, Thomas Loeb, François Dolveck, Nathalie-Sybille Goddet, Valerie Poirier, Michel Baer To cite this version: Alexis Descatha, Thomas
Towards Unified Tag Data Translation for the Internet of Things
Towards Unified Tag Data Translation for the Internet of Things Loïc Schmidt, Nathalie Mitton, David Simplot-Ryl To cite this version: Loïc Schmidt, Nathalie Mitton, David Simplot-Ryl. Towards Unified
Technologies for a CERIF XML based CRIS
Technologies for a CERIF XML based CRIS Stefan Bärisch GESIS-IZ, Bonn, Germany Abstract The use of XML as a primary storage format as opposed to data exchange raises a number of questions regarding the
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop
The Stratosphere platform for big data analytics
The VLDB Journal DOI 10.1007/s00778-014-0357-y REGULAR PAPER The Stratosphere platform for big data analytics Alexander Alexandrov Rico Bergmann Stephan Ewen Johann-Christoph Freytag Fabian Hueske Arvid
Information Technology Education in the Sri Lankan School System: Challenges and Perspectives
Information Technology Education in the Sri Lankan School System: Challenges and Perspectives Chandima H. De Silva To cite this version: Chandima H. De Silva. Information Technology Education in the Sri
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson
ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson ASTERIX What is it? It s a next generation Parallel Database System to addressing today s
Principles of Distributed Data Management in 2020?
Principles of Distributed Data Management in 2020? Patrick Valduriez To cite this version: Patrick Valduriez. Principles of Distributed Data Management in 2020?. DEXA 11: International Conference on Databases
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Chapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
ControVol: A Framework for Controlled Schema Evolution in NoSQL Application Development
ControVol: A Framework for Controlled Schema Evolution in NoSQL Application Development Stefanie Scherzinger, Thomas Cerqueus, Eduardo Cunha de Almeida To cite this version: Stefanie Scherzinger, Thomas
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
JackHare: a framework for SQL to NoSQL translation using MapReduce
DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:
Caching XML Data on Mobile Web Clients
Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D-33102 Paderborn,
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Damia: Data Mashups for Intranet Applications
Damia: Data Mashups for Intranet Applications David E. Simmen, Mehmet Altinel, Volker Markl, Sriram Padmanabhan, Ashutosh Singh IBM Almaden Research Center 650 Harry Road San Jose, CA 95120, USA {simmen,
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Territorial Intelligence and Innovation for the Socio-Ecological Transition
Territorial Intelligence and Innovation for the Socio-Ecological Transition Jean-Jacques Girardot, Evelyne Brunau To cite this version: Jean-Jacques Girardot, Evelyne Brunau. Territorial Intelligence and
An Automatic Reversible Transformation from Composite to Visitor in Java
An Automatic Reversible Transformation from Composite to Visitor in Java Akram To cite this version: Akram. An Automatic Reversible Transformation from Composite to Visitor in Java. CIEL 2012, P. Collet,
P2Prec: A Social-Based P2P Recommendation System
P2Prec: A Social-Based P2P Recommendation System Fady Draidi, Esther Pacitti, Didier Parigot, Guillaume Verger To cite this version: Fady Draidi, Esther Pacitti, Didier Parigot, Guillaume Verger. P2Prec:
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. [email protected] Mrs.
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
GEO DISTRIBUTED PARALLELIZATION PACTS IN MAP REDUCE FRAMEWORK IN MULTIPLE DATACENTER
GEO DISTRIBUTED PARALLELIZATION PACTS IN MAP REDUCE FRAMEWORK IN MULTIPLE DATACENTER C.Kirubanantham 1, C.Rajavenkateswaran 2 1 Student, computer science engineering, Nandha College of Technology, India
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1
ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)
Distributed High-Dimensional Index Creation using Hadoop, HDFS and C++
Distributed High-Dimensional Index Creation using Hadoop, HDFS and C++ Gylfi ór Gumundsson, Laurent Amsaleg, Björn ór Jónsson To cite this version: Gylfi ór Gumundsson, Laurent Amsaleg, Björn ór Jónsson.
