Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents



Similar documents
Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

Software Engineering and Development

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Over-encryption: Management of Access Control Evolution on Outsourced Data

Chapter 3 Savings, Present Value and Ricardian Equivalence

An Efficient Group Key Agreement Protocol for Ad hoc Networks

The transport performance evaluation system building of logistics enterprises

Chris J. Skinner The probability of identification: applying ideas from forensic statistics to disclosure risk assessment

Comparing Availability of Various Rack Power Redundancy Configurations

est using the formula I = Prt, where I is the interest earned, P is the principal, r is the interest rate, and t is the time in years.

ON THE (Q, R) POLICY IN PRODUCTION-INVENTORY SYSTEMS

An Introduction to Omega

HEALTHCARE INTEGRATION BASED ON CLOUD COMPUTING

A framework for the selection of enterprise resource planning (ERP) system based on fuzzy decision making methods

Comparing Availability of Various Rack Power Redundancy Configurations

INITIAL MARGIN CALCULATION ON DERIVATIVE MARKETS OPTION VALUATION FORMULAS

Ilona V. Tregub, ScD., Professor

An Analysis of Manufacturer Benefits under Vendor Managed Systems

Cloud Service Reliability: Modeling and Analysis

Top K Nearest Keyword Search on Large Graphs

An application of stochastic programming in solving capacity allocation and migration planning problem under uncertainty

THE DISTRIBUTED LOCATION RESOLUTION PROBLEM AND ITS EFFICIENT SOLUTION

Review Graph based Online Store Review Spammer Detection

Modeling and Verifying a Price Model for Congestion Control in Computer Networks Using PROMELA/SPIN

Things to Remember. r Complete all of the sections on the Retirement Benefit Options form that apply to your request.

METHODOLOGICAL APPROACH TO STRATEGIC PERFORMANCE OPTIMIZATION

Automatic Testing of Neighbor Discovery Protocol Based on FSM and TTCN*

Converting knowledge Into Practice

Supporting Efficient Top-k Queries in Type-Ahead Search

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems

Towards Automatic Update of Access Control Policy

Database Management Systems

Towards Realizing a Low Cost and Highly Available Datacenter Power Infrastructure

Financing Terms in the EOQ Model

How to recover your Exchange 2003/2007 mailboxes and s if all you have available are your PRIV1.EDB and PRIV1.STM Information Store database

Promised Lead-Time Contracts Under Asymmetric Information

Approximation Algorithms for Data Management in Networks

MULTIPLE SOLUTIONS OF THE PRESCRIBED MEAN CURVATURE EQUATION

Semipartial (Part) and Partial Correlation

Distributed Computing and Big Data: Hadoop and MapReduce

The impact of migration on the provision. of UK public services (SRG ) Final Report. December 2011

The Binomial Distribution

Optimizing Content Retrieval Delay for LT-based Distributed Cloud Storage Systems

Channel selection in e-commerce age: A strategic analysis of co-op advertising models

Firstmark Credit Union Commercial Loan Department

UNIT CIRCLE TRIGONOMETRY

Tracking/Fusion and Deghosting with Doppler Frequency from Two Passive Acoustic Sensors

9:6.4 Sample Questions/Requests for Managing Underwriter Candidates

Scheduling Hadoop Jobs to Meet Deadlines

The Role of Gravity in Orbital Motion

Reduced Pattern Training Based on Task Decomposition Using Pattern Distributor

Valuation of Floating Rate Bonds 1

Data Center Demand Response: Avoiding the Coincident Peak via Workload Shifting and Local Generation

How To Find The Optimal Stategy For Buying Life Insuance

A Capacitated Commodity Trading Model with Market Power

Loyalty Rewards and Gift Card Programs: Basic Actuarial Estimation Techniques

Spirotechnics! September 7, Amanda Zeringue, Michael Spannuth and Amanda Zeringue Dierential Geometry Project

Define What Type of Trader Are you?

2 r2 θ = r2 t. (3.59) The equal area law is the statement that the term in parentheses,

Give me all I pay for Execution Guarantees in Electronic Commerce Payment Processes

Model-Driven Engineering of Adaptation Engines for Self-Adaptive Software: Executable Runtime Megamodels

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

Analyzing Ballistic Missile Defense System Effectiveness Based on Functional Dependency Network Analysis

ENABLING INFORMATION GATHERING PATTERNS FOR EMERGENCY RESPONSE WITH THE OPENKNOWLEDGE SYSTEM

who supply the system vectors for their JVM products. 1 HBench:Java will work best with support from JVM vendors

PAN STABILITY TESTING OF DC CIRCUITS USING VARIATIONAL METHODS XVIII - SPETO pod patronatem. Summary

Choosing the best hedonic product represents a challenging

YARN PROPERTIES MEASUREMENT: AN OPTICAL APPROACH

Secure Smartcard-Based Fingerprint Authentication

MATHEMATICAL SIMULATION OF MASS SPECTRUM

CONCEPT OF TIME AND VALUE OFMONEY. Simple and Compound interest

CONCEPTUAL FRAMEWORK FOR DEVELOPING AND VERIFICATION OF ATTRIBUTION MODELS. ARITHMETIC ATTRIBUTION MODELS

An Approach to Optimized Resource Allocation for Cloud Simulation Platform

High Availability Replication Strategy for Deduplication Storage System

Ignorance is not bliss when it comes to knowing credit score

The LCOE is defined as the energy price ($ per unit of energy output) for which the Net Present Value of the investment is zero.

AMB111F Financial Maths Notes

Self-Adaptive and Resource-Efficient SLA Enactment for Cloud Computing Infrastructures

How to create RAID 1 mirroring with a hard disk that already has data or an operating system on it

Financial Derivatives for Computer Network Capacity Markets with Quality-of-Service Guarantees

Risk Sensitive Portfolio Management With Cox-Ingersoll-Ross Interest Rates: the HJB Equation

Gravitational Mechanics of the Mars-Phobos System: Comparing Methods of Orbital Dynamics Modeling for Exploratory Mission Planning

30 H. N. CHIU 1. INTRODUCTION. Recherche opérationnelle/operations Research

How To Write A Theory Of The Concept Of The Mind In A Quey

Patent renewals and R&D incentives

IBM Research Smarter Transportation Analytics

Transcription:

Uncetain Vesion Contol in Open Collaboative Editing of Tee-Stuctued Documents M. Lamine Ba Institut Mines Télécom; Télécom PaisTech; LTCI Pais, Fance mouhamadou.ba@ telecom-paistech.f Talel Abdessalem Institut Mines Télécom; Télécom PaisTech; LTCI Pais, Fance talel.abdessalem@ telecom-paistech.f Piee Senellat Télécom PaisTech & The Univesity of Hong Kong Pais, Fance & Hong Kong piee.senellat@ telecom-paistech.f ABSTRACT In ode to ease content enichment, exchange, and shaing, web-scale collaboative platfoms such as Wikipedia o Google Docs enable unbounded inteactions between a lage numbe of contibutos, without pio knowledge of thei level of expetise and eliability. Vesion contol is then essential fo keeping tack of the evolution of the shaed content and its povenance. In such envionments, uncetainty is ubiquitous due to the uneliability of the souces, the incompleteness and impecision of the contibutions, the possibility of malicious editing and vandalism acts, etc. To handle this uncetainty, we use a pobabilistic XML model as a basic component of ou vesion contol famewok. Each vesion of a shaed document is epesented by an XML tee and the whole document, togethe with its diffeent vesions, is modeled as a pobabilistic XML document. Uncetainty is evaluated using the pobabilistic model and the eliability measue associated to each souce, each contibuto, o each editing event, esulting in an uncetainty measue on each vesion and each pat of the document. We show that standad vesion contol opeations can be implemented diectly as opeations on the pobabilistic XML model; efficiency with espect to deteministic vesion contol systems is demonstated on eal-wold datasets. Categoies and Subject Desciptos H.2.1 [Database Management]: Logical Design Data models; I.7.1 [Document and Text Pocessing]: Document and Text Editing Vesion contol Keywods XML, collaboative wok, uncetain data, vesion contol 1. INTRODUCTION Vesion Contol in Open Envionments. In many collaboative editing systems, whee seveal uses can povide con- Pemission to make digital o had copies of all o pat of this wok fo pesonal o classoom use is ganted without fee povided that copies ae not made o distibuted fo pofit o commecial advantage and that copies bea this notice and the full citation on the fist page. Copyights fo components of this wok owned by othes than the autho(s) must be honoed. Abstacting with cedit is pemitted. To copy othewise, o epublish, to post on seves o to edistibute to lists, equies pio specific pemission and/o a fee. Request pemissions fom pemissions@acm.og. DocEng 13, Septembe 10 13, 2013, Floence, Italy. Copyight is held by the owne/autho(s). Publication ights licensed to ACM. ACM 978-1-4503-1770-2/13/09...$15.00. http://dx.doi.og/10.1145/2494266.2494277. tent, content management is based on vesion contol. A vesion contol system tacks the vesions of the content as well as changes. Such a system enables fixing eo made in the evision pocess, queying past vesions, and integation of content fom diffeent contibutos. As suveyed in [12,26], much effot elated to vesion contol has been caied out both in eseach and in applications. The pime applications wee collaboative document authoing pocess, compute-aided design, and softwae development systems. Cuently, poweful vesion contol tools, such as Subvesion [18] and Git [15], efficiently manage lage souce code epositoies and shaed filesystems. Howeve, existing appoaches leave no oom fo uncetainty handling, fo instance, uncetain data esulting fom conflicts. Conflicts ae common in collaboative editing tasks, in paticula in an open envionment. They aise wheneve concuent edits attempt to change the same content. As a esult, conflicts intoduce some ambiguities in content change management. But souces of uncetainties in the vesion contol pocess ae not only due to conflicts. Indeed, thee ae inheently uncetain applications using vesion contol, such as web-scale collaboative platfoms: Platfoms such as Wikipedia [6] o Google Docs [2] enable unbounded inteactions between a lage numbe of contibutos, without pio knowledge of thei level of expetise and eliability. In these systems, vesion contol is used fo keepingtackoftheevolutionoftheshaedcontentandits povenance. In such envionments, uncetainty is ubiquitous due to the uneliability of the souces, the incompleteness and impecision of the contibutions, the possibility of malicious editing and vandalism acts, etc. Theefoe, a vesion contol technique able to popely manipulate uncetain data may be vey helpful in this kind of applications. We detail application scenaios next. Uncetainty in Wikipedia Vesions. Some web-scale collaboative systems such as Wikipedia have no wite-access estictions ove documents. As a esult, multi-vesion documents include data fom diffeent uses. As shown in [38], Wikipedia has known an exponential gowth of contibutos and editions pe aticles. The open and fee featues lead to contibutions with vaiable eliability and consistency depending both on the contibutos expetise (e.g., novice o expet) and the scope of the debated subjects. At the same time, edit was, malicious contibutions like spams, and vandalism acts can happen at any time duing document evolution. Theefoe, the integity and the quality of each aticle may be stongly alteed. Suggested solutions to these citical

issues ae eviewing access policies fo aticles discussing hot topics, o quality-diven solutions based on the eputations of authos, statistics on fequency of content change, o the tust a given eade has on the infomation [10,20,29]. But esticting editions on Wikipedia aticles to a cetain goup of pivileged contibutos does not suppess the necessity of epesenting and assessing uncetainties. Indeed, edits may be incomplete, impecise o uncetain, showing patial views, misinfomations o subjective opinions. The eputation of contibutos o the confidence level on souces ae useful infomation towads a quantitative evaluation of the quality of vesions and even moe of each atomic contibution. Howeve, a pio efficient epesentation of uncetainty acoss document vesions emains a peequisite. Use Pefeence at Visualization Time. Filteing and visualizing content ae also impotant featues in collaboative envionments. In Wikipedia, uses ae not only contibutos, but also consumes, inteested in seaching and eading infomation on multi-vesion aticles. Cuent systems constain the uses to visualize eithe the latest evision of a given aticle, even though it may not be the most elevant, o the vesion at a specific date. Uses, especially in univesal knowledge management platfoms like Wikipedia, may want to easily access moe elevant vesions o those of authos whom they tust. Filteing uneliable content is one of the benefits of ou appoach. It can be achieved easily by hiding the contibutions of the offending souce, fo instance when a vandalism act is detected, o at quey time to fit use pefeences and tust in the contibutos. Altenatively, to deal with misinfomation, it seems useful to povide vesions to uses with infomation about thei amount of uncetainty and the uncetainty of each pat of thei content. Last but not least, uses at visualization time should be able to seach fo a document epesenting the outcome of combining pats (e.g., some of them might be incomplete, impecise, and even uncetain taken apat) fom diffeent vesions. We demonstate in [7] an application of these new modes of inteaction to Wikipedia evisions: an aticle is no longe consideed as the last valid evision, but as a mege of all possible (uncetain) evisions. Appoach. Since vesion contol is pimodial in uncetain web-scale collaboative systems, epesenting and evaluating uncetainties thoughout data vesion management becomes cucial fo enhancing collaboation and fo ovecoming poblems such as conflict esolution and infomation eliability management. In this pape, we popose an uncetain XML vesion contol model tailoed to multi-vesion teestuctued documents in open collaboative editing contexts. Data, that is, office documents, HTML o XHTML documents, stuctued Wiki fomats, etc., manipulated within the given application scenaios ae tee-like o can be easily tanslated into this fom; XML is a natual encoding fo teestuctued data. Wok elated to XML vesion contol has focused on change detection [17,21,27,32,39]. Only some, fo instance [31, 33, 35], have poposed an extensive semistuctued data model awae of vesion contol; see Section 6 fo details. Uncetainty management in XML has eceived a geat attention in the pobabilistic database community, especially fo data integation puposes. A set of elaboate uncetain(pobabilistic) XML data models [9,22,30,37] with seveal distinct semantics of pobability distibutions ove data items, has been poposed. [9] and [22] follow a geneal pobabilistic XML epesentation system defining the concept of pobabilistic documents (abb. p-documents) which genealizes peviously poposed uncetain XML models. In ou model, we handle uncetain data though a pobabilistic XML model as a basic component of ou vesion contol famewok. Each vesion of a shaed document is epesented by an XML tee. At the abstact level, we conside a multi-vesion XML document with uncetain data based on andom events, XML edit scipts attached to them and a diected acyclic gaph of these events. Fo a concete epesentation the whole document, with its diffeent vesions, is modeled as a pobabilistic XML document epesenting an XML tee whose edges ae annotated by popositional fomulas ove andom events. Each popositional fomula models both the semantics of uncetain editions (insetion and deletion) pefomed ove a given pat of the document and its povenance in the vesion contol pocess. Uncetainty is evaluated using the pobabilistic model and the eliability measue associated to each souce, each contibuto, o each editing event, esulting in an uncetainty measue on each vesion and each pat of the document. The diected acyclic gaph of andom events maintains the histoy of document evolution by keeping tack of its diffeent states and thei deivation elationships. As last majo contibution of this pape, we show that standad vesion contol opeations, in paticula update opeation, can be implemented diectly as opeations on the pobabilistic XML model; efficiency with espect to deteministic vesion contol systems like Git and Subvesion is demonstated on eal-wold datasets. Outline. Afte some peliminaies in Section 2, we eview the pobabilistic XML model we use in Section 3. We detail the poposed pobabilistic XML vesion contol model and some stong popeties theeof in Section 4. In Section 5, we demonstate the efficiency of ou model with espect to deteministic vesion contol systems though measues on eal-wold datasets, and we descibe some of the content filteing capabilities (Cf. Section 5.2) of ou appoach. Finally, we eview some elated wok in Section 6. Initial ideas leading to this wok wee pesented as a PhD wokshop aticle in [13]; the desciption of the model, with tanslations of vesion contol opeations into opeations on the pobabilistic XML model, poofs of tanslation coectness, and expeimental validation, ae fully novel. 2. PRELIMINARIES In this section, we pesent some basic vesion contol notions and the semi-stuctued XML document model undelying ou poposal. A multi-vesion document efes to a set of vesions of the same document handled within a vesion contol pocess. Each vesion of the document epesents a given state (instance) of the evolution of this vesioned document. A typical vesion contol model is built on the following common notions. Document vesion. A vesion is a conventional tem that efes to a document copy in document-oiented vesion contol systems. The diffeent vesions of a document ae elated by deivation opeations. A deivation consists of ceating a new vesion by fist copying a peviously existing one befoe pefoming modifications. Some vesions, epesenting vaiants, ae in a deivation elationship with the same oigin. The vaiants (paallel vesions) chaacteize a nonlinea editing histoy with seveal distinct banches of the same multi-vesion document. In this histoy, a banch is a

linea sequence of vesions. Instead of stoing the complete content of each vesion, most vesion contol appoaches only maintains diffs between states, togethe with metainfomation on states. These states (o commits in Git wold [15]) model diffeent sets of changes that ae explicitly validated at distinct stages of the vesion contol pocess. A state also comes with infomation about the context (e.g., autho, date, comment) in which these modifications ae done. As a consequence, each vesion depends on the complete histoy leading up to a given state. We will follow hee the same appoach fo modeling the diffeent vesions of a document within ou famewok. Vesion Space. Since the content of each vesion is not fully saved, thee must be manne to etieve it when needed. The vesion space epesents the editing histoy ove a vesioned document (e.g., wiki vesion histoy as given in [34]). It maintains necessay infomation elated to the vesions and thei deivations. As mentioned above, a deivation elationship implies at least one input vesion (seveal incoming vesions fo mege opeations) and an output vesion. Based on this, we model similaly to [15] a vesion space of any multi-vesion document as a diected acyclic gaph. Unodeed XML Tee Documents. Ou motivating applications handle mostly tee-stuctued data. As a esult, we conside data as unodeed XML tees. Note that the poposed model can be extended to odeed tees (this may equie esticting the set of valid vesions to those complying with a specific ode, we leave the details fo futue wok); we choose unodeed tees fo convenience of exposition given that in many cases ode is unimpotant. Let us assume a finite set L of stings (i.e., labels o text data) and a finite set I of identifies such that L I =. In addition, let Φ and α be espectively a labeling function and an identifying function. Fomally, we define an XML document as an unodeed, labeled tee T ove identifies in I with α and Φ mapping each node x T espectively to a unique identifie α(x) I and to a sting Φ(x) L. The tee is unanked, i.e., the numbe of childen of each node in T is not assumed to be fixed. Given an XML tee T, we define Φ(T ) and α(t ) as espectively the set of its node stings and the set of its node identifies. Fo simplicity, we will assume all tees have the same oot node (same label, same identifie). [2] title [10] aticle-title [1] aticle [3] paa [11] tex [12] title [19] sect-title [4] sect [13] paa [20] text 2 Figue 1: Example XML tee T : Wikipedia aticle Example 2.1 Figue 1 depicts an XML tee T epesenting a typical Wikipedia aticle. The node identifies ae inside squae backets below node stings. The title of this aticle is given in node 10. The content of the document is stuctued in sections ( sect ) with thei titles and paagaphs ( paa ) containing the text data. XML Edit Scipt. Based on unique identifies, we conside two basic edit opeations ove the specified XML document model: node insetions and deletions. We denote an insetion by ins i, x whose semantics ove any XML tee consists of inseting node x (we suppose x is not aleady in the tee) as a child of a cetain node y satisfying α(y) = i. If such a node is not found in the tee, the opeation does nothing. Notethataninsetioncanconcenasubtee, andinthiscase we simply efe with x to the oot of this subtee. Similaly, we intoduce a deletion as del i whee i is the identifie of the node to suppess. The delete opeation emoves the tageted node, if it exists, togethe with its descendants, fom the XML tee. We conclude by defining an XML edit scipt, =< u 1,u 2,...,u i >, as a sequence of a cetain numbe of elementay edit opeations u j (each u j, with 1 j i, being eithe an insetion o a deletion) to cay out one afte the othe on an XML document fo poducing a new one. Given a tee T, we denote the outcome of applying an edit scipt ove T by [T ]. Even though in this wok we ely on pesistent identifies on tee nodes to define edit opeations, the semantics of these opeations could be extended to updates expessed by queies, especially useful in distibuted collaboative editing envionments whee identifies may not be staightfowad to shae. 3. PROBABILISTIC XML We biefly intoduce in this section the pobabilistic XML epesentation system we use as a basis of ou uncetain vesion contol system. Fo moe details, see [9] fo the geneal famewok and [22] fo the specific PXML fie model we used. These epesentation systems ae oiginally intended fo XML-based applications such as Web data integation and extaction. Fo instance, when integating vaious semi-stuctued Web catalogs containing pesonal data, some poblems such as ovelapping o contadiction ae fequent. Typically, one can find fo the same peson name two distinct affiliations in diffeent catalogs. A pobabilistic XML model is used to automatically integate such data souces by enumeating all possibilities: (a) the system consides each incoming souce; (b) it maps its data items with the existing items in the pobabilistic epositoy to find coespondences and; (c) giving that, it epesents the matches as a set of possibilities. The esolution of conflicts is thus postponed to quey time, whee each quey will etun a set of possibilities togethe with thei pobabilities. The intuition is that esolving semantic issues befoe an effective integation is unfeasible in this situation. On one hand, it is often a tedious and eo-pone esolution pocess. On the othe hand, thee might not be any cetain knowledge about the eliability of the souces, and data completeness. p-documents. A pobabilistic XML epesentation system is a compact way of epesenting pobability distibutions ove possible XML documents; in the case of inteest hee, the pobability distibution is finite. Fomally, a pobabilistic XML distibution space, o px-space, S ove a collection of uncetain XML documents is a couple (D,p) whee D is a nonempty finite set of documents and p : D (0,1] is a pobability function that maps each document d in D to a ational numbe p(d) (0,1] such that Σ d D p(d) = 1. A p- document, o pobabilistic XML document, usually denoted P, defines a compact encoding of a px-space S. PXML fie : Syntax and Semantics. We conside in this pape one specific class of p-documents, PXML fie [22] (whee fie stands fo fomula of independent events); esticting

to this paticula class allows us to give a simplified pesentation, see [9,22] fo a moe geneal setting. Assume a set of independent andom Boolean vaiables, o event vaiables in shot, b 1,b 2,...,b m and thei espective pobabilities P(b 1),P(b 2)...,P(b m) of existence. A PXML fie p- document is an unodeed, unanked, and labeled tee whee evey node (except fo the oot) x may be annotated with an abitay popositional fomula fie(x) ove the event vaiables b 1,b 2,...,b m. Diffeent fomulas can shae common events, i.e., thee may be some coelation between fomulas and the numbe of event vaiables in the fomulas may vay fom one node to anothe. A valuation ν of the event vaiables b 1...b m induces ove P one paticula XML documents ν( P): the document whee only nodes annotated with fomulas valuated to tue by ν ae kept (nodes whose fomulas ae valuated to false by ν ae deleted fom the tee, along with thei descendants). Given a p-document P, the possible wolds of P, denoted as pwd( P) is the set of all such XML documents. The pobability of a given possible wold d of P is defined as the sum of the pobability of the valuations that yield d. The set of possible wolds, togethe with thei pobabilities, defines the semantics of P, the px-space P associated to P. (a) b 1 b 2 P s p 2 b 2 t 2 (b) s p 2 t 2 s s p 2 t 2 d 1 d 2 d 3 Figue 2: (a) PXML fie p-document P; (b) Thee possible wolds d 1, d 2 and d 3 Example 3.1 Figue 2 sketches on the left-side a concete PXML fie p-document P and on the ight-side thee possible wolds d 1, d 2 and d 3. Fomulas annotating nodes ae shown just above them: b 1 b 2 and b 1 ae bound to nodes and p 2 espectively. The thee possible wolds d 1, d 2 and d 3 ae obtained by setting the following valuations of b 1 and b 2: (a) tue and false; (b) tue and tue (o false and tue); (c) false and false. At each execution of the andom pocess, the distibutional node chooses exactly the nodes whose fomulas ae evaluated at tue given the valuation specified ove event vaiables. Assuming a pobability distibution ove events, fo instance P(b 1) = 0.4 and P(b 2) = 0.5, we deive the pobability of the possible wold d 1 as P(d 1) = P(b 1) (1 P(b 2)) = 0.4 (1 0.5) = 0.2. We can compute similaly the pobabilities of all othe possible wolds. With espect to othe pobabilistic XML epesentation systems [9], PXML fie is vey succinct (since abitay popositional fomulas can be used, involving abitay coelations among events), i.e., exponentially moe succinct than the models of [30,37], and offes tactable insetions and deletions [22], one key equiement fo ou uncetain vesion contol model. Howeve, a non-negligible downside is that all non-tivial (tee-patten) queies ove this model ae #P-had to evaluate [23]. This is not necessaily an issue, hee, since we favo in ou application efficient updates and etieval of given possible wolds, ove abitay queies. Data Povenance. Uncetain XML management based on the PXML fie model also takes advantage of the vaious possible semantics of event vaiables in tems of infomation desciption. Indeed, besides uncetainty management, the model also povide suppot fo keeping infomation about data povenance (o lineage) based on the event vaiables. Data povenance is infomation of taceability such as change semantics, esponsible paty, timestamp, etc., elated to uncetain data. To do so, we only need to use the semantics of event vaiables as epesenting infomation about data povenance. As such, it is sometimes useful to use pobabilistic XML epesentation systems even in the absence of eliable pobability souces fo individual events, in the sense that one can manipulate them as incomplete data models (i.e., we only cae about possible wolds, not about thei pobabilities). 4. UNCERTAIN MULTI-VERSION XML In this section we elaboate on ou uncetain XML vesion contol model fo tee-stuctued documents edited in a collaboative manne. We build ou model on thee main concepts: vesion contol events, a p-document, and a diected acyclic gaph of events. We stat by fomalizing a multi-vesion XML document though a fomal definition of its gaph of vesion space and its set of vesions. Then, we fomally intoduce the poposed model. 4.1 Multi-Vesion XML Documents Conside the infinite set D of all XML documents with a given oot label andidentifie. Let V be a set of vesion contol events e 1,...,e n. These events epesent the diffeent states of a tee. We associate to events contextual infomation about evisions (authoship, timestamp, etc.). To each event e i is futhe associated an edit scipt i. Based on this, we fomalize the gaph of vesion space and the set of vesions of any vesioned XML document as follows. Gaph of vesion space. The vesion space is a ooted diected acyclic gaph (DAG) G = (V {e 0}, E) whee: (i) the initial vesion contol event e 0 / V, a special event epesenting the fist state of any vesioned XML tee, is the oot of G; (ii) E V 2, defining the edges of G, consists of a set of odeed couples of vesion contol events. Each edge implicitly descibes a diected deivation elationship between two vesions. A banch of G is a diected path that implies a stat node e i and an end node e j. The latte must be eachable fom the fome by tavesing a set of odeed edges in E. We efe to this banch by B j i. A ooted banch is a banch that stats at the oot of the gaph. XML vesions. An XML vesion is the document in D coesponding to a set of vesion contol events, the set of events that made this vesion happen. In a deteministic vesion contol system, this set always coesponds to a ooted banch in the vesion space gaph. In ou uncetain vesion contol system, this set may be abitay. Let us conside the set 2 V compising all sub-pats of V. The set of vesions of a multi-vesion XML document is given by a mapping Ω : 2 V D: to each sets of events coesponds a given tee (these tees ae typically not all distinct). The function Ω

can be computed fom edit scipts associated with events as follows: Ω( ) maps to the oot-only XML tee of D. Fo all i, fo all F 2 V \{e i} Ω({e i} F) = [Ω(F)] i. A multi-vesion XML document, T mv, is now defined as a pai (G,Ω) whee G is a DAG of vesion contol events, wheeas Ω is a mapping function specifying the set of vesions of the document. In the following we popose a moe efficient way to compute the vesion coesponding to a set of events, using a p-document fo stoage. 4.2 Uncetain Multi-Vesion XML Documents A multi-vesion document will be uncetain if the vesion contol events, staged in a vesion contol pocess, come with uncetainty as in open collaboative contexts. By vesion contol events with uncetainty, we mean andom events leading to uncetain vesions and content. As a consequence, we will ely on a pobability distibution ove 2 V, that will, togethe with the Ω mapping, imply a pobability distibution ove D. Uncetainty modeling. We model uncetainty in events by futhe defining a vesion contol event e i in V as a conjunction of semantically unelated andom Boolean vaiables b 1,...,b m with the following assumptions: (i) a Boolean vaiable models a given souce of uncetainty (e.g., the contibuto) in the vesion contol envionment; (ii) all Boolean vaiables in each e i ae independent; (iii) a Boolean vaiable b j eused acoss events coelates diffeent vesion contol events; (iv) one paticula Boolean evision vaiable b (i), epesenting moe specifically the uncetainty in the contibution, is not shaed acoss othe vesion contol events and appeas positively in e i. Pobability Computation. We assume given a pobability distibution ove the Boolean andom vaiables b j s (this typically comes fom a tust estimation in a contibuto, o in a contibution), which induces a pobability distibution ove popositional fomulas ove the b j s in the usual manne [22]. We now obtain the pobability of each (uncetain) vesion d of as follows: P(d) = P( F V F) with the Ω(F)=d pobability of each set of events F V given by: P(F) = P e j e k. (1) e j F e k V \F Example 4.1 Figue 3 sketches an uncetain multi-vesion XML document T mv with fou staged vesion contol events. On the left-side, we have the vesion space G. The ight-side shows an example of fou possible (uncetain) vesions and thei associated event set. We suppose that T mv is initially a oot-only document. The thee fist vesions coespond to vesions coveed by deteministic vesion contol systems, wheeas the last one is geneated by consideing that the changes pefomed at an intemediate vesion contol event, hee e 2, as incoect. One featue of ou model is to povide the possibility fo viewing and modifying these kinds of uncetain vesions epesenting vitual vesions. Only edits pefomed at the specified vesion contol events ae taken into account in the pocess of poducing a vesion: in T 4, the node and the subtees ooted at s 1, s 3 espectively intoduced at e 0, e 1 and e 3 ae pesent, while the subtee p 3 added at e 3 does not appea because its paent node s 2 cannot be found. Finally, given pobabilities of vesion contol events, we ae able to measue the eliability of each uncetain vesion T i, fo each 1 i 4, based on its coesponding event set F i (and all othe event sets that map to the same tee). We staightfowadly obseve, fo instance with the simple example in Figue 3, that the amount of possible (uncetain) vesions of any uncetain multi-vesion document may gow apidly (indeed, exponentially in the numbe of events). As a esult, the enumeation and the handling of all the possibilities with the function Ω may become tedious at a cetain point. To addess this issue, we popose an efficient method fo encoding in a compact manne the possible vesions togethe with thei tuth values. Intuitively, a PXML fie p-document compactly models the set of possible vesions of an uncetain multi-vesion XML document. As stessed in Section 3, a pobabilistic tee based on popositional fomulas povides inteesting featues fo ou setting. Fist, it descibes well a distibution of tuth values ove a set of uncetain XML tees while poviding a meaningful pocess to find back a given vesion and its pobability. Second, it povides an update-efficient epesentation system, which is cucial in dynamic envionments such as vesioncontol based applications. 4.3 Pobabilistic XML Encoding We intoduce a geneal uncetain XML vesion contol epesentationfamewok, denotedby T mv, asacouple(g, P) whee (a) G is as befoe a DAG of events, epesenting the vesion space; (b) P is a PXML fie p-document with andom Boolean vaiables b 1...b m epesenting efficiently all possible (uncetain) XML tee vesions and thei coesponding tuth-values. We now define the semantics of such an encoding as the uncetain multi-vesion document (G, Ω) whee G is the same and Ω is defined as follows. Fo all F V, let B + be the set of all andom vaiables occuing in one of the events of F and B be the set of all evision vaiables b (i) s fo e i not in F. Let ν be the valuation of b 1...b m that sets vaiables of B + to tue, vaiables of B to false, and othe vaiables to an abitay value. We set Ω(F) := ν( P). The following shows that this semantics is compatible with the px-space semantics of p-documents on the one hand, and the pobability distibution defined by uncetain multivesion documents on the othe hand. Poposition 4.1 Let (G, P) be an uncetain vesion contol epesentation famewok and (G,Ω) its semantics as just defined. We futhe assume that all fomulas occuing in P can be expessed as fomulas ove the events of V (i.e., we do not make use of the b j s independently of vesion contol events). Then the px-space P defines the same pobability distibution ove D as Ω. The poof is staightfowad and elies on Equation (1). 4.4 Updating Uncetain Multi-Vesion XML We implement the semantics of standad update opeations on top of ou pobabilistic XML epesentation system. An update ove an uncetain multi-vesion document coesponds to the evaluation of some uncetain edits on a given (uncetain) vesion. With the help of a tiple (,e,e ), we efe to an update opeation as updop, e, e whee is an

G) e 2 e 3 T 1) T 2) T 3) T 4) s 1 s 1 s 2 s 1 s 2 s 3 s 1 s 3 e 0 e 1 p 2 p 3 p 4 p 2 p 4 (a) e 4 t 2 F 1 = {e 1} F 2 = {e 1, e 2} F 3 = {e 1, e 2, e 3} F 4 = {e 1, e 3} (b) t 3 t 4 t 2 t 4 Figue 3: (a) Gaph of Vesion Space; (b) Fou vesions and thei coesponding tuth-values Input: (G, P), updop,e,e Output: updating T mv in T mv G := G ({e },{(e,e )}); foeach (u in ) do if u = ins i, x then y := findnodebyid ( P, i) ; if matchisfound(t y, x) then fie o (x) := getfieofnode(x) ; setfieofnode (x, fie o (x) e ); else updcontent( P, ins i, x ); setfieofnode(x, e ); else if u = del i then x := findnodebyid ( P, i) ; fie o (x) := getfieofnode(x) ; setfieofnode(x, fie o (x) e ); etun (G, P); Algoithm 1: Update algoithm edit scipt, e is an existing vesion contol event pointing to the edited vesion and e is an incoming vesion contol event evaluating the amount of uncetainty in this update. We fomalize updop, e, e ove T mv as below. updop, e, e (T mv) := (G ({e },{(e,e )}), Ω ). An update opeation thus esults in the insetion of a new node and a new edge in G, and an extension of Ω with Ω that we now define. Fo any subset F V (V is the set of nodes in G afte the update), we have: if e F: Ω (F) = Ω(F); othewise: Ω (F) = [Ω(F\{e })]. What pecedes gives a semantics to updates on uncetain multi-vesion documents; howeve, the semantics is not pactical as it equies consideing evey subset F V. Fo a moe usable solution, we pefom updates diectly on the p-document epesentation of the multi-vesion document. Algoithm 1 descibes how such an update opeation updop,e,e is pefomed on top of an uncetain epesentation (G, P). Fist, thegaphis updatedas befoe. Then, fo each opeation u in, the algoithm etieves the tageted node in P using findnodebyid (typically this is a constanttime opeation). Accoding to the type of opeation, thee ae two possibilities. 1. If u is an insetion of a node x, the algoithm checks if x does not aleady occu in P, fo instance by looking fo a node with the same label (the function matchis- Found seaches a matching fo x in the subtee T y ooted at y). If such a matching exists, getfieofnode etuns its cuent fomula fie o (x) and the algoithm updates it to fie n (x) := fie o (x) e, specifying that x appeas when this update is valid. Othewise, updcontent and setfieofnode espectively insets the node x in P and sets its associated fomula as fie n (x) = e. 2. If u is a deletion of a node x, the algoithm gets its cuent fomula fie o (x) and sets it to fie n (x) := fie o (x) e, specifying that x must be emoved fom possible wolds whee this update is valid. The est of this section shows the coectness and efficiency of ou appoach: Fist, we establish that Algoithm 1 espects the semantics of updates. Second, we show that the behavio of deteministic vesion contol systems can be simulated by consideing only a specific kind of event set. Thid, we chaacteize the complexity of the algoithm. Poposition 4.2 Algoithm 1, when an on a pobabilistic XML encoding T mv = (G, P) of a multi-vesion document T mv = (G,Ω), togethe with an update opeation updop,e,e, computes a epesentation updop,e,e ( T mv) of the multi-vesion document updop,e,e (T mv). { updop,e,e ( T mv) =(G, P ) Poof. Let: updop,e,e (T mv) =(G,Ω ) (itiscleathat the vesion space DAG is the same in both cases). We need to show that Ω coesponds to the semantics of P ; that is, if we note the semantics of (G, P ) as (G,Ω ), we need to show that Ω = Ω. By definition, fo F V, Ω (F) = Ω(F) if e F, and Ω (F) = [Ω(F\{e })] othewise. Let us distinguish these two cases. In the fist scenaio implying subsets F which do not contain e, we have Ω (F) = Ω(F). Since T mv is the semantics of T mv, we know that Ω(F) = ν(f) fo a valuation ν that sets the special evision vaiable b coesponding to e to false. Now, let us look at the document ν( P ). By constuction the update algoithm does not delete any node fom P but just insets new nodes and modifies some fomulas. Suppose that thee exists a node x ν( P) suchthat x ν( P ). Since x ν( P), x cannot be a new node in P. Theeby, its new fomula fie n (x) afte the update is eithe fie o (x) e o fie o (x) e. In both cases, fie n (x) satisfies ν, because fie o (x) satisfies ν and ν sets b (and theefoe e ) to false. This leads to a contadiction and we can conclude that fo all nodex ν( P), we havex ν( P ). Similaly, if anodex is in F( P ), because ν sets e tofalse, xwill also beinν( P). Combining the two, Ω (F) = ν( P ) = ν( P) = Ω(F).

The second scenaio concens subsets F in which e appeas. We obtain a vesion Ω (F ) by updating Ω(F \{e }) with. Let us set F = F \{e }. Thee exists a valuation ν such that ν( P) = Ω (and thus, Ω (F ) = [ν( P)] ) with ν setting all vaiables of events in F to tue, and making sue that all othe events ae set to false. Let ν be the extension of ν whee all vaiables of e ae set to tue. It suffices to pove that [ν( P)] = ν ( P ). Fist, it is clea that the nodes in ν( P) which ae not modified by ae also in ν ( P ). Indeed, thei associated fomulas do not change in P, and hence the fact these satisfy ν ae sufficient fo selecting them in P with the valuation ν. Suppose now an opeation u in involving a node x: u eithe adds x as a child of a cetain node y o deletes x. In the fome case, if y exists in ν( P), then its fomula satisfies ν and x is added in the document when it does not aleady exist. With Algoithm 1, u is intepeted in P by the existence of x unde y with an attached fomula being eithe fie n (x) = e (newly added) o fie n (x) = fie o (x) e (eveted node). As a consequence, ν ( P ) selects x as in both possible expessions of fie n (x). Let us analyze the case whee u is a deletion of x. If x is not pesent in ν( P), i.e., u changes nothing in this document. Though Algoithm 1, u esults in a new associated fomula set to fie n (x) = fie o (x) e fo the node x in P. Obviously, we can see that x will not be in ν ( P ) because the satisfiability of fie n (x) equies the falseness of e whose condition does not hold in F. Now, if x is found in ν( P), u deletes the node, as well as its childen, fom the document. As a esult, the outcome does not contain x, which is confom to the fact that x ν ( P ). We have poved that fo all node x in [ν( P)], x is also in ν ( P ). By simila aguments, we can show that the convese is veified, i.e., fo all node x in ν ( P ), x belongs to [ν( P)]. The semantics of update is theefoe the same, whethe stated on uncetain multi-vesion documents, o implemented as in Algoithm 1. We now show that this semantics is compatible with the classical update opeation of vesion contol systems. Poposition 4.3 The fomal definition of updating in uncetain multi-vesion documents implements the semantics of the standad update opeation in deteministic vesion contol systems when sets of events ae esticted to ooted banches. Poof. (Sketch) The update in ou model changes the vesion space G similaly to a deteministic vesion contol setting. Asfo its evaluation ovethe set of vesions, we only need to show that the opeation also poduces a new vesion by updating the vesion mapping B i 0 (with e the ith vesion contol event in G) with as in a deteministic fomalism. Fo building the esulting vesion set, the opeation as given above is defined such that fo all subset F V with e F, we cay out on Ω(F) fo poducinganewvesion Ω (F {e }). Amongst all the subsets satisfying this condition, obviously thee is at least one which maps to B i 0. We conclude by showing ou algoithm is fully scalable: Poposition 4.4 Algoithm 1 pefoms the update pocess ove the epesentation of any uncetain multi-vesion XML document with a constant time complexity with espect to the size of the input document. The size of the output pobabilistic tee gows linealy in the size of the update scipt. Poof. The fist pat of the algoithm consists in updating G. This is clealy a constant-time opeation, which esults in a single new node and a single new edge in G fo evey edit scipt. As fo the second pat of the algoithm, i.e., the evaluation of the update scipt ove the pobabilistic tee, let P and be espectively the size of the input pobabilistic document P and the length of. By implementing P as an amotized hash table, we execute a lookup of nodes in P based on findnodebyid o matchisfound in constant time. (matchisfound equies stoing hashes of all subtees of the tee, but this data stuctue can be maintained efficiently we omit the details hee.) The uppe bound of Algoithm 1 occus when consists only of insetions. Since the functions getfieofnode, updcontent and setfieofnode also have constant execution costs, we can state that the oveall unning time of Algoithm 1 is only a function of the numbe of opeations in. As a esult, we can conclude that the update algoithm pefoms in O(1) with espect to the numbe of nodes in P and G. At each execution, Algoithm 1 will incease the input pobabilistic tee by a size bounded by a constant fo each update opeation, togethe with the size of all insets. To sum up, the size incease is linea in the size of the oiginal edit scipt. 5. EVALUATION OF THE MODEL This section descibes the expeimental evaluation of the poposed model, based on eal-wold applications. We fist pesent a compaative study of ou model with two popula vesion contol systems Git and Subvesion, in ode to pove its efficiency. Then we descibe the advances in tems of content filteing offeed by ou model. All times shown ae CPU time, obtained by unning inmemoy tests, avoiding disk I/O costs by putting all accessed file systems in a RAM disk. Measues have been caied out using the same settings fo all thee systems. 5.1 Pefomance analysis We measued the time needed fo the execution of two main opeations: the commit and checkout of a vesion. The tests wee conducted on Git, Subvesion, and the implementation of ou model (PXML). The goal is to show the feasibility of ou model athe then to pove that it is moe efficient than the mentioned vesion contol systems. We stess that, though fo compaison puposes ou system was tested in a deteministic setting, its main inteest elies in the fact that it is able to epesent uncetain multi-vesion documents, as we illustate futhe in Section 5.2. Datasets and Implementation. As datasets, we used the histoy of the maste banches of the Linux kenel development [4] and the Apache Cassanda poject [1] fo the tests. These data epesent two lage file systems and constitute two examples of tee-stuctued data shaed in an open and collaboative envionment. The Linux kenel development natively uses Git. We obtained a local copy of its histoy by cloning the maste development banch. We maintained upto-date ou local copy by pulling evey day the latest changes fom the oiginal souce. We followed a simila pocess with the Cassanda dataset (a Subvesion epositoy).

10 4 Subvesion Git PXML 10 4 Subvesion Git PXML Commit time (ms) 10 3 10 2 Commit time (ms) 10 3 10 2 10 1 10 1 0 50 100 150 200 250 300 Commit (Linux kenel) 0 200 400 600 Commit (Cassanda poject) Figue 4: Measues of commit time ove eal-wold datasets (logaithmic y-axis) In total, each local banch has moe than ten thousand commits (o evisions). Each commit mateializes a set of changes, to the content of files o to thei hieachy (the file system tee). In ou expeiments, we focused on the commits applied to the file system tee and ignoed content change. We detemined the commits and the deivation elationships fom Git and Subvesion logs. We epesented the file system in an XML document and we tansposed the atomic changes to the file system into edit opeations on the XML tee. To each insetion, espectively deletion, of a file o a diectoy in the file system coesponds an insetion, espectively a deletion, of a node in the XML tee. We implemented ou vesion contol model (PXML) in Java. We used the Java APIs SVNKit [5] and JGit [3] to set up the standad opeations of Subvesion and Git. The pupose was to pefom all the evaluations in the same conditions. Subvesionuses aset of log files totack the changes applied to the file system at the diffeent commits. Each log file contains a set of paths and the change opeations associated to each path. As fo Git, it handles seveal vesions of a file system as a set of elated Git tee objects epesented by the hashes of thei content. A Git tee object epesents a snapshot of the file system at a given commit. Cost analysis. Figues 4 and 6 compae the cost of the commit and the checkout opeations in Subvesion, Git, and PXML. The commit time indicates the time needed by the system to ceate a vesion (commit), wheeas the checkout time coesponds to the time necessay to compute and etieve the sought vesion. The obtained esults show clealy that PXML have good pefomance with espect to Git and Subvesion systems. The expeiments wee done using the datasets obtained fom the Linux Kenel and Cassanda pojects, as indicated above. Fo both datasets, we obseve in Figue 4 that ou model has in geneal a low commit cos (note that the y-axes ae logaithmic on Figue 4). An in-depth analysis of the esults show that the commit costs depend in ou model on the numbe of edit opeations associated to the commits (see Figue 5), as implied by Poposition 4.4. Howeve, PXML emains efficient compaed to the othe systems, except fo some few commits chaacteized by a lage numbe of edits (at least one hunded edit opeations). This can be explained by the fact that ou model pefoms the edit opeations ove XML tees, wheeas Git stoes the hashes of the files indexed by the 1 Ou measues of the commit time in PXML include the computation cost of the edit scipts. Commit time (ms) 10 4 10 3 10 2 Subvesion Git PXML 10 1 10 2 10 3 Numbe of edit opeations Figue 5: Commit time vs numbe of edit opeations (fo edit scipts of length 5) diectoy names, and Subvesion logs the changes togethe with the tageted paths in flat files. An insetion of a subtee (a hieachy of files and diectoies) in the file system can be teated as a simple opeation in Git and Subvesion, wheeas it equies a seies of node insetions in ou model. Ou model is able to geneate linea vesions (coesponding to event sets that ae ooted banches) as well as abitay ones. Howeve, taditional vesion contol systems ae only able to poduce linea vesions. As a consequence, in this pape we focused ou expeiments on etieving linea vesions fo compaison puposes. Figue 6 shows the measues obtained fo the checkout of successive vesions in PXML, Git and Subvesion. The x-axis epesents vesion numbes. Retieving a vesion numbe n equies the econstuction of all pevious vesions (1 to n 1). The esults obtained show that ou model is significantly moe efficient than Subvesion fo both datasets (Linux Kenel and Cassanda pojects). Compaed to Git, PXML has a lowe checkout cost fo initial vesions, while it becomes less efficient in etieving ecent vesions fo the Cassanda dataset. Note that, taditional vesion contol models mostly use evesible diffs [33] in ode to speed up the pocess of econstucting the ecent vesions in a linea histoy. 5.2 Filteing capabilities Efficient evaluation of the uncetainty and automatic filteing of uneliable contents ae two key issues fo lage scale collaboative editing systems. Evaluation of uncetainty is needed because a shaed document can esult fom conti-

400 Subvesion Git PXML 400 Subvesion Git PXML Checkout time (ms) 300 200 Checkout time (ms) 300 200 100 100 0 50 100 150 200 250 300 0 200 400 600 Revision (Linux kenel) Revision (Cassanda poject) Figue 6: Measues of checkout time ove eal-wold datasets (linea axes) butions of diffeent pesons, who may have diffeent levels of eliability. This eliability can be estimated in vaious ways, such as an indicato of the oveall eputation of an autho (possibly automatically deived fom the content of contibutions, cf. [10]) o the subjective tust a given eade has in the contibuto. Fo popula collaboative platfoms, like Wikipedia, an automatic management of conflicts is also necessay because the numbe of contibutos is often vey lage. This is especially tue fo documents elated to hot topics, whee the numbe of conflicts and vandalism acts can evolve apidly and compomise document integity. In ou model, filteing uneliable contents can be done easily by setting to false the Boolean vaiables modeling the coesponding souces. This can be done automatically, fo instance when a vandalism act is detected, o at quey time to fit use pefeences and opinion about the contibutos. A shaed document can also be egaded as the mege of all possible wolds modeled by the geneated evisions. We demonstate in [7] an application of these new filteing and inteaction capabilities to Wikipedia evisions: an aticle is nolonge consideedas thelast validevision, butas amege of all possible (uncetain) evisions. The oveall uncetainty on a given pat of the aticle is deived fom the uncetainty of the evisions having affected it. Moeove, the use can view the state of a document at a given evision, emoving the effect of a given evision o a given contibuto, o focusing only on the effect of some chosen evisions o some eliable contibutos. We also tested the possibility fo the uses to handle moe advanced opeations ove citical vesions of aticles such as vandalized vesions. We chose the most vandalized Wikipedia aticles (Cf. Wikipedia:Most vandalized pages), and we used ou model to study the impact of consideing as eliable some vesions affected by vandalism. We succeeded in econstucting the chosen aticles as if the vandalism had neve been emoved; obtaining this special vesion of the aticle is vey efficient, since it consists in applying a given valuation to the pobabilistic document, which is a checkout opeation whose timing is compaable to what is shown in Figue 6. Note that in the cuent vesion of Wikipedia, the content of vandalized vesions is systematically emoved fom the pesented vesion of an aticle, even if some uses may want to visualize them fo vaious easons. Ou expeiments have shown thatwe can detectthevandalism as well as Wikipedia obots do, and automatically manage it in PXML, keeping all uncetain vesions available fo checkout. 6. RELATED WORK Ou pevious wok. We pesent in [7,13] initial studies towads the design of an uncetain XML vesion contol system: [7] is a demonstation system focusing on Wikipedia evisions and showing the benefits of integating an uncetain XML vesion contol appoach in web-scale collaboative platfoms; [13] is a PhD wokshop pape with ealy ideas behind modeling XML uncetain vesion contol. Vesion Contol Systems. While a lot of wok was caied out on vesion contol in object-oiented systems (e.g., [8, 11, 14, 19]), ecent eseach and tools ae focusing on document-oiented models. Many poducts, seen as genealpupose systems, ae used fo vesion contol ove diffeent kind of documents. Subvesion, CleaCase, Git, BitKeepe, and Bazaa ae some examples of them. In geneal, the consideed appoaches do not take into account the semantics of the changes epesented by the successive vesions. The concen is the econstuction of the committed vesions, athe then the undestanding of the evolution of the modeled wold. In Subvesion [18] and simila systems, vesion contol is based on edit distance algoithms designed fo flat text, wheeas the Git family [15] of tools uses cyptogaphic appoaches. Fo XML and stuctued documents, both techniques ae inadequate because the semantics of the changes is cucial in this case. A lot of wok was done on change detection on XML documents, and diffeent XML diff tools have been developed [17,27,32]. An in-depth analysis of the poposed appoaches can be founnd in [16]. Besides that, XML vesion contol models such as [33] and [35] stoe all vesions in the same XML document, and extend the XML schema of the latte with some elements used fo the identification of each vesion. Howeve, the dawback of these appoaches is the edundancy of the content shaed between diffeent vesions and the cost of the updates opeations. Pobabilistic XML. Uncetainty handling in XML was oiginally associated to the poblem of automatic Web data extaction and integation. In this context, uncetainty may have diffeent oigins: the extaction pocess, the uneliability of the data souces, the incompleteness of the data, etc. Seveal effots have been made and some pobabilistic appoaches have been poposed (see [28] fo a suvey), especially the wok of van Keulen et al. [36,37]. Then a epesentation system that genealizes all the existing models was poposed in [9] and [22]; we efe to [25] fo a suvey of the pobabilistic XML liteatue.

7. CONCLUSION We pesented in this pape an uncetain XML vesion contol model tailoed to multi-vesion tee-stuctued documents, in open collaboative editing contexts. This is one of the fist actual wok focusing on concete applications of the existing liteatue on pobabilistic XML [9,22 25,30,37]. The compaison of ou model to the most popula vesion contol systems, done on eal-wold data, shows its efficiency. Moeove, ou model offes new filteing and inteaction capabilities which ae cucial in open collaboative envionments, whee the data souces, the contibutos and the shaed content ae inheently uncetain. The main diection fo futue developments is the suppot of moe complex vesion contol opeations, notably meging. Similaly to insetions and deletions, it is possible to implement meging by diectly modifying the p-document, leading to an efficient management of uncetain vesions. At last, the model could be extended to also suppot othe kinds of edit opeations like moves of intemediate nodes in XML. 8. ACKNOWLEDGEMENTS This wok was patially suppoted by the Île-de-Fance egional DROD poject, and the Fench govenment unde the STIC-Asia pogam, CCIPX poject. We would like to thank the anonymous eviewes fo thei valuable suggestions on impoving this pape. 9. REFERENCES [1] Cassanda Poject. http://cassanda.apache.og/. [2] Google Dive. https://dive.google.com/. [3] Java Git. http://www.eclipse.og/jgit/. [4] Linux Kenel. https://www.kenel.og/. [5] [Sub]Vesioning fo Java. http://svnkit.com/. [6] Wikipedia Platfom. http://www.wikipedia.og/. [7] T. Abdessalem, M. L. Ba, and P. Senellat. A pobabilistic XML meging tool. In EDBT, 2011. Demonstation. [8] T. Abdessalem and G. Jomie. VQL: A quey language fo multivesion databases. In DBPL, 1997. [9] S. Abiteboul, B. Kimelfeld, Y. Sagiv, and P. Senellat. On the expessiveness of pobabilistic XML models. VLDB Jounal, 18(5), 2009. [10] B. T. Adle and L. de Alfao. A content-diven eputation system fo the Wikipedia. In WWW, 2007. [11] A. Al-Khudai, W. A. Gay, and J. C. Miles. Dynamic evolution and consistency of collaboative configuations in object-oiented databases. In Poc. TOOLS, 2001. [12] K. Altmanninge, M. Seidl, and M. Wimme. A suvey on model vesioning appoaches. IJWIS, 5, 2009. [13] M. L. Ba, T. Abdessalem, and P. Senellat. Towads a vesion contol model with uncetain data. In PIKM, 2011. [14] W. Cellay and G. Jomie. Consistency of vesions in object-oiented databases. In VLDB, 1990. [15] S. Chacon. Git Book. http://book.git-scm.com/. [16] G. Cobéna and T. Abdessalem. A compaative study of XML change detection algoithms. In Sevices and Business Computing Solutions with XML: Applications fo Quality Management and Best Pocesses. IGI Global, 2009. [17] G. Cobéna, S. Abiteboul, and A. Maian. Detecting Changes in XML Documents. In ICDE, 2002. [18] B. Collins-Sussman, B. W. Fitzpatick, and C. M. Pilato. Vesion Contol with Subvesion. O Reilly Media, 2008. [19] R. Conadi and B. Westfechtel. Towads a unifom vesion model fo softwae configuation management. In System Configuation Management, 1997. [20] G. de la Calzada and A. Dekhtya. On measuing the quality of Wikipedia aticles. In WICOW, 2010. [21] L. Khan, L. Wang, and Y. Rao. Change detection of XML documents using signatues. In Real Wold RDF and Semantic Web Applications, 2002. [22] E. Khalamov, W. Nutt, and P. Senellat. Updating Pobabilistic XML. In Updates in XML, 2010. [23] B. Kimelfeld, Y. Koshaovsky, and Y. Sagiv. Quey evaluation ove pobabilistic XML. VLDB Jounal, 18(5), 2009. [24] B. Kimelfeld and Y. Sagiv. Modeling and queying pobabilistic XML data. SIGMOD Rec., 37(4), 2009. [25] B. Kimelfeld and P. Senellat. Pobabilistic XML: Models and complexity. In Z. Ma and L. Yan, editos, Advances in Pobabilistic Databases fo Uncetain Infomation Management. Spinge-Velag, 2013. [26] A. Koc and A. U. Tansel. A suvey of vesion contol systems. In ICEME, 2011. [27] T. Lindholm, J. Kangashaju, and S. Takoma. Fast and simple XML tee diffeencing by sequence alignment. In DocEng, 2006. [28] M. Magnani and D. Montesi. A suvey on uncetainty management in data integation. J. Data and Infomation Quality, 2, 2010. [29] S. Maniu, B. Cautis, and T. Abdessalem. Building a signed netwok fom inteactions in Wikipedia. In DBSocial, 2011. [30] A. Nieman and H. V. Jagadish. PoTDB: pobabilistic data in XML. In VLDB, 2002. [31] S. Rönnau and U. Boghoff. Vesioning XML-based office documents. Multimedia Tools and Applications, 43, 2009. [32] S. Rönnau and U. Boghoff. XCC: change contol of XML documents. CSRD, 2010. [33] L. I. Rusu, W. Rahayu, and D. Tania. Maintaining vesions of dynamic XML documents. In WISE, 2005. [34] M. Sabel. Stuctuing wiki evision histoy. In WikiSym, 2007. [35] C. Thao and E. V. Munson. Vesion-awae XML documents. In DocEng, 2011. [36] M. van Keulen and A. de Keijze. Qualitative effects of knowledge ules and use feedback in pobabilistic data integation. VLDB Jounal, 18, 2009. [37] M. Van Keulen, A. de Keijze, and W. Alink. A Pobabilistic XML Appoach to Data Integation. In ICDE, 2005. [38] J. Voss. Measuing Wikipedia. In ISSI, 2005. [39] Y. Wang, D. J. DeWitt, and J.-Y. Cai. X-Diff: An Effective Change Detection Algoithm fo XML Documents. In ICDE, 2003.