Data provenance the foundation of data quality
|
|
|
- Jonas McKenzie
- 9 years ago
- Views:
Transcription
1 Data provenance the foundation of data quality Peter Buneman University of Edinburgh Edinburgh, UK Susan B. Davidson University of Pennsylvania Philadelphia, USA September 1, What is Provenance? Provenance for data 1 is often defined by analogy with its use for the history of non-digital artifacts, typically works of art. While this provides a starting point for our understanding of the term, it is not adequate for at least two important reasons. First, an art historian seeing a copy of some artifacts will regard its provenance as very different from that of the original. By contrast, when we make use of a digital artifact, we often use a copy, and and copying in no sense destroys the provenance of that artifact. Second, digital artefacts are seldom raw data. They are created by a process of transformation or computation on other data sets; and we regard this process as a part of the provenance. For non-digital artefacts, provenance is usually traced back to the point of creation, but no further. These two issues the copying and transformation of data will figure prominently in our discussion of provenance, but rather than attempting a more elaborate definition, we shall start with some examples. In September 2008, a news article concerning the 2002 bankruptcy of UAL, United Airlines parent company, appeared on the list of top news stories from Google News. The (undated) article provoked investor panic and UAL s share price dropped by 75% before it was realized that the article was six years out of date. Look up some piece of demographic information on the Web and see if you can get an accurate account of its provenance. A recent Web search for population Corfu yielded 14 different values in the first 20 hits ranging from 70,000 to 150,000, several of them given to the nearest person. While some of the values were dated, in only one case was any kind of attribution or citation given. Most scientific data sets are not raw observations but have been generated by a complex analysis of the raw data with a mixture of scientific software and human input. Keeping a record of this process the workflow is all-important to a scientist who wants to use the data and be satisfied of its validity. Also, in science and scholarship generally, there has been a huge increase in the use of databases to replace the traditional reference works, encyclopedias, gazetteers etc. These curated databases are constructed with a lot of manual effort, usually by copying from other sources. Few of these keep a satisfactory record of the source. 1 The terms pedigree and lineage have also been used synonymously with provenance. 1
2 The Department of Defense will be painfully aware of the 1999 bombing of the Chinese Embassy in Belgrade which was variously attributed to an out-of-date map or a misunderstanding of the procedure intended to locate the intended target in either case a failure to have or to use provenance information. All these examples clearly indicate that, for data provenance, we need to understand the issues of data creation, transformation, and copying. In the following sections we shall briefly summarize the progress to date in understanding them and then outline some of the practical and major challenges that arise from this understanding. The topics treated here are, not surprisingly, related to the authors interests. There are a number of surveys and tutorials on provenance that will give a broader coverage [BT07, SPG05, DF08, MFF + 08, CCT09, BCTV08, FKSS08] 2 Models of Provenance As interest in data provenance has developed over the past few years, researchers and software developers have realized that it is not an isolated issue. Apart from being a fundamental issue in data quality, connections have developed with other areas of computer science: Query and update languages. Probabilistic databases Data integration Data cleaning Debugging schema transformations File/data synchronization Program debugging (program slicing) Security and privacy In some cases, such as data cleaning, data integration and file synchronization, the connection is obvious: Provenance information may help to determine what to select among conflicting pieces of information. In other cases the connection is less obvious but nevertheless important. For example in program slicing the idea is to trace the flow of data through the execution of a program for debugging purposes. What has emerged as a result of these studies is the need to develop the right models of provenance. Two general models have been developed for somewhat different purposes: workflow and data provenance also called coarse- and fine-grain provenance. Both of these concern the provenance of data but in different ways. Incidentally, we have used data provenance as a general term and for fine-grain provenance. As we shall see there is no hard boundary between coarse and fine grain provenance and the context should indicate the sense in which we are using the term. 2.1 Workflow provenance As indicated in our examples, scientific data sets are seldom raw. What is published is usually the result of a sophisticated workflow that takes the raw observations (e.g. from a sensor network or a microarray image) and subjects them to a complex set of data transformations that may involve informed input from a scientist. Keeping an accurate record of what was done is crucial if one wants to verify or, more importantly to repeat an experiment. 2
3 W1 W2 M1 M2 I lifestyle, family history, physical symptoms SNPs, ethnicity Determine genetic susceptibility disorders Evaluate disorder risk prognosis O τ τ M3 M4 M5 M6 Expand SNP Set SNPs Consult external DB Search PubMed Central articles Summarize articles τ W4 M7 M8 M10 query Generate DB queries Query OMIM disorders query Query PubMed disorders Combine disorder sets M9 W3 Figure 1: Disease Susceptibility Workflow For example, the workflow in Figure 1 estimates disease susceptibility based on genome-wide SNP array data. The input to the workflow (indicated by the node or module labeled I) is a set of SNPs, ethnicity information, lifestyle, family history, and physical symptoms. The first module within the root of the workflow (the dotted box labeled W1), M 1, determines a set of disorders the patient is genetically susceptible to based on the input SNPs and ethnicity information. The second module, M 2, refines the set of disorders the patient is at risk for based on their lifestyle, family history, and physical symptoms. M1 and M2 are complex processes, as indicated by the τ expansions to the subworkflows labeled W2 and W3, respectively. To understand the prognosis for an individual (the output indicated by the module labeled O), it is important to be able to trace back through this complex workflow and see not only what processing steps were executed, but examine the intermediate data that is generated, in particular the results returned by OMIM and PubMed queries in W4 (modules M8 and M9, respectively). 2.2 Data Provenance This is concerned with the provenance of relatively small pieces of data. Suppose, for example, on a visit to your doctor, you find that your office telephone number has been incorrectly recorded. It is probable that it was either entered incorrectly or somehow incorrectly copied from some other data set. In either case one would like to understand what happened in case the incorrect number occurs in other medical records. Here we do not want a complete workflow for all medical records such a thing is probably impossible to construct. What we are looking for is some simple explanation of how that telephone number arrived at where you saw it, e.g. when and where it was entered and who copied it. More generally, as depicted in Figure 2, the idea is to extract an explanation of how a small piece of data evolved, when one is not interested in the whole system. 2.3 Convergence of models It is obvious that there has to be some convergence of these two models. In the case of a telephone number, the operation of interest is almost certainly copying and a local account of the source can be given. By contrast, in our workflow example, since we know nothing about the processing involved in M3 Expand SNP Set, then each resulting SNP in the output set can only be understood to depend on all input SNPs and ethnicity information, rather than one particular SNP and ethnicity information. But most cases lie 3
4 Figure 2: An informal view of data provenance somewhere between these two extremes. First, even if we stay within the confines of relational query languages, we may be interested in how a record was formed. For example, the incorrect telephone number may have found its way into your medical record because that record was formed by the incorrect join of two other records. In this case you would not only be interested in where the telephone number was copied from, but also in how that record was constructed. To this end [GKT07] describes the provenance of a record by a simple algebraic term. This term can be thought of as a characterization of the mini-workflow that constructed the record. At the other end of the spectrum, workflow specification languages such as Taverna and Kepler do not simply treat their component programs and data sets as black boxes. They are capable of executing instructions such as applying a procedure to each member of a set or selecting from a set each member with a certain property (see, for example, [TMG + 07]). These are similar to operations of the relational algebra, so within these workflow specifications it is often possible to extract some finer grain provenance information. There is certainly hope [ABC + 10] that further unification of models is possible, but whether there will ever be a single model that can be used for all kinds of provenance is debatable. There is already substantial variation in the models that have been developed for data provenance, and although there is a standardization effort for workflow provenance, it is not yet clear that there is one satisfactory provenance model for all varieties of workflow specification. 3 Capturing provenance Creating the right models of provenance assumes that provenance is being captured to begin with. However, the current state of practice is largely manual population, i.e. humans must enter the information. We summarize some of the advances being made in this area: Schema development: Several groups are reaching agreement on what provenance information should be recorded, whether the capture be manual or automatic. For example, within scientific workflow systems the Open Provenance Model ( see also [MFF + 08]) has been developed to enable exchange between systems. Various government organizations are also defining the metadata that should be captured, e.g. PREMIS (Library of Congress), Upside (Navy Joint Air Missile Defense), and the Defense Discovery Metadata Specification (DDMS). Automated capture: Several scientific workflow systems (e.g. Kepler, Taverna, and VisTrails) automatically capture processing steps as well as their input and output data. This log data is typically 4
5 stored as an XML file, or put in a relational database, to enable users to view provenance information. Other projects focus on capturing provenance at the operating system level, e.g. the Provenance Aware Storage System (PASS) project at Harvard. Interception: Provenance information can also be intercepted at observation points in a system. For example, copy-paste operations could be intercepted so that where a piece of data was copied from is recorded. An experimental system of this kind has been proposed in [BCC06]. Another strategy is to capture calls at multi-system coordination points (e.g. enterprise service buses). 3.1 Why capture is difficult and the evil of Ctrl-c Ctrl-v One of the simplest forms of provenance to describe is that of manual copying of data. Much data, especially in curated databases is manually copied from one database to another. While describing this process is relatively simple and while it is a particularly important form of provenance to record, progress in realizing this has been slow. There are several reasons. First, the database into which a data element is being copied may have no field in which to record provenance information; and providing, for each data value another value that describes provenance could double (or worse) the size of the database unless special techniques such as those suggested in [BCC06] are used. Second, the source database may have no accepted method of describing from where (from what location in the database) the data element was copied. This is related to the data citation problem. Third, if we are to assume that the provenance information is to be manually transferred or entered, we may be requiring too much of the database curators. Typically and understandably these people are more interested in extending the scope of their database and keeping it current than they are with the scholarly record. This last point illustrates that one of the major challenges is to change the mindset of people who manipulate and publish data and of the people who design the systems they use. For example, data items are often copied by a copy-paste operation, the unassuming Ctrl-c Ctrl-v keystrokes that are part of nearly every desktop environment. This is where much provenance gets lost. Altering the effect of this operation and the attitude of the people (all of us) who use it is probably the most important step to be made towards automatic recording of provenance. The foregoing illustrates that while, in many cases, our provenance models tell us what should be captured, our systems may require fundamental revision in order to do this. 4 Related topics In this section we briefly describe a number of topics concerning the use and quality of data that are closely related to provenance. 4.1 Privacy Although capturing complete provenance information is desirable, making it available to all users may raise privacy concerns. For example, intermediate data within a workflow execution may contain sensitive information, such as the social security number, a medical record, or financial information about an individual; in our running example, the set of potential disorders may be confidential information. Although certain users performing an analysis with a workflow may be allowed to see such confidential data, making it available through a workflow repository, even for scientific purposes, is an unacceptable breach of privacy. Beyond 5
6 data privacy, a module itself may be proprietary, meaning that users should not be able to infer its behavior. While some systems may attempt to hide the meaning or behavior of a module by hiding its name and/or the source code, this does not work when provenance is revealed: Allowing the user to see the inputs and outputs of the module over a large number of executions reveals its behavior and violates module privacy (see [DKPR10, DKRCB10] for more details).. Finally, details of how certain modules in the workflow are connected may be proprietary, and therefore showing how data is passed between modules may reveal too much of the structure of the workflow; this can be thought of as structural privacy. There is therefore a tradeoff between the amount of provenance information that can be revealed and the privacy guarantees of the components involved. 4.2 Archiving, citation, and data currency Here is a completely unscientific experiment that anyone can do. One of the most widely used sources of demographic data is the World Factbook 2. Until recently it was published annually, but is now published on-line and updated more frequently. The following table contains for each of the last ten years, the Factbook estimate of the population of China (a 10-digit number) and the number of hits reported by Google on that number at the time of writing (August 2010) Year Population Google hits ,338,612,968 15, ,338,612,968 15, ,330,044,544 6, ,321,851,888 15, ,313,973,713 5, ,306,313,812 20, ,298,847,624 4, ,286,975,468 3, ,284,303,705 1, ,273,111,290 32, ,261,832, The number of hits that a random 10-digit number gets is usually less than 100, so we have some reason to think that most of the pages contain an intended copy of a Factbook figure for the population of China. The fluctuation is difficult to explain: it may be to do with release dates of the Factbook; it may also be the result of highly propagated shock stories of population explosion. Does this mean that most Web references to the population of China are stale? Not at all. An article about the population of China written in 2002 should refer to the population of China in that year, not the population at the time someone is reading the article. However, a desultory examination of some of the pages that refer to old figures reveals that a substantial number of them are intended as current reference and should have been updated. What this does show up is the need for archiving evolving data sets. Not only should provenance for these figures be provided (it seldom is) but it should also be possible to verify that the provenance is correct. It is not at all clear who, if anyone, is responsible for archiving, and publishing archives of, an important resource like the Factbook. The problem is certainly aggravated by the fact that the Factbook is now continually updated rather than being published in annual releases. The story is the same for many curated data sets, even though tools have been developed for space-efficient archiving [BKTT04], few publishers of on-line data do a good job of keeping archives. What this means is that, even if people keep provenance information, the provenance trail can go dead. 2 The Central Intelligence Agency World Factbook 6
7 Along with archiving is the need to develop standards for data citation. There are well-developed standards several of them for traditional citations. The important observation is that citations carry more than simple provenance information about where the relevant information originates it also carries useful information such as authorship, a brief description a title and other useful context information. It is often useful to store this along with provenance information or to treat it as part of provenance information. It is especially important to do this when, as we have just seen, the source data may no longer exist. 5 Conclusion Provenance is fundamental to understanding data quality. While models for copy-paste provenance [BCC06], database-style provenance [CCT09, BCTV08] and workflow provenance [DF08, FKSS08] are starting to emerge, there are a number of problems that require further understanding, including: operating across heterogeneous models of provenance; capturing provenance; compressing provenance; securing and verifying provenance; efficiently searching and querying provenance; reducing provenance overload; respecting privacy while revealing provenance; and provenance for evolving data sets. The study of provenance is also causing us to rethink established systems for information storage. In databases, provenance has shed new light on the semantics of updates; in ontologies it is having the even more profound effect of calling into question whether the three-column organization of RDF is adequate. We have briefly discussed models of provenance in this paper. While there is much further research needed in this area, it is already clear that the major challenges to capture of provenance are in engineering the next generation of programming environments and user interfaces and in changing the mind-set of the publishers of data to recognize the importance of provenance. References [ABC + 10] [BCC06] [BCTV08] [BKTT04] [BT07] [CCT09] [DF08] Umut Acar, Peter Buneman, James Cheney, Jan Van den Bussche, Natalia Kwasnikowska, and Stijn Vansummeren. A graph model of data and workflow provenance. In Theory and Practice of Provenance, Peter Buneman, Adriane Chapman, and James Cheney. Provenance management in curated databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages , Peter Buneman, James Cheney, Wang-Chiew Tan, and Stijn Vansummeren. Curated databases. In PODS 08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1 12, New York, NY, USA, ACM. Peter Buneman, Sanjeev Khanna, Keishi Tajima, and Wang-Chiew Tan. Archiving Scientific Data. ACM Transactions on Database Systems, 27(1):2 42, Peter Buneman and Wang Chiew Tan. Provenance in databases. In SIGMOD Conference, pages , James Cheney, Laura Chiticariu, and Wang-Chiew Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(9): , Susan B. Davidson and Juliana Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD Conference, pages ,
8 [DKPR10] Susan B. Davidson, Sanjeev Khanna, Debmalya Panigrahi, and Sudeepa Roy. Preserving module privacy in workflow provenance. Manuscript available at [DKRCB10] Susan B. Davidson, Sanjeev Khanna, Sudeepa Roy, and Sarah Cohen-Boulakia. Privacy issues in scientific workflow provenance. In Proceedings of the 1st International Workshop on Workflow Approaches for New Data-Centric Science, June [FKSS08] [GKT07] [MFF + 08] [SPG05] [TMG + 07] Juliana Freire, David Koop, Emanuele Santos, and Cláudio T. Silva. Provenance for computational tasks: A survey. Computing in Science and Engineering, 10(3):11 21, Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, pages 31 40, Luc Moreau, Juliana Freire, Joe Futrelle, Robert E. McGrath, Jim Myers, and Patrick Paulson. The open provenance model: An overview. In IPAW, pages , Yogesh Simmhan, Beth Plale, and Dennis Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31 36, Daniele Turi, Paolo Missier, Carole A. Goble, David De Roure, and Tom Oinn. Taverna workflows: Syntax and semantics. In escience, pages ,
Capturing provenance information with a workflow monitoring extension for the Kieker framework
Capturing provenance information with a workflow monitoring extension for the Kieker framework Peer C. Brauer Wilhelm Hasselbring Software Engineering Group, University of Kiel, Christian-Albrechts-Platz
Provenance and Scientific Workflows: Challenges and Opportunities
Provenance and Scientific Workflows: Challenges and Opportunities ABSTRACT Susan B. Davidson University of Pennsylvania 3330 Walnut Street Philadelphia, PA 19104-6389 [email protected] Provenance in
Using Provenance to Improve Workflow Design
Using Provenance to Improve Workflow Design Frederico T. de Oliveira, Leonardo Murta, Claudia Werner, Marta Mattoso COPPE/ Computer Science Department Federal University of Rio de Janeiro (UFRJ) {ftoliveira,
Provenance in Scientific Workflow Systems
Provenance in Scientific Workflow Systems Susan Davidson, Sarah Cohen-Boulakia, Anat Eyal University of Pennsylvania {susan, sarahcb, anate }@cis.upenn.edu Bertram Ludäscher, Timothy McPhillips Shawn Bowers,
Supporting Database Provenance under Schema Evolution
Supporting Database Provenance under Schema Evolution Shi Gao and Carlo Zaniolo University of California, Los Angeles {gaoshi, zaniolo}@cs.ucla.edu Abstract. Database schema upgrades are common in modern
Data Provenance: A Categorization of Existing Approaches
Data Provenance: A Categorization of Existing Approaches Boris Glavic, Klaus Dittrich University of Zurich Database Technology Research Group [email protected], [email protected] Abstract: In many
Semantic Workflows and the Wings Workflow System
To Appear in AAAI Fall Symposium on Proactive Assistant Agents, Arlington, VA, November 2010. Assisting Scientists with Complex Data Analysis Tasks through Semantic Workflows Yolanda Gil, Varun Ratnakar,
CRM dig : A generic digital provenance model for scientific observation
CRM dig : A generic digital provenance model for scientific observation Martin Doerr, Maria Theodoridou Institute of Computer Science, FORTH-ICS, Crete, Greece Abstract The systematic large-scale production
METADATA STANDARDS AND GUIDELINES RELEVANT TO DIGITAL AUDIO
This chart provides a quick overview of metadata standards and guidelines that are in use with digital audio, including metadata used to describe the content of the files; metadata used to describe properties
Implementing Topic Maps 4 Crucial Steps to Successful Enterprise Knowledge Management. Executive Summary
WHITE PAPER Implementing Topic Maps 4 Crucial Steps to Successful Enterprise Knowledge Management Executive Summary For years, enterprises have sought to improve the way they share information and knowledge
Big Data Provenance: Challenges and Implications for Benchmarking
Big Data Provenance: Challenges and Implications for Benchmarking Boris Glavic Illinois Institute of Technology 10 W 31st Street, Chicago, IL 60615, USA [email protected] Abstract. Data Provenance is information
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
Security Issues for the Semantic Web
Security Issues for the Semantic Web Dr. Bhavani Thuraisingham Program Director Data and Applications Security The National Science Foundation Arlington, VA On leave from The MITRE Corporation Bedford,
Report on the Dagstuhl Seminar Data Quality on the Web
Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,
A Survey of Data Provenance Techniques
A Survey of Data Provenance Techniques Yogesh L. Simmhan, Beth Plale, Dennis Gannon Computer Science Department, Indiana University, Bloomington IN 47405 {ysimmhan, plale, gannon}@cs.indiana.edu Technical
THE SEMANTIC WEB AND IT`S APPLICATIONS
15-16 September 2011, BULGARIA 1 Proceedings of the International Conference on Information Technologies (InfoTech-2011) 15-16 September 2011, Bulgaria THE SEMANTIC WEB AND IT`S APPLICATIONS Dimitar Vuldzhev
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM 1 KONG XIANGSHENG 1 Department of Computer & Information, Xinxiang University, Xinxiang, China E-mail:
Provenance Framework for the Cloud Environment (IaaS)
Framework for the Environment (IaaS) Muhammad Imran Research Group Entertainment Computing University of Vienna, Austria Email: [email protected] Helmut Hlavacs Research Group Entertainment Computing
A Capability Maturity Model for Scientific Data Management
A Capability Maturity Model for Scientific Data Management 1 A Capability Maturity Model for Scientific Data Management Kevin Crowston & Jian Qin School of Information Studies, Syracuse University July
Archiving Scientific Data
Archiving Scientific Data Peter Buneman Sanjeev Khanna Ý Keishi Tajima Þ Wang-Chiew Tan Ü ABSTRACT Ï ÔÖ ÒØ Ò Ö Ú Ò Ø Ò ÕÙ ÓÖ Ö Ö Ð Ø Û Ø Ý ØÖÙØÙÖ º ÇÙÖ ÔÔÖÓ ÓÒ Ø ÒÓ¹ Ø ÓÒ Ó Ø Ñ Ø ÑÔ Û Ö Ý Ò Ð Ñ ÒØ ÔÔ Ö
Dependencies Revisited for Improving Data Quality
Dependencies Revisited for Improving Data Quality Wenfei Fan University of Edinburgh & Bell Laboratories Wenfei Fan Dependencies Revisited for Improving Data Quality 1 / 70 Real-world data is often dirty
How to avoid building a data swamp
How to avoid building a data swamp Case studies in Hadoop data management and governance Mark Donsky, Product Management, Cloudera Naren Korenu, Engineering, Cloudera 1 Abstract DELETE How can you make
Designing a Semantic Repository
Designing a Semantic Repository Integrating architectures for reuse and integration Overview Cory Casanave Cory-c (at) modeldriven.org ModelDriven.org May 2007 The Semantic Metadata infrastructure will
It s all around the domain ontologies - Ten benefits of a Subject-centric Information Architecture for the future of Social Networking
It s all around the domain ontologies - Ten benefits of a Subject-centric Information Architecture for the future of Social Networking Lutz Maicher and Benjamin Bock, Topic Maps Lab at University of Leipzig,
Workflow Automation and Management Services in Web 2.0: An Object-Based Approach to Distributed Workflow Enactment
Workflow Automation and Management Services in Web 2.0: An Object-Based Approach to Distributed Workflow Enactment Peter Y. Wu [email protected] Department of Computer & Information Systems Robert Morris University
Guide to Writing a Project Report
Guide to Writing a Project Report The following notes provide a guideline to report writing, and more generally to writing a scientific article. Please take the time to read them carefully. Even if your
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Data Provenance in the Internet of Things
Universität Passau IT-Security Group Prof. Dr. rer. nat. Joachim Posegga Conference Seminar SS2013 Real Life Security (5827HS) Data Provenance in the Internet of Things This work was created in the context
HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - [email protected]. CMSC 601 - Presentation
HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM Aniket Bochare - [email protected] CMSC 601 - Presentation Date-04/25/2011 AGENDA Introduction and Background Framework Heterogeneous
Simon Miles King s College London Architecture Tutorial
Common provenance questions across e-science experiments Simon Miles King s College London Outline Gathering Provenance Use Cases Our View of Provenance Sample Case Studies Generalised Questions about
Fourth generation techniques (4GT)
Fourth generation techniques (4GT) The term fourth generation techniques (4GT) encompasses a broad array of software tools that have one thing in common. Each enables the software engineer to specify some
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Digital libraries of the future and the role of libraries
Digital libraries of the future and the role of libraries Donatella Castelli ISTI-CNR, Pisa, Italy Abstract Purpose: To introduce the digital libraries of the future, their enabling technologies and their
Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC
Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep Neil Raden Hired Brains Research, LLC Traditionally, the job of gathering and integrating data for analytics fell on data warehouses.
Semantic Search in Portals using Ontologies
Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br
Is ETL Becoming Obsolete?
Is ETL Becoming Obsolete? Why a Business-Rules-Driven E-LT Architecture is Better Sunopsis. All rights reserved. The information contained in this document does not constitute a contractual agreement with
Checklist for a Data Management Plan draft
Checklist for a Data Management Plan draft The Consortium Partners involved in data creation and analysis are kindly asked to fill out the form in order to provide information for each datasets that will
ASSESSING MIDDLE SCHOOL STUDENTS CONTENT KNOWLEDGE AND SCIENTIFIC
ASSESSING MIDDLE SCHOOL STUDENTS CONTENT KNOWLEDGE AND SCIENTIFIC REASONING THROUGH WRITTEN EXPLANATIONS Joseph S. Krajcik and Katherine L. McNeill University of Michigan Modified from McNeill, K. L. &
Using Ontology and Data Provenance to Improve Software Processes
Using Ontology and Data Provenance to Improve Software Processes Humberto L. O. Dalpra 1, Gabriella C. B. Costa 2, Tássio F. M. Sirqueira 1, Regina Braga 1, Cláudia M. L. Werner 2, Fernanda Campos 1, José
White Paper April 2006
White Paper April 2006 Table of Contents 1. Executive Summary...4 1.1 Scorecards...4 1.2 Alerts...4 1.3 Data Collection Agents...4 1.4 Self Tuning Caching System...4 2. Business Intelligence Model...5
An Ontology-based e-learning System for Network Security
An Ontology-based e-learning System for Network Security Yoshihito Takahashi, Tomomi Abiko, Eriko Negishi Sendai National College of Technology [email protected] Goichi Itabashi Graduate School
Data Analytics in Health Care
Data Analytics in Health Care ONUP 2016 April 4, 2016 Presented by: Dennis Giokas, CTO, Innovation Ecosystem Group A lot of data, but limited information 2 Data collection might be the single greatest
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software
The role of integrated requirements management in software delivery.
Software development White paper October 2007 The role of integrated requirements Jim Heumann, requirements evangelist, IBM Rational 2 Contents 2 Introduction 2 What is integrated requirements management?
REACCH PNA Data Management Plan
REACCH PNA Data Management Plan Regional Approaches to Climate Change (REACCH) For Pacific Northwest Agriculture 875 Perimeter Drive MS 2339 Moscow, ID 83844-2339 http://www.reacchpna.org [email protected]
Scarcity, Conflicts and Cooperation: Essays in Political and Institutional Economics of Development. Pranab Bardhan
Scarcity, Conflicts and Cooperation: Essays in Political and Institutional Economics of Development by Pranab Bardhan Table of Contents Preface Chapter 1: History,Institutions and Underdevelopment Appendix:
Test Data Management Concepts
Test Data Management Concepts BIZDATAX IS AN EKOBIT BRAND Executive Summary Test Data Management (TDM), as a part of the quality assurance (QA) process is more than ever in the focus among IT organizations
OpenAIRE Research Data Management Briefing paper
OpenAIRE Research Data Management Briefing paper Understanding Research Data Management February 2016 H2020-EINFRA-2014-1 Topic: e-infrastructure for Open Access Research & Innovation action Grant Agreement
Book Review of Rosenhouse, The Monty Hall Problem. Leslie Burkholder 1
Book Review of Rosenhouse, The Monty Hall Problem Leslie Burkholder 1 The Monty Hall Problem, Jason Rosenhouse, New York, Oxford University Press, 2009, xii, 195 pp, US $24.95, ISBN 978-0-19-5#6789-8 (Source
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques Sean Thorpe 1, Indrajit Ray 2, and Tyrone Grandison 3 1 Faculty of Engineering and Computing,
EUR-Lex 2012 Data Extraction using Web Services
DOCUMENT HISTORY DOCUMENT HISTORY Version Release Date Description 0.01 24/01/2013 Initial draft 0.02 01/02/2013 Review 1.00 07/08/2013 Version 1.00 -v1.00.doc Page 2 of 17 TABLE OF CONTENTS 1 Introduction...
Nexus Professional Whitepaper. Repository Management: Stages of Adoption
Sonatype Nexus Professional Whitepaper Repository Management: Stages of Adoption Adopting Repository Management Best Practices SONATYPE www.sonatype.com [email protected] +1 301-684-8080 12501 Prosperity
Digital Assets Repository 3.0. PASIG User Group Conference Noha Adly Bibliotheca Alexandrina
Digital Assets Repository 3.0 PASIG User Group Conference Noha Adly Bibliotheca Alexandrina DAR 3.0 DAR manages the full lifecycle of a digital asset: its creation, ingestion, metadata management, storage,
Data Warehousing and OLAP Technology for Knowledge Discovery
542 Data Warehousing and OLAP Technology for Knowledge Discovery Aparajita Suman Abstract Since time immemorial, libraries have been generating services using the knowledge stored in various repositories
Motivating Clinical SAS Programmers
Motivating Clinical SAS Programmers Daniel Boisvert, Genzyme Corporation, Cambridge MA Andy Illidge, Genzyme Corporation, Cambridge MA ABSTRACT Almost every job advertisement requires programmers to be
System Requirements for Archiving Electronic Records PROS 99/007 Specification 1. Public Record Office Victoria
System Requirements for Archiving Electronic Records PROS 99/007 Specification 1 Public Record Office Victoria Version 1.0 April 2000 PROS 99/007 Specification 1: System Requirements for Archiving Electronic
