A STUDY IN USER-CENTRIC DATA INTEGRATION



Similar documents
Experiments in Web Page Classification for Semantic Web


Mathematics Cognitive Domains Framework: TIMSS 2003 Developmental Project Fourth and Eighth Grades

The Masters of Science in Information Systems & Technology

Predicate logic Proofs Artificial intelligence. Predicate logic. SET07106 Mathematics for Software Engineering

Semantic Business Analytics in Industrial Facilities a Case Study

Chapter 6. The stacking ensemble approach

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Mining the Software Change Repository of a Legacy Telephony System

Language and Computation

HELP DESK SYSTEMS. Using CaseBased Reasoning

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Optical Digitizing by ATOS for Press Parts and Tools

A Test Case Generator for the Validation of High-Level Petri Nets

Screen Design : Navigation, Windows, Controls, Text,

Knowledge-based systems and the need for learning

Identify Disorders in Health Records using Conditional Random Fields and Metamap

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Using Artificial Intelligence to Manage Big Data for Litigation

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

E10: Controlled Experiments

Semantic EPC: Enhancing Process Modeling Using Ontologies

CENG 734 Advanced Topics in Bioinformatics

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Big Data: Rethinking Text Visualization

Visualization methods for patent data

Microsoft' Excel & Access Integration

A Pattern-based Framework of Change Operators for Ontology Evolution

The Visualization Pipeline

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

COCOVILA Compiler-Compiler for Visual Languages

Database Design Overview. Conceptual Design ER Model. Entities and Entity Sets. Entity Set Representation. Keys

Functional Modelling in secondary schools using spreadsheets

Intelligent Retrieval for Component Reuse in System-On-Chip Design

Business Intelligence for The Internet of Things

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

CHAPTER 1 INTRODUCTION

Ontology and automatic code generation on modeling and simulation

IRIS - English-Irish Translation System

MISTAKE-HANDLING ACTIVITIES IN THE MATHEMATICS CLASSROOM: EFFECTS OF AN IN-SERVICE TEACHER TRAINING ON STUDENTS PERFORMANCE IN GEOMETRY

Projektgruppe. Categorization of text documents via classification

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Data Mining Algorithms Part 1. Dejan Sarka

Instructional Design for Engineering Programs

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

Depth-of-Knowledge Levels for Four Content Areas Norman L. Webb March 28, Reading (based on Wixson, 1999)

Defining Equity and Debt using REA Claim Semantics

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Flexible mobility management strategy in cellular networks

SWAP: ONTOLOGY-BASED KNOWLEDGE MANAGEMENT WITH PEER-TO-PEER TECHNOLOGY

TRANSPORT SERVICE. RFID Vehicle Outbound Logistics Management Case Study

Specification and Analysis of Contracts Lecture 1 Introduction

Chapter 8 The Enhanced Entity- Relationship (EER) Model

Advanced Data Warehouse Design

Mining a Corpus of Job Ads

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

Success in Change. Anabel Houben Carsten Frigge C4 Consulting GmbH. Representative Survey on Success and Failure in Managing Change

Application of ontologies for the integration of network monitoring platforms

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

ANALYTICS IN BIG DATA ERA

Graduate School of Informatics

How To Be A Critical Thinker

Converging Web-Data and Database Data: Big - and Small Data via Linked Data

Data Visualization An Outlook on Disruptive Techniques (Technical Insights)

Maschinelles Lernen mit MATLAB

Software Engineering of NLP-based Computer-assisted Coding Applications

Outline. Lecture 13: Web Usability. Top Ten Web Design Mistakes. Web Usability Principles Usability Evaluations

FUNDAMENTALS OF ARTIFICIAL INTELLIGENCE KNOWLEDGE REPRESENTATION AND NETWORKED SCHEMES

Blog Post Extraction Using Title Finding

Other Required Courses (14-18 hours)

Big Data in Education

Master Thesis Proposal

A short guide to multiple choice and short answer exams

Linked Data Interface, Semantics and a T-Box Triple Store for Microsoft SharePoint

Transcription:

A STUDY IN USER-CENTRIC DATA INTEGRATION Heiner Stuckenschmidt 1 2, Jan Noessner 1, and Faraz Fallahi 3 1 School of Business Informatics and Mathematics, University of Mannheim. 68159 Mannheim. Germany 2 Institute for Enterprise Systems (InES), L 15, 1-6, 68131 Mannheim. Germany 3 ontoprise GmbH, An der RaumFabrik 33a, 76227 Karlsruhe. Germany

Motivation 1 Data Integration maps different data sources to a consistent target structure. Target Structure (Ontology) (Encompassing consistent view to the data) Data Integration Rules Data Sources (Direct extraction out of different data sources) Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 2

Outline Outline 1 2 3 4 Motivation Related Work User-Centric Mapping Assistant Approach Study Design and Datasets 5 6 Experimental Results Conclusion and Future Work Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 3

Related Work 2 Automatic data integration approachesare still errorproneand need to be supervised by human domain experts. The problem of data integration has been studied intensively on a technical level in different areas of computer science. Researchers have investigated the automatic identification of semantic relations between different datasets (Euzenatand Shvaiko, 2007). A prominent line of research investigates the use of ontologies-formal representations of the conceptual structure of an application domain -as a basis for both, identifying and using semantic relations. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 4

Related Work 2 Existing work in user-centric data integration investigated rather simple scenarios. In a recent study, Gassand Maedchehave investigated the problem of data integration in the context of personal information management from a usercentric point of view (Gassand Maedche, 2011). The scenario addressed in their work, however, focuses on the integration of rather simple data schemas, in that case personal data where the task is mainly to map properties describing a person (e.g. name or bank account number). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 5

Related Work 2 Traditional User Interfaces try to visualize integration rules Most approaches are based on advanced visualization of the models to be integrated and the mappings created by the user (Granitzeret al., 2010). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 6

Drawbacksare visualization limits in number and complexity of integration rules. Visualizations quickly reach their limits if Many integration rules exist Related Work 2 Presentation of Data Integration Rules in AgreementMaker Very complex mapping rules exist, which are hard to visualize. @{c#ccmappingrule1305046217116}?_siid:highcomfortandlowsavetycar :-?_SIID:<http://www.owl-ontologies.com/autos.owl#Cars>@<http://www.owlontologies.com/autos.owl> AND (?_SIID[<http://www.owlontologies.com/autos.owl#hasSafetyFeaturesRating>->?_VAR0]@<http://www.owlontologies.com/autos.owl> AND?_VAR0 <= 2.0) AND (?_SIID[<http://www.owl- ontologies.com/autos.owl#hascomfortandconveniencerating>-?_var1]@<http://www.owl-ontologies.com/autos.owl> AND?_VAR1 >= 3.5). High expert knowledge is needed to interprete the consequences of the Mapping Rules Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 7

Related Work 2 The need of User-Centric Data Integration has been recognized. Recently, researchers in ontology and schema matching have recognized the need for user support in aligning complex conceptual models (Falconer, 2009; Falconer and Storey, 2007). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 8

Related Work 2 The cognitive support model for data integration by Falconer and Noy(2011) underlines the user interaction. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 9

User-Centric Mapping Assistant Approach 3 Our Modified Cognitive Support Model is based on identifying wrong instances and asking questions in natural language. User identifies instances which have been classified incorrectly. User answers questions. User Inspection Decision which concept to examine Diagnostic algorithm generates the minimal amount of user questions Questions are represented to the user in natural language sentences in a todo-list. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 10

User-Centric Mapping Assistant Approach 3 Our Interactive User Interface enables users to investigate data on the instance level. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 11

User-Centric Mapping Assistant Approach 3 In the Analysis and Decision Making phase the user decides which concept he wants to examine. 1 User decides which concept he wants to examine Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 12

User-Centric Mapping Assistant Approach 3 In the Interaction phase the user identifies wrong classified instances. 1 User decides which concept he wants to examine 2 User identifies instances which have been classified incorrectly. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 13

User-Centric Mapping Assistant Approach 3 In the Analysis and Generation phase the minimal amount of user questions is generated by the system. 1 User decides which concept he wants to examine 2 User identifies instances which have been classified incorrectly. 3 Diagnostic algorithm generates the minimal amount of user questions Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 14

User-Centric Mapping Assistant Approach 3 In the Representationphase the questions are represented to the user in natural language. 1 User decides which concept he wants to examine 2 User identifies instances which have been classified incorrectly. 3 Diagnostic algorithm generates the minimal amount of user questions 4 Questions are represented to the user in natural language sentences in a todo-list. Is MX5_Mieta an instance of HighPerformanceCar? Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 15

Outline Outline 1 2 3 4 Motivation Related Work User-Centric Mapping Assistant Approach Study Design and Datasets 5 6 Experimental Results Conclusion and Future Work Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 16

Study Design and Datasets 4 The Source Dataset is an instructional dataset from the web. The target Schema is manually created. Source Dataset Instructional dataset from the carselling domain (http://gaia.isi.cnr.it/ straccia/down load/teaching/si/2006/autos.owl) Target Schema The dataset contains: 324 data records (cars, car parts, etc.) 100 attributes (like speed, fuel consumption,...). 91 concepts organized in a concept hierarchy. Complex enough, but small enough to be handled in a user-study. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 17

Ten Integration Rules were wrong and had to be identified by the subjects (Dependent Variable). Two Datasets containing 10 wrong integration rules each. Type 1: Easy Mistakes Study Design and Datasets 4 Wheel Engine Type 2: Complex Mistakes AirCondition Filter: haszonenumber = 2 hasautomatic = false AutomaticOneZoneAirCondition The subjects had to find as many wrong integration rules as possible. The dependent variableis the number of errors the subjects found in the respective dataset Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 18

Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 19

Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 20

Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 21

Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 22

Study Design and Datasets 4 For Simulating Background Knowledge the subjects had an information sheet. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 23

Study Design and Datasets 4 Both, the order of tasks and the order of datasets wereswitched. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 24

Study Design and Datasets 4 We performed the study with 22 subjects. 22 subjects participated in the user study, each performed both tasks on both datasets. 6 female, 16 male average age: 27.8 years (min = 21, max > 50). 54% of the subjects were students. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 25

Experimental Results 5 Precision, Recall, and F-Measure number of errors that have correctly been identified by a subject number of errors been identified by a subject number of errors that have correctly been identified by a subject number of all existing correct errors (10) 12 2 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 26

Experimental Results 5 In the Average Performance of Subjectsthe recall was one third higher in the MappingAssistant approach. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 27

Experimental Results 5 Comparing the Performance on the Subject Level91% of the subjects found more mistakes in the MappingAssistant approach. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 28

In the standard approach subjects with low technical knowledge reached lower F-Scores. Experimental Results 5 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 29

Experimental Results 5 In the MappingAssistantapproach the reached F-Score is independent from the level of knowledge. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 30

The User Feedback is better for the MappingAssistantapproach than for the standard approach. Task 1 Experimental Results 5 Task 2 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 31

Conclusion and Future Work 6 Conclusion The goal of our research was to enable the people with less or no knowledge of technologies to integrate their data. We presented a user-centric approach to data integration that is based on a cognitive support model. We presented the results of a user study demonstrating that our MappingAssistantapproach empowers users to solve data integration problems more effectively and efficiently. In particular, we showed that users were able to find more errors in mapping rules in a given period of time. Further, we were able to show that while with conventional mapping technology a high level of expertise in mapping technology is required, while the MappingAssistantapproach significantly reduces the performance difference of experienced and inexperienced users. s5 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 32

Slide 32 s5 auch hier ist das while zuviel oder?! shaihulud; 21.06.2012

Conclusion and Future Work 6 In Future Work we will focus on correcting the wrong integration rules. Select concept and mark wrong instance Actualizing the integration rule Feedback questions from the sysstem to the user Selection of a correction suggestion Identified the wront integration rules Calculation of correction suggestions of the integration rule Selection of the integration rule and mark wrong instances Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 33

End for your attention! If you have any questions feel free to ask. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 34