A STUDY IN USER-CENTRIC DATA INTEGRATION Heiner Stuckenschmidt 1 2, Jan Noessner 1, and Faraz Fallahi 3 1 School of Business Informatics and Mathematics, University of Mannheim. 68159 Mannheim. Germany 2 Institute for Enterprise Systems (InES), L 15, 1-6, 68131 Mannheim. Germany 3 ontoprise GmbH, An der RaumFabrik 33a, 76227 Karlsruhe. Germany
Motivation 1 Data Integration maps different data sources to a consistent target structure. Target Structure (Ontology) (Encompassing consistent view to the data) Data Integration Rules Data Sources (Direct extraction out of different data sources) Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 2
Outline Outline 1 2 3 4 Motivation Related Work User-Centric Mapping Assistant Approach Study Design and Datasets 5 6 Experimental Results Conclusion and Future Work Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 3
Related Work 2 Automatic data integration approachesare still errorproneand need to be supervised by human domain experts. The problem of data integration has been studied intensively on a technical level in different areas of computer science. Researchers have investigated the automatic identification of semantic relations between different datasets (Euzenatand Shvaiko, 2007). A prominent line of research investigates the use of ontologies-formal representations of the conceptual structure of an application domain -as a basis for both, identifying and using semantic relations. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 4
Related Work 2 Existing work in user-centric data integration investigated rather simple scenarios. In a recent study, Gassand Maedchehave investigated the problem of data integration in the context of personal information management from a usercentric point of view (Gassand Maedche, 2011). The scenario addressed in their work, however, focuses on the integration of rather simple data schemas, in that case personal data where the task is mainly to map properties describing a person (e.g. name or bank account number). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 5
Related Work 2 Traditional User Interfaces try to visualize integration rules Most approaches are based on advanced visualization of the models to be integrated and the mappings created by the user (Granitzeret al., 2010). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 6
Drawbacksare visualization limits in number and complexity of integration rules. Visualizations quickly reach their limits if Many integration rules exist Related Work 2 Presentation of Data Integration Rules in AgreementMaker Very complex mapping rules exist, which are hard to visualize. @{c#ccmappingrule1305046217116}?_siid:highcomfortandlowsavetycar :-?_SIID:<http://www.owl-ontologies.com/autos.owl#Cars>@<http://www.owlontologies.com/autos.owl> AND (?_SIID[<http://www.owlontologies.com/autos.owl#hasSafetyFeaturesRating>->?_VAR0]@<http://www.owlontologies.com/autos.owl> AND?_VAR0 <= 2.0) AND (?_SIID[<http://www.owl- ontologies.com/autos.owl#hascomfortandconveniencerating>-?_var1]@<http://www.owl-ontologies.com/autos.owl> AND?_VAR1 >= 3.5). High expert knowledge is needed to interprete the consequences of the Mapping Rules Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 7
Related Work 2 The need of User-Centric Data Integration has been recognized. Recently, researchers in ontology and schema matching have recognized the need for user support in aligning complex conceptual models (Falconer, 2009; Falconer and Storey, 2007). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 8
Related Work 2 The cognitive support model for data integration by Falconer and Noy(2011) underlines the user interaction. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 9
User-Centric Mapping Assistant Approach 3 Our Modified Cognitive Support Model is based on identifying wrong instances and asking questions in natural language. User identifies instances which have been classified incorrectly. User answers questions. User Inspection Decision which concept to examine Diagnostic algorithm generates the minimal amount of user questions Questions are represented to the user in natural language sentences in a todo-list. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 10
User-Centric Mapping Assistant Approach 3 Our Interactive User Interface enables users to investigate data on the instance level. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 11
User-Centric Mapping Assistant Approach 3 In the Analysis and Decision Making phase the user decides which concept he wants to examine. 1 User decides which concept he wants to examine Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 12
User-Centric Mapping Assistant Approach 3 In the Interaction phase the user identifies wrong classified instances. 1 User decides which concept he wants to examine 2 User identifies instances which have been classified incorrectly. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 13
User-Centric Mapping Assistant Approach 3 In the Analysis and Generation phase the minimal amount of user questions is generated by the system. 1 User decides which concept he wants to examine 2 User identifies instances which have been classified incorrectly. 3 Diagnostic algorithm generates the minimal amount of user questions Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 14
User-Centric Mapping Assistant Approach 3 In the Representationphase the questions are represented to the user in natural language. 1 User decides which concept he wants to examine 2 User identifies instances which have been classified incorrectly. 3 Diagnostic algorithm generates the minimal amount of user questions 4 Questions are represented to the user in natural language sentences in a todo-list. Is MX5_Mieta an instance of HighPerformanceCar? Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 15
Outline Outline 1 2 3 4 Motivation Related Work User-Centric Mapping Assistant Approach Study Design and Datasets 5 6 Experimental Results Conclusion and Future Work Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 16
Study Design and Datasets 4 The Source Dataset is an instructional dataset from the web. The target Schema is manually created. Source Dataset Instructional dataset from the carselling domain (http://gaia.isi.cnr.it/ straccia/down load/teaching/si/2006/autos.owl) Target Schema The dataset contains: 324 data records (cars, car parts, etc.) 100 attributes (like speed, fuel consumption,...). 91 concepts organized in a concept hierarchy. Complex enough, but small enough to be handled in a user-study. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 17
Ten Integration Rules were wrong and had to be identified by the subjects (Dependent Variable). Two Datasets containing 10 wrong integration rules each. Type 1: Easy Mistakes Study Design and Datasets 4 Wheel Engine Type 2: Complex Mistakes AirCondition Filter: haszonenumber = 2 hasautomatic = false AutomaticOneZoneAirCondition The subjects had to find as many wrong integration rules as possible. The dependent variableis the number of errors the subjects found in the respective dataset Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 18
Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 19
Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 20
Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 21
Study Design and Datasets 4 We compared the conventional approach with the MappingAssistant approach (Independent Variable). Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 22
Study Design and Datasets 4 For Simulating Background Knowledge the subjects had an information sheet. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 23
Study Design and Datasets 4 Both, the order of tasks and the order of datasets wereswitched. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 24
Study Design and Datasets 4 We performed the study with 22 subjects. 22 subjects participated in the user study, each performed both tasks on both datasets. 6 female, 16 male average age: 27.8 years (min = 21, max > 50). 54% of the subjects were students. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 25
Experimental Results 5 Precision, Recall, and F-Measure number of errors that have correctly been identified by a subject number of errors been identified by a subject number of errors that have correctly been identified by a subject number of all existing correct errors (10) 12 2 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 26
Experimental Results 5 In the Average Performance of Subjectsthe recall was one third higher in the MappingAssistant approach. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 27
Experimental Results 5 Comparing the Performance on the Subject Level91% of the subjects found more mistakes in the MappingAssistant approach. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 28
In the standard approach subjects with low technical knowledge reached lower F-Scores. Experimental Results 5 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 29
Experimental Results 5 In the MappingAssistantapproach the reached F-Score is independent from the level of knowledge. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 30
The User Feedback is better for the MappingAssistantapproach than for the standard approach. Task 1 Experimental Results 5 Task 2 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 31
Conclusion and Future Work 6 Conclusion The goal of our research was to enable the people with less or no knowledge of technologies to integrate their data. We presented a user-centric approach to data integration that is based on a cognitive support model. We presented the results of a user study demonstrating that our MappingAssistantapproach empowers users to solve data integration problems more effectively and efficiently. In particular, we showed that users were able to find more errors in mapping rules in a given period of time. Further, we were able to show that while with conventional mapping technology a high level of expertise in mapping technology is required, while the MappingAssistantapproach significantly reduces the performance difference of experienced and inexperienced users. s5 Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 32
Slide 32 s5 auch hier ist das while zuviel oder?! shaihulud; 21.06.2012
Conclusion and Future Work 6 In Future Work we will focus on correcting the wrong integration rules. Select concept and mark wrong instance Actualizing the integration rule Feedback questions from the sysstem to the user Selection of a correction suggestion Identified the wront integration rules Calculation of correction suggestions of the integration rule Selection of the integration rule and mark wrong instances Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 33
End for your attention! If you have any questions feel free to ask. Jan Noessner - Lehrstuhl für künstliche Intelligenz University of Mannheim 34