Automating the Schema Matching Process for Heterogeneous Data Warehouses
|
|
|
- Rosamond Walters
- 9 years ago
- Views:
Transcription
1 Automating the Schema Matching Process for Heterogeneous Data Warehouses Marko Banek 1, Boris Vrdoljak 1, A. Min Tjoa 2, and Zoran Skočir 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, HR Zagreb, Croatia {marko.banek,boris.vrdoljak,zoran.skocir}@fer.hr 2 Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040 Wien, Austria [email protected] Abstract. A federated data warehouse is a logical integration of data warehouses applicable when physical integration is impossible due to privacy policy or legal restrictions. In order to enable the translation of queries in a federated approach, schemas of the federated and the local warehouses must be matched. In this paper we present a procedure that enables the matching process for schema structures specific to the multidimensional model of data warehouses: facts, measures, dimensions, aggregation levels and dimensional attributes. Similarities between warehouse-specific structures are computed by using linguistic and structural comparison, where calculated values are used to create necessary mappings. We present restriction rules and recommendations for aggregation level matching, which builds the most complex part of the process. A software implementation of the entire process is provided in order to perform its verification, as well as to determine the proper selection metric for mapping different multidimensional structures. 1 Introduction Increasing competitiveness in business and permanent demands for greater efficiency (either in business or government and non-profit organizations) enforce independent organizations to integrate their data warehouses. Data warehouse integration enables a broader knowledge base for decision-support systems, knowledge discovery and data mining than each of the independent warehouses could offer separately. Large corporations integrate their separately developed regional warehouses, newly merged companies integrate their warehouses to enable the business to be run centrally, while independent organizations join their warehouses leading to a significant benefit to all participants and/or their customers. Copying data from the source heterogeneous warehouses to a new central location would be the easiest way to perform the integration. However, the privacy policy of the organizations, as well as the legal protection of sensitive data in the majority of cases prohibits the data to be copied outside the organizations where they are stored. Hence, a federated data warehouse is created, which implies that the integration must be performed from a logical point of view, using a common conceptual model, while I.Y. Song, J. Eder, and T.M. Nguyen (Eds.): DaWaK 2007, LNCS 4654, pp , Springer-Verlag Berlin Heidelberg 2007
2 46 M. Banek et al. the heterogeneous source warehouses exist physically. The user of such a federated solution observes the whole federation as a single unit i.e. she must not notice that several heterogeneous parts of the warehouse actually exist. The logical existence of the federated warehouse does not have any impact on the users of local component warehouses [13]. In [3, 12], we developed the conceptual model of a federated data warehouse that unifies data warehouses of different health insurance organizations. Queries on the federated data warehouse must be translated into sub-queries that correspond to the schemas of local warehouses. Since the multidimensional data model prevails in data warehouse design, it is necessary for the query translation to discover mappings between particular structures of the federated multidimensional conceptual model (facts, measures, dimensions, aggregation levels and dimensional attributes) and their matching counterparts in the multidimensional conceptual models of the local warehouses. In this paper, the process of creating the mappings between the matching warehouse-specific components will be automated as much as possible in order to shorten the data warehouse integration process as a whole. The contribution of our work is to describe a match discovery procedure for data warehouse-specific structures, with particular emphasis to aggregation levels. We enhance the match discovery techniques for database schemas, enabling their application to data warehouse schemas. Software that automates the integration process is provided and the entire procedure evaluated on an example. The paper is structured as follows. Section 2 gives an overview of the related work. Basic strategies for matching multidimensional structures are presented in Section 3. Similarity functions for multidimensional structures are explained in Section 4, while the mapping strategies are shown in Section 5. Section 6 evaluates the performance of the algorithm that matches data warehouse schemas automatically. Conclusions are drawn in Section 7. 2 Related Work There exist two basic approaches to automated matching of database schemas, semistructured data schemas and ontologies: schema-based and instance-based [10]. Schema-based matching considers only schema information, not instance data. The most prominent approaches are ARTEMIS-MOMIS [1], Cupid [7] and similarityflooding [8, 9]. ARTEMIS/MOMIS and Cupid are based on linguistic matching. Using a word thesaurus, structure names are compared and their similarity is expressed as a probability function. Complex structures are decomposed into substructures where linguistic comparison can be performed and thereafter a formula is used to compute the similarity of complex structures. The similarity-flooding algorithm translates database tables or semi-structured sources into graphs whose structures are then compared. Names of the graph vertices are simply compared as strings and no semantic knowledge is required. Instance-based approaches apply various learning and mining techniques to compare instance data, together with schema metadata (thus being able to outperform schema-based techniques). imap [5] is a comprehensive approach for semi-automatic
3 Automating the Schema Matching Process for Heterogeneous Data Warehouses 47 discovery of semantic matches between database schemas, which uses neural networks and text searching methods. An automated check of match compatibility between data warehouse dimensions is performed in [4], but match candidates must first be proposed manually by the integration designer. To the best of our knowledge, none of the existing frameworks for automated schema matching can be successfully applied to match data warehouses, due to the specific features of the multidimensional conceptual model of data warehouses. 3 Matching Strategy for Solving Heterogeneities Among Multidimensional Structures Our algorithm aims at automating the matching of data warehouse schemas in cases when the access to data warehouse content is prohibited (e.g. medical data warehouses). Hence, we develop a matching algorithm that is based exclusively on analyzing data warehouse schemas. 3.1 Classification of Schema Heterogeneities Heterogeneities in data warehouse schemas arise either when (1) different structures (relational, multidimensional etc.) are used to represent the same information, or (2) different specifications (i.e. different interpretations) of the same structure exist. A survey of heterogeneities that arise in multidatabase systems, i.e. the integration of different relational databases is given in [6]. Attribute conflicts are due to different names and domains and can be divided into one-to-one (see Fig. 1: city in table insurant corresponding to municipality in table patient is an example of naming conflict while two country attributes with different data types is an example of domain conflict), and -tomany (street_name and street_number in table insurant corresponding to a single attribute address). Table conflicts occur when two tables in different databases describe the same information, but different names, data types or constraints are used to represent the information (there is a table conflict between insurant and patient). first_name last_name street_name street_number city country INSURANT given_name middle_name family_name address municipality country (character) PATIENT Fig. 1. To compatible tables with different attribute and table conflicts Heterogeneities that are specific to data warehouses and the multidimensional conceptual model are analyzed in [2]. Diverse aggregation hierarchies can manifest either as the lowest level conflict or inner level conflict. In the first case, two semantically corresponding dimensions (e.g. time dimensions in DWH 1 and DWH 2 in Fig. 2) have different lowest (i.e. basic) grain level (hour and day) which means that the granularity of their facts is also different. In the second case there is a common aggregation level (not necessarily the lowest), but the aggregation at coarser grain levels takes different ways (levels month and week in DWH 1 and DWH 2 ).
4 48 M. Banek et al. The dimensionality conflict corresponds to a different number of dimensions associated to the same fact (the hour and time dimensions in DWH 1 corresponding to a single time dimension in DWH 2 ). Finally, schema-instance conflicts appear when some context of the fact in one data warehouse becomes the content (value) of dimensions in the other (insurance and patient are part of measure names in DWH 1, while being values of cost_type dimension in DWH 2 ). day month year TREATMENT cost_insurance cost_patient DWH1 hour DWH2 cost_type TREATMENT cost hour day week year Fig. 2. Data warehouses with heterogeneities specific to the multidimensional model 3.2 The Matching Algorithm The matching process determines that two multidimensional structures S 1 and S 2, belonging to data warehouses DWH 1 and DWH 2, respectively, are mutually equivalent. The two structures, S 1 and S 2, may be either be facts, measures, dimensions, aggregation levels or dimensional attributes, but we state that both of them must belong to the same type of multidimensional structures. Mapping cardinalities can be one-to-one, one-to-many or many-to-many. -to-many mappings are more to be expected for attributes and measures than for aggregation levels, dimensions or facts. The algorithm that automatically matches heterogeneous data warehouse schemas consists of two basic phases: (1) comparison of match target structures and (2) creation of mappings. During the first phase the equivalence of multidimensional structures is determined as the value of a probability function called similarity function. Similarity between multidimensional structures is calculated (using a heuristic algorithm given in Section 4) by comparing their names as well as their data types (for attributes and measures) or substructures (for aggregation levels, dimensions and facts). While the same basic idea is used by ARTEMIS/MOMIS and Cupid, these frameworks cannot solve the heterogeneities that specifically occur in the multidimensional model of data warehouses, especially the hierarchical organization of aggregation levels in dimensions. In the second phase, heuristic rules are applied to determine which structures should be mapped as equivalent, among a much larger number of possible matches. The result of the automated matching process must as much as possible commensurate with the solution that a data warehouse designer would produce manually. Our matching algorithm recognizes four levels of matching: (1) facts, (2) dimensions and measures, constructing the facts, (3), aggregation levels, constructing the dimensions, and (4) dimensional attributes, constructing the aggregation levels. Similarity calculation, as the first part of the matching algorithm, starts with atomic structures, dimensional attributes and measures, as their similarity is computed from the similarity of their names and data types. Similarity between aggregation levels is calculated next, taking into account their names and the already calculated similarity between attributes of which they consist. The process continues with dimension similarity, while similarity between facts is determined at the end.
5 Automating the Schema Matching Process for Heterogeneous Data Warehouses 49 On the other hand, we opine that mapping (the second phase of the automated matching process) must start with the most complex multidimensional structures: the facts. If we determine that two facts are compatible (i.e. that they match), we map their measures and dimensions in the next phase. The mapping candidates are only dimensions and measures of the two facts and no other measures and dimensions in the warehouse schema. Next, we proceed to aggregation levels and then, finally, to attributes. 3.3 Mapping Aggregation Levels Mapping facts, measures, dimensions and dimensional attributes is similar to matching database structures. For instance, given two corresponding aggregation levels L 1 and L 2, an attribute A 1i, which belongs to L 1 can be mapped to any attribute A 2j belonging to L 2. Each attribute is mapped to its most similar counterpart according to the value of the similarity function. Hierarchical structure imposes several inherent limits to aggregation level matching, as the existing partial order must be preserved. The limitations and additional heuristic recommendations are shown in the rest of this section. Prohibition of mappings that violate the partial order in hierarchies. Let D 1 and D 2 be two mapped dimensions, each of them (for reasons of simplicity) consist of a single hierarchy. Let L 1i be an aggregation level in D 1 and L 2j an aggregation level in D 2. Let L 1i and L 2j be equivalent, matching levels and let their mapping be already registered, as shown in the left part of Fig. 3. No level in D 1 representing finer granularity than L 1i can be mapped to a level in D 2 representing coarser granularity than L 2j. Similarly, no level in D 1 representing coarser granularity than L 1i can be mapped to a level in D 2 representing finer granularity than L 2j. The mapping between L 1i and L 2j is represented by a solid line. Invalid mappings are showed as dashed lines. The coherence of the partial orders in dimensions has also been stated as a necessary condition for dimension compatibility in [4]. L1(i-1) no no L2(j-1) day1 day2 month1 day2 L1i L2j week1 month2 quarter1 month2 L1(i+1) L2(j+1) year1 year2 year1 year2 Fig. 3. Preserving the partial order in hierarchies Mapping unmatched levels to the counterparts of their parent levels. This strategy is used to solve inner level conflicts. Let us suppose that the equivalent levels day 1 and day 2 have already been mapped (see the middle part of Fig. 3). The low value of the similarity function between week 1 and month 2 proves their incompatibility. Week records in D 1 and month records in D 2 are both aggregations obtained by summing the day records. Thus, week 1 can be mapped to the counterpart of its parent (i.e. the nearest finer) level, day 1, which is day 2. Similarly, month 2 can be mapped to day 1. When a query on D 1 concerning week 1 is translated in order to correspond to D 2, records of the level day 2 need to be summed.
6 50 M. Banek et al. Prohibition of mapping unmatched levels to the counterparts of their descendants. The rule is not inherent to the multidimensional model like both previous rules (it would be possible to create valid mappings that are not in accordance with it), but is experience-based. Mapping L 1i to L 2(j+1) or L 2(j+2), as well as L 2j to L 1(i+1) or L 1(i+2) (see the left part of Fig. 4) should be forbidden. An explanation can be obtained by watching the matching process in the other direction. The goal of the mapping process is to map every level to the coarsest possible counterpart that semantically corresponds to the target (in order to preserve the richness of the hierarchy). Supposing that L 1(i+2) is equally similar to L 2j and L 2(j+2) we prefer mapping it to L 2(j+2). Back to the initial problem, mapping L 2j to L 1(i+2), can be viewed in the opposite direction, as mapping L 1(i+2) to L 2j (the right part of Fig. 4). L 1(i+2) is already mapped to L 2(j+2), which is coarser than L 2j, so mapping it also to L 2j would be redundant and useless. L1i no L2j L1i L2j L1(i+1) L2(j+1) L1(i+1) L2(j+1) L1(i+2) L2(j+2) L1(i+2) L2(j+2) Fig. 4. Mapping unmatched levels to the counterparts of their descendants 4 Similarity Functions for Multidimensional Structures As stated in Section 3, the process of schema matching starts with calculating similarities between various multidimensional structures and then applying a match selection algorithm based on those similarities. The basic idea for similarity calculation is the fact that complex structures similarity is computed from the similarities between their less complex substructures, by means of some mathematical expression. Therefore, the similarity between any two complex multidimensional structures of the same kind can recursively be translated into a set of calculations at the atomic level: the level of dimensional attributes and measures. We introduce a new formula for calculating similarity between two attributes or two measures. Their similarity is obtained from the similarity of their names (nsim) and data type compatibility coefficient tcoeff raised to the power determined by a nonnegative real number texp (the latter enables us to calibrate the formula): sim att (A 1, A 2 ) = nsim(a 1, A 2 ) tcoeff(a 1, A 2 ) texp, texp [0, ). (1) Similarity of names is actually the semantic similarity of words that stem from the name. Semantic similarity functions present the degree of relatedness of two input words (word senses), which is calculated by using WordNet [14], a large thesaurus of English language, hand-crafted by psycholinguists. WordNet organizes terms according to human, native speaker s perception, providing a list of synonyms, antonyms and homonyms, as well as the subordination-superordination hierarchy and partwhole relations. We use the semantic similarity calculation method presented by Yang and Powers [15] to calculate name similarity. Words (word senses) in WordNet are interpreted as graph vertices, connected by edges representing subordination-superordination and
7 Automating the Schema Matching Process for Heterogeneous Data Warehouses 51 part-whole relations (each edge is given a weight according to the relation type). All possible paths between two target vertices are constructed and weights multiplied across paths. The highest weight product becomes the linguistic similarity between two target words. We assume the notion of data type compatibility such that attributes sharing their data type are more related than those that do not. We reduce the data types to four basic ones: numeric, string, datetime and boolean. Rarely appearing data types (date, boolean) are more indicative than the frequent string and numeric types, either when their compatibility indicates a higher similarity or when their incompatibility suggest that attributes do not correspond. We empirically determine the following values of tcoeff for two compatible basic types: 1.0 for two datetime or boolean attributes, and 0.9 for two numeric or two string attributes. Different combinations of incompatible data types imply the following values of tcoeff: 0.8 for numeric-string, 0.7 for datetime-boolean and 0.75 for other combinations. Similarity between complex multidimensional structures S 1 and S 2 is expressed as a weighted sum (i.e. linear combination) of their name similarity (nsim) and structural similarity (ssim), as stated in by Madhavan et al. [7]: sim(s 1,S 2 ) = w name nsim(s 1,S 2 ) + (1 w name ) ssim(s 1,S 2 ), w name [0, 1]. (2) In [7] structure similarity is calculated recursively, with some initial values being adjusted in several steps. We use a different approach, adapting the formula for calculating semantic similarity among entity classes of different ontologies [11]. Neighborhood similarity between two entity classes not only takes into account their names, but also other classes surrounding them within a certain radius. We therefore make an analogy between aggregation levels (and dimensions and facts, respectively) and entity classes. Attributes (and aggregation levels and dimensions/measures, respectively) surrounding them correspond to neighborhood classes within radius one (each attribute is directly related to the target aggregation level). 5 Mapping Multidimensional Structures Creation of mappings is the second main phase of the algorithm for automated warehouse schema matching. This process is determined by two basic factors: (1) constraints and (2) selection metrics. While there are no particular constraints to mapping facts, dimensions, measures and dimensional attributes, the aggregation level mapping is restricted by the rules given in Section 3.3, in order to preserve the partial order in hierarchies. A selection metric defines how the calculated similarity values are used to determine which mappings are better or best among all possible mappings. The selection metric issue has long been studied in graph theory as the problem of stable marriage in bipartite graphs. Bipartite graphs consist of two disjoint parts such that no edge connects any two vertices in the same part (i.e. vertices in the same part are of the same sex, while an edge symbolizes a possible marriage and can thus exists only between vertices in different parts of the graph). Each edge is given a weighting coefficient that corresponds to the similarity between multidimensional structures represented by the vertices. The task of a selection metric (which is also called a filter), is to select
8 52 M. Banek et al. an appropriate subset of edges as mappings and eliminate all others. In the original version of the stable marriage problem, only 1:1 mappings (i.e. the monogamous ones) are allowed. In [9] six different filters were tested. We implemented three of them: best sum, threshold and outer. The best sum filter maps combinations with the highest possible total sum of edge similarities. For the graph given in the left part of Fig. 5, it would produce combinations (a 1,b 2 ) and (a 2,b 1 ) as their sum, 1.20, is greater than the sum of the other possible combination (1.04). The threshold and outer filter allow each of the partners to make the choice of their own. First, relative similarities are computed as fractions of the absolute similarities of the best match candidates for any given element: sim R ( a, b j ) = sim ( a, bj )/ maxsim( a, bj ). (3) Vertices with the highest relative similarity choose first (if there are several such vertices, one of the two vertices that are adjacent to the edge with the highest absolute similarity will take the lead; in particular case a 1 or b 1 ). The threshold filter creates a mapping if relative similarities at both sides of the selected edge are greater than a threshold. Thus, only mapping (a 1,b 1 ) is created (the right part of Fig. 5). The outer filter checks the threshold only at the side that proposed a match, mapping (a 2,b 2 ) as the second pair. j BEST SUM: (a1,b1)+(a2,b2)= = 1.04 (a1,b2)+(a2,b1)= = 1.20 a b a b2 a1 1 1 b a b2 THRESHOLD thr=0.4 (a1 b1)=1>0.4 and (b1 a1)=1>0.4 OK (a2 b2)=0.5>0.4 and (b2 a2)=0.33<0.4 NO OUTER thr=0.4 (a1 b1)=1>0.4 OK (a2 b2)=0.5>0.4 OK Fig 5. Mapping the same structures differently using the best sum, threshold and outer filter Polygamous (i.e. 1:N and M:N) matches are allowed for multidimensional structures. The standard best sum, threshold and outer filter can be applied to map facts, dimensions, measures and attributes. Restriction rules to mapping aggregation levels create additional dynamic constraints. If we falsely map level quarter 1 to day 2 (see the right part of Fig. 3), we will not be able to produce the correct mapping between month 1 and month 2. 6 Evaluation of Algorithm Performance The quality of the automated matching process is measured by its accuracy, as defined in [8, 9]. The measure estimates how much effort it costs the user to modify the automatically proposed match result P={(x 1, y 1 ),,(x n, y n )} into the intended result I={(a 1, b 1 ),,(a m, b m )}, i.e. how many additions and deletions of map pairs have to be made. Let c = P I be the number of correct suggestions. The difference (n c) is the number of false positives to be removed from P, and (m c) the number of missing
9 Automating the Schema Matching Process for Heterogeneous Data Warehouses 53 matches that need to be added. Assuming (for the reason of simplicity) that deletions and additions of match pairs require the same amount of effort, accuracy, i.e. the labor savings obtained by using our matching technique is defined as: 1 [(n c)+(m c)] / m. (4) In a perfect match, n = m = c, resulting in accuracy 1. Negative accuracy (for c/n < 0.5 i.e. when more than half of the created matches are wrong) suggests that it would take the user more effort to correct the automatically proposed map than to perform the matching manually from scratch. We evaluate the proposed algorithm by matching the schema of the federated data warehouse for health insurance organizations developed in [3, 12] to the local warehouse of one of the organizations. Both warehouses consist of mutually compatible three facts representing patient encounters, performed therapies and medicine prescriptions. In both cases facts share dimensions (many, but not all which are compatible). The values of match accuracy for each of the three filters are given in Table 1. The outer and threshold filter perform best with relative similarity thresholds between 0.8 and 0.7. For the most critical part of the matching process, aggregation levels, the outer and best sum filter outperform threshold. Threshold outperforms other filters when mapping other multidimensional structures. Table 1. Comparison of three mapping selection metrics (filters) OUTER THRESHOLD BEST SUM n m c acc n m c acc n m c acc facts measures dimensions agg. levels dim. attrib Conclusion This paper presents an approach to automating the schema matching process for heterogeneous data warehouses in order to shorten the warehouse integration process. The approach is based on analyzing warehouse schemas and can be used when the access to data content is restricted. We defined a match discovery procedure capable of solving heterogeneities among data warehouse-specific structures: facts, measures, dimensions, aggregation levels and dimensional attributes. A particular relevance was given to aggregation level matching, as the partial order in dimension hierarchies must be preserved. A software implementation of the approach has been developed and different selection filters tested on a case study example. The best sum and outer filter performed better than threshold when mapping aggregation levels, while the latter was the best solution for other multidimensional structures.
10 54 M. Banek et al. References 1. Bergamaschi, S., Castano, S., Vincini, M.: Semantic Integration of Semistructured and Structured Data Sources. SIGMOD Record 28, (1999) 2. Berger, S., Schrefl, M.: Analysing Multi-dimensional Data accross Autonomous Data Warehouses. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK LNCS, vol. 4081, pp Springer, Heidelberg (2006) 3. Banek, M., Tjoa, A.M., Stolba, N.: Integrating Different Grain Levels in a Medical Data Warehouse Federation. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK LNCS, vol. 4081, pp Springer, Heidelberg (2006) 4. Cabibbo, L., Torlone, R.: Integrating Heterogeneous Multidimensional Databases. In: Proc. Int. Conf. Scientific and Stat. Database Management 05, pp , IEEE Comp. Soc. (2005) 5. Dhamankar, R., Lee, Y., Doan, A.-H., Halevy, A.Y., Domingos, P.: imap: Discovering Complex Mappings between Database Schemas. In: Proc. SIGMOD Conf. 2004, pp ACM Press, New York (2004) 6. Kim, W., Seo, J.: Classifying Semantic and Data Heterogeneity in Multidatabase Systems. IEEE Computer 24/12, (1991) 7. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic Schema Matching with Cupid. In: Proc. Int. Conf. on Very Large Data Bases 01, pp Morgan Kaufmann, San Francisco (2001) 8. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching. In: Proc. Int. Conf. on Data Engineering 2002, pp IEEE Computer Society, Los Alamitos (2002) 9. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm. Technical Report (2001), Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10, (2001) 11. Rodríguez, M.A., Egenhofer, M.J.: Determining Semantic Similarity among Entity Classes from Different Ontologies. IEEE Trans. Knowl. Data Eng. 15, (2003) 12. Stolba, N., Banek, M., Tjoa, A.M.: The Security Issue of Federated Data Warehouses in the Area of Evidence-Based Medicine. In: Proc. Conf. Availability, Reliability and Security 06, pp IEEE Computer Society, Los Alamitos (2006) 13. Sheth, A.P., Larson, J.A.: Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys 22, (1990) 14. Princeton University Cognitive Science Laboratory: WordNet, a lexical database for English Language (Last access: March 25, 2007) Yang, D., Powers, D.M.W.: Measuring Semantic Similarity in the Taxonomy of WordNet. In: CRPIT 38, pp , Australian Computer Society (2005)
Automatic Annotation Wrapper Generation and Mining Web Database Search Result
Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS
INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS Tadeusz Pankowski 1,2 1 Institute of Control and Information Engineering Poznan University of Technology Pl. M.S.-Curie 5, 60-965 Poznan
A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS
A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS Abdelsalam Almarimi 1, Jaroslav Pokorny 2 Abstract This paper describes an approach for mediation of heterogeneous XML schemas. Such an approach is proposed
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
Dimensional Modeling for Data Warehouse
Modeling for Data Warehouse Umashanker Sharma, Anjana Gosain GGS, Indraprastha University, Delhi Abstract Many surveys indicate that a significant percentage of DWs fail to meet business objectives or
Transformation of OWL Ontology Sources into Data Warehouse
Transformation of OWL Ontology Sources into Data Warehouse M. Gulić Faculty of Maritime Studies, Rijeka, Croatia [email protected] Abstract - The Semantic Web, as the extension of the traditional Web,
Appendix B Data Quality Dimensions
Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets
Semantically Enhanced Web Personalization Approaches and Techniques
Semantically Enhanced Web Personalization Approaches and Techniques Dario Vuljani, Lidia Rovan, Mirta Baranovi Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, HR-10000 Zagreb,
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 [email protected]
Towards Semantics-Enabled Distributed Infrastructure for Knowledge Acquisition
Towards Semantics-Enabled Distributed Infrastructure for Knowledge Acquisition Vasant Honavar 1 and Doina Caragea 2 1 Artificial Intelligence Research Laboratory, Department of Computer Science, Iowa State
CHAPTER-6 DATA WAREHOUSE
CHAPTER-6 DATA WAREHOUSE 1 CHAPTER-6 DATA WAREHOUSE 6.1 INTRODUCTION Data warehousing is gaining in popularity as organizations realize the benefits of being able to perform sophisticated analyses of their
IV. The (Extended) Entity-Relationship Model
IV. The (Extended) Entity-Relationship Model The Extended Entity-Relationship (EER) Model Entities, Relationships and Attributes Cardinalities, Identifiers and Generalization Documentation of EER Diagrams
Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System
Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Bala Kumari P 1, Bercelin Rose Mary W 2 and Devi Mareeswari M 3 1, 2, 3 M.TECH / IT, Dr.Sivanthi Aditanar College
Basics of Dimensional Modeling
Basics of Dimensional Modeling Data warehouse and OLAP tools are based on a dimensional data model. A dimensional model is based on dimensions, facts, cubes, and schemas such as star and snowflake. Dimensional
How To Write A Diagram
Data Model ing Essentials Third Edition Graeme C. Simsion and Graham C. Witt MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF ELSEVIER AMSTERDAM BOSTON LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE
PartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2
On the k-path cover problem for cacti
On the k-path cover problem for cacti Zemin Jin and Xueliang Li Center for Combinatorics and LPMC Nankai University Tianjin 300071, P.R. China [email protected], [email protected] Abstract In this paper we
Curriculum of the research and teaching activities. Matteo Golfarelli
Curriculum of the research and teaching activities Matteo Golfarelli The curriculum is organized in the following sections I Curriculum Vitae... page 1 II Teaching activity... page 2 II.A. University courses...
Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001
A comparison of the OpenGIS TM Abstract Specification with the CIDOC CRM 3.2 Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001 1 Introduction This Mapping has the purpose to identify, if the OpenGIS
Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
A Neural-Networks-Based Approach for Ontology Alignment
A Neural-Networks-Based Approach for Ontology Alignment B. Bagheri Hariri [email protected] H. Abolhassani [email protected] H. Sayyadi [email protected] Abstract Ontologies are key elements
Information Visualization of Attributed Relational Data
Information Visualization of Attributed Relational Data Mao Lin Huang Department of Computer Systems Faculty of Information Technology University of Technology, Sydney PO Box 123 Broadway, NSW 2007 Australia
Facilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat [email protected] Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
Determination of the normalization level of database schemas through equivalence classes of attributes
Computer Science Journal of Moldova, vol.17, no.2(50), 2009 Determination of the normalization level of database schemas through equivalence classes of attributes Cotelea Vitalie Abstract In this paper,
Report on the Dagstuhl Seminar Data Quality on the Web
Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,
XClust: Clustering XML Schemas for Effective Integration
XClust: Clustering XML Schemas for Effective Integration Mong Li Lee, Liang Huai Yang, Wynne Hsu, Xia Yang School of Computing, National University of Singapore 3 Science Drive 2, Singapore 117543 (065)
ONTOLOGY IN ASSOCIATION RULES PRE-PROCESSING AND POST-PROCESSING
ONTOLOGY IN ASSOCIATION RULES PRE-PROCESSING AND POST-PROCESSING Inhauma Neves Ferraz, Ana Cristina Bicharra Garcia Computer Science Department Universidade Federal Fluminense Rua Passo da Pátria 156 Bloco
1.6 The Order of Operations
1.6 The Order of Operations Contents: Operations Grouping Symbols The Order of Operations Exponents and Negative Numbers Negative Square Roots Square Root of a Negative Number Order of Operations and Negative
Component visualization methods for large legacy software in C/C++
Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University [email protected]
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
Semantic Analysis of Business Process Executions
Semantic Analysis of Business Process Executions Fabio Casati, Ming-Chien Shan Software Technology Laboratory HP Laboratories Palo Alto HPL-2001-328 December 17 th, 2001* E-mail: [casati, shan] @hpl.hp.com
Solving Geometric Problems with the Rotating Calipers *
Solving Geometric Problems with the Rotating Calipers * Godfried Toussaint School of Computer Science McGill University Montreal, Quebec, Canada ABSTRACT Shamos [1] recently showed that the diameter of
A Secure Mediator for Integrating Multiple Level Access Control Policies
A Secure Mediator for Integrating Multiple Level Access Control Policies Isabel F. Cruz Rigel Gjomemo Mirko Orsini ADVIS Lab Department of Computer Science University of Illinois at Chicago {ifc rgjomemo
not necessarily strictly sequential feedback loops exist, i.e. may need to revisit earlier stages during a later stage
Database Design Process there are six stages in the design of a database: 1. requirement analysis 2. conceptual database design 3. choice of the DBMS 4. data model mapping 5. physical design 6. implementation
Visualizing WordNet Structure
Visualizing WordNet Structure Jaap Kamps Abstract Representations in WordNet are not on the level of individual words or word forms, but on the level of word meanings (lexemes). A word meaning, in turn,
Grid Data Integration based on Schema-mapping
Grid Data Integration based on Schema-mapping Carmela Comito and Domenico Talia DEIS, University of Calabria, Via P. Bucci 41 c, 87036 Rende, Italy {ccomito, talia}@deis.unical.it http://www.deis.unical.it/
Deductive Data Warehouses and Aggregate (Derived) Tables
Deductive Data Warehouses and Aggregate (Derived) Tables Kornelije Rabuzin, Mirko Malekovic, Mirko Cubrilo Faculty of Organization and Informatics University of Zagreb Varazdin, Croatia {kornelije.rabuzin,
Constraint-based Query Distribution Framework for an Integrated Global Schema
Constraint-based Query Distribution Framework for an Integrated Global Schema Ahmad Kamran Malik 1, Muhammad Abdul Qadir 1, Nadeem Iftikhar 2, and Muhammad Usman 3 1 Muhammad Ali Jinnah University, Islamabad,
Chapter 2: Entity-Relationship Model. Entity Sets. " Example: specific person, company, event, plant
Chapter 2: Entity-Relationship Model! Entity Sets! Relationship Sets! Design Issues! Mapping Constraints! Keys! E-R Diagram! Extended E-R Features! Design of an E-R Database Schema! Reduction of an E-R
Using Semantic Data Mining for Classification Improvement and Knowledge Extraction
Using Semantic Data Mining for Classification Improvement and Knowledge Extraction Fernando Benites and Elena Sapozhnikova University of Konstanz, 78464 Konstanz, Germany. Abstract. The objective of this
PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.
PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions A Technical Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Data Warehouse Modeling.....................................4
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques
Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques Sean Thorpe 1, Indrajit Ray 2, and Tyrone Grandison 3 1 Faculty of Engineering and Computing,
Mining Mobile Group Patterns: A Trajectory-Based Approach
Mining Mobile Group Patterns: A Trajectory-Based Approach San-Yih Hwang, Ying-Han Liu, Jeng-Kuen Chiu, and Ee-Peng Lim Department of Information Management National Sun Yat-Sen University, Kaohsiung, Taiwan
Community Mining from Multi-relational Networks
Community Mining from Multi-relational Networks Deng Cai 1, Zheng Shao 1, Xiaofei He 2, Xifeng Yan 1, and Jiawei Han 1 1 Computer Science Department, University of Illinois at Urbana Champaign (dengcai2,
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
ACL Based Dynamic Network Reachability in Cross Domain
South Asian Journal of Engineering and Technology Vol.2, No.15 (2016) 68 72 ISSN No: 2454-9614 ACL Based Dynamic Network Reachability in Cross Domain P. Nandhini a, K. Sankar a* a) Department Of Computer
Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases
Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh
The Role of Metadata for Effective Data Warehouse
ISSN: 1991-8941 The Role of Metadata for Effective Data Warehouse Murtadha M. Hamad Alaa Abdulqahar Jihad University of Anbar - College of computer Abstract: Metadata efficient method for managing Data
American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access
Caching XML Data on Mobile Web Clients
Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D-33102 Paderborn,
Novel Data Extraction Language for Structured Log Analysis
Novel Data Extraction Language for Structured Log Analysis P.W.D.C. Jayathilake 99X Technology, Sri Lanka. ABSTRACT This paper presents the implementation of a new log data extraction language. Theoretical
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
International Journal of Advanced Research in Computer Science and Software Engineering
Volume, Issue, March 201 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient Approach
Data Warehouse Snowflake Design and Performance Considerations in Business Analytics
Journal of Advances in Information Technology Vol. 6, No. 4, November 2015 Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Jiangping Wang and Janet L. Kourik Walker
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 3, Issue 5 (August 2012), PP. 56-61 A Fast and Efficient Method to Find the Conditional
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Software Testing. Definition: Testing is a process of executing a program with data, with the sole intention of finding errors in the program.
Software Testing Definition: Testing is a process of executing a program with data, with the sole intention of finding errors in the program. Testing can only reveal the presence of errors and not the
Self-organized Multi-agent System for Service Management in the Next Generation Networks
PROCEEDINGS OF THE WORKSHOP ON APPLICATIONS OF SOFTWARE AGENTS ISBN 978-86-7031-188-6, pp. 18-24, 2011 Self-organized Multi-agent System for Service Management in the Next Generation Networks Mario Kusek
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX
ISSN: 2393-8528 Contents lists available at www.ijicse.in International Journal of Innovative Computer Science & Engineering Volume 3 Issue 2; March-April-2016; Page No. 09-13 A Comparison of Database
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
THE QUALITY OF DATA AND METADATA IN A DATAWAREHOUSE
THE QUALITY OF DATA AND METADATA IN A DATAWAREHOUSE Carmen Răduţ 1 Summary: Data quality is an important concept for the economic applications used in the process of analysis. Databases were revolutionized
A Review And Evaluations Of Shortest Path Algorithms
A Review And Evaluations Of Shortest Path Algorithms Kairanbay Magzhan, Hajar Mat Jani Abstract: Nowadays, in computer networks, the routing is based on the shortest path problem. This will help in minimizing
A Solution for Data Inconsistency in Data Integration *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 27, 681-695 (2011) A Solution for Data Inconsistency in Data Integration * Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai,
SOLVING SEMANTIC CONFLICTS IN AUDIENCE DRIVEN WEB DESIGN
SOLVING SEMANTIC CONFLICTS IN AUDIENCE DRIVEN WEB DESIGN Olga De Troyer Vrije Universiteit Brussel, WISE Pleinlaan 2 B-1050 Brussel Belgium [email protected] Peter Plessers Vrije Universiteit Brussel,
Lesson 8: Introduction to Databases E-R Data Modeling
Lesson 8: Introduction to Databases E-R Data Modeling Contents Introduction to Databases Abstraction, Schemas, and Views Data Models Database Management System (DBMS) Components Entity Relationship Data
Class-specific Sparse Coding for Learning of Object Representations
Class-specific Sparse Coding for Learning of Object Representations Stephan Hasler, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl-Legien-Str. 30, 63073 Offenbach am Main, Germany
A brief overview of developing a conceptual data model as the first step in creating a relational database.
Data Modeling Windows Enterprise Support Database Services provides the following documentation about relational database design, the relational database model, and relational database software. Introduction
Mining Text Data: An Introduction
Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo
