A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems
|
|
|
- Brooke Walton
- 10 years ago
- Views:
Transcription
1 A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems Agusthiyar.R, 1, Dr. K. Narashiman 2 Assistant Professor (Sr.G), Department of Computer Applications, Easwari Engineering College, Chennai, India 1 Professor & Director, AUTVS Centre for Quality Management, Anna University, Chennai, India 2 ABSTRACT: Nowadays, data cleaning solutions are very essential for the large amount of data handling users in an industry and others. Normally, data cleaning, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. There are number of frameworks to handle the noisy data and inconsistencies in the market. While traditional data integration problems can deal with single data sources at instance level. But the data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. This paper proposed a framework to handle errors in heterogeneous data sources at schema level and this framework detecting and removing errors and inconsistencies in a simplified manner and improve the quality of the data in multiple data source of the company having different sources of different locations. KEYWORDS: Data cleaning, Data quality, Attribute selection, Data warehouse. I. INTRODUCTION Data cleaning process deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data quality problems are present in single data collections, such as files and databases, e.g., due to misspellings during data entry, missing information or other invalid data [1]. For data cleaning in single data source can dealt with attribute selection or feature selection method to detecting and removing errors significantly. This method gives the quality data for the end users or business community for homogeneous data sources. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems data cleaning process is very crucial. Data warehouses require and provide extensive support for data cleaning. They load and continuously refresh huge amounts of data from a variety of sources so the probability that some of the sources contain dirty data is high. Furthermore, data warehouses are used for decision making, so that the correctness of their data is vital to avoid wrong conclusions. For instance, duplicated or missing information will produce incorrect or misleading statistics ( garbage in, garbage out ). Due to the wide range of possible data inconsistencies and the sheer data volume, data cleaning is considered to be one of the biggest problems in data warehousing. II. RELATED WORK During data cleaning, multiple records representing the same real life object are identified, assigned only one unique database identification, and only one copy of exact duplicate records is retained. A token-based algorithm for cleaning a data warehouse is the notion of token records was introduced for record comparison and the smart tokens are more likely applicable to domain-independent data cleaning, and could be used as warehouse identifiers to enhance the process of incremental cleaning and refreshing of integrated data. [4] Data cleaning is a process of identifying or determining expected problem when integrating data from different sources or from a single source. There are so many problems can be occurred in the data warehouse while loading or integrating data. The main problem in data warehouse is noisy data. The attribute selection algorithm is used for the attribute selection before the token formation. An attribute selection algorithm and token formation algorithm is used for data cleaning to reduce a complexity of data cleaning process and to clean data flexibly and effortlessly without any confusion. [5] Every attribute value forms Copyright to IJIRSET
2 either a special token like birth date or an ordinary token, which can be alphabetic, numeric, or alphanumeric. These tokens are sorted and used for record match. The tokens also form very good warehouse identifiers for future faster incremental warehouse cleaning. The idea of smart tokens is to define from two most important fields by applying simple rules for defining numeric, alphabetic, and alphanumeric tokens. Database records now consist of smart token records, composed from field tokens of the records. These smart token records are sorted using two separate most important field tokens. The result of this process is two sorted token tables, which are used to compare neighbouring records for a match. Duplicates are easily detected from these tables, and warehouse identifiers are generated for each set of duplicates using the concatenation of its first record s tokens. These warehouse identifiers are later used for quick incremental record identification and refreshing. [6] III. MULTIPLE DATA SOURCE PROBLEMS Each source may contain dirty data and the data in the sources may be represented differently, overlap or contradict. This is because the sources are typically developed, deployed and maintained independently to serve specific needs. This results in a large degree of heterogeneity with respect to data management systems, data models, schema designs and the actual data. The main problems with respect to schema design are naming and structural conflicts. Naming conflicts arise when the same name is used for different objects (homonyms) or different names are used for the same object (synonyms). Structural conflicts occur in many variations and refer to different representations of the same object in different sources, e.g., attribute vs. table representation, different component structure, different data types, different integrity constraints, etc. A main problem for cleaning data from multiple sources is to identify overlapping data, in particular matching records referring to the same real-world entity (e.g., customer). This problem is also referred to as the object identity problem [3], duplicate elimination or the merge/purge problem [2]. Frequently, the information is only partially redundant and the sources may complement each other by providing additional information about an entity. Thus duplicate information should be purged out and complementing information should be consolidated and merged in order to achieve a consistent view of real world entities. A. DATA CLEANING IN MULTIPLE DATA SOURCE WITH SOCIAL ASPECT The data quality problem is raised in government sector databases such as census data, voter id information, driving license data bases and personal id information. These databases have the errors like missing information, invalid data, data entry mistakes, duplicated records and spelling errors. When the government would decides to implement or distribute any welfare scheme benefits to the peoples, the duplicate records or noisy data leads wrong decision or the welfare scheme would not be reached every one. This framework is used to detect and remove noisy data in single and multi source databases and it is used to clean the data and get the quality data for good decision making in government sectors and other business organizations. Copyright to IJIRSET
3 Problem Data entry problem Name conflicts Column name Place Customer Name Noisy data Ch/enn,ai Cal&cat/ta De<lhi John Abraham S Suresh kumar B Hari babu M Solution Remove the noisy Split the name Chennai Mumbai Calcutta Delhi First Name John Abraham Suresh kumar Hari babu Clean data Last Name S B M Same name with different value Check inheritance Abbreviatio n and acronym Address conflicts Column name with space and underscore Gender DOB Male Fema le M F 1 0 1/January/ jan /01/1970 Use Common format Use Common date format Organization Govt, Pvt Abbreviate the terms Address Person_detail First name Address 133-A, Krishna Street, MGR Nagar Chennai Connecting two different meanings Split the address column Better to avoid space and underscore Male Female 01/01/ /01/ /01/1970 Government Private Street Area City Zip code 133-A Krishna Street MGR Nagar Chenn ai Persondetail Firstname Table 1: An Example of Multiple Data Source Problems B. AN EXAMPLE OF MULTIPLE DATA SOURCE PROBLEM The above Table 1: shows that the multiple data source problems occurred in different situations. Most frequently, error is made while data entry, but this will occur only human unconscious, it will rectify easily by this framework. The name conflict of the customer name will leads the data quality errors at the time of decision making in business organization. So, this type of error can detect and remove by this framework efficiently. The same name with different value, date format and short terms would be replaced by common format and abbreviation appropriately. The address conflicts could be split as Street, Area, City and Zip code format. The last row of the table shows that the column or attribute header problem with space and underscore. This could be better to avoid and give meaningful attribute header to the table. This table shows that some of the data source problems, when two tables or data sources would be integrated. Copyright to IJIRSET
4 IV. PROPOSED FRAMEWORK Figure 1: A Framework for Data cleaning in Multiple Data Sources Copyright to IJIRSET
5 The Figure 1: shows that the framework implementation of the proposed research work. This step by step process of data cleaning methodology explains the user to detect the error and inconsistencies and then clean the noisy data efficiently. 1. Selection of tables from multiple data sources 2. Selecting each attribute in the table and formatting the attributes. 3. Match each table column to the final table column. 4. Joining all the tables. 5. Create Tokens and to find duplicates through tokens 6. Ranking: After finding duplicates, we rank the columns according to the uniqueness of the value 7. Removing duplicates according to the rank 8. Finally get the cleaned data then store in the database 1. SELECTION OF TABLES FROM MULTIPLE DATA SOURCES In this step, data tables have been selected from various sources of single domain. Data coming from different origins and may have been created different times by different peoples and using different conventions. In this context, the question of deciding which data refers to the same real object becomes crucial. A company may have information about its clients stored in different tables, because each client buys different services that are managed by distinct departments. Selection of table from the multiple data source is the core research process. 2. SELECTING EACH ATTRIBUTE IN THE TABLE AND FORMATTING OF ATTRIBUTES After selecting the table from the multiple data sources, each attribute of the table has been selected for data cleaning. In this step, attribute selection was done for the formation of attributes. For example, the Contact Name attribute may be split into two columns (firstname and lastname) and Gender attribute would be changed as m to male, f to female. Such kind of attribute formation is useful for the further research step. 3. MATCH EACH TABLE COLUMN TO THE FINAL TABLE COLUMN When multiple tables are selected each table have different attributes in different name. A company may have information about its clients stored in different tables, because each client buys different services that are managed by distinct departments. Once it is decided to build a unified repository of all the company clients, the same customer may be referred to in different tables, by slightly different but correct names. This kind of mismatching is called the Object Identity problem. This problem would be solved by the common table and each table attribute should be matching to the common table attribute and it would be maintain a unified repository of all the tables which has been already selected. 4. JOINING ALL THE TABLES After the formation of each table attributes to common table attributes, all the tables would be combined and then the common table format should be saved as final database. After creating unified repository of all the tables, the tokens will be created. 5. CREATE TOKENS AND TO FIND DUPLICATES THROUGH TOKENS Creating and forming tokens is that the very important for data cleaning process. While forming tokens, the key value attributes acted as crucial role. This step makes use of the selected attribute field values to form a token. The tokens can be created for a single attribute field value or for combined attributes. For example, contact name attribute is selected to create a token for further cleaning process. The contact name attribute is split as first name, middle name and last name. The first name and last name is combined as contact name to form a token. Tokens are formed using numeric values, alphanumeric values and alphabetic values. The field values are split. Unimportant elements are removed [title tokens like Mr., Dr. and so on [6]. This step is eliminates the need to use the entire string records with multiple passes, for duplicate identification. Copyright to IJIRSET
6 6. RANKING: RANK THE COLUMNS ACCORDING TO THE UNIQUENESS OF THE VALUE Select and ranks two or three fields that could be combined to most uniquely identify records. The condition for field selection and ranking is that the user is very familiar with the problem domain, and can select and rank fields according to their unique identifying power. We assume that the user in our banking domain Birth, Name, and Address and ranked them in the order given [6]. After detecting the duplicates we rank the columns according to the uniqueness of the value. For example, Phone no, DOB, etc., 7. REMOVING DUPLICATES ACCORDING TO THE RANK Data Mining primarily works with large databases. Sorting the large datasets and data duplicate elimination process with this large database faces the scalability problems. This framework solves this problem by ranking the attributes according the dependency of the attribute. 8. GET THE CLEANED DATA THEN STORE IN THE DATABASE The above step by step process of this framework gives the cleaned data of quality for further use. This quality data will store in to the data base and used for good decision making. V. CONCLUSION This framework is simplified the data cleaning process compare than other approaches previously used. These sequential steps are easy to handling data cleaning and information retrieval. This new framework consists of eight steps: Selection of tables, Selection of attribute, matching the column, joining all the tables, Forming tokens, Detecting duplicates, Ranking and removing the duplicates by ranking algorithm, and Merged to the database. This framework will be useful to develop a powerful data cleaning tool by using the existing data cleaning techniques in a sequential order. ACKNOWLEDGEMENT We would like to thank the reviewers for their comments and suggestions to develop the framework as data cleaning tool further. REFERENCES 1. Erhard Rahm Hong Hai Do, Data Cleaning: Problems and Current Approaches, pp: Hernandez, M.A.; Stolfo, S.J.: Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Data Mining and Knowledge Discovery 2(1):9-37, Helena Galhardas - Daniela Florescu - Dennis Shasha - Eric Simon, An Extensible Framework for Data Cleaning. 4. Timothy E. Ohanekwu, C.I. Ezeife, A Token-Based Data Cleaning Technique for Data Warehouse Systems, PP: J. Jebamalar Tamilselvi, Dr. V. Saravanan, Handling Noisy Data using Attribute Selection and Smart Tokens, proceedings of International Conference on Computer Science and Information Technology 2008, PP: Christie I. Ezeife, Timothy E. Ohanekwu, Use of Smart Tokens in Cleaning Integrated Warehouse Data, International Journal of Data Warehousing & Mining (IJDW), Vol.1 No.2,PP: 1-22, Ideas Group Publishers, April-June J. Jebamalar Tamilselvi, Dr. V. Saravanan, J. A Unified Framework and Sequential Data Cleaning Approach for a Datawarehouse, IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.5,PP: May BIOGRAPHY Agusthiyar. R. He received his Post Graduate MCA degree from Madurai Kamaraj University, Tamil Nadu, India in 2001, and M. Phil degree from Periyar University, Tamil Nadu, India in 2008 respectively. He is pursuing his Ph.D in Data mining at Anna University and he started his teaching profession from 2002 to till date and now he is working as Asst. Professor (Sr.G) in Copyright to IJIRSET
7 Dept. of Computer Applications. His research interests include Data mining, and Artificial Intelligence. Prof. Dr. K. Narashiman is the Director for AU TVS Center for Quality Management, Anna University Chennai. India. He is a well known Teacher, Trainer, Consultant and Researcher in the area of Quality Management. He has been recognized by the state of Bremen, Germany for implementing Quality Management System. Trained with scholarship in JAPAN, he is successfully propagating Quality Management to industrial and academic community. Copyright to IJIRSET
Removing Fully and Partially Duplicated Records through K-Means Clustering
IACSIT International Journal of Engineering and Technology, Vol. 4, No. 6, December 2012 Removing Fully and Partially Duplicated Records through K-Means Clustering Bilal Khan, Azhar Rauf, Huma Javed, Shah
Hybrid Technique for Data Cleaning
Hybrid Technique for Data Cleaning Ashwini M. Save P.G. Student, Department of Computer Engineering, Thadomal Shahani Engineering College, Bandra, Mumbai, India Seema Kolkur Assistant Professor, Department
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
Data Integration and ETL Process
Data Integration and ETL Process Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second
Data Integration and ETL Process
Data Integration and ETL Process Krzysztof Dembczyński Institute of Computing Science Laboratory of Intelligent Decision Support Systems Politechnika Poznańska (Poznań University of Technology) Software
Data Cleaning Using Identify the Missing Data Algorithm (IMDA)
Data Cleaning Using Identify the Missing Data Algorithm (IMDA) L.Ramesh 1, N.Marudachalam 2 M.Phil Research Scholar, P.G & Research, Department of Computer Science, Dr.Ambedkar Govt. Arts College, Vyasarpadi,
SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS
SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS Irwan Bastian, Lily Wulandari, I Wayan Simri Wicaksana {bastian, lily, wayan}@staff.gunadarma.ac.id Program Doktor Teknologi
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Regression model approach to predict missing values in the Excel sheet databases
Regression model approach to predict missing values in the Excel sheet databases Filling of your missing data is in your hand Z. Mahesh Kumar School of Computer Science & Engineering VIT University Vellore,
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 [email protected]
A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment
DOI: 10.15415/jotitt.2014.22021 A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment Rupali Gill 1, Jaiteg Singh 2 1 Assistant Professor, School of Computer Sciences, 2 Associate
Survey on Data Cleaning Prerna S.Kulkarni, Dr. J.W.Bakal
Survey on Data Cleaning Prerna S.Kulkarni, Dr. J.W.Bakal Abstract DATA warehouse of an enterprise consolidates the data from multiple sources of the organization/enterprise in order to support enterprise
Study of Data Cleaning & Comparison of Data Cleaning Tools
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
ETL-EXTRACT, TRANSFORM & LOAD TESTING
ETL-EXTRACT, TRANSFORM & LOAD TESTING Rajesh Popli Manager (Quality), Nagarro Software Pvt. Ltd., Gurgaon, INDIA [email protected] ABSTRACT Data is most important part in any organization. Data
Extraction Transformation Loading ETL Get data out of sources and load into the DW
Lection 5 ETL Definition Extraction Transformation Loading ETL Get data out of sources and load into the DW Data is extracted from OLTP database, transformed to match the DW schema and loaded into the
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,
DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH
DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH 1 DATA INTEGRATION Motivation Many databases and sources of data that need to be integrated to work together Almost all applications have many sources
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.
Physical Design Physical Database Design (Defined): Process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and
Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence
Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence Darshan M. Tank Department of Information Technology, L.E.College, Morbi-363642, India [email protected] Abstract
Import Clients. Importing Clients Into STX
Import Clients Importing Clients Into STX Your Client List If you have a list of clients that you need to load into STX, you can import any comma-separated file (CSV) that contains suitable information.
FEDERATED DATA SYSTEMS WITH EIQ SUPERADAPTERS VS. CONVENTIONAL ADAPTERS WHITE PAPER REVISION 2.7
FEDERATED DATA SYSTEMS WITH EIQ SUPERADAPTERS VS. CONVENTIONAL ADAPTERS WHITE PAPER REVISION 2.7 INTRODUCTION WhamTech offers unconventional data access, analytics, integration, sharing and interoperability
A Framework for Data Migration between Various Types of Relational Database Management Systems
A Framework for Data Migration between Various Types of Relational Database Management Systems Ahlam Mohammad Al Balushi Sultanate of Oman, International Maritime College Oman ABSTRACT Data Migration is
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Predictive Analytics using Genetic Algorithm for Efficient Supply Chain Inventory Optimization
182 IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.3, March 2010 Predictive Analytics using Genetic Algorithm for Efficient Supply Chain Inventory Optimization P.Radhakrishnan
CHAPTER-6 DATA WAREHOUSE
CHAPTER-6 DATA WAREHOUSE 1 CHAPTER-6 DATA WAREHOUSE 6.1 INTRODUCTION Data warehousing is gaining in popularity as organizations realize the benefits of being able to perform sophisticated analyses of their
Namrata 1, Dr. Saket Bihari Singh 2 Research scholar (PhD), Professor Computer Science, Magadh University, Gaya, Bihar
A Comprehensive Study on Data Warehouse, OLAP and OLTP Technology Namrata 1, Dr. Saket Bihari Singh 2 Research scholar (PhD), Professor Computer Science, Magadh University, Gaya, Bihar Abstract: Data warehouse
NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE
www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan
Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya
Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data
An Overview of Database management System, Data warehousing and Data Mining
An Overview of Database management System, Data warehousing and Data Mining Ramandeep Kaur 1, Amanpreet Kaur 2, Sarabjeet Kaur 3, Amandeep Kaur 4, Ranbir Kaur 5 Assistant Prof., Deptt. Of Computer Science,
Application of Clustering and Association Methods in Data Cleaning
Proceedings of the International Multiconference on ISBN 978-83-60810-14-9 Computer Science and Information Technology, pp. 97 103 ISSN 1896-7094 Application of Clustering and Association Methods in Data
Databases in Organizations
The following is an excerpt from a draft chapter of a new enterprise architecture text book that is currently under development entitled Enterprise Architecture: Principles and Practice by Brian Cameron
A Survey on Data Warehouse Architecture
A Survey on Data Warehouse Architecture Rajiv Senapati 1, D.Anil Kumar 2 1 Assistant Professor, Department of IT, G.I.E.T, Gunupur, India 2 Associate Professor, Department of CSE, G.I.E.T, Gunupur, India
A UPS Framework for Providing Privacy Protection in Personalized Web Search
A UPS Framework for Providing Privacy Protection in Personalized Web Search V. Sai kumar 1, P.N.V.S. Pavan Kumar 2 PG Scholar, Dept. of CSE, G Pulla Reddy Engineering College, Kurnool, Andhra Pradesh,
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
www.ijreat.org Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org) 28
Data Warehousing - Essential Element To Support Decision- Making Process In Industries Ashima Bhasin 1, Mr Manoj Kumar 2 1 Computer Science Engineering Department, 2 Associate Professor, CSE Abstract SGT
Edifice an Educational Framework using Educational Data Mining and Visual Analytics
I.J. Education and Management Engineering, 2016, 2, 24-30 Published Online March 2016 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2016.02.03 Available online at http://www.mecs-press.net/ijeme
Data-Warehouse-, Data-Mining- und OLAP-Technologien
Data-Warehouse-, Data-Mining- und OLAP-Technologien Chapter 4: Extraction, Transformation, Load Bernhard Mitschang Universität Stuttgart Winter Term 2014/2015 Overview Monitoring Extraction Export, Import,
A Data Warehouse Design for A Typical University Information System
(JCSCR) - ISSN 2227-328X A Data Warehouse Design for A Typical University Information System Youssef Bassil LACSC Lebanese Association for Computational Sciences Registered under No. 957, 2011, Beirut,
DATA WAREHOUSING AND OLAP TECHNOLOGY
DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are
DATA CONSISTENCY, COMPLETENESS AND CLEANING. By B.K. Tyagi and P.Philip Samuel CRME, Madurai
DATA CONSISTENCY, COMPLETENESS AND CLEANING By B.K. Tyagi and P.Philip Samuel CRME, Madurai DATA QUALITY (DATA CONSISTENCY, COMPLETENESS ) High-quality data needs to pass a set of quality criteria. Those
Jet Data Manager 2012 User Guide
Jet Data Manager 2012 User Guide Welcome This documentation provides descriptions of the concepts and features of the Jet Data Manager and how to use with them. With the Jet Data Manager you can transform
The Role of Metadata for Effective Data Warehouse
ISSN: 1991-8941 The Role of Metadata for Effective Data Warehouse Murtadha M. Hamad Alaa Abdulqahar Jihad University of Anbar - College of computer Abstract: Metadata efficient method for managing Data
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Grid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,
Extract Transform and Load Strategy for Unstructured Data into Data Warehouse Using Map Reduce Paradigm and Big Data Analytics
Extract Transform and Load Strategy for Unstructured Data into Data Warehouse Using Map Reduce Paradigm and Big Data Analytics P.Saravana kumar 1, M.Athigopal 2, S.Vetrivel 3 Assistant Professor, Dept
Jagir Singh, Greeshma, P Singh University of Northern Virginia. Abstract
224 Business Intelligence Journal July DATA WAREHOUSING Ofori Boateng, PhD Professor, University of Northern Virginia BMGT531 1900- SU 2011 Business Intelligence Project Jagir Singh, Greeshma, P Singh
Business Intelligence Using Data Mining Techniques on Very Large Datasets
International Journal of Science and Research (IJSR) Business Intelligence Using Data Mining Techniques on Very Large Datasets Arti J. Ugale 1, P. S. Mohod 2 1 Department of Computer Science and Engineering,
Data Cleaning: Problems and Current Approaches
Data Cleaning: Problems and Current Approaches Erhard Rahm Hong Hai Do University of Leipzig, Germany http://dbs.uni-leipzig.de Abstract We classify data quality problems that are addressed by data cleaning
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
Chapter 5. Learning Objectives. DW Development and ETL
Chapter 5 DW Development and ETL Learning Objectives Explain data integration and the extraction, transformation, and load (ETL) processes Basic DW development methodologies Describe real-time (active)
Data Modeling Basics
Information Technology Standard Commonwealth of Pennsylvania Governor's Office of Administration/Office for Information Technology STD Number: STD-INF003B STD Title: Data Modeling Basics Issued by: Deputy
DATA GOVERNANCE AND DATA QUALITY
DATA GOVERNANCE AND DATA QUALITY Kevin Lewis Partner Enterprise Management COE Barb Swartz Account Manager Teradata Government Systems Objectives of the Presentation Show that Governance and Quality are
Introduction to Datawarehousing
DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the Information Society
Reverse Engineering in Data Integration Software
Database Systems Journal vol. IV, no. 1/2013 11 Reverse Engineering in Data Integration Software Vlad DIACONITA The Bucharest Academy of Economic Studies [email protected] Integrated applications
A Knowledge Management Framework Using Business Intelligence Solutions
www.ijcsi.org 102 A Knowledge Management Framework Using Business Intelligence Solutions Marwa Gadu 1 and Prof. Dr. Nashaat El-Khameesy 2 1 Computer and Information Systems Department, Sadat Academy For
Markovian Process and Novel Secure Algorithm for Big Data in Two-Hop Wireless Networks
Markovian Process and Novel Secure Algorithm for Big Data in Two-Hop Wireless Networks K. Thiagarajan, Department of Mathematics, PSNA College of Engineering and Technology, Dindigul, India. A. Veeraiah,
MDM and Data Warehousing Complement Each Other
Master Management MDM and Warehousing Complement Each Other Greater business value from both 2011 IBM Corporation Executive Summary Master Management (MDM) and Warehousing (DW) complement each other There
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
Automatic Annotation Wrapper Generation and Mining Web Database Search Result
Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India
QOS Based Web Service Ranking Using Fuzzy C-means Clusters
Research Journal of Applied Sciences, Engineering and Technology 10(9): 1045-1050, 2015 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scientific Organization, 2015 Submitted: March 19, 2015 Accepted: April
Relational Database Basics Review
Relational Database Basics Review IT 4153 Advanced Database J.G. Zheng Spring 2012 Overview Database approach Database system Relational model Database development 2 File Processing Approaches Based on
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Business Intelligence: Effective Decision Making
Business Intelligence: Effective Decision Making Bellevue College Linda Rumans IT Instructor, Business Division Bellevue College [email protected] Current Status What do I do??? How do I increase
Optimization of ETL Work Flow in Data Warehouse
Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. [email protected] P Srinivasu
Deriving Business Intelligence from Unstructured Data
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 9 (2013), pp. 971-976 International Research Publications House http://www. irphouse.com /ijict.htm Deriving
Understanding Data De-duplication. Data Quality Automation - Providing a cornerstone to your Master Data Management (MDM) strategy
Data Quality Automation - Providing a cornerstone to your Master Data Management (MDM) strategy Contents Introduction...1 Example of de-duplication...2 Successful de-duplication...3 Normalization... 3
Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad Email: [email protected]
96 Business Intelligence Journal January PREDICTION OF CHURN BEHAVIOR OF BANK CUSTOMERS USING DATA MINING TOOLS Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
A Framework for Data Warehouse Using Data Mining and Knowledge Discovery for a Network of Hospitals in Pakistan
, pp.217-222 http://dx.doi.org/10.14257/ijbsbt.2015.7.3.23 A Framework for Data Warehouse Using Data Mining and Knowledge Discovery for a Network of Hospitals in Pakistan Muhammad Arif 1,2, Asad Khatak
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Data Warehouse: Introduction
Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,
Application of Data Mining Methods in Health Care Databases
6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Application of Data Mining Methods in Health Care Databases Ágnes Vathy-Fogarassy Department of Mathematics and
CHAPTER SIX DATA. Business Intelligence. 2011 The McGraw-Hill Companies, All Rights Reserved
CHAPTER SIX DATA Business Intelligence 2011 The McGraw-Hill Companies, All Rights Reserved 2 CHAPTER OVERVIEW SECTION 6.1 Data, Information, Databases The Business Benefits of High-Quality Information
Data Cleaning. Helena Galhardas DEI IST
Data Cleaning Helena Galhardas DEI IST (based on the slides: A Survey of Data Quality Issues in Cooperative Information Systems, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference
Chapter 3 - Data Replication and Materialized Integration
Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 [email protected] Chapter 3 - Data Replication and Materialized Integration Motivation Replication:
www.ducenit.com Analance Data Integration Technical Whitepaper
Analance Data Integration Technical Whitepaper Executive Summary Business Intelligence is a thriving discipline in the marvelous era of computing in which we live. It s the process of analyzing and exploring
Text Classification Using Symbolic Data Analysis
Text Classification Using Symbolic Data Analysis Sangeetha N 1 Lecturer, Dept. of Computer Science and Applications, St Aloysius College (Autonomous), Mangalore, Karnataka, India. 1 ABSTRACT: In the real
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, [email protected]
Data Mining Techniques for Banking Applications
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 2, Issue 4, April 2015, PP 15-20 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Data
ISSN: 2321-7782 (Online) Volume 3, Issue 7, July 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 7, July 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
