A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
|
|
|
- Barnaby Copeland
- 10 years ago
- Views:
Transcription
1 A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION Suprativ Saha 1 and Avik Samanta 2 1 Department of Computer Science and Engineering, Global Institute of Management and Technology, Krishnagar City, West Bengal, India [email protected] 2 Department of Computer Application, JIS College of Engineering, West Bengal, India [email protected] ABSTRACT Recently the incremental growth of the storage space and data is parallel. At any instant data may go beyond than storage capacity. A good RDBMS should try to reduce the redundancies as far as possible to maintain the consistencies and storage cost. Apart from that a huge database with replicated copies wastes essential spaces which can be utilized for other purposes. The first aim should be to apply some techniques of data deduplication in the field of RDBMS. It is obvious to check the accessing time complexity along with space complexity. Here different techniques of data de duplication approaches are discussed. Finally based on the drawback of those approaches a new approach involving row id, column id and domain-key constraint of RDBMS is theoretically illustrated. Though apparently this model seems to be very tedious and non-optimistic, but in reality for a large database with lot of tables containing lot of lengthy fields it can be proved that it reduces the space complexity vigorously with same accessing speed. KEYWORDS DeDuplication, SQL, Chunk, Genetic Algorithm, Replicated copy, Domain Key Constraint 1. INTRODUCTION Recently the incremental growth of enterprise and digitized data volumes areexpanding rapidly along with the increase of the information and multimedia files.according to the Enterprise Strategy Group, an industry analysisfirm, 13 percent of midsized companiessurveyed in 2004 used more than 10 terabytes of data storage,but by this year, the number hasbeen increased from 40 to 50 percent.in case of digitized data the study by IDC [9] predicts that the information added annually to the digital universe will increase more than six fold from 161 Exabyte to 988 Exabyte between 2006 and 2010, growing by 57% annually. So it is clearthat it requires a large space for storing those data. However, a lot of the duplicate information is stored from different sources; require more spaces to store replicas. The bitter experiences from the several events proved that data loss is common to a modern enterprise. So it is important to backup the data periodically to a reliable place for the purpose of data availability and integrity. Data backup in the external storage system is more costly and power consuming. According to the survey, although the price of the disk decreased, Rupak Bhattacharyya et al. (Eds) : ACER 2013, pp , CS & IT-CSCP 2013 DOI : /csit
2 224 Computer Science & Information Technology (CS & IT) cost is the main reason to hinder the deployment of the disk-based backup solutions. In fact, there is a huge amount of duplicate or redundant data existing in current storage systems [10], especially backup systems. As a result a lot of spaces are required for data backup and replication. In case of data warehouse for improvement of data quality, data cleaning is a very important task. Data cleaning deals with the detecting and removing of inconsistencies and errors. As there are differences in conventions between the external sources and the target data warehouse as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into the warehouse. This is usually accomplished through an Extract-Transform-Load process. There are numerous techniques for reducing redundancy when data is stored or sent. Recently, to cope with this, organizations are intending to usedata de duplication technology. Data de duplication, also called Intelligent Compression or Single-Instance Storage, eliminates redundancy caused by identical object which can be detected efficiently by comparing hashes of the objects content [4]. Storage space requirements can be reduced by a factor of 10 to 20 or more when backup data is de-duplicated [8]. In earlier days, data De Duplication technique was done by several methods. Especially SQL based method; Chunk based method and model incorporated with the Genetic algorithm are used to formed Data De Duplication technique. Beside the space complexity, time complexity is also an important parameter to judge a concept. Using the previous model of de duplication technique space complexity is reduced, which is main important part in this research domain but it is not required to over look about the accessing time of data from traditional database or data warehouse. Here, a new model is proposed to establish the data de duplication technique which keeps track of the accessing speed of data from traditional database or data warehouse, availability of data besides the decreasing size of storage space. In the next section current state of arts of de duplication technique is discussed. In section 3 a new proposed model is incorporated in this paper. Finally in section 4 conclude of this technique. 2. CURRENT STATE OF ARTS Data De Duplication technique is generally used to reduce the storage cost and increasing accessing capability of data from traditional database or data warehouse. Data rearrangement, Data cleaning, data indexing all are incorporated in De Duplication technique. Several authors used different methods to implement De Duplication techniques. At first SQL based model are discussed, after that Chunk based model and Genetic algorithm based model are briefly illustrated respectively SQL Based Model for Data De Duplication In a large database or data warehouse may reside many duplicate records. These duplicate records reduce the storage capacity and create more time complexity in accessing data. Apart from the storage related issues these duplicate records in the DBMS perspective may generate many inconsistencies. That s why; Normalization technique is incorporated. So, to have a well formed data warehouse or data base, removing the duplicate data as far as possible is most important. In this paper [14], many techniques related to SQL based are going to be discussed for the elimination of duplicate records.
3 Computer Science & Information Technology (CS & IT) De duplication by using sub queries tactically In this technique sub queries are used to tactfully delete the duplicate records. In sub query the inner most query is evaluated only once and the resulting values are substituted into the WHERE clause of the outer query. But in this case at first the outer query is evaluated on the basis of the result obtained from inner sub Query. This approach can be used only if we have identity column on the target table or you are willing to alter the target table to add an identity column which would require ALTER TABLE permission De duplication using temporary table In this approach distinct records from the original table are fetched out and stored in a temporary table. Hence the temporary table contains only unique entitles. Then the original table is truncated and the records from the temporary table are stored back to the original table. BEGIN TRANSACTION /*Pull out the distinct records in the temporary table*/ SELECT DISTINCT * INTO <T> FROM TABLE /* truncate the target table*/ TRUNCATE TABLE <T> /* Insert the distinct records from temporary table */ /* Back to target table */ INSERT INTO <T> SELECT * FROM <T> /* Drop the temporary table*/ IF OBJECT_ID ( tempdb... T ) IS NOT NULL DROP TABLE <T> COMMIT TRANSACTION In this case the size of the temporary table should be at least to that extent so that it can hold those temporary records. In SQL based technique, temporary table is used to remove the duplicate data. As a result there is a big chance of data loss. Instate of temporary table, if permanent table is used to remove the duplicate records, the ultimate aim is not satisfied. The storage cost also increases to store those permanent tables. On replace of permanent table or temporary table if view concept is implemented, the problem be solved. In this method user have to involve inserting data in to database or data warehouse. It means each and every techniques of this model is not possible to hide from the 3 rd party user. Further more, the accessing complexity of the SQL based technique is high, which is not required Chunk Based Clustering Architecture Data de duplication, one of the most emerging technologies of latest time reduces the storage cost and increases the storage capacity of several storage devices like magnetic disks and tapes using several techniques. In DBMS perspective, in a simpleform it can be told that it increases the efficiency and the performance of the database or a data ware house. In reality there is huge amount of duplicate data in a data warehouse. In this paper [12], a concept about the clustering architecture consisting of a lot of nodes, by using B+ tree data structure is discussed in details.
4 226 Computer Science & Information Technology (CS & IT) There are multiple nodes in the storage system and the chunks of a file may be stored in different storage node. For checking a duplicate chunk each node firstly will check its own database, and then checks other nodes to detect duplicate chunk if necessary. To check the duplicity a new technique called Fingerprint Summary is used. The Fingerprint Summary keeps the compact summaries of each fingerprint of each and every node. If a chunk is new for this node, the Fingerprint Summary in this nodes memory should be queried to determine whether some other node has stored an identical chunk. Thus inter-node duplicate data can be removed and can save more storage space. Fingerprint Summary is a compact summary of the chunks fingerprints of every other node. As the chunks on each node are increasing dynamically, it needscontinuous refreshing the fingerprint summary to reflect the changes. The fingerprint summary can be refreshedon every node in real time. When a new chunk is stored in some node, all other nodes fingerprint summary needs to be refreshed at the same time. Thus as the number of new chunks increases, both the communication and the processing overhead increase quadritically Genetic Algorithm based Model This approach [13] combines several different pieces of evidencesextracted from data content to produce a deduplication function that is able to identify whether two or more entries in a repository are replicas or not. Since a repository may contain a lot of duplicate data so the first aim is to develop such a function, which may be a combination of various linear or non-linear function or probabilistic function, reducing the processing time and maximizing the performance. The main advantage of Genetic Programming [2] approach is that it provides a very suitable manner to find the duplicate data very efficiently without searching the entire search space for a solution which may be very large. Till now GP has been applied in many areas of information handling like Ranking Function Discovery [3], Document Classification [5], Content Based Image Retrieval [11] and content target Advertising [6]. As a whole, the main aim of this GP based approach is to record deduplication that 1. Outperforms an existing state-of-the art machine learning based method found in the literature. 2. This process is user independent because of the fact that it actually indirectly chooses the functions appropriate for it and combines them according to the requirement. 3. It also automatically chooses the replica identification boundary value freeing the user from choosing it. 4. The suggested deduplication functions use the available evidences more efficiently. In the development stage the approach can be classified into the following two categories Ad-Hoc or Domain Knowledge Approaches This category includes approaches that usually depend on specific domain knowledge or specific string distance unit Training-based Approaches This category includes all approaches that depend on some sort of training supervised or semisupervised in order to identify the replicas.
5 Domain Knowledge Approaches Computer Science & Information Technology (CS & IT) 227 In this case [7] a matching algorithm proposes that a given record in a file (or repository), will always firstly search for another record in a reference file that matches first record according to a given similarity function. The matched reference records are selected based on a user defined minimum similarity threshold. A vector space model is used for computing similarity among fields from different sources and evaluates four distinct strategies to assign weights and combining the similarity scores of each field. In this way it uses evidence extracted from individual attributes improving the results of the replica identification task Probabilistic Approaches At the initial phase Bayesian inference problem (a probabilistic problem) is used as the first approach to automatically handle replicas. But the implementation of this approach is obsolete in practical context since it lacks a more elaborated statistical ground. In the next phase a proposed model developed by Fellegi and Sunter [1] elaborated the statistical approach to deal with the problem of combining evidences.their method relies on the definition of two boundary values that are used to classify a pair of records as being replicas or not. Most of the tools like Febrl [15] usually work with two boundaries as follows 1. Positive Identification Boundary: If the similarity value lies above this boundary, then the records will be considered as replicas. 2. Negative Identification Boundary: If the similarity value lies below this boundary,then the records will be considered as non-replica Genetic Operations All the similarity functions in the GP approach are implemented using tree where the leaves are called as terminals and are inputs. The function set is the collection of operators, statements, and basic or user defined functions that can be used by the GP evolutionary process to manipulate the terminal values. These functions are placed in the internal nodes of the tree. The genetic operations use several biological genetic processes such as reproduction, crossover and mutation [2] successively spawning a new tree with the better solutions. The crossover operations make exchanges between the parents or nodes or subtrees generating a new tree with the better solution in a random way. Reproduction is nothing but the inheritance of the genetic codes of the previous generation (tree) to the next generation (tree) keeping the good features or genetic codes intact. At last the mutation operation has the role of keeping a minimum diversity level of individuals in the population avoiding premature convergence. Most of the cases the trees generated by crossover also pass through the mutation operation. 3.A NEW APPROACH OF DATA DE DUPLICATION The topics discussed in the earlier cases suffer from time complexity though the algorithms successfully reduce the space complexity. If we concentrate on only the RDBMS where each and every data is represented in tabular form containing the rows and columns then it can be shown that any particular tuple of a particular attribute can be represented in a pair wise form like (R, C) where R denotes the rowid and C denotes serial number of the column.now let s explain a scenario with execution flow chart shown in figure 1.
6 228 Computer Science & Information Technology (CS & IT) START Data to be inserted NO Whether in the Key Column YES Fetch Database Identify the data type Whether that data type of table exists? N Create the 2 column based data type Table YE YE If search successful? N Search the corresponding Data type table along with Varchar table Copy the index number in to the field Add the data into last row, 1 st Column of corresponding Data Type Table and index value of the field with comma separator put in the next column Sort all data type table when System is idle END Figure 1: Flow Diagram of the Proposed Approach of Data De Duplication Technique. In table 1 there is many duplicate data as far as the attribute Place of Birth and attribute Country is concerned. If a single copy of these duplicate data is stored and the replicated copies are replaced by some sort of location identifier of the original copy then the space complexity can be reduced substantially. To have this type of technique a proposed model has been developed based on the row column pair.
7 Computer Science & Information Technology (CS & IT) 229 Table 1: Example Database before execution of Data De Duplication Technique Name Date Of Birth Place Of Birth Country Serial No Avik Samanta 12/03/1975 KOLKATA INDIA 101 Suprativ Saha 12/03/1975 KOLKATA INDIA 102 Robert Pattinson 23/06/1985 NEW YORK USA 103 James Thomson 17/10/1990 CHICAGO USA 101 Rajeev Malhotra 25/02/1970 MUMBAI INDIA 105 At the first stage the several data types and keys of the attributes will be identified based on the domain key constraints. Then an individual table [table2, table 3, table 4, table 5] should be created for each data type without considering the data type of the key attribute because the key attribute can never hold any duplicate data. These tables consists of two attributes namely category and position. According to the above example the tables created are Table 2: C(For character type data) Data Index Avik Samanta 1,1 Suprativ Saha 2,1 Robert Pattinson 3,1 James Thomson 4,1 Rajeev Malhotra 5,1 Table 3: D (For Date type Data) Data Index 12/03/1975 1,2 23/06/1985 3,2 17/10/1990 4,2 25/02/1970 5,2 Table 4: V (For Alphanumeric Type Data) Table 5: N (For Number type Data) Data Index KOLKATA 1,3 NEW YORK 3,3 CHICAGO 4,3 MUMBAI 5,3 INDIA 1,4 USA 3,4 Data Index 101 1, , , ,5 The first column holds a single copy of each data from the original table satisfying the domain key constraints and the second column is a pair wise form of the location of the data in the original table, where the most significant digit represent the rowid and the remaining the column represent serial number in a (R,C) form. An extra comma separator is used between the R and C to classify the rowid and serial number separately. All the tablesare created to hold different data categorically based on data type and should be sorted at the system idle time so that at the very next time when need to access the individual table, a binary search can easily be used to find out the original location of the master record in the corresponding table. Hence the searching complexity is going to be reduced to O(log 2 n) in compare to the sequential search of complexity of O(n). After the execution of the algorithm the resultant table will be like as table 6. All the bold and highlighted data in the above table are actually the replicated copies which are deduplicated by reducing the data length in the two digit format.in this way the space complexity can be reduced. The C table will not be created in reality as it is a key column and there doesn t
8 230 Computer Science & Information Technology (CS & IT) exist any other character type of column, so it can be omitted. This algorithm will be beneficial as long as the data length be bigger. Table 6: Example Database after execute of Data De Duplication Technique Name Date Of Birth Place Of Birth Country Serial No Avik Samanta 1,2 1,3 1,4 1,5 Suprativ Saha 1,2 1,3 1,4 2,5 Robert Pattinson 3,2 3,3 3,4 3,5 James Thomson 4,2 4,3 3,4 1,5 Rajeev Malhotra 5,2 5,3 1,4 5,5 Let explain how storage space is reduced. Take a data from the previous example, in place of birth attribute the data value KOLKATA used twice. KOLKATA is an alphabetic as well as alphanumeric character. Let s think it takes 1 byte to store every character; in this case, KOLKATA takes 7 bytes to store. In case of thrice takes 21 bytes. According this proposed approach to store this scenario it takes only 19 bytes (In original table takes 3 bytes to store rowid, column serial number and comma separator, so total 3*3 = 9 bytes are required. In the Index table KOLKATA takes 7 bytes and next column takes 3 bytes. In summation = 19 bytes are required). For this small scenario a fruitful result is provided by this new proposed approach so it is obvious it will provide better result to store more number of replicas. 4. CONCLUSION The duplicate information is stored from different sources, which require more spaces to store replicas. The bitter experience of the several events proved that data loss is common to a modern enterprise.different types of approaches are involved to solve those problems. SQL based data deduplication, Chunk based clustering architecture and Genetic approach models all are very tedious in their respective field of implementation because of their huge dependence on file organization or file system. Moreover all of these have indecent time complexity which is a matter of great concern. But in the case of this new proposed model as huge as be the relational database with huge number of tables (which is the most anticipated consequence) this proposed model will perform more optimistically than a tiny database. The domain identifier, key identifier, rowid, column id are the key role players in this model. Apart from that the binary search and the table index position are the main optimisation factor for this model. Hence the length of a data of a particular field of a particular row and column are not the matter of concern regarding space wastage, rather it proves our model more practically, in that case. Since the entire algorithm is basically based on the row column approach that is why this can only be implemented in RDBMS initially, though our goal will be to renovate this concept in a broader sense so that it can cover up the all the branches of DBMS. REFERENCES [1] I.P.Fellegi and A.B.Sunter, A theory for record linkage, (1969) Journal American Statistical Association, vol, 66, no.1, pp [2] J.R.Koza, (1992) Genetic Programming: on the Programming of Computers by Means of Natural Selection. MIT Press. [3] W.Banzhaf, P.Nordin, R.E.Kellar and F.D.Francone, (1998) Genetic Programming-An Introduction: on the Automatic Evolution of Computer Programs and its Applications, Morgan Kaufmann Publishers.
9 Computer Science & Information Technology (CS & IT) 231 [4] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur, (2000), Single instance storage in Windows 2000, in Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, WA, pp [5] B.Zhang, Y.Chen, W.Fan, E.A.Fox, M.Gonclaves, M.Cristo and P.Calado, (2005) Intelligent GP fusion from from multiple sources for text classification, in CIKM 05: Proceedings of the 14th ACM international conference on Information and knowledge management. New York, NY, USA: ACM, pp [6] A.Lacerda, M.Cristo, M.A.Goncalves, W.Fan, N.Ziviani, and B.Ribiero-Neto, (2006) Learning to advertise, in SIGIR 06: proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, pp, [7] A.K.Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, (2007) Duplicate record detection: A survey, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp [8] H. Biggar, (2007) Experiencing data de-duplication: Improving efficiency and reducing capacity requirements White Paper, the Enterprise Strategy Group. [9] J F Gantz, (2007) The expanding Degital universe: A forecast of worldwide growth through 2010, IDC. [10] Dave Reinsel, (2008) Our expanding digital world: Can we contain it? Can we manage it?, Intelligent Storage Workshop (ISW 2008), UMN, MN, pp [11] R.d.s. Torres, A.X. Falcao, M.A. Goncalves, J.P.Papa, B.Zhang, W.Fan, and E.A.Fox. (2009) A genetic programming framework for content based image retrieval, Pattern Recogn. vol. 42, no. 2, pp [12] Guohua Wang, Yuelong Zhao, Xiaoling Xie, and Lin Liu, (2010), Research on a clustering data deduplication mechanism based on Bloom Filter, IEEE. [13] Mois es G. de Carvalho, UFMG Alberto H. F. Laender, UFMG Marcos Andr e Goncalves, UFMG Altigran S. dasilva, UFAM, (2010), A Genetic Programming Approach to Record De duplication, IEEE TRANSACTIONS on Knowledge and Data Engineering. [14] Srivatsa Maddodi, Girija V. Attigeri, Dr. Karunakar A. K, (2010) Data Deduplication Techniques and Analysis, Third International Conference on Emerging Trends in Engineering and Technology, IEEE Computer Society, pp AUTHORS Suprativ Saha associate with Global Institute of Management and Technology, Krishnagar City, West Bengal,Indiaas an assistant professor since He received his M.E degree from West Bengal University of Technology in year 2012 and B-Tech degree from JIS college of Engineering in His research interests include the field of Database Management System, DataMining, Distributed Database and Optical Network.Mr. Saha has about 2 referred international publications toher credit. Avik Samantareceived his M.C.A degree from JIS college of Engineering in 2010 and B.S.C degree from The University of Calcutta in year His research interests include the field of Database Management System, Data Mining, and Distributed Database.
A Genetic Programming Approach for Record Deduplication
A Genetic Programming Approach for Record Deduplication L.Chitra Devi 1, S.M.Hansa 2, Dr.G.N.K.Suresh Babu 3 Research Scholars, Dept. of Computer Applications, GKM College of Engineering and Technology,
Record Deduplication By Evolutionary Means
Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte
Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3
Storage Structures Unit 4.3 Unit 4.3 - Storage Structures 1 The Physical Store Storage Capacity Medium Transfer Rate Seek Time Main Memory 800 MB/s 500 MB Instant Hard Drive 10 MB/s 120 GB 10 ms CD-ROM
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design
Chapter 6: Physical Database Design and Performance Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden Robert C. Nickerson ISYS 464 Spring 2003 Topic 23 Database
A block based storage model for remote online backups in a trust no one environment
A block based storage model for remote online backups in a trust no one environment http://www.duplicati.com/ Kenneth Skovhede (author, [email protected]) René Stach (editor, [email protected]) Abstract
A Deduplication-based Data Archiving System
2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
A SURVEY ON DEDUPLICATION METHODS
A SURVEY ON DEDUPLICATION METHODS A.FARITHA BANU *1, C. CHANDRASEKAR #2 *1 Research Scholar in Computer Science #2 Assistant professor, Dept. of Computer Applications Sree Narayana Guru College Coimbatore
Deploying De-Duplication on Ext4 File System
Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems
A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems Agusthiyar.R, 1, Dr. K. Narashiman 2 Assistant Professor (Sr.G), Department of Computer Applications,
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
www.gr8ambitionz.com
Data Base Management Systems (DBMS) Study Material (Objective Type questions with Answers) Shared by Akhil Arora Powered by www. your A to Z competitive exam guide Database Objective type questions Q.1
A Service Revenue-oriented Task Scheduling Model of Cloud Computing
Journal of Information & Computational Science 10:10 (2013) 3153 3161 July 1, 2013 Available at http://www.joics.com A Service Revenue-oriented Task Scheduling Model of Cloud Computing Jianguang Deng a,b,,
An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space
An Authorized Duplicate Check Scheme for Removing Duplicate Copies of Repeating Data in The Cloud Environment to Reduce Amount of Storage Space Jannu.Prasanna Krishna M.Tech Student, Department of CSE,
A Survey on Deduplication Strategies and Storage Systems
A Survey on Deduplication Strategies and Storage Systems Guljar Shaikh ((Information Technology,B.V.C.O.E.P/ B.V.C.O.E.P, INDIA) Abstract : Now a day there is raising demands for systems which provide
Numerical Research on Distributed Genetic Algorithm with Redundant
Numerical Research on Distributed Genetic Algorithm with Redundant Binary Number 1 Sayori Seto, 2 Akinori Kanasugi 1,2 Graduate School of Engineering, Tokyo Denki University, Japan [email protected],
Neural Networks in Data Mining
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department
Chapter 1: Introduction. Database Management System (DBMS) University Database Example
This image cannot currently be displayed. Chapter 1: Introduction Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Database Management System (DBMS) DBMS contains information
Data Mining and Database Systems: Where is the Intersection?
Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: [email protected] 1 Introduction The promise of decision support systems is to exploit enterprise
A Framework for Data Migration between Various Types of Relational Database Management Systems
A Framework for Data Migration between Various Types of Relational Database Management Systems Ahlam Mohammad Al Balushi Sultanate of Oman, International Maritime College Oman ABSTRACT Data Migration is
D A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of
In-memory Tables Technology overview and solutions
In-memory Tables Technology overview and solutions My mainframe is my business. My business relies on MIPS. Verna Bartlett Head of Marketing Gary Weinhold Systems Analyst Agenda Introduction to in-memory
Data Backup and Archiving with Enterprise Storage Systems
Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia [email protected],
Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?
Files What s it all about? Information being stored about anything important to the business/individual keeping the files. The simple concepts used in the operation of manual files are often a good guide
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
A basic create statement for a simple student table would look like the following.
Creating Tables A basic create statement for a simple student table would look like the following. create table Student (SID varchar(10), FirstName varchar(30), LastName varchar(30), EmailAddress varchar(30));
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
Detection and Elimination of Duplicate Data from Semantic Web Queries
Detection and Elimination of Duplicate Data from Semantic Web Queries Zakia S. Faisalabad Institute of Cardiology, Faisalabad-Pakistan Abstract Semantic Web adds semantics to World Wide Web by exploiting
14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:
14 Databases 14.1 Source: Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define a database and a database management system (DBMS)
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis
Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Rajesh Reddy Muley 1, Sravani Achanta 2, Prof.S.V.Achutha Rao 3 1 pursuing M.Tech(CSE), Vikas College of Engineering and
IT2304: Database Systems 1 (DBS 1)
: Database Systems 1 (DBS 1) (Compulsory) 1. OUTLINE OF SYLLABUS Topic Minimum number of hours Introduction to DBMS 07 Relational Data Model 03 Data manipulation using Relational Algebra 06 Data manipulation
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
DBMS Questions. 3.) For which two constraints are indexes created when the constraint is added?
DBMS Questions 1.) Which type of file is part of the Oracle database? A.) B.) C.) D.) Control file Password file Parameter files Archived log files 2.) Which statements are use to UNLOCK the user? A.)
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
MS SQL Performance (Tuning) Best Practices:
MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
PREDICTIVE DATA MINING ON WEB-BASED E-COMMERCE STORE
PREDICTIVE DATA MINING ON WEB-BASED E-COMMERCE STORE Jidi Zhao, Tianjin University of Commerce, [email protected] Huizhang Shen, Tianjin University of Commerce, [email protected] Duo Liu, Tianjin
Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
ETL as a Necessity for Business Architectures
Database Systems Journal vol. IV, no. 2/2013 3 ETL as a Necessity for Business Architectures Aurelian TITIRISCA University of Economic Studies, Bucharest, Romania [email protected] Today, the
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Chapter 1: Introduction
Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases
Cassandra A Decentralized, Structured Storage System
Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922
IT2305 Database Systems I (Compulsory)
Database Systems I (Compulsory) INTRODUCTION This is one of the 4 modules designed for Semester 2 of Bachelor of Information Technology Degree program. CREDITS: 04 LEARNING OUTCOMES On completion of this
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Database Migration over Network
Database Migration over Network Kanimozhi N 1, Thresphine J. R 2 1 M.TECH (Computer Science & Eng), PRIST University, Pondicherry India 2 Assistant Professor (Computer Science & Eng), PRIST University,
Performance Analysis of Decision Trees
Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate
Database Design and Normalization
Database Design and Normalization 3 CHAPTER IN THIS CHAPTER The Relational Design Theory 48 46 Database Design Unleashed PART I Access applications are database applications, an obvious statement that
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
DYNAMIC QUERY FORMS WITH NoSQL
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH
A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number
A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number 1 Tomohiro KAMIMURA, 2 Akinori KANASUGI 1 Department of Electronics, Tokyo Denki University, [email protected]
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
ETL PROCESS IN DATA WAREHOUSE
ETL PROCESS IN DATA WAREHOUSE OUTLINE ETL : Extraction, Transformation, Loading Capture/Extract Scrub or data cleansing Transform Load and Index ETL OVERVIEW Extraction Transformation Loading ETL ETL is
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX
ISSN: 2393-8528 Contents lists available at www.ijicse.in International Journal of Innovative Computer Science & Engineering Volume 3 Issue 2; March-April-2016; Page No. 09-13 A Comparison of Database
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Data Hierarchy. Traditional File based Approach. Hierarchy of Data for a Computer-Based File
Management Information Systems Data and Knowledge Management Dr. Shankar Sundaresan (Adapted from Introduction to IS, Rainer and Turban) LEARNING OBJECTIVES Recognize the importance of data, issues involved
A Brief Introduction to MySQL
A Brief Introduction to MySQL by Derek Schuurman Introduction to Databases A database is a structured collection of logically related data. One common type of database is the relational database, a term
Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm
Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Data Mining Applications in Fund Raising
Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,
Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad Email: [email protected]
96 Business Intelligence Journal January PREDICTION OF CHURN BEHAVIOR OF BANK CUSTOMERS USING DATA MINING TOOLS Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad
Databases and Microsoft Access II
Databases and Microsoft Access II Northern New York Library Network Workshop Jim Crowley C3 - Crowley Computer Consulting 9148 State Highway 37 Ogdensburg NY 13669 315-394-7008 fax 315-394-7009 www.crowleycomputers.com
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices
Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil
B.1 Database Design and Definition
Appendix B Database Design B.1 Database Design and Definition Throughout the SQL chapter we connected to and queried the IMDB database. This database was set up by IMDB and available for us to use. But
Filling the Semantic Gap: A Genetic Programming Framework for Content-Based Image Retrieval
INSTITUTE OF COMPUTING University of Campinas Filling the Semantic Gap: A Genetic Programming Framework for Content-Based Image Retrieval Ricardo da Silva Torres [email protected] www.ic.unicamp.br/~rtorres
LDA, the new family of Lortu Data Appliances
LDA, the new family of Lortu Data Appliances Based on Lortu Byte-Level Deduplication Technology February, 2011 Copyright Lortu Software, S.L. 2011 1 Index Executive Summary 3 Lortu deduplication technology
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Content Problems of managing data resources in a traditional file environment Capabilities and value of a database management
Hybrid Technique for Data Cleaning
Hybrid Technique for Data Cleaning Ashwini M. Save P.G. Student, Department of Computer Engineering, Thadomal Shahani Engineering College, Bandra, Mumbai, India Seema Kolkur Assistant Professor, Department
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
How To Improve Performance In A Database
Some issues on Conceptual Modeling and NoSQL/Big Data Tok Wang Ling National University of Singapore 1 Database Models File system - field, record, fixed length record Hierarchical Model (IMS) - fixed
Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification
Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Outline More Complex SQL Retrieval Queries
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Netezza Basics Class Outline
Netezza Basics Class Outline CoffingDW education has been customized for every customer for the past 20 years. Our classes can be taught either on site or remotely via the internet. Education Contact:
Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication
Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication Table of Contents Introduction... 3 Shortest Possible Backup Window... 3 Instant
GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining
GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining Celso Y. Ishida, Aurora T. R. Pozo Computer Science Department - Federal University of Paraná PO Box: 19081, Centro Politécnico - Jardim das
Original-page small file oriented EXT3 file storage system
Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: [email protected]
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
Converting Relational Database Into Xml Document
www.ijcsi.org 127 Converting Relational Database Into Xml Document Kanagaraj.S 1 and Dr.Sunitha Abburu 2 1 Master of Computer Application, Adhiyamaan College of Engineering Hosur, Tamilnadu-635109, India.
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Efficient Integration of Data Mining Techniques in Database Management Systems
Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
ISM 318: Database Systems. Objectives. Database. Dr. Hamid R. Nemati
ISM 318: Database Systems Dr. Hamid R. Nemati Department of Information Systems Operations Management Bryan School of Business Economics Objectives Underst the basics of data databases Underst characteristics
