An Efficient DNA Molecule Clustering using GCC Algorithm

Size: px
Start display at page:

Download "An Efficient DNA Molecule Clustering using GCC Algorithm"

Transcription

1 DOI /s y GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 An Efficient DNA Molecule Clustering using GCC Algorithm Faisal Alsaby and Kholood Alnowaiser Received 31 Dec 2014 Accepted 27 Jan 2015 Abstract Researchers in the biotechnology field have accomplished many achievements in the past century. They can now measure expression levels for thousands of genes, testing different conditions over varying periods of time. The analysis of the measurement results is essential to understand gene patterns and extract information about their functions and their biological roles. This paper describes a novel approach for clustering large-scale next-generation sequences (NGS). It also facilitates the process of predicting patterns and the likelihood of mutations based on a semi-supervised clustering technique. The process is based on the previously developed construction of FuzzyFind Dictionary utilizing the Golay Code for error correction. The introduced method is exceptional; it has linear time complexity with one passage through the file. Keywords- DNA; RNA; Gene Clustering; Pattern Recognition; Golay Code, Big data I. INTRODUCTION Researchers have generated overwhelming amounts of gene expression data. As time advances, the pace only gets faster and data grows rapidly. Comprehending gene expression data is a fundamental step in understanding human ancestry, diseases, and their interaction with environmental conditions. It also can result in developing new medicines and treatments for disease. Although we have successfully accommodated the pace at which gene expression data is generated storage-wise, our human brains are not capable of understanding these amounts of raw data. Therefore arises the need for clustering. Clustering is the process of subdividing a given data set into groups, named clusters, such that items within one cluster are similar and items located in different clusters are dissimilar. Clustering of next-generation sequences is a rather complicated computational problem to study the large-scale data of DNA and RNA molecules. There are many advantages of gene expression clustering: it allows scientists and researchers to study data without studying each individual gene, it allows data visualization, and it helps scientists to figure out roles for unknown genes in the same cluster as well as reduce the redundancies in NGS data. similar ideas, their results can vary tremendously in realworld data. It is important to note that there is not a single algorithm that is suitable for all; however, each one of them optimizes one of the many algorithm-evaluation criteria [1]. Most of the available clustering algorithms are not adequately fast and scalable; therefore, they are not capable of handling large-scale data with tens of millions of reads. Examples of the methods that are capable of clustering data in the range of million sequences are UCLUST, FreClu, SEED, and BLAST. This paper presents a novel approach for clustering massive amount of NGS stream by utilizing the Golay Code Clustering algorithm (GCC). The approach uses a distance based unsupervised clustering technique. The proposed clustering method efficiently places sequences into clusters based on mismatch features that are examined by user-definable similarity templates. The presented algorithm tackles the similarity search problem, which is the most common challenge that is faced in the postgenome era in which sequence is compared against another, or to a database, to find similar region between them. In other words, it is capable of recognizing exact matches, and best matches. The algorithm outperforms other algorithms in the literature in terms of time complexity. This paper also presents a novel approach of predicting patterns and recognizing new patterns based on the clustering results. In section 2, a brief description of already available algorithms and clustering techniques is presented. The sections after that explain the In section 2, a brief description of already available algorithms and clustering techniques is presented. The sections after that explain the proposed technique. Section 3 gives a brief theoretical background about the clustered data. It also discusses the process of clustering data, such as data transformation into vectors and clustering-addresses composition. Section 4 illustrates the best-match search. Section 5 talks about training the algorithm with labeled data and the pattern recognition approach. Finally, the authors provide their suggestion for future work. There are many clustering algorithms that have been applied to NGS data. Although these algorithms may apply DOI: / _

2 GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 II. RELATED WORK A. Hierarchical clustering Hierarchical clustering method is favored by biologists due to its ability to render graphical representation of the results. Despite this algorithm s graphical capabilities, it does not give the best results. A drawback of this method that the data needs to be preprocessed due to the lack of noise recognition abilities. B. K-means clustering K-means clustering method works by choosing k number of clusters and assigning an initial mean to each cluster. After that, the distances between each pattern and the cluster means are calculated and the pattern is assigned to the cluster with the closest mean. New means are calculated and patterns are reorganized in clusters after every iteration. This process repeats until patterns are stable in their clusters and no further reorganization happens. Although this method is relatively fast (NK iterations), it is not particularly suitable for clustering gene expression data. One reason is the fact that it needs prior specification of the number of clusters k. This can be difficult with most gene specification data. Another problem is that it forces every gene into a cluster. This may produce meaningless clusters that are difficult to analyze [2]. C. Self-Organizing feature Maps (SOM) It is similar to k-means clustering; it divides the input into groups with similar features (patterns), and it requires the user to predetermine the number of initial clusters. The output of this algorithm is a grid of clusters, each cluster with similar features. It is preferred to use this algorithm over the k-means algorithm when dealing with larger number of clusters [1]. The goal of this paper is to bring forward a new simple and efficacious tool for one of the most demanding operations of the Big Data methodology clustering of large-scale information in a data stream mode. In addition, it aims at improving the clustering and prediction outcomes. There is no optimal clustering technique that is suitable for all clustering problems; however, we claim that proposed approach is superior in terms of time complexity and accuracy. The approach uses the information currently available about gene expression data as labeled training data in order to achieve more accurate results. In addition, the proposed algorithm is linear with respect to the input size and its time complexity is in O(n). III. CLUSTERING APPROACH A. Brief and background information The importance of genomic studies rises from the importance of life itself. DNA molecules hold the code of human life. Scientists have discovered the structure of deoxyribonucleic acid (DNA) in the 1950s. Ever since that time, huge governmental and private projects have been established to study this molecule. Those projects have paid off and we have accomplished a lot in the past 30 years, we have decoded the human genome. Scientists now know that a single stranded DNA molecule is a string of characters over the set {A, C, G, T}. They also can identify this string for different molecules (sequencing molecules) [3]. The next step after sequencing is to use those sequences in order to understand their biological functions, how they interact with each other, and how they interact with external factors. One important aspect of DNA studies is the study of genetic mutations. Things can go wrong when parents pass their genetic information to their offspring. They can also go when a cell divides to produce more cells. Although cells have enzymes to proofread newly replicated DNA strands, errors can still occur. A genetic mutation is a change in DNA basepair sequence [4]. This paper presents a clustering technique that is based on a reverse of the traditional error-correction scheme using the perfect Golay code described in [5] and [6]. Our presented clustering algorithm proposes utilizing the Golay Code error correction algorithm to create hash indices. Traditionally, This paper presents a clustering technique that is based on a reverse of the traditional error-correction scheme using the perfect Golay code described in [5] and [6]. Our presented clustering algorithm proposes utilizing the Golay Code error correction algorithm to create hash indices. Traditionally, Golay Code is used for data transmission. It accepts a message of 12-bit and adds 11 correction or parity bits to produce a 23-bit codeword. Then this codeword is sent through communication channels. Error correction process examines the received vector to recover the original transmitted codeword and then the original message. The parity bits assist in recovering and discovering up to three errors that might occur during the transmission phase. In our clustering technique, the error correcting procedure is utilized to map any possible combination of 23-bit into a 12-bit data word or index. Consequently, every vector should be uniquely correspondence with the nearest valid codeword and hence associated with 12-bit data word. Note that none of these codewords should lead into ambiguous results. Therefore, based on this correspondence between codeword space and data word space, we can create a clustering system based on the special error correcting relationship between the two spaces. The idea of the proposed clustering system comes from a reverse of the traditional error-correction scheme using the perfect Golay code (23, 12, 7) as described in [5, 7, 8]. With the Golay Code (23, 12, 7), the whole set of 23- bit vectors is partitioned into 212 spheres with radius 3. 60

3 GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 Therefore, a transformation that maps the 23 bit string into the 12-centers of these spheres is able to tolerate certain dissimilarity bit positions of the 23-bit strings. According to [5], reversing the Golay Code allows 86.5% of the codewords to map into their correspondence six data words. For our presented approach, this feature is very essential. Among the possible 23-bit codewords (223=8,388,608), there are 13.5% codewords do not generate the six data words that are required for the clustering purpose. Therefore, a remedy solution is introduced to overcome this limitation. It works by flipping randomly one bit within the 23-bit of these codewords. Because the Golay Code can tolerate up to three mismatches, flipping one bit of each vector that does not generate the six data words will not alter the resulted codeword remarkably. B. Vector Clustering gene expression data using The Golay Code requires data to be represented in the form of a 23-bit vector. This size is suitable to cluster DNA/ RNA seeds. The original gene expression data is a series of the letters (A, C, T, and G). For instance, encoding the DNA molecule into a stream of bits is possible using two bits to represent a single letter [3]. This will generate a stream of a length of an even number. Since the approach requires a 23-bit vector, the stream of bits is divided into blocks, each is 22- bit long. The 23rd bit is discarded i.e. filled with dummy data. This step will not alter the results remarkably, due to the ability of Golay Code to tolerate up to 3 errors. Thus, it can be neglected. Another way of creating these vectors is discussed in [D]. The final result is a group of 23-bit vectors, which is the transformation of the gene expression data. Each of these vectors is a codeword that is fed to the clustering algorithm. Each vector V is processed by the Golay code clustering algorithm and six different 12-bit long indices are generated as a result. Another process performs some pairwising over these indices to generate 15 addresses, and the vector will be placed in the 15 clusters responding to these 15 addresses. The algorithm depends on the Hamming distance between the vectors; it will place two vectors in the same cluster if the Hamming distance between them does not exceed a certain number (i.e. the number of mismatching bits between the two vectors). This will come helpful in identifying not only exact matches, but also similar patterns up to a certain similarity threshold. As mentioned in [6], the algorithm will identify two vectors V1 and V2 to be belonging to the same cluster if they are 65.2% up to 100% similar. C. Similarity measure The proposed clustering methodology employs Hamming distance to measure the similarity between vectors. The Hamming distance d(x, y) between two vectors x, y is the number of coefficients in which they differ. In other words, Hamming distance between two vector is the number of bits we must change to convert one codeword into the other. For example: the Hamming distance between the vectors and is 4. This methodology is considered one of the most simple, efficient, and accurate distance measures [7]. When using Hamming distance it is possible to a codeword to have more than one nearest neighbor i.e. neighbor is not always unique. The Hamming distance, a natural similarity measure on binary codes, requires a few machine instructions for each comparison. Moreover, in practical performing exact nearest neighbor search in Hamming space significantly faster than linear search, with sub linear run-times [6]. D. Metadata templates Realization of this clustering algorithm requires using templates of yes/no questions for each data item [8]. Each template consists of a set of questions or some other ways of dichotomic inquiring that produce a 23-bit. A group of 23-bit Metadata Template that is suitable for a given problem is designed. Each of these questions investigates the presence or absence of a property for instance, symptom, or a feature. In our case; however, because of the large size of the genome database, it is computational expensive to compare each position in a query with each position on the database. Thus, as with BLAST and other systems, it is beneficial to use a short and continuous sequence of letters as seed to be used to tackle this problem. An exact match might lead to find a longer match. As with spaced seeds where it investigate the match in a position. In precise, it requires some positions to have exact letter while others are not required [9]. Based on that, the templates that we suggest can be develop in such a way that ensures that similar vectors are placed within similar clusters to allow performing exact and best match search operations. The answer of each question within the templates can be represented in a form of (0 or 1), which eventually is used to build a 23-bit vector that representing a specific block or seed of the gene (as with Fig.1)[9]. This procedure results in having the corresponding seeds in clusters where each item differs from another seed in no more than some limited number of bitpositions mismatches of a corresponding 23-bit template. Another way of creating these templates can be as follows: suppose that the purpose is to investigate and find the mutation or provide predictions based on the current status of the clusters for instance, chance of developing breast cancer can be computed by utilizing the bit positions feature. As mentioned above, two vectors can be in the same cluster if they are 65.2% up to 100% similar. Based on the questions that are included in the templates, which investigate the presence of features, vector whose its neighbors have different features are more likely to have 61

4 GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 them. For more illustration, if two patients are clustered based on their symptoms into one common cluster, each one has more chance to develop the same symptoms that the other person has already developed. Thus, the template might be built in such a way that it contains 23 questions that compare the under assessment sequence with a reference DAN sequence. For sake of illustration, the following example and more discussion is considered. Suppose we have this original DNA sequence (AATTTTCTTA TATGGCTTGA GATGTCTTAT) and we would like to investigate a new DNA sequence and compare it to this original sequence in order to find exact match, mutations, and perform other data analysis operations. Initially, we assumed that k templates where each one of them contains 23-questions have already been developed. TABLE I. SEEDS REPRESENTATION IN VECTORS DNA (Block) Sequence Generated Vector Hamming Distance from the Original AATTTTCTTT TATGGCTTGA AAT AATTTTCACC TATTGCTTGAGAT TTAACCGCAT ATTGGCTTGCGAT In precise, the first template investigates the first 23 contiguous letters and the second template investigates the second 23-letters and so on. Answering these questions would produce vectors as with table (1). It is important to note that each template is associated with a clustering scheme where vectors that are generated based on answering the questions of this template will be clustered in this scheme. E. cluster address composition In order to cluster each generated vector within the specified scheme, appropriate clustering addresses have to be created. The 23-bit vector is transferred into a 12-bit index when the Golay Code to is reversed (see [5] for more details). This 12-bit index points into a Hash table addresses that is created for each clustering scheme. Due to the limited size of this table that is imposed by the size of the Golay code data words space i.e. 212, newly way of creating addresses is presented to expand the Hash table. By performing a special concatenating process between each two of the six data words, 15 pairs of indices are created. For the sake of convenience, more explanation is considered. Applying the aforementioned GCC to a certain vector, six hash indices are generated which maps it to a set of 12-bit data words. According to [5], any two 23-bit codewords differing by two bits i.e. two Hamming distance will have two hash indices in common. Therefore, all possible 2-bit distortions of a codeword will have one out of the 15 addresses in common with the undistorted vector. Hence, similar vectors are clustered based on the common indices into same cluster. F. Cluster coverage and validation When applying the GCC to the possible 23-bit vectors (8,388,608 vectors), a total of 1,267,712 non-empty clusters were created. Each one of the generated clusters contains (139) or (70) codewords. For simplicity, we call them larger cluster (LC) and smaller cluster (SC) respectively. The maximum Hamming distance within each cluster is either 7 or 8. More importantly, the minimum total number of bit positions that have common bit values within each cluster is either 15 or 16. This is specifically a significant feature since it represents the total number of common attributes between codewords within a certain cluster i.e. it ensures that vectors that represent a certain block must have no more than 7 or 8 different features. For example, the first and second vectors in table (1) will be clustered into at least one similar cluster while the third one will not be included in any cluster that contains the first and second vectors. More importantly, within each one of the SLs, 98.55% of the vectors have at least 17 common features, while the remaining vectors have either 16 or 15. On the other hand, 86.25% of the codewords in LCs have at least 17 common features, while 13.75% share either 16 or 15. IV. ENSEMBLE METHOD AND BEST MATCH SEARCH As aforementioned, realization of the proposed clustering algorithm requires utilizing of k templates where each one of them associated with a clustering scheme. Thus, in order to cluster a gene of large size, a limitation is imposed by the size of the Golay Code, which requires the size of the vectors to be 23 bits. Hence, we propose a novel approach of dividing the large file into short chunk of size 23 and examine them according to the appropriate template or examining the same original file without segmentation with multiple templates that investigate different features. Furthermore, the resulted vector is placed within appropriate clusters within a correspondence scheme. Thus, the whole gene can be clustered using our efficient clustering algorithm (Fig.1). Nevertheless, when performing the exact or best match search, the system would offer multiple answers following fuzzy search algorithm that is discussed in [10]. Therefore, combining the result of clustering schemes and assigning the relevance 62

5 GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 in terms of Hamming metric yields high accuracy and efficiency that are needed to discover mutations, alignment and perform other analysis functionalities. Figure 1. Ensemble Method Overview V. TRAINING AND PATTERN RECOGNITION METHOD If data points are contained in the same cluster, there is a high possibility that they are of the same class. This assumption does not necessarily mean that each class forms a single compact cluster. However, it means that in practice we usually do not observe objects of two distinct classes in the same cluster. [11]. Training the system is an important step that precedes the pattern recognition pattern. It is so because unlabeled data form a major challenge that machine learning and data mining systems are facing. The training method uses a fully labeled dataset to achieve better results. For future work, the researchers suggest to consider one of the public data sets provided by many institutions and information centers, such as National Center for Biotechnology Information (NCBI). Theoretically, assume there are N training vectors that are fully labeled. Those labels can vary depending on the type of samples the chosen dataset contains. Types of tissues, types of organisms, known sequences names, all of these are valid cluster labels [12]. Those N vectors will be referred to as centers in this paper, and will be used to put labels on vectors that have been already clustered. To give more in-depth explanation, this paper will consider the Barrett s esophagus data set used and discussed in [13]. The data set has samples of tissue types; three normal and one neoplastic. The training vectors in this case will be as follows: V1 normal type1 V2 normal type2 V3 normal type3 V4 neoplastic After that, the algorithm will iterate all the clusters and calculate the Hamming distance between each of clustered vector and V1, V2, V3, and V4. The aim of these calculations is to label clustered vectors. The label of the vector is basically the exact label of the nearest center, as long as the Hamming distance does not exceed a certain number [6]. The next step is to label clusters depending on the majority of vector labels within the cluster. For each cluster, the frequency of labels is calculated, this referred to as weight W: Wn = Frequency[n] w represents the weight of the object (n). The label with the highest weight W will be chosen to label that cluster. This process is called the vote of the majority. To avoid the effect of noise on this process, the algorithm limits the ability to vote. A cluster is granted the right to vote if the number of vectors in it is equal to (or greater) than a certain threshold. Based on [5], good results were achieved when threshold = 10, and the labeling process is 92.7% accurate. A. Pattern prediction The presented approach is perfect not only to recognize patterns with exact matches, but also to recognize and predict similar patterns, best matches, and patterns with mutations. Assume that a sequence of gene expression data S was divided into blocks B1, B2,, Bn. Each of these blocks is fed to the algorithm described in previous sections. After that, a pointer to each block, and eventually to S, is placed in each of the k clusters. The vote of the majority in this case will identify the type of S. Another type of identification is to recognize possible future mutations based on the analysis of previously processed gene expression data using Fuzzy search method. If S is the expression in question, the algorithm follows a Fuzzy search method to find all other expressions E ={E1, E2,, En} that are neighbors of S with certain hamming distance less than or equal to h. By sequencing through E, all the bit positions of mismatches with S are grouped in a group L. Since every two bits represent a letter, distortions in the sequence of bits represent mutations. Mutations in L are ranked based on the frequency of their occurrences, and future mutations in S are predicted based on those ranks. VI. FUTURE WORK This paper opens the door for many ideas. It presents a theoretical idea and claims it runs in time complexity of O(n). For future work, the authors suggest a practical implementation of the presented algorithm. Experimental results can validate the claims and verify the algorithm s optimality. In fact, many questions can only be answered 63

6 GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 after the application of the algorithm on data [14, 15, 16], such as: Developing a similarity matrix to combine the ensembles results. Do the results match our prior knowledge? (external validation) Do the clusters fit data well? (internal validation) Are the results achieved by this algorithm better than those achieved by other? Also, the authors have not addressed any technical details about the process of mapping from block clustering of B1, B2,, Bn to pattern recognition of the original sequence S. This is an important aspect of the implementation of this algorithm[17]. [7] S. Berkovich, and E. El-Qawasmeh, Reserving the error-correction scheme for a Fault- Tolerant Indexing, Computer Journal. England, vol. 43, no. 1, pp , 2000 [8] E. El-Qawasmeh, and M. Safar, Investigation of Golay Code (24, 12, 8) Structure in Improving Search Techniques, Associations of Arab Universities, 2011 [9] S. Berkovich, and D. Liao, On clusterization of big data streams, Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications, article no.26. ACM press, New York [10] M. Yammahi, K. Kowsari, C. Shen, and S. Berkovich, "An Efficient Technique for Searching Very Large Files with Fuzzy Criteria Using the Pigeonhole Principle," Computing for Geospatial Research and Application (COM.Geo), 2014 Fifth International Conference on, vol., no., pp.82,86, 4-6 Aug [11] O. Chapelle, B. Scholkopf, and A. Zien, Introduction to SemiSupervised Learning, Cambridge, Massachusetts, The MIT Press, ch. 1. pp.6 [12] K. Yeung, D. Haynor, and W. Ruzzo, Validating clustering for gene expression data, Bioinformatics, vol. 17, no. 4, pp , VII. CONCLUSION In this paper, a novel approach for clustering NGS has been introduced. The process is based on the GCC. The proposed methodology works by clustering diverse information items in a data-stream mode. The basic idea is to take an unsupervised clustering method, then use labeled data to train the system. The result is a group of clusters such that data points within each cluster are homogenous, and data points from different clusters are nonhomogenous. The technique followed by this paper improves the ability to extract knowledge and insights from large and complex collections of gene expression data by improving the classification accuracy. It also surpasses other methods as it improves the time complexity to O(n). For these two reasons, the introduced approach is applicable to many classification problems, and specifically suitable for analyzing long DNA sequences and understanding their biological functions. [13] N. Grira, M. Crucianu, N. Boujemaa, Unsupervised and Semisupervised Clustering: a Brief Survey, 7th ACM SIGMM international workshop on Multimedia information retrieval, pp. 9-16, [14] Y. Hongjun, T. Jing, D. Chen, and S. Berkovich, Golay Code Clustering for Mobility Behavior Similarity Classification in Pocket Switched Networks, J. of Communication and Computer, USA, [15] D. Greene, M. Parnas, and F. Yao, Multi-index hashing for information retrieval. FOCS, [16] M. Norouzi, A. Punjani, and D. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012 [17] U. Keich, M. Li, B. Ma, and J. Tromp, On spaced seeds for similarity search, Discrete Applied Mathematics, Volume 138, Issue 3, 15 April 2004, Pages AUTHORS PROFILE REFERENCES [1] P. D'haeseleer. How does gene expression clustering work? Nat. Biotechnol. 23(12), pp [2] Basu, Mitra, and Tin Kam Ho, Data Complexity in Pattern Recognition, London: Springer, [3] S. Faro and T. Lecroq. An efficient matching algorithm for encoded DNA sequences and binary strings. 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009) [4] C. K. Omoto and P. F. Lurquin, Genes and DNA: A Beginner's Guide to Genetics and Its Applications, New York: Columbia University Press, [5] F. Alsaby and S. Berkovich. Realization of clustering with Golay code transformations. Global Science and Technology Forum, J. on Computing (JoC) Vol 4 No 1, 2014 [6] F. Alsaby, K. Alnowaiser and S. Berkovich, Golay Code Transformations for Ensemble Clustering in Application to Medical Diagnostics. Unpublished. Faisal Alsaby received his BSc degree in Computer Science and Information Systems from the King Saud University, Saudi Arabia, in He received an MS degree in Computer Science from the George Washington University, USA in He is currently a Ph.D candidate at the GWU majoring in Computer Science. His research interests are big data clustering algorithms, machine learning, and pattern recognition. Kholood Alnowaiser received her BSc degree in Computer Science from Dammam University. She received an MS degree in Computer Science from the George Washington University, USA in She is currently lecturer at Dammam University. This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited. 64

Golay Code Transformations for Ensemble Clustering in Application to Medical Diagnostics

Golay Code Transformations for Ensemble Clustering in Application to Medical Diagnostics Golay Code Transformations for Ensemble Clustering in Application to Medical Diagnostics Faisal Alsaby Kholood Alnowaiser Simon Berkovich Abstract Clinical Big Data streams have accumulated largescale

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Management Science Letters

Management Science Letters Management Science Letters 4 (2014) 905 912 Contents lists available at GrowingScience Management Science Letters homepage: www.growingscience.com/msl Measuring customer loyalty using an extended RFM and

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

How To Encrypt Data With A Power Of N On A K Disk

How To Encrypt Data With A Power Of N On A K Disk Towards High Security and Fault Tolerant Dispersed Storage System with Optimized Information Dispersal Algorithm I Hrishikesh Lahkar, II Manjunath C R I,II Jain University, School of Engineering and Technology,

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Organization of Knowledge Extraction from Big Data Systems

Organization of Knowledge Extraction from Big Data Systems 2014 Fifth International Conference on Computing for Geospatial Research and Application Organization of Knowledge Extraction from Big Data Systems Ganapathy Mani, Nima Bari, Duoduo Liao, Simon Berkovich

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Image Compression through DCT and Huffman Coding Technique

Image Compression through DCT and Huffman Coding Technique International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

TOWARD BIG DATA ANALYSIS WORKSHOP

TOWARD BIG DATA ANALYSIS WORKSHOP TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 [email protected] What is Learning? "Learning denotes changes in a system that enable

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

In developmental genomic regulatory interactions among genes, encoding transcription factors

In developmental genomic regulatory interactions among genes, encoding transcription factors JOURNAL OF COMPUTATIONAL BIOLOGY Volume 20, Number 6, 2013 # Mary Ann Liebert, Inc. Pp. 419 423 DOI: 10.1089/cmb.2012.0297 Research Articles A New Software Package for Predictive Gene Regulatory Network

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, [email protected]; Third C.

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Tutorial in Genetic Sequence Classification Tools and Techniques A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide

More information

A leader in the development and application of information technology to prevent and treat disease.

A leader in the development and application of information technology to prevent and treat disease. A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today

More information

New Hash Function Construction for Textual and Geometric Data Retrieval

New Hash Function Construction for Textual and Geometric Data Retrieval Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

More information

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Data Mining for Customer Service Support Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Traditional Hotline Services Problem Traditional Customer Service Support (manufacturing)

More information

Research on the UHF RFID Channel Coding Technology based on Simulink

Research on the UHF RFID Channel Coding Technology based on Simulink Vol. 6, No. 7, 015 Research on the UHF RFID Channel Coding Technology based on Simulink Changzhi Wang Shanghai 0160, China Zhicai Shi* Shanghai 0160, China Dai Jian Shanghai 0160, China Li Meng Shanghai

More information

Linear Codes. Chapter 3. 3.1 Basics

Linear Codes. Chapter 3. 3.1 Basics Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

More information

Credit Card Fraud Detection Using Self Organised Map

Credit Card Fraud Detection Using Self Organised Map International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 13 (2014), pp. 1343-1348 International Research Publications House http://www. irphouse.com Credit Card Fraud

More information

Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach

Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach International Journal of Civil & Environmental Engineering IJCEE-IJENS Vol:13 No:03 46 Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach Mansour N. Jadid

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 [email protected],

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Character Image Patterns as Big Data

Character Image Patterns as Big Data 22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,

More information

Quality Assessment in Spatial Clustering of Data Mining

Quality Assessment in Spatial Clustering of Data Mining Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

How To Make Visual Analytics With Big Data Visual

How To Make Visual Analytics With Big Data Visual Big-Data Visualization Customizing Computational Methods for Visual Analytics with Big Data Jaegul Choo and Haesun Park Georgia Tech O wing to the complexities and obscurities in large-scale datasets (

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC [email protected] Hong Cheng CS Dept, UIUC [email protected] Abstract Most current search engines present the user a ranked

More information

CLUSTERING FOR FORENSIC ANALYSIS

CLUSTERING FOR FORENSIC ANALYSIS IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS

More information

COMPUTING CLOUD MOTION USING A CORRELATION RELAXATION ALGORITHM Improving Estimation by Exploiting Problem Knowledge Q. X. WU

COMPUTING CLOUD MOTION USING A CORRELATION RELAXATION ALGORITHM Improving Estimation by Exploiting Problem Knowledge Q. X. WU COMPUTING CLOUD MOTION USING A CORRELATION RELAXATION ALGORITHM Improving Estimation by Exploiting Problem Knowledge Q. X. WU Image Processing Group, Landcare Research New Zealand P.O. Box 38491, Wellington

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

Personalized Information Management for Web Intelligence

Personalized Information Management for Web Intelligence Personalized Information Management for Web Intelligence Ah-Hwee Tan Kent Ridge Digital Labs 21 Heng Mui Keng Terrace, Singapore 119613 Email: [email protected] Abstract Web intelligence can be defined

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Web Usage Mining: Identification of Trends Followed by the user through Neural Network International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 617-624 International Research Publications House http://www. irphouse.com /ijict.htm Web

More information

Identification algorithms for hybrid systems

Identification algorithms for hybrid systems Identification algorithms for hybrid systems Giancarlo Ferrari-Trecate Modeling paradigms Chemistry White box Thermodynamics System Mechanics... Drawbacks: Parameter values of components must be known

More information

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring 714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: [email protected]

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

A New Approach for Evaluation of Data Mining Techniques

A New Approach for Evaluation of Data Mining Techniques 181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

More information