DOI 10.7603/s40601-014-0011-y GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 An Efficient DNA Molecule Clustering using GCC Algorithm Faisal Alsaby and Kholood Alnowaiser Received 31 Dec 2014 Accepted 27 Jan 2015 Abstract Researchers in the biotechnology field have accomplished many achievements in the past century. They can now measure expression levels for thousands of genes, testing different conditions over varying periods of time. The analysis of the measurement results is essential to understand gene patterns and extract information about their functions and their biological roles. This paper describes a novel approach for clustering large-scale next-generation sequences (NGS). It also facilitates the process of predicting patterns and the likelihood of mutations based on a semi-supervised clustering technique. The process is based on the previously developed construction of FuzzyFind Dictionary utilizing the Golay Code for error correction. The introduced method is exceptional; it has linear time complexity with one passage through the file. Keywords- DNA; RNA; Gene Clustering; Pattern Recognition; Golay Code, Big data I. INTRODUCTION Researchers have generated overwhelming amounts of gene expression data. As time advances, the pace only gets faster and data grows rapidly. Comprehending gene expression data is a fundamental step in understanding human ancestry, diseases, and their interaction with environmental conditions. It also can result in developing new medicines and treatments for disease. Although we have successfully accommodated the pace at which gene expression data is generated storage-wise, our human brains are not capable of understanding these amounts of raw data. Therefore arises the need for clustering. Clustering is the process of subdividing a given data set into groups, named clusters, such that items within one cluster are similar and items located in different clusters are dissimilar. Clustering of next-generation sequences is a rather complicated computational problem to study the large-scale data of DNA and RNA molecules. There are many advantages of gene expression clustering: it allows scientists and researchers to study data without studying each individual gene, it allows data visualization, and it helps scientists to figure out roles for unknown genes in the same cluster as well as reduce the redundancies in NGS data. similar ideas, their results can vary tremendously in realworld data. It is important to note that there is not a single algorithm that is suitable for all; however, each one of them optimizes one of the many algorithm-evaluation criteria [1]. Most of the available clustering algorithms are not adequately fast and scalable; therefore, they are not capable of handling large-scale data with tens of millions of reads. Examples of the methods that are capable of clustering data in the range of million sequences are UCLUST, FreClu, SEED, and BLAST. This paper presents a novel approach for clustering massive amount of NGS stream by utilizing the Golay Code Clustering algorithm (GCC). The approach uses a distance based unsupervised clustering technique. The proposed clustering method efficiently places sequences into clusters based on mismatch features that are examined by user-definable similarity templates. The presented algorithm tackles the similarity search problem, which is the most common challenge that is faced in the postgenome era in which sequence is compared against another, or to a database, to find similar region between them. In other words, it is capable of recognizing exact matches, and best matches. The algorithm outperforms other algorithms in the literature in terms of time complexity. This paper also presents a novel approach of predicting patterns and recognizing new patterns based on the clustering results. In section 2, a brief description of already available algorithms and clustering techniques is presented. The sections after that explain the In section 2, a brief description of already available algorithms and clustering techniques is presented. The sections after that explain the proposed technique. Section 3 gives a brief theoretical background about the clustered data. It also discusses the process of clustering data, such as data transformation into vectors and clustering-addresses composition. Section 4 illustrates the best-match search. Section 5 talks about training the algorithm with labeled data and the pattern recognition approach. Finally, the authors provide their suggestion for future work. There are many clustering algorithms that have been applied to NGS data. Although these algorithms may apply DOI: 10.5176/2251-3043_4.2.324 59
GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 II. RELATED WORK A. Hierarchical clustering Hierarchical clustering method is favored by biologists due to its ability to render graphical representation of the results. Despite this algorithm s graphical capabilities, it does not give the best results. A drawback of this method that the data needs to be preprocessed due to the lack of noise recognition abilities. B. K-means clustering K-means clustering method works by choosing k number of clusters and assigning an initial mean to each cluster. After that, the distances between each pattern and the cluster means are calculated and the pattern is assigned to the cluster with the closest mean. New means are calculated and patterns are reorganized in clusters after every iteration. This process repeats until patterns are stable in their clusters and no further reorganization happens. Although this method is relatively fast (NK iterations), it is not particularly suitable for clustering gene expression data. One reason is the fact that it needs prior specification of the number of clusters k. This can be difficult with most gene specification data. Another problem is that it forces every gene into a cluster. This may produce meaningless clusters that are difficult to analyze [2]. C. Self-Organizing feature Maps (SOM) It is similar to k-means clustering; it divides the input into groups with similar features (patterns), and it requires the user to predetermine the number of initial clusters. The output of this algorithm is a grid of clusters, each cluster with similar features. It is preferred to use this algorithm over the k-means algorithm when dealing with larger number of clusters [1]. The goal of this paper is to bring forward a new simple and efficacious tool for one of the most demanding operations of the Big Data methodology clustering of large-scale information in a data stream mode. In addition, it aims at improving the clustering and prediction outcomes. There is no optimal clustering technique that is suitable for all clustering problems; however, we claim that proposed approach is superior in terms of time complexity and accuracy. The approach uses the information currently available about gene expression data as labeled training data in order to achieve more accurate results. In addition, the proposed algorithm is linear with respect to the input size and its time complexity is in O(n). III. CLUSTERING APPROACH A. Brief and background information The importance of genomic studies rises from the importance of life itself. DNA molecules hold the code of human life. Scientists have discovered the structure of deoxyribonucleic acid (DNA) in the 1950s. Ever since that time, huge governmental and private projects have been established to study this molecule. Those projects have paid off and we have accomplished a lot in the past 30 years, we have decoded the human genome. Scientists now know that a single stranded DNA molecule is a string of characters over the set {A, C, G, T}. They also can identify this string for different molecules (sequencing molecules) [3]. The next step after sequencing is to use those sequences in order to understand their biological functions, how they interact with each other, and how they interact with external factors. One important aspect of DNA studies is the study of genetic mutations. Things can go wrong when parents pass their genetic information to their offspring. They can also go when a cell divides to produce more cells. Although cells have enzymes to proofread newly replicated DNA strands, errors can still occur. A genetic mutation is a change in DNA basepair sequence [4]. This paper presents a clustering technique that is based on a reverse of the traditional error-correction scheme using the perfect Golay code described in [5] and [6]. Our presented clustering algorithm proposes utilizing the Golay Code error correction algorithm to create hash indices. Traditionally, This paper presents a clustering technique that is based on a reverse of the traditional error-correction scheme using the perfect Golay code described in [5] and [6]. Our presented clustering algorithm proposes utilizing the Golay Code error correction algorithm to create hash indices. Traditionally, Golay Code is used for data transmission. It accepts a message of 12-bit and adds 11 correction or parity bits to produce a 23-bit codeword. Then this codeword is sent through communication channels. Error correction process examines the received vector to recover the original transmitted codeword and then the original message. The parity bits assist in recovering and discovering up to three errors that might occur during the transmission phase. In our clustering technique, the error correcting procedure is utilized to map any possible combination of 23-bit into a 12-bit data word or index. Consequently, every vector should be uniquely correspondence with the nearest valid codeword and hence associated with 12-bit data word. Note that none of these codewords should lead into ambiguous results. Therefore, based on this correspondence between codeword space and data word space, we can create a clustering system based on the special error correcting relationship between the two spaces. The idea of the proposed clustering system comes from a reverse of the traditional error-correction scheme using the perfect Golay code (23, 12, 7) as described in [5, 7, 8]. With the Golay Code (23, 12, 7), the whole set of 23- bit vectors is partitioned into 212 spheres with radius 3. 60
GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 Therefore, a transformation that maps the 23 bit string into the 12-centers of these spheres is able to tolerate certain dissimilarity bit positions of the 23-bit strings. According to [5], reversing the Golay Code allows 86.5% of the codewords to map into their correspondence six data words. For our presented approach, this feature is very essential. Among the possible 23-bit codewords (223=8,388,608), there are 13.5% codewords do not generate the six data words that are required for the clustering purpose. Therefore, a remedy solution is introduced to overcome this limitation. It works by flipping randomly one bit within the 23-bit of these codewords. Because the Golay Code can tolerate up to three mismatches, flipping one bit of each vector that does not generate the six data words will not alter the resulted codeword remarkably. B. Vector Clustering gene expression data using The Golay Code requires data to be represented in the form of a 23-bit vector. This size is suitable to cluster DNA/ RNA seeds. The original gene expression data is a series of the letters (A, C, T, and G). For instance, encoding the DNA molecule into a stream of bits is possible using two bits to represent a single letter [3]. This will generate a stream of a length of an even number. Since the approach requires a 23-bit vector, the stream of bits is divided into blocks, each is 22- bit long. The 23rd bit is discarded i.e. filled with dummy data. This step will not alter the results remarkably, due to the ability of Golay Code to tolerate up to 3 errors. Thus, it can be neglected. Another way of creating these vectors is discussed in [D]. The final result is a group of 23-bit vectors, which is the transformation of the gene expression data. Each of these vectors is a codeword that is fed to the clustering algorithm. Each vector V is processed by the Golay code clustering algorithm and six different 12-bit long indices are generated as a result. Another process performs some pairwising over these indices to generate 15 addresses, and the vector will be placed in the 15 clusters responding to these 15 addresses. The algorithm depends on the Hamming distance between the vectors; it will place two vectors in the same cluster if the Hamming distance between them does not exceed a certain number (i.e. the number of mismatching bits between the two vectors). This will come helpful in identifying not only exact matches, but also similar patterns up to a certain similarity threshold. As mentioned in [6], the algorithm will identify two vectors V1 and V2 to be belonging to the same cluster if they are 65.2% up to 100% similar. C. Similarity measure The proposed clustering methodology employs Hamming distance to measure the similarity between vectors. The Hamming distance d(x, y) between two vectors x, y is the number of coefficients in which they differ. In other words, Hamming distance between two vector is the number of bits we must change to convert one codeword into the other. For example: the Hamming distance between the vectors 01101010 and 11011011 is 4. This methodology is considered one of the most simple, efficient, and accurate distance measures [7]. When using Hamming distance it is possible to a codeword to have more than one nearest neighbor i.e. neighbor is not always unique. The Hamming distance, a natural similarity measure on binary codes, requires a few machine instructions for each comparison. Moreover, in practical performing exact nearest neighbor search in Hamming space significantly faster than linear search, with sub linear run-times [6]. D. Metadata templates Realization of this clustering algorithm requires using templates of yes/no questions for each data item [8]. Each template consists of a set of questions or some other ways of dichotomic inquiring that produce a 23-bit. A group of 23-bit Metadata Template that is suitable for a given problem is designed. Each of these questions investigates the presence or absence of a property for instance, symptom, or a feature. In our case; however, because of the large size of the genome database, it is computational expensive to compare each position in a query with each position on the database. Thus, as with BLAST and other systems, it is beneficial to use a short and continuous sequence of letters as seed to be used to tackle this problem. An exact match might lead to find a longer match. As with spaced seeds where it investigate the match in a position. In precise, it requires some positions to have exact letter while others are not required [9]. Based on that, the templates that we suggest can be develop in such a way that ensures that similar vectors are placed within similar clusters to allow performing exact and best match search operations. The answer of each question within the templates can be represented in a form of (0 or 1), which eventually is used to build a 23-bit vector that representing a specific block or seed of the gene (as with Fig.1)[9]. This procedure results in having the corresponding seeds in clusters where each item differs from another seed in no more than some limited number of bitpositions mismatches of a corresponding 23-bit template. Another way of creating these templates can be as follows: suppose that the purpose is to investigate and find the mutation or provide predictions based on the current status of the clusters for instance, chance of developing breast cancer can be computed by utilizing the bit positions feature. As mentioned above, two vectors can be in the same cluster if they are 65.2% up to 100% similar. Based on the questions that are included in the templates, which investigate the presence of features, vector whose its neighbors have different features are more likely to have 61
GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 them. For more illustration, if two patients are clustered based on their symptoms into one common cluster, each one has more chance to develop the same symptoms that the other person has already developed. Thus, the template might be built in such a way that it contains 23 questions that compare the under assessment sequence with a reference DAN sequence. For sake of illustration, the following example and more discussion is considered. Suppose we have this original DNA sequence (AATTTTCTTA TATGGCTTGA GATGTCTTAT) and we would like to investigate a new DNA sequence and compare it to this original sequence in order to find exact match, mutations, and perform other data analysis operations. Initially, we assumed that k templates where each one of them contains 23-questions have already been developed. TABLE I. SEEDS REPRESENTATION IN VECTORS DNA (Block) Sequence Generated Vector Hamming Distance from the Original AATTTTCTTT 1111111110 TATGGCTTGA AAT 1111111111011 AATTTTCACC 1111111000 TATTGCTTGAGAT 1110111111111 TTAACCGCAT 000000000 ATTGGCTTGCGAT 001111110111 2 4 13 In precise, the first template investigates the first 23 contiguous letters and the second template investigates the second 23-letters and so on. Answering these questions would produce vectors as with table (1). It is important to note that each template is associated with a clustering scheme where vectors that are generated based on answering the questions of this template will be clustered in this scheme. E. cluster address composition In order to cluster each generated vector within the specified scheme, appropriate clustering addresses have to be created. The 23-bit vector is transferred into a 12-bit index when the Golay Code to is reversed (see [5] for more details). This 12-bit index points into a Hash table addresses that is created for each clustering scheme. Due to the limited size of this table that is imposed by the size of the Golay code data words space i.e. 212, newly way of creating addresses is presented to expand the Hash table. By performing a special concatenating process between each two of the six data words, 15 pairs of indices are created. For the sake of convenience, more explanation is considered. Applying the aforementioned GCC to a certain vector, six hash indices are generated which maps it to a set of 12-bit data words. According to [5], any two 23-bit codewords differing by two bits i.e. two Hamming distance will have two hash indices in common. Therefore, all possible 2-bit distortions of a codeword will have one out of the 15 addresses in common with the undistorted vector. Hence, similar vectors are clustered based on the common indices into same cluster. F. Cluster coverage and validation When applying the GCC to the possible 23-bit vectors (8,388,608 vectors), a total of 1,267,712 non-empty clusters were created. Each one of the generated clusters contains (139) or (70) codewords. For simplicity, we call them larger cluster (LC) and smaller cluster (SC) respectively. The maximum Hamming distance within each cluster is either 7 or 8. More importantly, the minimum total number of bit positions that have common bit values within each cluster is either 15 or 16. This is specifically a significant feature since it represents the total number of common attributes between codewords within a certain cluster i.e. it ensures that vectors that represent a certain block must have no more than 7 or 8 different features. For example, the first and second vectors in table (1) will be clustered into at least one similar cluster while the third one will not be included in any cluster that contains the first and second vectors. More importantly, within each one of the SLs, 98.55% of the vectors have at least 17 common features, while the remaining vectors have either 16 or 15. On the other hand, 86.25% of the codewords in LCs have at least 17 common features, while 13.75% share either 16 or 15. IV. ENSEMBLE METHOD AND BEST MATCH SEARCH As aforementioned, realization of the proposed clustering algorithm requires utilizing of k templates where each one of them associated with a clustering scheme. Thus, in order to cluster a gene of large size, a limitation is imposed by the size of the Golay Code, which requires the size of the vectors to be 23 bits. Hence, we propose a novel approach of dividing the large file into short chunk of size 23 and examine them according to the appropriate template or examining the same original file without segmentation with multiple templates that investigate different features. Furthermore, the resulted vector is placed within appropriate clusters within a correspondence scheme. Thus, the whole gene can be clustered using our efficient clustering algorithm (Fig.1). Nevertheless, when performing the exact or best match search, the system would offer multiple answers following fuzzy search algorithm that is discussed in [10]. Therefore, combining the result of clustering schemes and assigning the relevance 62
GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 in terms of Hamming metric yields high accuracy and efficiency that are needed to discover mutations, alignment and perform other analysis functionalities. Figure 1. Ensemble Method Overview V. TRAINING AND PATTERN RECOGNITION METHOD If data points are contained in the same cluster, there is a high possibility that they are of the same class. This assumption does not necessarily mean that each class forms a single compact cluster. However, it means that in practice we usually do not observe objects of two distinct classes in the same cluster. [11]. Training the system is an important step that precedes the pattern recognition pattern. It is so because unlabeled data form a major challenge that machine learning and data mining systems are facing. The training method uses a fully labeled dataset to achieve better results. For future work, the researchers suggest to consider one of the public data sets provided by many institutions and information centers, such as National Center for Biotechnology Information (NCBI). Theoretically, assume there are N training vectors that are fully labeled. Those labels can vary depending on the type of samples the chosen dataset contains. Types of tissues, types of organisms, known sequences names, all of these are valid cluster labels [12]. Those N vectors will be referred to as centers in this paper, and will be used to put labels on vectors that have been already clustered. To give more in-depth explanation, this paper will consider the Barrett s esophagus data set used and discussed in [13]. The data set has samples of tissue types; three normal and one neoplastic. The training vectors in this case will be as follows: V1 normal type1 V2 normal type2 V3 normal type3 V4 neoplastic After that, the algorithm will iterate all the clusters and calculate the Hamming distance between each of clustered vector and V1, V2, V3, and V4. The aim of these calculations is to label clustered vectors. The label of the vector is basically the exact label of the nearest center, as long as the Hamming distance does not exceed a certain number [6]. The next step is to label clusters depending on the majority of vector labels within the cluster. For each cluster, the frequency of labels is calculated, this referred to as weight W: Wn = Frequency[n] w represents the weight of the object (n). The label with the highest weight W will be chosen to label that cluster. This process is called the vote of the majority. To avoid the effect of noise on this process, the algorithm limits the ability to vote. A cluster is granted the right to vote if the number of vectors in it is equal to (or greater) than a certain threshold. Based on [5], good results were achieved when threshold = 10, and the labeling process is 92.7% accurate. A. Pattern prediction The presented approach is perfect not only to recognize patterns with exact matches, but also to recognize and predict similar patterns, best matches, and patterns with mutations. Assume that a sequence of gene expression data S was divided into blocks B1, B2,, Bn. Each of these blocks is fed to the algorithm described in previous sections. After that, a pointer to each block, and eventually to S, is placed in each of the k clusters. The vote of the majority in this case will identify the type of S. Another type of identification is to recognize possible future mutations based on the analysis of previously processed gene expression data using Fuzzy search method. If S is the expression in question, the algorithm follows a Fuzzy search method to find all other expressions E ={E1, E2,, En} that are neighbors of S with certain hamming distance less than or equal to h. By sequencing through E, all the bit positions of mismatches with S are grouped in a group L. Since every two bits represent a letter, distortions in the sequence of bits represent mutations. Mutations in L are ranked based on the frequency of their occurrences, and future mutations in S are predicted based on those ranks. VI. FUTURE WORK This paper opens the door for many ideas. It presents a theoretical idea and claims it runs in time complexity of O(n). For future work, the authors suggest a practical implementation of the presented algorithm. Experimental results can validate the claims and verify the algorithm s optimality. In fact, many questions can only be answered 63
GSTF Journal on Computing (JoC) Vol.4 No.2, March 2015 after the application of the algorithm on data [14, 15, 16], such as: Developing a similarity matrix to combine the ensembles results. Do the results match our prior knowledge? (external validation) Do the clusters fit data well? (internal validation) Are the results achieved by this algorithm better than those achieved by other? Also, the authors have not addressed any technical details about the process of mapping from block clustering of B1, B2,, Bn to pattern recognition of the original sequence S. This is an important aspect of the implementation of this algorithm[17]. [7] S. Berkovich, and E. El-Qawasmeh, Reserving the error-correction scheme for a Fault- Tolerant Indexing, Computer Journal. England, vol. 43, no. 1, pp. 54-64, 2000 [8] E. El-Qawasmeh, and M. Safar, Investigation of Golay Code (24, 12, 8) Structure in Improving Search Techniques, Associations of Arab Universities, 2011 [9] S. Berkovich, and D. Liao, On clusterization of big data streams, Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications, article no.26. ACM press, New York 2012. [10] M. Yammahi, K. Kowsari, C. Shen, and S. Berkovich, "An Efficient Technique for Searching Very Large Files with Fuzzy Criteria Using the Pigeonhole Principle," Computing for Geospatial Research and Application (COM.Geo), 2014 Fifth International Conference on, vol., no., pp.82,86, 4-6 Aug. 2014. [11] O. Chapelle, B. Scholkopf, and A. Zien, Introduction to SemiSupervised Learning, Cambridge, Massachusetts, The MIT Press, ch. 1. pp.6 [12] K. Yeung, D. Haynor, and W. Ruzzo, Validating clustering for gene expression data, Bioinformatics, vol. 17, no. 4, pp. 309 318, 2001. VII. CONCLUSION In this paper, a novel approach for clustering NGS has been introduced. The process is based on the GCC. The proposed methodology works by clustering diverse information items in a data-stream mode. The basic idea is to take an unsupervised clustering method, then use labeled data to train the system. The result is a group of clusters such that data points within each cluster are homogenous, and data points from different clusters are nonhomogenous. The technique followed by this paper improves the ability to extract knowledge and insights from large and complex collections of gene expression data by improving the classification accuracy. It also surpasses other methods as it improves the time complexity to O(n). For these two reasons, the introduced approach is applicable to many classification problems, and specifically suitable for analyzing long DNA sequences and understanding their biological functions. [13] N. Grira, M. Crucianu, N. Boujemaa, Unsupervised and Semisupervised Clustering: a Brief Survey, 7th ACM SIGMM international workshop on Multimedia information retrieval, pp. 9-16, 2005. [14] Y. Hongjun, T. Jing, D. Chen, and S. Berkovich, Golay Code Clustering for Mobility Behavior Similarity Classification in Pocket Switched Networks, J. of Communication and Computer, USA, 2012. [15] D. Greene, M. Parnas, and F. Yao, Multi-index hashing for information retrieval. FOCS, 1994. [16] M. Norouzi, A. Punjani, and D. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012 [17] U. Keich, M. Li, B. Ma, and J. Tromp, On spaced seeds for similarity search, Discrete Applied Mathematics, Volume 138, Issue 3, 15 April 2004, Pages 253-263. AUTHORS PROFILE REFERENCES [1] P. D'haeseleer. How does gene expression clustering work? Nat. Biotechnol. 23(12), pp. 1499-501. 2005. [2] Basu, Mitra, and Tin Kam Ho, Data Complexity in Pattern Recognition, London: Springer, 2006. [3] S. Faro and T. Lecroq. An efficient matching algorithm for encoded DNA sequences and binary strings. 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009) 2009. [4] C. K. Omoto and P. F. Lurquin, Genes and DNA: A Beginner's Guide to Genetics and Its Applications, New York: Columbia University Press, 2004. [5] F. Alsaby and S. Berkovich. Realization of clustering with Golay code transformations. Global Science and Technology Forum, J. on Computing (JoC) Vol 4 No 1, 2014 [6] F. Alsaby, K. Alnowaiser and S. Berkovich, Golay Code Transformations for Ensemble Clustering in Application to Medical Diagnostics. Unpublished. Faisal Alsaby received his BSc degree in Computer Science and Information Systems from the King Saud University, Saudi Arabia, in 2005. He received an MS degree in Computer Science from the George Washington University, USA in 2012. He is currently a Ph.D candidate at the GWU majoring in Computer Science. His research interests are big data clustering algorithms, machine learning, and pattern recognition. Kholood Alnowaiser received her BSc degree in Computer Science from Dammam University. She received an MS degree in Computer Science from the George Washington University, USA in 2015. She is currently lecturer at Dammam University. This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited. 64