Detection of ORF Frames Using Data Mining

Detection of ORF Frames Using Data Mining 1 Nisha Rani, 2 Rajbir Singh, 3 Gaurav Arora 1,2,3 Dept. of Information Technology, Lala Lajpat Rai Institute of Engg. and Tech., Moga, Punjab, INDIA Abstract The biological data is available in different formats and is comparatively more complex. Knowledge discovery from these large and complex databases is the key problem of this era. Data mining and machine learning techniques are needed which can scale to the size of the problems and can be customized to the application of biology. The research in bioinformatics has accumulated large amount of data. As the hardware technology advancing, the cost of storing is decreasing. The biological data is available in different formats and is comparatively more complex. Knowledge discovery from these large and complex databases is the key problem of this era. Data mining and machine learning techniques are needed which can scale to the size of the problems and can be customized to the application of biology. In the present research work, Open Reading Frame is Detected with the help of data mining. Various consensus Sequences are gathered and Cluster algorithm is applied based on local alignment score calculation. The sequences having greater score will be more similar and this is basis for applying clustering algorithm. Another algorithm which uses the clusters created by previous algorithm is created for detecting the percentage match of the consensus sequence with the entered DNA sequence which further results for the Detection of Open reading Frame of entered sequence Index Terms DNA, RNA, amino acid, codons. information (at the other extreme, a single cluster containing all the records is a great summarization but does not contain enough specific information to be useful) [8]. Clustering analysis has received significant attention in the area of ORF Detection. It allows the identification of the structure of a data set, i.e. the identification of groups of similar objects in multidimensional space. Clustering procedures yield a data description in terms of clusters or groups of data points that possess strong internal similarities. Some possible methods of clustering are: 1. Divisive or Partitional Clustering These methods start with each point as a part of a random or guessed cluster and iteratively move points between clusters until some local minimum is found with respect to some distance metric between each point and the center of the cluster it belongs to. 2. Hierarchical Clustering These methods start with each point being considered a cluster and recursively combine pairs of clusters (subsequently updating the inter-cluster distances) until all points are part of one hierarchically constructed cluster. 3. Graph Theoretic Methods These methods are partitioning methods that partition the space into sub-graphs with respect to some geometric properties. Keywords Gene Expression, ORF, Splicing. I. Introduction A. Data mining The largest part of process known as knowledge discovery, Data mining is the process of extracting information from large volumes of data. This is achieved through the identification and analysis of relationships and trends within commercial databases. Data mining is used in areas as diverse as space exploration and medical research. The non trivial extraction of implicit, previously unknown, and potentially useful information from data variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. B. Cluster Analysis Technique of data mining Cluster analyzes the data objects without consulting a class label. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity. Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to in other clusters. Each cluster that is formed can be viewed as a class of objects from which rules can be derived [9]. Fig. 1 below shows When a hierarchy of clusters like this is created the user can determine the right number of clusters that adequately summarizes the data while still providing useful 90 International Journal of Computer Science and Technology Fig. 1 : Hierarchy of Clusters. C. Bioinformatics Bioinformatics is the integration of computer Science, Mathematical and Statistical methods to manage and analyze biological information. It is a field of science in which Life Sciences and Computational Sciences merge into a single discipline. The Oxford English Dictionary defines bioinformatics as The science of collecting and analyzing complex biological data such as genetic codes The National Center for Biotechnology Information defines bioinformatics as the field of science in which biology, computer science and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: The development of new algorithms and statistics with which to assess relationships among members of large data sets. The analysis and interpretation of various types of data

including nucleotide and amino acid sequences, protein domains, and protein structures. The development and implementation of tools that enable efficient access and management of different types of information.with the hardware technology advancing, the cost of storing is decreasing. Thus there is an urgent need for new techniques and tools that can intelligently and automatically assist us in transferring this data into useful knowledge. Different techniques of data mining are developed which are helpful for handling these large size databases. II. Open Reading Frames Translation by ribosomes starts at translation-initiation sites on RNA copies of gene and proceeds until a stop codon is encountered. Just as three codons of the genetic code reserved as stop codons, one triplet codon is always used as a start codon.specifically, the codon AUG is used both to code for the amino acid methionine as well as to mark the precise spot along an RNA molecules where translation begins in both prokaryotes and eukaryotes. Accurate translation can only occur when ribosomes examine codons in the phase or reading frames that is established by a gene s start codon. Unless a mistake involving some multiple of three nucleotides occurs, alterations of a gene s reading frames change every amino acid coded downstream of the alteration, and such alteration typically result in the production of a truncated version of the protein due to ribosomes encountering a premature stop codon. Most gene code for proteins that are hundreds of amino acids long. Since stop codons occur in a randomly generated sequence at about every 20th triplet codon, one of the reading frames of the RNA copies of most genes has unusually long runs of Codons in which no stop codons occur. These strings of Codons uninterrupted by stop codons are known as open reading frames (ORFs) and are a distinguishing feature of many prokaryotic and eukaryotic genes. A. Introns and Exons Introns are a common eucharistic event. Several features of interrupted genes are: The sequence order is the same as in the mrna The structure of an interrupted gene is identical in all tissues. Introns of nuclear genes have termination codons in all three reading frames. Exon : RNA sequences in the primary transcript that are found in the mrna. Intron : RNA sequences between exons that are removed by B. Splicing Introns have been found in eucharistic mrna, trna and rrna, as well as chloroplast, mitochondrial and a phase of E. coli. Eubacteria are the only species in which introns have not been found. In general, genes that are related by evolution have related organizations with conservation of the position at least some introns. Furthermore, conservation of introns is also detected between genes in related species. The amount and size of introns varies greatly. The mammalian DHFR has 6 exons that total about 2000 bases, yet the gene is 31,000 bases. Likewise, the alphacollagen has 50 exons that range from 45-249 bases and the gene is about 40,000 bases. Clearly two genes of the same size can have different number of introns, and introns that vary in size. Some species will have an intron in a gene, but another species may not have an intron in the same gene. An example is the cytochrome IJCST Vol. 2, Issue 3, September 2011 oxidase subunit II gene of plant mitochondria where some plant species have an intron in this gene and others do not (Rastogi & Mendiratta, 2004). 1. Splicing of RNA Most introns start from the sequence GU and end with the sequence AG (in the 5' to 3' direction). They are referred to as the splice donor and splice acceptor site, respectively. However, the sequences at the two sites are not sufficient to signal the presence of an intron. Another important sequence is called the branch site located 20-50 bases upstream of the acceptor site. The consensus sequence of the branch site is "CU (A/G) A(C/U)", where A is conserved in all genes. Fig. 2 : The Consensus Sequence for Splicing. Pu = A or G; Py = C or U. III. Choice of Sequence Format There are four basic types of molecules involved in life: (1) small molecules, (2) proteins, (3) DNA and (4) RNA. Proteins, DNA and RNA are known collectively as biological macromolecules. DNA is the main information carrier molecule in a cell. DNA may be single or double stranded. A single stranded DNA molecule, also called a polynucleotide, is a chain of small molecules, called nucleotides. There are four different nucleotides grouped into two types, purines: adenosine and guanine and pyrimidines: cytosine and thymine. They are usually referred to as bases (in fact bases are the only distinguishing element between different nucleotides, and denoted by their initial letters, A, C, G and T. This sequence is stored in text files. There is very long sequence (Chain of different combinations of ATCG) stored in the file. The main interest of biotechnologist is to search the sequence for similarity with other sequence. The only difference in DNA and RNA is that in RNA Thymine (T) is replaced with Uracil (U). There are different formats that are used to organize this sequence data. Types of different data formats [5] are given below: Plain Text FASTA Format Genbank Genetic Computer Group Format (GCG) After studying the different DNA format, plain text was chosen for the present problem. The plain text format is comparatively simple than the other formats available. Moreover the other details regarding DNA are not necessary for the present research work. IV. Methodology The methodology for the proposed work will involve the use of Data Mining techniques to analyze the given database and retrieve the desired information. Data collection for eukaryotes. Finding rules for intron predictions. Creation of Open Reading Frame. Different mining techniques and suggesting appropriate approach from the available data. Implementation. Verification. International Journal of Computer Science and Technology 91

This project handles the data of DNA analysis module. One data is related to DNA sequence and other is related to the Protein sequence. These are discussed below: A. DNA to pre-mrna Conversion DNA is the main building block of a living organism. A cell acts as biological catalysts called as enzymes. Information stored in DNA is used to make more transient, single-stranded polynucleotide called RNA (ribonucleic acid). The process of making an RNA copy of a gene is called transcription and is accomplished through the enzymatic activity of a RNA polymerase. DNA sequence is accepted in the form of Plain Text format and then the extra spaces or gaps are trimmed from it. There is a one to one correspondence between the nucleotide used to make the RNA (G, A, U and C where U is an abbreviation for Uracil). In the process of DNA to RNA, T is replaced by U, which results in RNA sequence. B. Pre-RNA to mrna Conversion The messenger RNA (mrna) copies of prokaryotic genes correspond perfectly to the DNA sequence present in the organism s genome with the exception that the nucleotide uracil (U) is used in place of thymine (T). In case, translation by ribosomes almost always begins while RNA polymerases are still actively transcribing a prokaryotic gene. Eukaryotic RNA polymerases also use uracil (U) in place of thymine (T), but much more striking differences are commonly found between the mrna molecules by ribosomes and the nucleotide sequence of the eukaryotic gene. In eukaryotic genome the two-step of genes expression are physically separated by the nuclear membrane, with transcription occurring within the nucleus and translation occurring after mrna has been exported to the cytoplasm. Pre-RNA to mrna conversion is computed The RNA sequence is accepted in the form of plain text format. The first eukaryotic genomic sequence is obtained, which contain genes intervening sequence (introns) that interrupt the coding regions. RNA splicing is a process that removes the non-coding part (introns) and joins exons in a primary transcript. The different kind of introns have been found in eukaryotic cells through only one of that type, the one that conform to a GU-AG rule is predominantly associated with eukaryotic protein coding genes. The GU-AG rule, the first two nucleotides at the 5 end of the RNA sequence of all of these introns are in variably 5 -GU- 3 and the last two at the 3 end of the intron are always 5 -AG-3. Additional nucleotides associated with these 5 and 3 splice junction as well as an internal branch site [10][14]. In the present work, introns are removed from pre-mrna based on GU-AG rule. The final sequence after skipping the intron part is m-rna, which contains exons only. V Results and Discussion The present research work is on predicting Open reading Frame for eukaryotes using data mining technique. The following processes associated with two sub processes. DNA to pre-mrna conversion Pre-mRNA to mrna conversion The main focus in this research work was applying appropriate data mining algorithm on the consensus data gathered from various sources and applying the clusters created through that data mining algorithm on entered DNA sequence to know the percentage match of consensus sequence with the entered DNA sequence and then come to the conclusion based on threshold value entered by the user that whether entered DNA is it is a junk DNA and if not Open Reading Frames of the entered DNA sequence is calculated and shown, consensus sequence match 92 International Journal of Computer Science and Technology and consensus class is shown. The user inputs the DNA sequence. The extra spaces present in the input sequence are also trimmed. Local Alignment score is calculated with the sequences having greatest similarity calculated though the data mining algorithm applied on the consensus sequences.if score >=8 then score of entered DNA sequence is matched with the items of the cluster having similarity>=8 and consensus sequence having greatest score is found out otherwise best similarity with the second cluster is calculated. Then the percentage match of the entered DNA sequence with the consensus sequence is calculated and if that match is greater than or equal to the entered threshold value then various outputs are displayed. The clustering algorithm that works at back end is similar to the single link technique. With this serial algorithm, items are iteratively merged into existing clusters that are closest(in terms of similarity). In this algorithm a threshold, threshold, is used to determine if items will be added to existing clusters or if a new cluster is created. The basis for making clusters is local alignment between the sequences, as larger the score more the sequences will be similar. So the sequences having score greater than or equal to the threshold value are entered into one cluster and rest of the sequences having score less than the given threshold are entered into second cluster. The complexity of this algorithm actually depends on the number of items. For each loop, items must be compared to each item already in a cluster. This is n in the worst case. Thus, the time complexity is O(n2).Space requirement is assumed to be also O(n2).That is same as nearest neighbor algorithm. Proposed algorithm is changed form of nearest neighbor algorithm. Changes are based on seeing the characteristics of input data. The objective of target identification is to identify a biological molecule that is essential for the survival or proliferation of a particular disease-causing agent, or pathogen. Once the target is identified, the objective of drug design is to develop a molecule that will bind to and inhibit the drug target. An understanding of the structure and function of proteins is an important component of drug development because proteins are the most common drug targets. Various types of RNA sequences (rrna, trna etc) can also be generated for a given DNA sequence. Fig. 3: First Display Screen of Data Mining Model for ORF Detection

IJCST Vol. 2, Issue 3, September 2011 Technology, Moga, for his invaluable advice, guidance, motivation and patience throughout the research work., without his support, this thesis would not have been completed.i am highly indebted to Dr.H.S Gulati Principal, Lala LajPat Rai Institute of Engg. and Technology, Moga, for providing opportunity to carry out the present thesis work. Last, but not the least I wish to thank my parents, husband and friends, who have given me moral support and their relentless advice throughout the completion of this work.i lack words to express my cordial thanks to the members of Departmental Research Committee (DRC) for their useful comments and constructive suggestions during all the phases of the present study as well as critically going through the manuscript. Fig. 4 : After entering consensus sequence data. VI. Conclusion The Data Mining Model constructed for Open Reading Frame outputs the Open Reading Frame, Consensus sequence class and consensus sequence present in the entered DNA sequence. This model can further be useful for determining the gene expression and further protein sequence for the input DNA sequence. The provision is also there to input the consensus sequence directly which results to working of the data mining algorithm at backend and clusters created are used to retrieve the results at the relevant stages. All the possible Open Reading Frames, consensus sequence and consensus sequence class are shown at the output. The present work will assist the researchers to detect the relevant outputs at the different stages of Open reading Frame. This can further assist to predict the gene expression for the given DNA sequence which can further lead to predict the type or class of the gene under consideration. Following data mining techniques were utilized for retrieving the results. Association Cluster Analysis VIII. Future Scope Of Work Following improvements regarding the developed model of bioinformatics can be made: The model can be extended to implement Gene expression for eukaryotes. Predicting ORF is intermediate step for finding the final amino acid sequence; different protein structures (primary, secondary and tertiary) can be predicted. This can further help in drug design. The overall process of drug design can be broken into two major components: Discovery and testing. The testing process, which involves preclinical and clinical trails, is generally not subject to significant enhancement using computational methods. The discovery process, however, is labor intensive and expensive and has provided a fertile ground for components, including target identification, lead discovery and optimization, toxicology and pharmacokinetics. IX. Acknowledgment I wish to express my sincere gratitude to my thesis advisor, Mr. Rajbir Singh Cheema, Asst Proff. Deptt. of Information References [1] Anne de Jong, Sacha A. F. T. van Hijum, Jetta J. E. Bijlsma, Jan Kok and Oscar P. Kuipers BAGEL: a web-based bacteriocin genome mining tool, Nucleic Acids Research, vol. 34, 2006 pp. 273-279. [2] Alvis, B., Helen, P. Array Express a public repository for micro array gene expression data at the EBI, Nucleic Acids Research, vol. 31, 2003 pp. 68-71. [3] Brunak, S., Engelbrecht, J., Knudsen, S., prediction of human mrna donor and acceptor sites from the DNA sequence, Journal of Molecular Biology, vol. 220, 1991 pp. 49-65. [4] Burge, C., Karlin, S., Prediction of complete gene structures in human genomic DNA, Journal of Molecular Biology, vol. 268, 1997, pp. 78-94. [5] Burge, C., Modeling dependencies in pre-mrna splicing signals, Computational Methods in Molecular Biology, 1998, pp. 127 163. [6] Batzoglou, S., Mesirov, J.P., Human and mouse gene structure Comparative analysis and application to exon prediction, Genome Res., vol. 10, 2000, pp. 950 958. [7] Burge, C., Karlin, S., Prediction of complete gene structures in human genomic DNA, Journal of Molecular Biology, vol. 268, 1997, pp. 78-94. [8] Brunak, S., Engelbrecht, J., Knudsen, S., prediction of human mrna donor and acceptor sites from the DNA sequence, Journal of Molecular Biology, vol. 220, 1991, pp. 49-65. [9] Carle-Urioste, J. C., Benito, C. H., In vivo analysis of intron processing using Splicing dependent reporter gene assays, Plant Journal of Molecular Biology, vol. 26, 1994, pp. 1785-795. [10] Carle-Urioste, J. C., Walbot, V., A combinatorial role for exon, intron and splice site sequences in splicing in maize, Plant Journal of Molecular Biology, vol. 11, 1997, pp. 1253-1263. [11] Chan K.C.C., Wai-H. Au, Mining Fuzzy Association Rules in a Database Containing Relational and Transactional Data, Data Mining and Computational Intelligence, vol. 15, 2001, pp. 95-114. [12] Fan, Jianhua, Li, Deyi, Overview of data mining and knowledge discovery, Journal of Computer Science and Technology, vol. 13 (4), 1998, pp. 348-368. [13] Fayyad, U., Paul, Data mining and KDD: Promise and challenges, Generation Computer Systems, vol. 13 (2-3), 1997, pp. 99-115. [14] Jiang, D. Tang, C., Zhang, A., Cluster Analysis for Gene Expression Data, IEEE Transactions on knowledge and data engineering, vol. 11, 2004, pp. 1370-1386. [15] Jiang, D., Pei, J., An Interactive Approach to Mining Gene Expression Data, IEEE Transactions on knowledge and Data Engineering, vol. 17, 2005, pp. 1363-1378. International Journal of Computer Science and Technology 93

[16] John Besemer, Mark Borodovsky, GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses, NucleicAcids Research, vol 33, 2005, pp.451-454. [17] Kitamura, Sumie, Takahashi, N., Distribution Analysis of 5 Splice Site-Like Sequences in Human and Mouse premrnas, Genome Informatics vol. 12, 2001, pp. 415 416. [18] Li, J.J, Huang, De-S., Characterizing Human Gene Splice Sites Using Evolved Regular Expressions, Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005, 2005, pp. 493-498. [19] Malhi, M.S. et al., Development of Data Mining model for bioinformatics system, M.Tech Thesis PAU, Ludhiana, October 2003,, pp. 1-76. [20] Masaru, T., Nobuyoshi, S., Introns and Reading frames: Correlation between splicing sites and their codon positions, The Society for Molecular Biology and Evolution, vol. 13(9), 1996, pp. 1219-1223. [21] Nurtdinov, R.N., Gelfand, M.S., Low conservation of alternative splicing patterns in the human and mouse genomes, Human Molecular Genetic, vol. 12, 2003, pp. 1313 1320. [22] Pertea, M., Lin, X.Y., Salzberg, S.L, Gene splice a new computational method for splice site prediction, Nucleic Acids Research, vol. 29, 2001, pp. 1185-1190. [23] Raghavan, V.V., Server, H., Introduction to Data Mining, Journal of the American Society for Information Science, vol. 49 (5), 1998, pp. 397-402. [24] Segal, E., Yelensky, R., Genome-Wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression, Bioinformatics, vol. 19, 2003, pp. 273-282. [25] Yi Peng, Gang Kou, Yong Shi, Zhengxin Chen, A Systemic Framework for the Field of Data Mining and Knowledge Discovery, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW 06). 94 International Journal of Computer Science and Technology