BIOINFORMATICS ANALYSIS OF SNPS IN MICRO-RNA AND ITS PROCESSING MACHINERY GENES

Transcription

1 CHAPTER 6 BIOINFORMATICS ANALYSIS OF SNPS IN MICRO-RNA AND ITS PROCESSING MACHINERY GENES 6.1. INTRODUCTION MicroRNAs are coded from post translational regulatory genes which are responsible for silencing of the mrnas. These are a group of non-coding RNAs usually located in the intergenic non-coding regions of protein coding genes (25%) or the introns or exons of non-coding RNA genes (75%). They show ubiquitous expression in all cell types. Victor Ambros, Rhonda Feinbaum and Rosalind Lee discovered the first microrna in lin-4 was the first microrna from the bacteria C.elegans. Lin-4 was reported to be a mutant gene which showed the aberrant cell lineages during cell development(bartel 2004). Due to the lack of an open reading frame or ORF it could not generate a protein. Instead the gene products were short RNA transcripts called Precursor mirna and mature mirna of lengths 61 and 22 nucleotides, respectively(moss and Tang 2003). Lin-4 mirna (22nt) silenced or repressedthe translation of the Lin-14 gene by hybridizing to the complementary 3 UTR region of the Lin-14 gene. However,miRNA gene repression was inhibited by RNAs transcribed from the genome. These were called Pri-miRNAs which had a stem loop structure and acted as a trigger for mirna silencing pathways. MicroRNAs are functionally important endogenous molecules which play an important role in the regulation of biological processes, such as Development Cell proliferation Cell differentiation Apoptosis Transposon silencing Antiviral defence 89 P a g e

2 DNA repair Most mammalian mirnas do not appear to be primarily involved at the upper levels of the gene regulatory cascades but instead appear to be operating at many levels to regulate the expression of a diverse set of genes, many of which do not go on to directly influence the expression of other genes (Lewis et al 2003). mirnas and sirnas have a shared central biogenesis and can perform interchangeable biochemical functions. Hence, these two classes of silencing RNAs cannot be distinguished by either their chemical composition or mechanism of action. Nonetheless, important distinctions can be made, particularly based upon their origin, evolutionary conservation and the genes they target. First, mirnas are derived from genomic loci distinct from other genes, whereas sirnas often derive from Transposons, viruses, mrnas or heterochromatic DNA. Second, mirnas are processed from transcripts that can form local RNA hairpin structures. sirnas on the other hand are processed from long bimolecular RNA duplexes or extended hairpins. Third, a single mirna-mirna* duplex is generated from each mirna precursor molecule,whereas a multitude of sirna duplexes are generated from each sirna precursor molecules, leading to many sirnas accumulating from both strands of this extended double stranded RNA. Fourth, mirna sequences are nearly always conserved in related organisms unlike the sirnas which are seldom conserved. Strikingly, endogenous sirnas typically specify auto-silencing i.e. they specify the silencing of the same locus from which they originate. mirnas specify hetero-silencing in that they are produced from genes which silence very different genes. The fifth distinction explains the greater sequence conservation seen for mirnas.to the extent that sirnas come from the same loci that they target, a mutational event that changes the sequence of the sirna would also change the sequence of its regulatory target, and the sirna regulation would be preserved. In contrast, a mutation in mirna is rarely accompanied by simultaneous compensatory changes at the loci of its targets and thus selection pressure would preserve the mirna sequence (Dykxhoorn et al 2003). 90 P a g e

3 Single nucleotide polymorphisms Identification of genetic variants underlying complex traits is a major goal in genetic studies. Herein, it is critical to focus on the genetic variants that are most likely to exert functional impacts (Ye et al 2001). A single-nucleotide polymorphism is a DNA sequence variation occurring when a single nucleotide A, T, C, or G in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual). For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles: C and T. Almost all common SNPs have only two alleles. Within a population, SNPs can be assigned a minor allele frequency the lowest allele frequency at a locus that is observed in a particular population. This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms. There are variations between human populations, so a SNP allele that is common in one geographical or ethnic group may be much rarer in another. SNPs are important with regards to microrna regulation, as a single nucleotide change occurring naturally in the putative target sites are candidates for functional variation that may be of interest for the biomedical applications and evolutionary studies(saunders et al 2007). Not just constrained to the functional aspects, SNPs are also associated with the altering and modulation of the primary mirna processing. A large scale analysis of the occurrence of SNPs in specific functional and non-functional regions of the microrna has been performed in this study. Results have been obtained for the presently known mirnas in Homo sapiens toestablish the functional and structural implications of the SNP on the structural stability, target binding and posttranslational regulation. After obtaining the SNPs from the reference databases, it has been attempted to distinguish these SNPs based upon their locations within the five domains of the mirna. Prime importance has been attributed to the SNPs located in the seed region of the mirnas. Potential targets for these mirnas have been predicted using the tools RNAhybrid and miranda. 91 P a g e

4 MicroRNA target prediction RNAhybridis a target prediction tool for finding the minimum free energy hybridisation of a long and a short RNA. The hybridisation is performed in a kind of domain mode, i.e. the short sequence is hybridised to the best fitting part of the long one.the algorithmic core of RNAhybridis a variation of the classic RNA secondary structure prediction. Instead of a single sequence that is folded back onto itself in the energetically most favourable fashion, RNAhybriddetermines the most favourablehybridization site between two sequences. Though in principle these two sequences can be arbitrarily long, for microrna target prediction, the target candidate will be rather long (hundreds to thousands of nucleotides) and the mirna will be between 19 and 24 nt. Since microrna/target interactions have not been reported to contain bifurcations (also called multi-loops), these are not considered by RNAhybrid, thus considerably increasing the speed of the algorithm. RNAhybriddoes not use any RNA folding or pairwise sequence alignment code, but implements an algorithm that was specifically designed for RNA hybridization (Kruger and Rehmsmeier 2006). mirandais one of the earliest developed large-scale target prediction algorithms for vertebrates. The standard version of miranda selects target genes based on three properties: sequence complementarity using a position-weighed local alignment algorithm, free energies of RNA-RNA duplexes using the Vienna RNA fold package and conservation of targets in related genomes _ENREF_8 (John et al 2004). These features are weighed in a decreasing order. Targets binding to the mirna may fall into the category of True, Sifted, False or Coding. The relative binding positions of the different categories of targets as observed in case of miranda and RNAhybrid have been shown in Fig 6.1. miranda has the highest specificity when testing on shuffled and coding sequences; RNAhyrbid has the highest specificity when testing on validated false target set. Moreover, miranda and TargetScanS have similar patterns on different data sets. The specificity on coding set drop around 10% and 40-50% when comparing to that of shuffled and false ori set, respectively. RNAhybrid, however, did not follow this pattern. 92 P a g e

5 Fig. 6.1: Relative binding position miranda andrnahybrid(zhang and Verbeek 2010). A possible reason for this is that miranda and TargetScanS are sequence-based algorithms which respond similarly on different types of sequences; whereas RNAhybrid is energy-based. In general, all three exhibit either a relative low specificity or/and sensitivity indicating that their prediction accuracy cannot yet be considered satisfactory. (Kruger and Rehmsmeier 2006). The net change in the secondary structural stability of the mirnas due to the occurrence of variation within them has been enunciated by the use of the Mfold, which is a tool for the nucleic acid folding and hybridization prediction based upon thermodynamic methods. The core algorithm predicts a minimum free energy, G, as well as minimum free energies for foldings that must contain any particular base pair. Any base pair, ri-rj, between the ith nucleotide and the jth nucleotide that is contained in a folding no more than G from the minimum, is plotted in a triangular plot called the energy dot plot. The base pair ri-rjis plotted in row i and column j of this matrix(mathews et al 1999). The free energy increment, G, is chosen a priori by the user, who selects a percent suboptimality, P. From this, G is computed to be P/100 G. Base pairs within this free energy increment are chosen either automatically, or else by the user, and foldings that contain the chosen base pair are computed. They have minimum free energy conditional on containing the chosen base pair (Zuker 2003). 93 P a g e

6 6.2. MATERIALS AND METHODS In silico analysis of SNPs in microrna Identification of SNPs in human micrornas The genomic co-ordinates (hg19; National Centre for Biotechnology Information build 37.1) of all the available human pre-mirnas (n=721) were taken from mirbasedatabase (Release 14)(Griffiths-Jones et al 2008). The mirbase database is sub divided into three parts, the mirbase Sequence, Registry and Targets databases, the details of which are explained below: 1. The mirbase Sequence Database is a searchable database of published mirnasequences with annotation. The data were previously provided by the mirna Registry. 2. The mirbase Registry continues to provide gene hunters with unique names for novelmirna genes prior to publication of results. 3. The mirbase Target database is a new resource of predicted mirna targets inanimals. Each entry in the mirbase Sequence database represents a predicted hairpin portion of a mirna transcript (termed mir in the database), with information on the location and sequence of the mature mirna sequence (termed mir). Both hairpin and mature sequences are available for searching; using BLAST(Altschul et al 1990) and SSEARCH (Smith-Waterman search algorithm)(pearson 1990), and entries can also be retrieved by name, keyword, references and annotation. All sequence and annotation data are also available for download. The SNPs were identified (dbsnp build 130)(Sherry et al 2001) within the mirna genes using the application programming interface tool Biomart(Kasprzyk 2011) in the ENSEMBL database. Identification of SNPs in the flanking regions around the gene (upstream & downstream) was also carried out by querying the database in a similar manner. Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system, which produces and maintains automatic annotation of eukaryotic genomes. 94 P a g e

7 Secondary structural analysis of microrna using Mfold The secondary structural analysis of 180 micrornas containing SNPs, was performed by means of the MFOLD web server(zuker 2003). This analysis has facilitated a structural insight into the effects of variations on the secondary structure of micrornas. These variations may eventually affect the structural stability and target binding or may not have any specific consequences at all. The analysis is purely based upon the thermodynamic aspects of structural stability. TheGibbs free energy (ΔG) for each wild and mutantmicrorna has been calculated. Various parameters such as the temperature (37 C), ionic conditions (1M NaCl, no divalent atoms), percent sub-optimality (controls the free energy increase), and upper bound on the number of folding allowed (50), window parameter and the maximum distance between paired bases are available. Default parameters have been used for the purpose of mirna structural prediction Domain classifications of pre-mirna The SNPs within the gene were classified into five well defined regions or domains of the microrna (Fig. 6.2). MicroRNAs act as adaptors that employ a silencing complex to target mrnas by selective base-pairing, primarily in the 3'-UTR region. Target interaction does not require perfect complementarity between microrna and mrna sequences, although nearperfect base-pairing in a small region in the 5'end (positions 2 7) of the microrna (sometimes termed seed ) appears to be one of the key determinants of target recognition. The SNPs which were constrained to the seed region of the microrna were specifically used for target prediction. 95 P a g e

8 Fig. 6.2: Five domain regions of microrna The 3 UTR sequences of the entire Human genome have been retrieved from the UCSC Genome Browser. The UCSC Genome Browser is developed and maintained by the Genome Bioinformatics Group, a cross-departmental team within the Centre for Biomolecular Science and Engineering (CBSE) at the University of California Santa Cruz (UCSC) MicroRNA target prediction Two tools, namely RNAhybrid and miranda have been used in the standalone mode in order to predict the targets of each of the microrna having an SNP within its seed region, as these variations are crucial factors affecting mirna-mrna binding miranda It computes optimal sequence complementarity between a set of mature micrornas and a given mrna using a weighted dynamic programming algorithm. The key extension to the Smith Waterman algorithm is that the alignment score is a weighted sum of match and mismatch scores for base pairs (including G:U wobbles) and gap penalties. Weights are position-dependent and reflect the relative importance of the 5' and 3' regions in a finely adjustable way. The weight of each position can be optimized to reflect 96 P a g e

9 experimental facts and physical principles. In addition, miranda uses an estimate of the free energy of formation of the microrna: mrna duplex as a secondary filter. A natural consequence of the weighted alignment optimization is the inclusion of potential targets with some mismatches at the dominant 5' end of the microrna, but with otherwise good complementarity to the target gene. To reflect the importance of the 5' region, base-pairing in positions 2 8 of the microrna are given higher weights when computing the microrna: mrna alignment score. This approach, as designed into the original miranda algorithm, is congruent with experimentally validated targets that do not contain perfect seed matches and includes what other approaches have subsequently introduced as 3'-compensatory matches or combinations of seed and 3' match rules (Betel et al 2008).The mirna sequence given as the query is searched through the file containing the 3 UTR sequences of the human using the default parameters. mirandamir15.fasta 3_UTR.fasta The Number of targets without the presence of SNP and after the incorporation of the SNP in the seed region has been analyzed and the total number of hits and target sites has been listed in both cases. The results have been analyzed through a program written using PERL scripting languageprogram calculates the total number of genes giving hits, the number of genes not giving hits and the number of target sites within the genes giving hits) RNAhybrid Itperforms a thorough statistical analysis of the MFEs (Minimum Free Energy). It normalizes the MFEs with the sequence lengths of mirnas and targets, and models such normalized MFEs as extreme value distributed. The parameters of thesedistributions are estimated specifically for every mirna with a second program, RNAcalibrate, is subsequently used to assign p-values to normalized MFEs. The significance of multiple binding sites in a single target is evaluated with a Poisson 97 P a g e

10 statistics. For comparative studies on multiple organisms such as different Drosophila species, it combines Poisson p-values from the orthologous targets using the effective number of sequences. This effective number respects the fact that related sequences cannot always be treated as statistically independent. Calculation of these effective numbers is mirna and target specific and is accomplished by a third program, RNAeffective. In this study, we have used the program RNAhybrid using the 3_UTR human file as the reference file. The mirna containing the SNP within the seed region is given as the query, the 3 UTR of the entire human genome is given as the target. The command used for performing the search of query against the targets is: RNAhybrid q t -3UTR_human >result Here ; -q is the name of the file containing FASTA sequence of the mirna -t is the name of the file containing the 3 UTR sequences of human genome >result will write the results into a file named results The targets with the least p-value and minimum free energy have been considered as the best hits. Thus binding efficacy of the mirna before and after the occurrence of variation has been elucidated and the SNPs showing drastic effect on target binding have been highlighted. The validated targets have been retrieved from the TarBase Database and the unknown chromosomal positions of mrna binding region have been established.the NM reference number for these known targets was then obtained from National Centre of Biotechnology (NCBI) in order to identify the target genes and their sequences were retrieved from the UCSC table browser for the respective mirnas targeting them. Asthe targets for the micrornas are already known, a specific program has been written in order to obtain the chromosomal positions which have not been found yet. 98 P a g e

11 6.2.2.insilicoanalysis of SNPs in microrna processing machinery genes Data mining of SNPs The various proteins involved in the microrna biogenesis pathway were searched in the literature database, Pubmed-NCBI. From 22 different scientific papers 38 proteins, their domains and their role in the biogenesis process were identified.through an extensive mining of the databases of the International HapMap Project ( and dbsnp ( and a perl-interface program Biomart the SNPs in the desired 38 genes were retrieved and mapped into six different domains of the gene (i) promoter region, (ii) 5 UTR, (iii) exon, (iv) intron, (v) 3 UTR, and (vi) 3 near gene region. The SNP density in each of the six domains was calculated Sequence analysis SNPs in the functionally important domain (promoter, exon and 3 UTR) of the microrna biogenesis genes were analyzed using in silico tools SIFT The exonic SNPs were categorized into non-synonymous and synonymous changes. The non-synonymous SNPs were submitted into a sequence homology-based tool called SIFT (Sorts Intolerant From Tolerant) which predicts whether an amino acid substitution in a protein will have a phenotypic effect. More details of SIFT are given in Chapter MicroRNA target prediction The potential target sites for the micrornas in the 3 UTR of the mrnas were predicted using a stand-alone tool miranda-3.3a version. 3 UTR sequences were retrieved from nucleotide database of NCBI and from mirbase human mature microrna sequences were obtained. The effect of polymorphism in microrna target sites was predicted using miranda. 99 P a g e

12 Transcription factor prediction The flanking sequences of 100 bp for each SNPs were obtained from dbsnp.the SNPs with MAF of >1% were subjected for Transcription factor analysis using Match Transfac (BIOBASE) Structural analysis The functional role of nssnps was further analyzed from protein structure Homology modelling The sequences of all 38 microrna biogenesis related proteins were retrieved from the Uniprot (protein sequence database). Templates were searched for proteins whose crystal structures have not been solved completely or not available in the Protein Data Bank. BLASTP program was used to search protein template against PDB. PHYRE2 and I- TASSER protein modeling servers were used to model the proteins because blast results do not find significantly suitable template. Results from both the servers were subjected for structural validation and the best validated structures were taken for the further analysis Protein structure validation The predicted protein structures were energy minimized using SCHRODINGER- Protein preparation wizard. To validate the modelled structure, a Ramachandran plot was drawn for energy minimized structure using PROCHECK Stability analysis The PoPMuSiC program is a tool for evaluating the changes in stability of a given protein or peptide under single-site mutations, on the basis of the protein s structure. Three options are available: Single:PoPMuSiC predicts the stability change resulting from a given mutation. 100 P a g e

13 Systematic:PoPMuSic evaluates the stability changes resulting from all possible mutations, and returns a report containing a list of the most stabilizing or destabilizing mutations, or of the mutations that do not affect stability. File:PoPMuSiC predicts the stability changes resulting from a list of mutations specified by the user in an uploaded file. In this study, we used File option, where we uploaded the PDB structure of the protein to be analyzed and a text file with all the mutations of that protein. Results were analyzed using prediction ΔΔG in kcal/mol of each mutation RESULTS AND DISCUSSION General observations 721 MicroRNAs in Homo sapiens have been listed till date in mirbase. The number of micrornas within each chromosome have been shown in Fig 6.4.The maximum number of micrornas are found in Chr19 ie 81 and Chr.Xie 82 as shown in Fig 6.3. The minimum number of micrornas have been found in chromosome 21 ie 5 mirnas only. The three mitochondrial mirnas are predicted; however, these have not been validated as yet and in current version this three mirna were removed from the database. Chr.Y is devoid of any microrna for a reason yet to be identified. The micrornas were classified based upon their location within Exon, intron and intergenic regions. The following bar diagram (Fig. 6.4) discerns the number of mirtrons to be 344, micrornas in intergenic regions to be 337 and number within exons to be the least, ie 56 only.this reinforced the fact that majority of the micrornas are present in intronic and intergenic regions as stated before. 101 P a g e

14 Fig. 6.3: MicroRNA count within each chromosome Fig. 6.4: Number of mirnas in introns, intergenic and exonic regions. 102 P a g e

15 Fig. 6.5: SNP count in validated human pre-mirnas and flanking regions Single nucleotide polymorphism in microrna SNPs in human microrna genes were identified by querying the Single Nucleotide Polymorphism database (dbsnp) at the genomic co-ordinates of 721 genes. A total of 257 SNPs (including in-del polymorphisms) were obtained within 180 mirnas. For the purpose of comparasion, dbsnp was also queried for the flanking regions around the premirnas. As the flanking regions around the pre-mirnas are mostly intergenic in origin, these showed a higher density of SNPs SNPs were located in the upstream flanking region of the mirnas and 1250 within the downstream region of the mirna genes. Fig 6.5 clearly shows the SNP count within and around the mirna genes. The pre-mirna is composed of different domains with different functional significance. To gain insight into the potential functional importance of the identified polymorphisms, the SNPs were mapped to five different domains of the pre-mirnas: (i) the seed region, (ii) the mature region excluding the seed region (MIR_seed), (iii) the stem region complementary to the MIR (MIR*), (iv) the stem region that is neither the MIR nor 103 P a g e

16 MIR*, and (v) the loop region (Fig 6.2). 14 SNPs identified within the Seed region were attributed prime importance in the present study. 22 SNPs in the loop region, 43 in the complementary or mir* domain, 121 within the stem and 55 SNPs within the mir seed domain were located based upon their chromosomal positions. The bar diagram (Fig 6.6) displays the SNP distribution between the five domains of the mirnas.the percentage of SNP was calculated within each domain, elucidated in the graph shown in Fig 6.6. Fig.6.6:SNP distribution between 5 domains of mirna. Fig. 6.7:Percentage of SNPs within each domain 104 P a g e

17 Overall, ~90% human pre-mirnas have no reported polymorphisms and most observed polymorphisms were not present within the seed region (Fig 6.6,6.7). Hence this suggested a strong selective pressure on human pre-mirnas Stability analysis of microrna-mfold It was carefully listed as to which SNP brought about a favourable change in the secondary structural stability of the mirna genes within which they are occurring. The secondary structure (Appendix 6.1) elucidate the mirnas within which the polymorphism occurred along with the free ΔG before and after the occurrence of variation and the structure with the position of the SNP were underlined(appendix 6.1). The mature region is shown in red and polymorphism position shown in green. 41 SNPs showed a favourable decrease in the free ΔG of the mirna structure and hence seem to stabilize these mirnas. 30 SNPs showed no net change in ΔG and did not affect the stability of the mirna in any manner. Thus, the remaining 196 SNPs which showed considerable difference in free ΔG may be responsible for a drastic change in the target binding specificity of these mirnas as the changed conformation of structure may attract a different set of targets(appendix 6.1) mirna target prediction of seed domain SNPs RNAhybrid The 14 MicroRNAs containing SNPs in the seed region were evaluated using the standalone version of RNAhybrid. The top 5 targets with the lowest P-values and lowest energies are reported for the mirna genes before and after (indicated by SNP in brackets in Table 6.1) polymorphism. Tables for 5 mirnas have been shown as examples. As evident from the tables shown (Table 6.1), the top 5 target bindings to the mirnas mir- 499 and mir-513 are different presence and absence polymorphism. In case of mir-124 and mir-219 only 1 target, ie C11orf74 and CCDC124, respectively, occurred even after presence of SNP.These however are binding with lower energies and lower P-values 105 P a g e

18 after polymorphism. This indicates that these SNPs are favorable for the hybridization of the mirna to the target in terms of structural stability. Table 6.1miRNA targets before and after polymorphism. mir-124 mir-124 (SNP) Target gene Mfe P-value NLRX C11orf ATPIF ZNF KRTAP mir-125a Target gene Mfe P-value DUX LOC LOC LOC LOC mir-219 Target gene Mfe P-value SPATC SLC1A MAN2B FKSG CCDC mir-499 Target gene Mfe P-value EHMT C9orf CCS ATP6AP1L TAAR Target gene Mfe P-value C11orf GMPS DOM3Z C4orf MATN mir-125a (SNP) Target Mfe P-value RETN gene SHH DUX LOC LOC mir-219 (SNP) Target gene Mfe P-value CCDC C2orf COX6A DUX LOC mir-499(snp) Target gene Mfe P-value LOXHD GALR WDR FAM128B WNT9A P a g e

19 mir-513 Target gene MFE P-value RTL CRYBA PGAM C4orf C1orf mir-513(snp) Target gene MFE P-value TUBGCP RNF TCEB GABRQ C9orf *SI units of MFE value is kcal/mole miranda An overall conclusion was drawn from the results obtained by using miranda algorithm for target prediction. A graph was generated for each of the 14 mirnas containing SNPs in the seed region, in order to identify the difference in the number of targets binding before and after variation. The results are as follows: mir-124 SNP No Hits HITS Target Sites Fig. 6.8: hsa-mir-124 target prediction. Number ofmrnas not binds to mirna (No hits), number of mrnas binding (Hits) & the number of target sites within the mrnas binding (Target Sites). 107 P a g e

20 The bar diagram (Fig.6.8) indicates that the number of mrna genes binding to the mirna before variation is more than the number of mrna genes binding to hsa-mir-124 after variation, i.e and 2151, respectively. There was a clear difference of 4068 genes giving hits in the two cases. There was also a clear drop in the number of target sites after variation. Thus the SNP rs showeda drastic effect on the target binding of the mirna. The bar diagram (Fig. 6.9) shows a very minute difference between the number of hits before and after variation. However, strikingly the number of hits and target sites binding to hsa-mir-125a before mutation (16,802) was less than the hits after the occurrence of the SNP rs (17,543) in the seed region hsa-mir-125a mir-125a snp-125a No Hits HITS Target Sites Fig. 6.9: hsa-mir-125a target prediction 108 P a g e

21 30000 hsa-mir mir-499 snp No Hits HITS Target Sites Fig.6.10: hsa-mir-499 Target Analysis graph The bar diagram (Fig 6.10) for hsa-mir-499 target analysis also shows a similar effect of the SNP rs on mrna binding to the mirna. The number of genes retrieved as hits before SNP occurrence was 7672 and showed an increase to 8514 after the incorporation of the SNP in the mirna. Hsa-mir-219 displayed a similar trend and as seen in Fig 6.11, the target binding sites and the number of hits showed a slight hike in the number after the incorporation of SNP within the microrna. Thus, these demonstrated the example of single nucleotide polymorphisms which arefavorable for target binding. 109 P a g e

22 hsa-mir-219 No Hits HITS Target Sites mir-219 snp-219 Fig. 6.11: Target Analysis for hsa-mir-219 In the following figures (Fig 6.12, 6.13, 6.14, 6.15), hsa-mir-513, hsa-mir-518d, hsa-mir- 627 and hsa-mir-662 showed identical pattern of reduction in the number of targets binding to them after the Single nucleotide polymorphism in the seed region. This highlights the importance of the seed region 2-8 in the hybridizing potential and specificity for silencing of mrnas in the 3 UTR region in human beings. While mir-513 and mir-518d showed a minute difference in the number of hits, mir-627 and mir-585 show considerable depreciation in target binding. 110 P a g e

23 30000 hsa-mir mir-513 snp No Hits HITS Target Sites Fig. 6.12: Target analysis for hsa-mir hsa-mir-518d mir-518d snp-518d No Hits HITS Target Sites Fig. 6.13: Target Analysis for hsa-mir-518d 111 P a g e

24 35000 hsa-mir mir-585 snp No Hits HITS Target Sites Fig. 6.14: Target analysis for hsa-mir hsa-mir mir-627 snp No Hits HITS Target Sites Fig. 6.15: Target analysis for hsa-mir P a g e

25 60000 hsa-mir mir-662 snp No Hits HITS Target Sites Fig. 6.16: Target analysis for hsa-mir-662 Hsa-mir-662 (Fig 6.16) showed an interesting result as there was no change in the number of targets binding to this mirna even after the occurrence of SNP rs in the seed region. There was no net change in the number of hits due to this variation. Hsamir-1268 (Fig 6.19) too showed no net effect of the mutation on targets but differed in target sites hsa-mir-941 No Hits HITS Target Sites mir-941 snp-941 Fig. 6.17: Target analysis for hsa-mir P a g e

26 70000 hsa-mir mir-1236 snp No Hits HITS Target Sites Fig. 6.18: Target analysis for hsa-mir hsa-mir-1268 No Hits HITS Target Sites mir-1268 snp-1268 Fig. 6.19: Target analysis for hsa-mir P a g e

27 30000 hsa-mir mir-1276 snp No Hits HITS Target Sites Fig. 6.20: Target analysis for hsa-mir hsa-mir mir-1302 snp No Hits HITS Target Sites Fig. 6.21: Target analysis for hsa-mir SNPs in validated target sites The Number of SNPs in the target sites was much lower than the number of SNPs in the seed region of the mirnas. A total of only 7 SNPs were obtained within 139 target sites i.e. 843 bases, thus showing the low rate of polymorphisms within the messenger RNAs 115 P a g e

28 (Fig. 6.22). Thus we may suggest the importance of polymorphism in mirna seed region in the target silencing mechanism, as compared to the target site polymorphisms. Target site SNPs Target Sites SNP count 5% 95% Fig. 6.22: SNPs in validated target sites SNPs in microrna biogenesis proteins Literature search Till date, the analytical study of approximately ten biogenesis proteins has been carried out. Through literature survey, we found 38 proteins were involved directly or indirectly in the microrna biogenesis process. For most of the protein we were able to identify the domains (Table 6.2) and role of proteins in microrna biogenesis through literature. 116 P a g e

29 Table6.2 MicroRNA biogenesis proteins, their domains and function. Protein Domains Function Dicer RNase III, DEAD, PAZ, mirna precursor processing dsrbd AGO3 PAZ, PIWI Short RNA binding AGO4 PAZ, PIWI Short RNA binding Gemin3 DEAD RNA helicase Drosha RNase III Processing of primary mirna transcript Exportin-5 NA Nuclear export of mirna precursors Gemin4 NA Not investigated TGF-β,SMADs R-Smad, Co-Smad Induces SMAD signaling to the mir-21 precursor and enhances its efficient processing by Drosha TRBP NA Stabilizes Dicer Importin-8 NA Required for cytoplasmic mirnaguided gene silencing ELAV1 AU-rich element Inhibit mir-122 repression of target sites AGO1 PAZ, PIWI Short RNA binding AGO2 PAZ, PIWI Short RNA binding Dnd1 U-rich region Inhibit mirna access to target mrna TNRC6B RRM, GW repeats mirna-guided cleavage DCP1a NA Not investigated DCP2 Nudix Not investigated MOV10 DExH, box mirna-guided cleavage PRMT5 Methyl-transferase Not investigated TNRC6A/GW18 2 RRM, GW repeats mirna-guided cleavage, translational repression TTP Zn-finger AU-rich mrna destabilization 117 P a g e

30 eif4e NA Not investigated Rck/p54 DEAD box mirna-guided cleavage PACT DsRBD Small RNA Processing, RISC activity FMRp KH domain, RGG box Not investigated FXR1 KH domain, RGG box Not investigated FXR2 KH domain, RGG box Not investigated KIF17b Kinesin motor Not investigated MVH DEAD box Not investigated MAEL HMG box Not investigated SNP density The overall SNP density in the different domains of all the genes has been represented in Fig This represents that the presence of SNP is usually high in the 3 UTR region. We can hypothesis that since these proteins were directly or indirectly involved in the biogenesis process, it could be possible that due to the polymorphism, target sites for some micrornas were being created/changed in a gene, which eventually lead to the negative feedback for the synthesis of a microrna. 1.6 SNP density ' NEAR GENE 5'UTR EXON INTRON 3'UTR 3' NEAR GENE Fig.6.23:Overall SNP density in the six domains. 118 P a g e

31 On comparing the overall SNP density with the SNP density of individual genes like Ago1 (Fig.6.24) and Importin8 (Fig. 6.25), it s clearly figured out that the SNP density could be high in any domain of the gene rather than just the 3 UTR. Thus, polymorphism in all the regulatory domains could play an important role and need to be studied further. 21% 9% 0% 5' NEAR GENE 5'UTR 26% EXON 19% INTRON 3'UTR 3' NEAR GENE 25% Fig. 6.24: SNP density in Ago1 0% 11% 9% 5' NEAR GENE 5'UTR 22% 28% EXON INTRON 3'UTR 30% 3' NEAR GENE Fig. 6.25: SNP density in IPO Non-synonymous SNPs All the SNPs in the exonic region of the gene were categorized into synonymous and non-synonymous SNPs. The non-synonymous SNPs were further classified as Missense change and Nonsense change.the missense change is which results in a change in the amino acid and eventually in the protein product formed. The nonsense change results in 119 P a g e

32 a premature stop codon. Both the changes affect the function of the protein being encoded by the gene. In this study we were interested to know how the mutation in the SNPs SIFT To identify the important non-synonymous SNPs in each gene we used SIFT. The variations which werepredicted to cause damage in the protein were considered as important SNPs (Table 6.3). Table 6.3 SIFT results for DCP2 protein. SNP AA change Prediction Homologs Score Homologs rs33555* L16F TOLERATED 1 DAMAGING 0 rs * S71I TOLERATED 0.62 DAMAGING 0.02 rs T213A TOLERATED 1 TOLERATED 0.34 rs Q298K TOLERATED 1 TOLERATED 0.77 Note:* important SNP which may damage the protein Promoter region SNPs: Transfac-MATCH In order to analyze the significance of the SNPs in the promoter region, all the SNPs with reported MAF of over 1% were selected and analyzed by MATCH (Transfac) tool. The binding site of various transcription factors were predicted at the promoter region of the genes. To know how the polymorphism at promoter region regulates the transcription factor binding was also analyzed using MATCH. Out of 38 genes only 9 genes promoter- SNPs showed a regulation in transcription factor binding (Fig.6.26). 15 transcription factors binding were regulated by SNPs, out of this PAX6 and Nkx2-5 were the two frequently regulated transcription factors. PAX6, a transcription factor, has recentlyreported as a tumor suppressor in glioblastoma and acts as an early differentiation marker for neuroendocrine cells. Its role in prostate cancer is also being investigated. 120 P a g e

33 Nkx2-5 has shown association with congenital heart defects and its role in prostate and colon cancer is being investigated. Total SNPs Significant SNPs AGO1 ELAV1 GEMIN3 GEMIN4 MAEL PACT TNRC6A SMAD2 SMAD4 Fig. 6.26: MATCH-Transfac Results The change in the transcription factor binding site of the PACT gene was shown (Fig. 6.27). Transcription factor, Pax-6 binds to the wild-type sequence of the promoter region whereas the polymorphism (highlighted in yellow) creates the binding site for Nkx2-5. Wild-type Mutant Fig. 6.27: Promoter sequence of PACT gene. 121 P a g e

34 microrna target prediction of 3 utr SNP To predict microrna target sites on the 3 UTR region of the genes, a stand-alone tool miranda was used. Here we show the results for Ago4 gene which is a key protein involved in RISC. Fig 6.28 shows the number of microrna binding sites and the number of micrornas binding in the 3 UTR of Ago Chart Title Site HIT Fig. 6.28:miRanda results for Ago4 We observed that the wild type sequence has 606 binding sites onto which 488 micrornas could bind. But in case of polymorphism (rs ), in the 3 UTR of Ago4, an additional binding site for microrna was created (Fig. 6.28). 122 P a g e

35 MicroRNA SNP ID SCORE ENERGY mirlength hsa-mir-663b rs Fig. 6.29:miRanda output In the output shown Fig 6.29, the region in green represents the seed sequence and the polymorphism (G/C) is highlighted in yellow which is causing the binding of an additional microrna Structural analysis of nssnps Blast results The homology modelling method is one of the best computational methods to predict the protein structure which is a template based method. For the structural analysis of the effect of the nssnps, i first carried out the BLASTP search was carried out in order to retrieve the suitable templates for the homology modelling of the proteins. Parameters to filter the blast results: Identity to be greater than 35% The E-value to be upto E-10 Significant query coverage E-value for all the templates was insignificant and the query coverage for most of the templates was not more than 70 amino acids. Similar results were obtained for all the other proteins (Table 6.4).Proteins with less homology were modeled by PHYRE2 and I- TASSER. 123 P a g e

36 Table6.4 Blast results for Dicer1 protein. pdb id identity positive match mis-match gap query start query end e-value score 3c4b eb a ffl e ffl ffl z0m eaq i oyy o6b PHYRE2 The amino acid sequences of all the proteins were submitted into a homology based modelling tool called PHYRE2. SNP (rs ) shows a change S73T (highlighted in yellow), where a polar amino acid is changing into another polar amino acid. Thus the change is not significant. However, SNP (rs ) shows a significant change D147N (highlighted in red), where a negative amino acid is changing into a polar amino acid I-TASSER I-TASSER is anabinitomethod based protein modelling tool. All the protein sequences were submitted into I-TASSER. The structure with higher C-score (confidence score) were taken for the analysis. C-score is a confidence score for estimating the quality of predicted models by I-TASSER. It is calculated based on the significance of threading template alignments and the convergence parameters of the structure assembly simulations. 124 P a g e

37 rs rs Fig. 6.30: 3-D structure of eif4e protein generated by PHYRE2. C-score is typically in the range of [-5 to 2], where a C-score of higher value signifies a model with a high confidence and vice-versa.tm-score and RMSD are known standards for measuring structural similarity between two structures which are usually used to measure the accuracy of structure modelling when the native structure is known.for Gemin4 protein, five models with different C-scores were obtained (Fig. 6.31). Model1: Model2: Model3: Model4: Model5: Since the C-score of model1 is highest, so it was selected for further analysis. 125 P a g e

38 Fig. 6.31: 3D structure of Gemin-4 generated by I-TASSER Energy minimization & validation The modelled structures were energy minimized using SCHRODINGER (Fig 6.32). A project table was created and the PDB structure was imported into the table. Protein is energy minimized by OPLS2005 force field for 500 iterations. In order to validate the energy minimized structure, Ramachandran plot is generated by PROCHECK. This determines whether the alpha-helices, beta sheet and the turn region of the protein are present in a stable orientation. 126 P a g e

39 Fig. 6.32: Ramachandran plot for Gemin4 In Ramachandran plot protein with >90% of the residues in the most favoured regions is considered to be good structure. However in Gemin4, 87.6% of the residues are in the most favoured orientation. Similarly, energy minimization and validation of each protein by generation of the Ramachandran plot was carried out. Modelled structures with good Ramachandran plot were selected for the further analysis. 127 P a g e

40 POPMUSIC The PoPMuSiC predicts the changes in stability of a given protein structure due to point mutant (6.33). rs , 1.27 rs , rs , 0.17 rs , 0.41 rs , 0.86 rs7813, 0.96 rs , 0.92 rs , 1.46 ΔΔG rs , 1.34 rs , -0.1 rs , 0.98 rs , 1.26 rs , 1.38 rs910925, 1.16 rs , 0.34 rs , 1.47 rs , 0.59 rs , 0.28 rs , 0.93 rs , 0.12 rs , 1.2 rs , rs , 1.9 rs , 0.87 rs , Fig.6.33: Results for Popmusic for Gemin4 It predicts the ΔΔG value, in kcal/mol. lesser the value, more stable is the structure of the protein and vice-versa. Gemin4 gene popmusic results were shown in the graph. Most of the SNPs in the Genin4 destabilize the protein. SNP (rs ) make the protein to be more unstable than any other mutation CONCLUSION For the past two decades, we only knew about the involvement of 6 major proteins in microrna biogenesis, but in recent years researchers have reported several others protein involved directly or indirectly in microrna biogenesis. In this study, we found 38 proteins were involved in microrna biogenesis through literature survey. Expression 128 P a g e

41 of these proteins was also regulated in diseases like cancer. Our interest was to know how microrna expressions were regulated through biogenesis proteins and the mutations therein affect the microrna function indirectly. However, enough mutational information of these proteins is not available, leaving no choice than to analyze the polymorphic effect on these proteins computationally. SNPs of these proteins were mapped into 6 domains of each gene. Expressions of these genes were regulated by SNPs at promoter or 3 UTR domains. To study the mutational effect of these genes, nonsynonymous SNPs in exon domain were analyzed. In promoter region, 15 SNPs provided significant results and also PAX6 and Nkx2-5 were found to be important regulators for microrna biogenesis. In 3 UTR analysis, few new microrna target sites were created. This result also suggested a negative feedback of microrna. SIFT predicted 28% of the missense mutation caused damage to the protein. This sequence based mutational analysis results suggested that most of the non-synonymous SNPs in these genes could affect the function of the protein. This made us to study the mutation at structural level. To carry out the mutational analysis on biogenesis protein structures, we modelled 35 proteins, since only 3 proteins were fully crystallized and deposited in PDB and the Blast search didn t find good template. Very few templates had good sequence identity although these failed to cover significant query coverage. PSI-Blast search also didn t find any significant template. Blast results made us to model the proteins using PHYRE2 and I-TASSER. Structures from both the modelling servers were subjected to energy minimization using the force field OPLS2005 for 500 iterations in SCHRODINGER. Ramachandran plot was created to validate the modelled structures. Both the server modelled structures were validated and structures with maximum residues falling in allowed regions were taken for the stability analysis. POPMUSIC was used to calculate the stability of the wild and mutant structures. Most of the mutations, it was observe, destabilized the proteins. The mutations with delta G higher than 1 were considered as most important mutants. The significant SNPs in each domain of a gene in future would require experimental validation. 129 P a g e

42 MicroRNAs function as endogenous translational repressors of protein-coding genes in animals by binding to target sites in the 3 UTRs of mrnas. Because a single nucleotide change in the sequence of a target site or seed region COULD affect mirna regulation, naturally occurring SNPs in target sites are candidates for functional variation that may be of interest for biomedical applications and evolutionary studies. However, little is known to date about variation among humans in mirnas and their target sites. In this study, we analyzed publicly available SNP data in context with mirnas throughout the human genome, and we found a relatively low level of variation in functional regions of mirnas. The stem and complementary strand of the mirnas showed a very high count of SNPs within them.the seed region showed a very low polymorphism rate, however with considerable effect on the target binding.regulation of microrna through biogenesis proteins is not well understood. In the present study we made an attempt to investigate the role of SNPs in the biogenesis proteins. From this study we came out with interesting inferences. SNPs at the promoter region regulated the binding of two transcription factors PAX6 and Nkx2-5 which are also associated with many types of cancers. We also found that a negative feedback inhibition of the microrna is possible due to the polymorphism in the 3 UTR. Structural analysis suggested that most of the SNPs in the proteins could affect the protein stability. Sequence and structural analysis of our study predicted 5.7% of the SNPs significantly regulating the microrna-biogenesisproteins. 130 P a g e