As the number of genomes sequenced is increasing at high rate, there is a need of gene prediction method which is quick, reliable, inexpensive. In such conditions, the computations tool will serve as an alternative to wet lab methods. The confidence level of annotation by the tool can be enhanced by preparing exhaustive training data sets. The aim is to develop a tool which will read data from a DNA sequence file in the fasta format and will annotate it. For this purpose Genome Database was used to retrieve the input data.
PERL programming has been put to develop this tool for annotation. To increase the confidence level of annotation the data was validated from multiple sources. Perl script was written to find the promoter region, repeats, transcription factor binding site, base periodicity, and nucleotide frequency. The program written was also executed to identify repeats, poly (A) signals, CpG islands, ARS. The tool will annotate the DNA by predicting the gene structure based on the consensus sequences of important regulatory elements. The confidence level of annotation of the predicted gene, non-coding region, ARS, repeats etc. were checked by running test dataset. This test dataset was annotated data as reported by genome database and computational tools. Gene prediction of the non-coding regions as reported by genome database (SGD) were performed by existing tools; the regions identified as non-coding by these tools were then analyzed for presence of repeats. The BLAST was used to annotate on the basis of sequence similarity with the already annotated genes.
GeneMark.hmm and FGENESH were used for gene prediction. In order to validate the predicted results, annotations of genome of Saccharomyces cerevisiae from SGD Database, and output of different computational tools viz, Emboss-CpGplot, PolyAh, REPFind, Promoter 2.0 Prediction Server were compared with the output of developed tool. The output generated was also used for validation and checking sensitivity of the tool. Such tools reduce the cost and time required for genome annotation and bridge the gap between sequenced and annotated data. The output generated by the developed tool was also used for validation and checking sensitivity of the tool. Such tools reduce the cost and time required for genome annotation and bridge the gap between sequenced and annotated data.
ABSTRACT
As the number of genomes sequenced is increasing at high rate, there is a need of gene prediction method which is quick, reliable, and inexpensive. In such conditions, the computational tool will serve as an alternative to wet lab methods. The confidence level of annotation by the tool can be enhanced by preparing exhaustive training data sets. The aim is to develop a tool which will read data from a DNA sequence file in the fasta format and will annotate it. For this purpose Genome Database was used to retrieve the input data. PERL programming has been put to develop this tool for annotation. To increase the confidence level of annotation the data was validated from multiple sources. Perl script was written to find the promoter region, repeats, transcription factor binding site, base periodicity, and nucleotide frequency. The program written was also executed to identify repeats, poly (A) signals, CpG islands, ARS. The tool will annotate the DNA by predicting the gene structure based on the consensus sequences of important regulatory elements. The confidence level of annotation of the predicted gene, non-coding region, ARS, repeats etc. were checked by running test dataset. This test dataset was annotated data as reported by genome database and computational tools. Gene prediction of the non-coding regions as reported by genome database (SGD) were performed by existing tools; the regions identified as non-coding by these tools were then analyzed for presence of repeats. The BLAST was used to annotate on the basis of sequence similarity with the already annotated genes. GeneMark.hmm and FGENESH were used for gene prediction. In order to validate the predicted results, annotations of genome of Saccharomyces cerevisiae from SGD Database, and output of different computational tools viz, Emboss-CpGplot, PolyAh, REPFind, Promoter 2.0 Prediction Server were compared with the output of developed tool. The output generated was also used for validation and checking sensitivity of the tool. Such tools reduce the cost and time required for genome annotation and bridges the gap between sequenced and annotated data.
KEYWORDS
Genome Annotation, Gene structure, Gene prediction, Yeast genome, Saccharomyces cerevisiae, promoter recognition, Comparative genomics, computational tools, Bioinformatics.
INTRODUCTION
The hereditary material in the living cells is DNA. It is a large molecule made up of smaller units called nucleotides. Each nucleotide has three parts: a sugar molecule, a phosphate molecule, and a nitrogenous base. The genetic information is carried in the nitrogenous base. Nitrogenous bases are divided into two groups; purine and pyrimidine. This classification is based on the structural formulae. Pyrimidine has only one nitrogenized carbon ring and purines have two nitrogenized associated carbon rings. Cytosine, thymine and uracil are pyrimidine and adenine and guanine are purines. These bases are represented by their first letters, G, A, T and C. DNA is a double stranded molecule that forms helical structure. The two strands are complementary to each other whereby an A on one strand always binds to T and C always binds to G. DNA is associated with proteins to form chromosome.
Genome consists of complete content of genetic information in an organism. Eukaryotic genome is made up of a single, haploid set of chromosomes. The cell has two copies of these haploid set except reproductive and red blood cells. Genome Sequencing involves the determination of the order of nucleotides (A, T, G, and C) within a DNA molecule. Genome sequencing helps in numerous fields including biological research, diagnostic, biotechnology, forensic biology, evolutionary biology and biological systematics.
The literal meaning of annotation is to add explanation. And so, genome annotation is the process of attaching biological information to genomic sequences. Genome annotation helps in identification of important gene functions. The process of identification of genomic elements, intron-exon structure, coding regions, regulatory motifs comes under Structural genome annotation. The addition of biological information to these genomic elements referred as Functional genome annotation.
Genome annotation has led to the advancement in several fields like medicine, agriculture, biotechnology, chemistry and other basic science. Genome annotation is widely used in genetic engineering to develop genetic engineered crops (drought resistant, insect resistant) and genetically modified organisms (GMO). It is also used in Molecular medicine for better diagnosis of diseases, early detection of diseases, gene therapies etc. After genome annotation, the gene product of a particular sequence can be known and the biochemical functions can be established. Genome annotation is being used to reconstruct metabolic pathways e.g. the metabolic network in the yeast Saccharomyces cerevisiae was reconstructed using currently available genomic, biochemical, and physiological information. It also aids construction of transport reactions for transporter proteins based on genome annotation of an organism [1]. It plays role in food safety. If the genome of pathogen or the microorganisms responsible for food spoilage is annotated, the gene regulatory sequences can be found and thus the gene expression profile can be exploited to repress its growth and thereby increasing the shelf life of the food. It also helps in phylogenetic studies i.e. understanding the course of evolution. It has led to discoveries that are useful in energy production, toxic waste reduction and industrial processing.
Genome annotation consists of two phase; computation phase and annotation phase. In computation phase the genetic elements like intron, exon, protein, etc. are computed. This can either be done by, homology search or by prediction based methods. Homology search rely on sequence similarity search by aligning query to mRNA sequences (ESTs) and prediction based methods rely on the algorithms designed to find genes/gene structures based on nucleotide sequence and composition. The second phase is annotation phase which includes use of the computed data to synthesize gene annotation including functional annotation. Genome annotation pipeline starts with searching sequence databases (typically, NCBI NR) for sequence similarity, usually using BLAST. It is followed by statistical prediction of protein-coding genes using methods like GeneMark or Glimmer. The conserved domains are identified by specialized database search such as Pfam, SMART, and CDD, COGs. Functional predictions are refined using metabolic databases, such as KEGG [Fig. 1].
Abbildung in dieser Leseprobe nicht enthalten
Fig 1: Flow chart of genome annotation process: FB: feedback from gene identification for correction of sequencing errors, primarily frameshifts. General database search: usually using BLAST. Statistical gene prediction: GeneMark or Glimmer. Specialized database search: Pfam, SMART, CDD, COGs. Functional prediction: metabolic databases such as KEGG.
Prokaryotes have high gene density (1 kb per gene on average); short intergenic regions and they lack introns. Unlike prokaryotes, Eukaryotes have split genes with high number of introns and exons, their gene density (1-200 kb per gene) is low and the non-coding regions have large sections of repeats. Hence, genome annotation is much easier in prokaryotes than eukaryotes [Fig.2].
Abbildung in dieser Leseprobe nicht enthalten
Fig 2: Schematic representation of prokaryotic and eukaryotic gene structure and transcription units. TATA denotes one of the possible eukaryotic core promoter elements, and poly (A) denotes the posttranscriptional addition of a poly (A) tail. Black bars denote coding DNA, open bars denote transcribed but untranslated DNA, and thin lines within transcribed regions denote introns. [2]
Table 1: Prokaryotic and eukaryotic genome organization
Abbildung in dieser Leseprobe nicht enthalten
The non-coding region has role in regulation of gene expression; these regulatory regions may also have repetitive elements. The repeats can be divided into two types; tandem repeats and dispersed repeats. When the pattern of one or more nucleotides is present as consecutive copies along a DNA strand it is called Tandem repeat e.g. satellite, minisatellite, and microsatellite. The repeats that are distributed throughout genomes are called Dispersed repeat sequences.
There are two approaches to predict gene; ab-initio [Fig: 3] and comparative. Ab-initio gene prediction is based on gene content and signal detection e.g. promoter and regulatory sequences that precede a gene, binding sites for the poly A tail at the end of a gene, CpG islands (stretches of DNA with high GC content). Ab-initio methods can easily predict novel genes but are not effective in detecting alternately spliced forms, interleaved or overlapping genes. They also have difficulty in accurate identification of exon/intron boundaries.
Abbildung in dieser Leseprobe nicht enthalten
Fig 3: Flow chart of gene prediction process by HMM: Each box and arrow has associated transition probabilities, and emission probabilities for emission of nucleotides (dotted arrow). These are learnt from examples of known gene models and provide the probability that a stretch of sequence is a gene.
Comparative methods use annotations from previously analyzed genomes i.e. compare genomic sequence data to gene, cDNA, EST and protein sequences already present in databases. For the purpose of searching for genes in DNA sequences derived from eukaryotes, dbEST is particularly useful. This is a database of ESTs that have been generated by single pass sequencing of random clones from cDNA libraries. Genomic sequences are compared directly to the contents of dbEST in order to identify potential ORFs [3].
REVIEW OF LITERATURE
Many genome annotation pipelines and tools are available e.g. NCBI Eukaryotic Genome Annotation Pipeline [4], MAKER2 [5]. These pipelines use both or either of the two approaches ab-initio and comparative search. BLAST is used for comparative search. It identifies sequences similar to query from database such as GenBank or Swiss-Prot. EST sequence database contains all the transcripts. ESTs are DNA sequences of expressed genes that are represented in a cDNA library [6]. As they are derived from cDNA, then are transcribed from functioning gene. Therefore, the predicted genes are also used as a BLAST query against an EST database. BLASTN queries a nucleotide sequence against a nucleotide database, BLASTX translates a nucleotide query into all six frames and searches a protein database, and BLASTP uses a protein sequence to search a protein database [7].Numerous ab initio gene prediction methods have been developed [8].
Gene can be predicted using conserved regions of the genome. Regulatory elements like promoter sequences, polyadenylation signal, 5’ capping signal can be used to predict gene. Promoter sequence is a region of DNA that initiates transcription of a particular gene. Promoters are located near the transcription start sites of genes, on the same strand and upstream on the DNA. Promoters can be about 100–1000 base pairs long. Eukaryotic promoters can contain a TATA element (consensus sequence TATAAA), and a BRE. The TATA element and BRE typically are located close to the transcriptional start site (typically within 30 to 40 base pairs). E-box (sequence CACGTG), are promoter regulatory sequences that binds to TFB proteins to regulate transcription. CAAT box or CAT box is a distinct pattern of nucleotides with GGCCAATCT consensus sequence that occur, 75-80 bases upstream to the initial transcription site. GC box has GGGCGG as consensus sequence. CAAT and GC are primarily located in the region from 100-150bp upstream from the TATA box. In Saccharomyces cerevisiae the TSS have been reported to be 45–120 bp downstream of a TATA element (9).
CAP site is transcription initiation sequence or start point defined as +1, at which the transcription process actually starts. The consensus sequence is ATG. Poly (A) signal with consensus AAUAAA is found 10-30nt upstream of the polyA site. When AATAAA is deleted from the DNA template, no mature mRNA is made requirement for both AAUAAA and GU or U-rich signals for efficient mature 3' end formation The AAUAAA signal is sufficient for polyadenylation if it is located at the appropriate distance from the end of a molecule. PolyA tails function in mRNA stabilization and in initiation of translation
Analysis of the codon usage and base periodicity also help in gene annotation because they show marked differences between coding and non-coding regions [10]. Most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. This periodicity in exons is determined by codon usage frequencies [11]. The 3-base periodicity exists in many exon sequences due to the non-uniform distribution of the four nucleotides A, C, T, G in protein-coding region. Introns rather show 2-base periodicity. ORFs are expected to be shorter in DNA sequences with high AT (Adenine + Thymine) levels >50%. The reason is that A and T are more frequent in stop codons than G. Since there are three stop codons and 61 amino acid codons, (3:61) a stop codon occurs with a probability of approximately one in twenty (1:20). Furthermore, given three base pairs per codon, this should lead to one stop codon every sixty base pairs, in which A, C, G or T are equally likely to occur. The identification of segments with GC content much higher than average GC content, and a higher CpG frequency than average frequency of the CpG dinucleotide, could be indicative of a CpG island. Such islands are found at the 5′ end. The frequency of stop codons may vary significantly depending upon the local nucleotide. So, it can be interpreted that the probability of an ORF being a coding sequence increases with its size. Most proteins are larger than 100 codons (300 bp) and their ORFs are relatively easy to classify. UTRs are sections of the mRNA before the start codon and after the stop codon that are not translated, termed the five prime untranslated region (5' UTR) and three prime untranslated region (3' UTR), respectively. These regions are transcribed with the coding region and thus are exonic as they are present in the mature mRNA. UTR has many roles in gene expression including mRNA stability, mRNA localization, and translational efficiency. UTRScan predicts functional elements by searching sequences for patterns located in 5ꞌ or 3ꞌ UTR sequences. DNA sequence can be examined to find sites for all restriction enzymes that cut the sequence. The recognition site of these restriction enzymes might be flanking the gene and thus would be important in the genetic engineering. A sequence-tagged site (or STS) is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known. STSs can be easily detected by the polymerase chain reaction (PCR) using specific primers. The DNA sequence of an STS may contain repetitive elements. Eukaryotic genomes are characterized and often dominated by repetitive, non-genic DNA sequences [12].
PERL, practical extraction and report language, is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose UNIX scripting language to make report processing easier. The language provides powerful text processing facilities facilitating easy manipulation of text files. It is also used for graphics programming, system administration, network programming, applications that require database access and CGI programming on the Web.
The tool developed finds the coding region on the basis of presence of promoter region and transcription factor binding site. Majority of the non-coding region contains repetitive elements. The tool identifies repetitive DNA in the non-coding regions.
RATIONALE AND SCOPE OF PROJECT
This project aims to develop computational tool for genome annotation. There has been an exponential increase in the number of genomes sequenced in the past decade. Accurately predicting genes can significantly reduce the amount of experimental verification work which is time and labor consuming as well as expensive to carry out. Ab initio gene prediction plays a critical role because it predicts gene structures quickly, inexpensively, and in most cases reliably. The transcripts predicted by ab initio algorithms are normally complete and ab initio prediction results in at least partial prediction for about 95% of all genes, leaving fewer entirely missing genes [13].
SCOPE:
As the number of genomes sequenced is increasing at high rate, there is a need of gene prediction method which is quick, reliable, inexpensive. In such conditions, the tool provides will serve as an alternative to wet lab methods. The confidence level of annoation by the tool can be enhanced by preparing exhaustive training data sets.
RESEARCH METHODOLOGY
The project aims to develop a tool which will read a DNA sequence file (the fasta format) and will annotate it i.e. determine the important genomic elements including exon, intron, promoter region, identify repeats, TFBS, poly (A) signal, ARS, CpG islands, sequence tagged sites. For this purpose Perl programming will be implemented. The results of the program will be analyzed. The sensitivity of the tool can be validated by running on test dataset. This test dataset will be annotated data as reported by genome database and computational tools viz., RepeatMasker [14], LTR_finder [15], tRNAScan-SE [16], UTRScan [17]. The transcription factor binding sites will be retrieved from YEASTRACT [18].
- Quote paper
- Renu Rawat (Author), 2014, Genome annotation and finding repetitive DNA elements, Munich, GRIN Verlag, https://www.grin.com/document/273971
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.