genome Genomics
BIO 427
Instructors:
Louise Temple and  Jon Monroe
JMU Department of Biology

Home

Copyright Jon Monroe and Louise Temple, 2/5/06
Exercise 3.   Finding genes in raw genomic DNA sequence

This exercise will allow students to use some of the web-based tools for finding genes in raw DNA sequence.  In the first section an Open Reading Frame (ORF) finding tool predicts protein coding sequence based on the lack of stop codons.  BLAST is then used to determine if the ORF matches any known genes.  In the second section, GeneID, GENSCAN and FGENESH are used to predict gene structure using a hidden Markov model trained on various eukaryotic genomes.
A.  Open Reading Frames

Open reading frames are defined as regions of DNA that have a potential to code for polypeptides as indicated by the lack of stop codons at the expected frequency of 3/64 codons (~1/20 codons or 1/60 bp). 

1. In a random nucleic acid sequence that is 1500 bp long, how many stop codons should occur in each reading frame?

2. If a typical protein is 500 amino acids long, how much longer is this open reading frame than the average distance between stop codons in the other two reading frames?  State the answer in terms of a ratio.

Dr. Kyle Seifert studies unusual serine-rich repeat (SRR) proteins found on the surface of Streptococcus species that can cause bacterial meningitis.  These proteins may be involved in attachment to host cells and thus may be important targets for vaccine development.  Dr. Seifert cloned and sequenced 24 kbp of the genome of this bacterium.  Your task is to find the open reading frame encoding the serine-rich repeat protein and to identify several other proteins in this DNA sequence using an Open Reading Frame finder and BLAST. 

Copy the 24 kbp sequence (FASTA format) and paste it into the sequence box at NCBI's ORF Finder.  Now click the "OrfFind" button.  On the left side of the output file is a map of the ORFs in each of 6 reading frames (three L to R [+ strand] on top of 3 R to L [- strand]).  On the right side is a list of the ORFs sorted by size. 

3.  Are the longest open reading frames transcribed in both directions on the DNA (located on the + and - strand) or are they transcribed in only one direction?

4.  In rapidly growing bacteria, replication and transcription usually occur simultaneously so that DNA and RNA polymerases compete for the same template strand.  How does the arrangement of genes in this DNA allow for an increased rate of growth?

Predicted sequences in ORF Finder can be obtained either by clicking on the ORF in the map, or by clicking on the small box in front of each ORF in the list to the right.  Click on each of the long ORFs and scroll down through the predicted sequence looking for a protein that might be named a "serine rich repeat protein". (Hint: S is really frequent in parts of the protein)

5.  Which ORF encodes the SRR protein?  List its nucleotide position from the table.

Once the SRR is identified, do a BLAST search to confirm that it is homologous to other SRR title "Streptococcus genomic DNA for BIO 427 exercise".  Use the latter one because it automatically uses the amino acid sequence below to do a BLASTp search against the protein sequence database.  After receiving the BLAST results you will notice that some entries are listed as unnamed or unknown proteins, and some have other descriptive names.  This simply reflects the fact that proteins are given names by different people interested in, and hence looking for, different things.  Finding homologous sequences from different organisms can sometimes lead to new ideas about a protein's function.

6.  List three different organisms in which SRRs have been identified and list three different names that have been given to SRRs (not including "unknown" or "unnamed" protein).  What  kind of organisms are these?

7.  Identify four other ORFs in this sequence listing the most frequent name found in the BLAST output for each ORF.  Because most of these gene names reveal little about function, use PubMed to find abstracts related to each gene and describe the role of each of these genes in one sentence.  Also list the citations in the following format:

Jilaveanu LB, Oliver D. 2006. SecA dimer cross-linked at its subunit
interface is functional for protein translocation. J Bacteriol 188: 335-338.