![]() |
Genomics BIO 427 |
Instructors: Louise Temple and Jon Monroe JMU Department of Biology |
|
Copyright Jon Monroe and
Louise Temple,
1/15/06
Exercise
1. Exploring sequence databases
The purpose of this exercise is to guide you through your initial exploration of several sequence databases. Refer to Chapter 3 of the text for more detailed descriptions. You will discover what the databases contain, how to find sequences, and what is contained within sequence entries. You will also learn how to change a file format and look for membrane-spanning domains in the program TopPred. This is not meant to be an exhaustive exercise, but we hope it will make future use of these databases easier. Throughout the exercise there are numbered questions. Type the answers to questions into a word processing file and turn them in at the start of the class period in which it is due. Do this exercise with a copy next to you of the paper by Jespersen et al. (1997), From sequence analysis of three novel ascorbate peroxidases from Arabidopsis thaliana to structure, function and evolution of seven types of ascorbate peroxidase. Biochem. J. (1997) 326, 305-310. |
A. NCBI Entrez NCBI (the National Center for Biotechnology Information) maintains many, linked databases including databases of nucleotide and amino acid sequences, protein structures, taxonomy, genetic diseases, genomes, and published literature. A cross-database search page, Entrez, allows one to find entries in any of the databases simultaneously. Once you are within a page in one database it will contain links to others. Go to the main search page in a new window keeping this window open. Notice that below the search box are two sets of databases; the top one includes various literature databases including PubMed for abstracts, PubMed Central for free full-text journal articles, books, and several disease databases. The larger box below contains all of the molecular databases and at the bottom are several miscellaneous databases. To see how these databases are linked go to a diagram showing the Entrez databases and the connections between them (this requires Flash). Mouse-over the colored dots to see how each database is linked to the others. (Pretty cool flash animation, eh?) 1.
How many links are there from the nucleotide database to the protein
database, and visa versa? Why are these numbers different?
B. Searching Entrez starting with a gene name and organism On the main search page type the word "peroxidase". 2.
How many hits are there in the protein database?
Add "ascorbate" in front of peroxidase and do the search again. 3. Now how many protein hits are there?
Now add "Arabidopsis" to the list of search terms (their order doesn't mater) 4.
How many protein hits are there?
Clearly, to find a particular sequence it pays to use multiple search terms to narrow the results. Also notice the hits in other databases. Feel free to explore as some of these will become valuable to you in the weeks to come. Using the "Arabidopsis ascorbate peroxidase" search results, click on the protein database results to see links to the hits. For various reasons there are many more entries than the number of genes for this enzyme in the Arabidopsis genome. Some represent cDNAs sequenced by individual labs, while others for the same gene are entries from the genome sequencing program. Click on the first entry or BAA03334. Notice the various parts of this entry. At the top are pieces of information about the entry such as the accession number (also called the locus number), how many amino acids it includes, and the date it was entered into GenBank. There is also a link to the nucleotide sequence from which this amino acid sequence was derived (D14442.1). The entire taxonomic listing for source organism is provided (Eukaryota; Viridiplantae; Streptophyta;...) followed by one or more citations that may be to published papers or unpublished (direct submission) entries. 5. Which journals contain
references listed for this entry?
6. Why isn't the last reference to a journal? These references usually contain links to the abstracts in PubMed and from there one can often obtain the full text of the paper. Below the references is information about the sequence followed by the sequence itself. 7. Look at the first 5-10 amino
acids of the sequence and compare them with the sequences in Figure 1
of the Jespersen paper. Which ascorbate peroxidase does this
sequence represents (i.e. cs1, cs2, cm1...)? Notice that the
number of amino acids in this GenBank entry (250) matches the number at
the end of this sequence! Always use as much supporting evidence
that you can to confirm that you know what you are looking at!
C. RefSeqs. Since many sequences are represented more than once in the databases, NCBI initiated the RefSeq (reference sequences) collection which is a curated database of non redundant sequences. Read more about these on pages 15-16 of your text. Go back to the protein sequence database results and find a RefSeq. These are easy to pick out because the accession numbers begin with two letters followed by an underscore (e.g. NT_... or XM_...). To limit your search results to just the RefSeqs, click on the Limits tab (left side of the tabs bar), and choose RefSeq from the "only from" button (right side of page), then press Go. 8. How
many NP_ entries for Arabidopsis ascorbate peroxidase are there?
9. How many Arabidopsis ascorbate peroxidases was Jespersen aware of in 1997? This discrepancy illustrates the power of whole genome sequencing to identify genes of a particular type in a given genome. D. Searching Entrez starting with a sequence number. Look on page 306 of the Jespersen paper under the subheading Three novel ATAs. The first cDNA they sequenced was number 109F7T7. This number originated from the Arabidopsis EST collection and it appears in the databases 4 times. Search Entrez using that number then click on the nucleotide database link. The second of the two entries (#T41879) is the original EST sequence entry from 1994. Recall that an EST is the nucleotide sequence from one sequencing run starting at one end of a unique cDNA. Typically, sequence near the primer is unambiguous but farther from the primer it becomes more and more difficult to read the sequence, as you probably noticed in the manual sequence analysis exercise. 10. Whenever a sequence calling
program fails to identify a base accurately, it inserts a letter
representing any base and then continues until the proportion of
ambiguous bases reaches some predetermined threshold. They become
more frequent near the 3' end of this sequence. What letter
(other than a, c, g, or t) is used to
represent any
nucleotide?
Go back to the nucleotide database results page and click on the first entry (#X98003). Notice who entered this sequence into the database in 1996? Our old friend Hans Jespersen! This sequence contains 1127 nucleotides which is more than the number mentioned in the paper, but notice that it contains the 24 "A" residues that were part of the poly-A tail not included in number mentioned in the paper. The paper also mentions that it encodes 287 amino acids but to find that number you will have to link to the amino acid sequence entry from the main search page. There will be only one entry (#CAA66640) and it will have 287 aa at the top. E. Changing the sequence format for use in other programs. Some programs you will use require that you enter sequence in a particular format. One common format you will need to use is called FASTA. Just after the red CAA66640 heading at the top of the page is a link named Reports. Click on this link to pull down a number of options and let go on FASTA. The resulting page will have the sequence in all caps with the following first line: >gi|1332439|emb|CAA66640.1| ascorbate peroxidase [Arabidopsis thaliana]The FASTA format always begins with a (>) symbol followed by the sequence identifier. After the first space is information that humans can read to identify the sequence, but computer programs will ignore it. The actual sequence, usually at 60 characters per line begins on the next line. Now copy the FASTA formatted sequence (including the line starting with >) and paste it in the actual data box in the TopPred program. TopPred is a program for predicting the membrane topology of proteins. Membrane proteins contain one or more regions of ~20 hydrophobic amino acids that can be represented graphically in a hydrophobicity plot. Jespersen did this analysis and determined that a membrane spanning domain existed near the C-terminus of the protein. After pasting the sequence into the box, enter your email address in the top box. Under Control options (scroll down) notice that the default organism is a prokaryote. Your sequence came from a plant so check the box next to "Organism: eukaryot" and then press "Run toppred" near the top of the page. The results will not be sent to your email address. In the next window you will see links to the results in four formats. Click on the first link to a .pgn file which contains a graphical output of the results. 11. How many regions of the
protein reach the lower limit for predicting a membrane spanning domain?
Look at the other output files. By clicking "toppred.out" you will obtain data on the analysis. Near the bottom is the range of amino acids predicted to make up the transmembrane domain. Notice that the program predicts one such domain from amino acids 260 to 280, as indicated by Jespersen. 12. Copy and paste the .png
file into your answer sheet.
Before you print your answer sheet to turn it in, be sure your name is
on the top of the page. Also indicate how much time you spent on
this tutorial.For a more thorough tutorial, go to the Entrez Tutorial |