![]() |
Genomics BIO 427 |
Instructors: Louise Temple and Jon Monroe JMU Department of Biology |
|
Copyright Jon Monroe and
Louise Temple,
2/5/07
Exercise 5. Using BLAST to identify protein sequences In this exercise you will learn how to used BLAST to identify protein sequences in various sets of organisms, and how to interpret BLAST results. In part A you will use an alpha-amylase sequence from Arabidopsis as an example. In part B you will use the same approaches to search for members of the your gene family in various organisms. Later, these sequences will be aligned and used to generate phylogenetic trees. |
| A. Search for relatives of an Arabidopsis alpha-amylase using protein-protein BLAST (blastp) Copy the FASTA-formatted sequence (including the first line that with >gi|...) for amylase At4g25000 (NP_567714) then go to the NCBI search page, select BLAST at the far right side of the header, select protein-protein BLAST (blastp) and then paste the sequence into the box. Do the BLASTp search using all of the default parameters. Click on the Format! button and wait for the results. At peak times this may be slow. The results page shows a color-coded graphic near the top indicating how close the best matches are to the query sequence, followed by a listing of the matches. The top three hits are all sequences of the same gene as you used in the query thus the E values are zero. The link on the left side of the first match (gi|...) takes you to the GenBank entry for this hit. The link on the right side under Score (Bits) takes you down the page to an alignment of that portion of each protein used in determining the score. To the right of some hits are links to entries in the UniGene database (U), Entrez Gene database (G), or Structure database (S). Click on both of the first two links mentioned above (gi| and score) for the first hit to answer the following two questions. 1. How many amino acids are
in the Arabidopsis AMY3 protein?
2. How many identities were found in the BLASTp comparison between the query and the subject? The next two hits (as of 2/4/07) are to different entries in GenBank for the same protein. The first of these is a deduced amino acid sequence from a cDNA and the third hit is to a sequence deduced from the sequenced gene in which a computer program was used to identify intron/exon boundaries. Look at both entries and both alignments. 3. How does the third hit
differ from the first two in terms of numbers of amino acids?
4. Which of these latter two sequences do you think has the more reliable sequence? Explain why. Lesson:
Not every GenBank entry is accurate!
The fourth hit is to an alpha-amylase from Ipomea nil (Japanese morning glory). 5. How many amino acids are
in this protein?
6. How many amino acids were used by BLASTp? Notice that BLASTp did not use all of the amino acids... 7. What are the percent identities and similarities and the number of gaps for this BLASTp comparison? Lesson:
Because BLASTp does not use entire protein sequences, you
should not use these results in an alignment . BLAST results are
primarily
useful in finding related sequences. Because you searched the
entire non-redundant database, and because the results are limited to
the closest matches and are not sorted by type of organism, this type
of
search is not very useful if you want to find more distant matches or
just those in a particular genome or group of related genomes.
Search 2 Close the search results window and go back to the first BLASTp page where you copied the AMY3 sequence. This time scroll down under Options and pull down the "or select from:" bar to reveal the various genomes or subsets of organisms which BLASTp will use in its search. Select Viridiplantae (true plants) and use the results to answer the following questions. 8. How many plant sequences
were
found? Many of these hits were seen in the first search.
9. Select an entry with a Bit score of about 40. This score is much lower that the Ipomea nil score you looked at above and may indicate that the protein is not related to the Arabidopsis AMY3 sequence. 10. What are the percent identities and similarities for this BLASTp comparison? 11. How long is this sequence (click on the gi|... link)? 12. Despite the relatively high percent similarity, what effect does the length of some of these sequences have on the Bit score? Lesson:
The chance that two unrelated sequences share some identity becomes
larger when the regions compared are short so these results must be
viewed with skepticism...
Search 3 Narrow the search further to just the plant RefSeq database (a pull down menu in the first section of BLAST page). 13. For which two plant
species are there Reference Sequences in the database?
Look at the graphical representation at the top of this page and notice
that
there are about 12 close matches (red bars) and
several poor matches with low scores (black bars). Look at some
of these
alignments. The scores are low because the
matching regions are short and the identity is relatively low.
Whether or not these poor matches represent evolutionarily related
(homologous) sequences is difficult to know.14. In which of these species does there appear to be more alpha-amylase genes? 15. The RefSeq collection is a good place to look for all of the members of a particular gene family (paralogs) in a sequenced genome. Is it good place to look for related sequences (orthologs) from many species? 16. Find the names of the
proteins in two of these weak matches. Do they appear to be
alpha-amylases?
Search 4 Select metazoans (animals) as the target database and search again using the Arabidopsis AMY3 sequence. Notice that none of the hits have alignment scores of >200 (red bars in the graphic) indicating that these hits would not have been seen in your first search. Scroll down and notice all of the hits to various Drosophila species. Those fly gene sequencers have been busy! The closest animal match is to a sequence in the genome of Callosobruchus chinensus. 17. What is Callosobruchus
chinensus? To find out, follow the
links to the NCBI page for that sequence and look under source organism.
18. What type of animals
contain
the the top six hits? If you can't tell from the sequence page,
click on the source organism link, then scroll down to the linkout
options. There are some hits to species in other taxa but they
are hard to find in this list.
Lesson:
What you find in GenBank usually reflects the interests and activity of
groups of people. Despite the importance of your favorite
organism in the world, if no one has sequenced its genes you won't find
them...
Search 5 Continue limiting your BLASTp searches to answer the following questions. 19. Which prokaryotes, Archaea or Bacteria, have alpha-amylases that match the plant amylase most closely or are they about the same? Which measure did you use? 20. Assuming that alpha-amylases are similarly represented in Archaeal and Bacterial genomes, which one of these groups has been more extensively sequenced? 21. What is the best bit
score and E value among the primate sequences? Despite these low
scores, the names indicate that they are indeed alpha-amylases.
Search 6 Now copy the amino acid sequence of a primate salivary amylase and do a BLASTp search against other primate sequences. 22. How similar are the
primate sequences to each other? Which measure did you use?
What can explain this observation?
B. Use BLAST to learn about your gene family Obtain the amino acid sequence associated with the member of your gene family for which you know the structure and do BLASTp searches to identify a robust set of paralogs and orthologs. For the paralogs you will need to use a sequenced genome that contains a large number of homologous sequences. For the orthologs you will need to find the closest match in as many different genomes as possible. Look at the most unrelated sequences as possible and avoid including pairs of closely related sequences (e.g. human and gorilla). use what you learned above to design your search strategies. Write a brief report describing your searches and what you found. In your report, identify the sequence you started with, then detail the number of hits you got with each search including scores, and source organisms. Which genome has the most paralogs? How many are there? Include the accession numbers from one genome for later use. List the species and accession numbers for the orthologs you identified. |