genome Genomics
BIO 427
Instructors:
Louise Temple and  Jon Monroe
JMU Department of Biology

Home

Copyright Jon Monroe and Louise Temple, 2/5/07

Exercise 5.   Using BLAST to identify protein sequences


In this exercise you will learn how to used BLAST to identify protein sequences in various sets of organisms, and how to interpret BLAST results.  In part A you will use an alpha-amylase sequence from Arabidopsis as an example.  In part B you will use the same approaches to search for members of the your gene family in various organisms.  Later, these sequences will be aligned and used to generate phylogenetic trees. 
A.  Search for relatives of an Arabidopsis alpha-amylase using protein-protein BLAST (blastp)

Copy the FASTA-formatted sequence (including the first line that with   >gi|...) for amylase At4g25000 (NP_567714) then go to the NCBI search page, select BLAST at the far right side of the header, select protein-protein BLAST (blastp) and then paste the sequence into the box.  Do the BLASTp search using all of the default parameters.  Click on the Format! button and wait for the results.  At peak times this may be slow.

The results page shows a color-coded graphic near the top indicating how close the best matches are to the query sequence, followed by a listing of the matches.  The top three hits are all sequences of the same gene as you used in the query thus the E values are zero.  The link on the left side of the first match (gi|...) takes you to the GenBank entry for this hit.  The link on the right side under Score (Bits) takes you down the page to an alignment of that portion of each protein used in determining the score.  To the right of some hits are links to entries in the UniGene database (U), Entrez Gene database (G), or Structure database (S).  Click on both of the first two links mentioned above (gi| and score) for the first hit to answer the following two questions.

1.  How many amino acids are in the Arabidopsis AMY3 protein?

2.  How many identities were found in the BLASTp comparison between the query and the subject?


The next two hits (as of 2/4/07) are to different entries in GenBank for the same protein.  The first of these is a deduced amino acid sequence from a cDNA and the third hit is to a sequence deduced from the sequenced gene in which a computer program was used to identify intron/exon boundaries.  Look at both entries and both alignments. 

3.  How does the third hit differ from the first two in terms of numbers of amino acids?

4.  Which of these latter two sequences do you think has the more reliable sequence?  Explain why. 

Lesson:  Not every GenBank entry is accurate!


The fourth hit is to an alpha-amylase from Ipomea nil (Japanese morning glory).

5.  How many amino acids are in this protein?

6.  How many amino acids were used by BLASTp?  Notice that BLASTp did not use all of the amino acids...

7.  What are the percent identities and similarities and the number of gaps for this BLASTp comparison?

Lesson:  Because BLASTp does not use entire protein sequences, you should not use these results in an alignment .  BLAST results are primarily useful in finding related sequences.  Because you searched the entire non-redundant database, and because the results are limited to the closest matches and are not sorted by type of organism, this type of search is not very useful if you want to find more distant matches or just those in a particular genome or group of related genomes.


Search 2


Close the search results window and go back to the first BLASTp page where you copied the AMY3 sequence.  This time scroll down under Options and pull down the "or select from:" bar to reveal the various genomes or subsets of organisms which BLASTp will use in its search.  Select Viridiplantae (true plants) and use the results to answer the following questions.

8.  How many plant sequences were found?  Many of these hits were seen in the first search.

9.  Select an entry with a Bit score of about 40.  This score is much lower that the Ipomea nil score you looked at above and may indicate that the protein is not related to the Arabidopsis AMY3 sequence.

10.  What are the percent identities and similarities for this BLASTp comparison?

11.  How long is this sequence (click on the gi|... link)?

12.  Despite the relatively high percent similarity, what effect does the length of some of these sequences have on the Bit score?
Lesson:  The chance that two unrelated sequences share some identity becomes larger when the regions compared are short so these results must be viewed with skepticism...


Search 3


Narrow the search further to just the plant RefSeq database (a pull down menu in the first section of BLAST page).

13.  For which two plant species are there Reference Sequences in the database?

14.  In which of these species does there appear to be more alpha-amylase genes?

15.  The RefSeq collection is a good place to look for all of the members of a particular gene family (paralogs) in a sequenced genome.  Is it good place to look for related sequences (orthologs) from many species?

Look at the graphical representation at the top of this page and notice that there are about 12 close matches (red bars) and several poor matches with low scores (black bars).  Look at some of these alignments.  The scores are low because the matching regions are short and the identity is relatively low.  Whether or not these poor matches represent evolutionarily related (homologous) sequences is difficult to know.

16.  Find the names of the proteins in two of these weak matches.  Do they appear to be alpha-amylases?  


Search 4

Select  metazoans (animals) as the target database and search again using the Arabidopsis AMY3 sequence.  Notice that none of the hits have alignment scores of >200 (red bars in the graphic) indicating that these hits would not have been seen in your first search.  Scroll down and notice all of the hits to various Drosophila species.  Those fly gene sequencers have been busy!  The closest animal match is to a sequence in the genome of Callosobruchus chinensus

17.  What is Callosobruchus chinensus?  To find out, follow the links to the NCBI page for that sequence and look under source organism.

18.  What type of animals contain the the top six hits?  If you can't tell from the sequence page, click on the source organism link, then scroll down to the linkout options.  There are some hits to species in other taxa but they are hard to find in this list.

Lesson:  What you find in GenBank usually reflects the interests and activity of groups of people.  Despite the importance of your favorite organism in the world, if no one has sequenced its genes you won't find them...


Search 5

Continue limiting your BLASTp searches to answer the following questions.

19.  Which prokaryotes, Archaea or Bacteria, have alpha-amylases that match the plant amylase most closely or are they about the same?   Which measure did you use?

20.  Assuming that alpha-amylases are similarly represented in Archaeal and Bacterial genomes, which one of these groups has been more extensively sequenced?

21.  What is the best bit score and E value among the primate sequences?  Despite these low scores, the names indicate that they are indeed alpha-amylases. 


Search 6

Now copy the amino acid sequence of a primate salivary amylase and do a BLASTp search against other primate sequences.

22.  How similar are the primate sequences to each other?  Which measure did you use?  What can explain this observation?


 
B.  Use BLAST to learn about your gene family

Obtain the amino acid sequence associated with the member of your gene family for which you know the structure and do BLASTp searches to identify a robust set of paralogs and orthologs.  For the paralogs you will need to use a sequenced genome that contains a large number of homologous sequences.  For the orthologs you will need to find the closest match in as many different genomes as possible.  Look at the most unrelated sequences as possible and avoid including pairs of closely related sequences (e.g. human and gorilla).  use what you learned above to design your search strategies.

Write a brief report describing your searches and what you found.  In your report, identify the sequence you started with, then detail the number of hits you got with each search including scores, and source organisms.  Which genome has the most paralogs?  How many are there?  Include the accession numbers from one genome for later use.  List the species and accession numbers for the orthologs you identified.