genome Genomics
BIO 427
Instructors:
Louise Temple and  Jon Monroe
JMU Department of Biology

Home

Copyright Jon Monroe and Louise Temple, 1/24/06
Exercise 2.   Repetitive DNA and Sequencing Genomes

Sequencing entire genomes is accomplished by either or both of two strategies: whole-genome shotgun sequencing and clone-by-clone sequencing.  In the former, the entire genome is sheared into 0.5-20 kb fragments which are then cloned into vectors for end-sequencing.  In the latter, the genome is initially broken into ~100 kb fragments which are first mapped onto the genome and then sequenced.  In both approaches overlapping regions of the individual sequences are assembled into contiguous sequences and then gaps are filled and ambiguities resolved by resequencing subclones or PCR products in a process called finishing. 

Any sequencing project will encounter various problems that make assembly or finishing difficult in certain regions of the genome.  For eukaryotic genomes one significant problem is the existence of repetitive DNA.  The problem is simple.  If two or more sequences from randomly isolated clones are identical, one must determine if they came from one locus or several different repetitive loci.  In this exercise you will explore several types of repetitive DNA in genomes.  Make use of the NCBI Genome Glossary if terms are unfamiliar.
A.  What types of repetitive DNA are found in the human genome?

repeats chart Repetitive DNA makes up a very large part of big eukaryotic genomes.  One simple way to see it is pictured on the right.  While 25% of the human genome is composed of genes (exons + introns), nearly all of that DNA is introns that are spliced out of mRNAs.  Of the remaining DNA most is repetitive; 45% of the genome is comprised of transposons and 8% contains simple repeats or large duplications.

Repetitive DNA is also characterized by size and abundance.  There are four types of transposable elements comprising almost half of the human genome.  These range in size from about 300 bp to over 10 kb and are repeated hundreds of thousands of times throughout the genome (data from Lewin B., Essential Genes, 2006).

Element
Length (kb)
Human number
Genome Fraction
Retroviruses/ Retroposons 1-11
450,000 8%
LINES (long interspersed elements) 6-8
850,000 17%
SINES (short interspersed elements) ~0.3
1,500,000 15%
Transposons 2-3
300,000 3%

Many of these elements are remnants of virus-like sequences that once hopped around our genome.  Of these four groups of repetitive sequences, all but the SINEs contain functional sequence encoding genes such as transposase that are responsible for this hopping behavior.  Most have suffered deletions damaging the genes required for transposition so they are no longer mobile. 

The most abundant SINES family are the Alu repeats, with over 1 million copies comprising 10% of the genome.  Alu repeats are named for the restriction enzyme that they are cut by, Alu1, that is used in their identification.



B.  Exploring the human genome for Alu repeats, LINES and SINES.

First search for some normal, functional genes to see how abundant they are.  Go the the main Entrez search page then click on the Map Viewer section of the top navigation bar.  To search just the human genome pull down the Select Group or Genome bar to Homo sapiens, and then type "amylase" in the box to the right and press Go.   You will then see a map of all of the human chromosomes with several red marks indicating the locations of amylase genes. 

1. How many amylase genes are there and on which chromosome are they clustered?  Notice that each gene is listed twice, one from the reference sequences and one from the Celera sequences.

On the same page search for "hemoglobin".

2.  On how many chromosomes are hemoglobin genes found?

Now search for "Alu" and watch for the red marks.  Notice that just under the map it indicates that only the first 1500 hits out of 1,146,042 found are shown!

3.  Is there a chromosome without Alu repeats?

Below the map is a long table with lines reading:      Alu    7SLRNA      REPEAT     Repeats
Click on the first Repeat link to see a map of the region in which this Alu repeat is located.
 
4.  On which arm (long or short) of which chromosome is located? 
(Use the chromosome numbers across the top and the Ideogram map on the left.)

This default view is the highest resolution possible in this viewer.  The vertical line represents only about 700 bp.  Any other identified regions are listed to the right.  Your Alu element is in the middle.  To step back and see a larger view of this chromosome use the zoom box on the left. Start with the smallest box at the bottom.  As you mouse over the bottom bar it will indicate:
show 1/10,000th of chromosome
Now you will see a map of 19,200 bp centered on the Alu repeat. 

5.  What other types of repetitive DNA are found in this region?

Below the Alu repeat are two elements labelled (TA)n  Simple_repeat.  Follow the line from the lower TA(n) repeat to the gray box on the chromosome with your mouse, then click once to pop up a more detailed Map Viewer zoom box.  Click on Show Sequence at the bottom. 

6. Describe this sequence element.

Look at the (TG)n Simple element above the Alu repeat.

7.  Describe this sequence element.

These two sequences are quite short and are likely to be found within a single sequence read of ~400 bp, but longer repeats will cause problems in the assembly phase of genome sequencing.



C.  Repetitive Elements in Bacterial Genomes.

Repeated sequences including transposons, can wreck havoc in a genome.  We don’t really know the effect of the high number of repeats in the human genome.  But we can look at a bacterial genome to see some dramatic effects of a number of so-called “Insertion Sequence” elements.  Below are a table, a figure, and two paragraphs from a paper comparing the chromosomal sequences of three highly related species of Bordetella, which are pathogens of mammals.  In the figure, you see the chromosomes shown as horizontal lines, with red bands between them where there is a high degree of nucleotide-nucleotide similarity (over 85% identical).  Between the bottom two, you see that most of the genome is organized identically, although certain regions are rearranged and there is some difference in the overall size.  However, comparing the top line (B. pertussis, the human pathogen, and B. bronchiseptica, the dog and pig pathogen), you can see that the B. pertussis is quite different.  Read the two paragraphs about these observations and look at the table, as well. (The entire article is posted on Blackboard and linked below.)

8.  In the table, how many “pseudogenes” are present in each of the three genomes? 

9.  How many IS elements are present in each of the three genomes?

10.  What is the mechanism by which the authors believe the differences between the top two genomes were derived?

11.  It turns out that B. pertussis, in contrast to the other two Bordetellae species in this paper and others not shown here, exists in a very narrow ecological niche:  the human ciliated tracheal epithelium.  Based on what you see and read here, can you suggest a relationship between this observation and the presence of the numerous IS elements?


From Parkhill et al., Nature Genetics  35, 32 - 40 (2003), "Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica."  PDF and full text

"Genomic comparison of B. bronchiseptica and B. pertussis (Fig. 2) showed that rearrangements and deletions in B. pertussis were similar to those in B. parapertussis but more extreme, with short blocks of almost perfect conservation broken up by nearly 150 individual rearrangements, 88% of which are bounded by ISEs (primarily IS481). Again, the rearrangements do not follow the usual reciprocal pattern around the origin or terminus of replication. The large degree of rearrangement of overall genome structure is unprecedented for congeneric bacteria.


"The B. pertussis genome is considerably smaller than that of B. bronchiseptica, and, although individual events are difficult to trace, much of this loss is probably due to ISE-mediated deletion events. At some point in its recent evolution, B. pertussis seems to have undergone a massive expansion of one family of ISEs (IS481; Table 1), and subsequent recombination between these perfect DNA repeats caused a large amount of rearrangement and deletion in the chromosome. Comparison of the genetic maps of several B. pertussis strains showed that inversions of large sections of the genome are frequent, presumably owing to such recombination12. Such ISE expansions have previously been seen in rarely recombining organisms whose effective population size was greatly reduced by an evolutionary bottleneck13. In any population, ISEs transpose to novel sites at a certain rate, but most of these novel insertions are lethal or carry a selective disadvantage, and bacteria carrying them are therefore competed out of the population. When the population size is very small, however, the degree of intraspecies competition is reduced, and bacteria carrying such non-lethal mutations are less likely to be competed out, leading to an increase in ISE accumulation in the population."  (see the original paper for the citations)

Table 1



Figure 2

Figure 2.
 Linear genomic comparison of B. pertussis, B. bronchiseptica and B. parapertussis.
The gray bars represent the forward and reverse strands. Top, B. pertussis. Black triangles represent ISEs. Center, B. bronchiseptica. Pink boxes represent prophage. Bottom, B. parapertussis. Black triangles represent ISEs. The red lines between the genomes represent DNA:DNA similarities (BLASTN matches) between the two sequences.




Go back to the Map Viewer page, find the genome you will work with all semester and start exploring!