Dr.
Jonathan D. Monroe, Department
of Biology, James Madison
University,
Harrisonburg,
VA, monroejd@jmu.edu
Office
of Health Policy and Clinical Outcomes, Jefferson
Medical College, Thomas Jefferson
University,
Philadelphia, PA
Background Information
A. Introduction to DNA, Genes and Chromosomes
If you would like more background information than is described here, please link to a Primer on Molecular Genetics written by Denise Casey and read the Introductory sections, or look at Access Excellence's Graphics Gallery. Most of the links on this page are to images stored on the Access Excellence web site.
DNA (deoxyribonucleic acid) is usually composed of two very long helical polymers of nucleotides. The nucleotides are composed of a sugar (deoxyribose), phosphate, and one of 4 nitrogenous bases, adenine (A), cytosine (C), guanine (G), or thymine (T). The two polymers are linked to each other by hydrogen bonds to form a double helix. In a double helix A is always bonded to T and C is always bonded to G. These are the AT and CG base pairs. Generally speaking, the only thing that makes DNA from a human different from the DNA from a dog, an oak tree, a fungus or a bacterium is the order of the four nucleotides.
DNA can be illustrated in a variety of ways including a simple line with various types of labels on it like a map (the letters refer to sites where enzymes can cut the DNA),
![]()
or as a list of letters representing one strand of the double helix,
cggctggacg
tgatgatgga gactgagaac cgcctccact tcacgatcaa agatccagct
aacaggcgct
acgaggtgcc cttggagacc ccgcgtgtcc acagccgggc accgtcccca
ctctacagcg
tggagttctc cgaggagccc ttcggggtga tcgtgcaccg gcagctggac
ggccgcgtgc
tgctgaacac gacggtggcg cccctgttct ttgcggacca gttccttcag
ctgtccacct
cgctgccctc gcagtatatc acaggcctcg ccgagcacct cagtcccctg
atgctcagca
ccagctggac caggatcacc ctgtggaacc gggaccttgc gcccacgccc
or as a three dimensional model of sticks representing atoms and bonds. In this model carbon is gray, nitrogen is blue, oxygen is red, and phosphorus is orange.

These illustrations each provide different information used for different reasons but each one represents DNA. In this tutorial you will mostly see lists of letters.
B. What is a gene?
Genes are regions of DNA that serve some sort of function usually via the proteins they encode. Genes were initially defined by their function - a gene could be identified when a mutation in a given gene caused an observable phenotype. Now we can locate genes on chromosomes simply by their sequence. Deciphering the function of a gene identified in this way involves comparing it's sequence to other known genes and observing the phenotype of the organism in which that gene has been disrupted.
When a gene is "expressed" the sequence of nucleotides in the DNA is used to determine the sequence of amino acids in a protein in a 2-step process. First, the enzyme RNA polymerase uses one strand of the DNA as a template to synthesize a complementary strand of messenger RNA (ribonucleic acid) called mRNA in a process called transcription. RNA is identical to DNA except that in RNA T is replaced with U (for uracil), and the nucleotides of RNA have a hydroxyl group at the position in which the nucleotides of DNA do not (hence DNA is "deoxy"). Also, unlike DNA, RNA usually exists as a single stranded molecule. After the mRNA for a particular gene is made it is used as a template with which ribosomes synthesize the protein in a process called translation. The sequence of nucleotides in a gene determines the sequence of nucleotides in the mRNA, and the sequence of nucleotides in the mRNA determines the sequence of amino acids in the protein. Each amino acid is represented by a codon composed of three nucleotides in the DNA or RNA. This process is called the Central Dogma of molecular biology.
Genes are usually thousands of nucleotides long and consist of two parts, a promoter that helps to determine when and where and how much the gene is expressed, and the coding region that determines what the protein is. The coding regions of genes from eukaryotes (organisms composed of cells with nuclei, e.g. plants, animals and fungi) are usually divided into 2 or more exons separated by introns. After a gene is transcribed the introns are removed from the mRNA and the adjacent exons are spliced together in the nucleus prior to translation in the cytosol (outside the nucleus). These are important concepts useful for interpreting genomic information because any given human gene may contain many long introns and be spread out over a very long region of DNA.
C. Genes, ESTs, and genomes.
For many years after DNA sequencing became commonplace but before anyone tackled a whole genome, most of the sequences that were deposited in Internet databases such as the GenBank were full length (or nearly full length) genes that had a known function. These sequenced genes are of two types depending on how they were obtained. If someone isolates and sequences a fragment of a genome containing a gene of interest (GOI) it is called a genomic clone. Coding sequences in these clones will contain introns and the clones may also contain sequences adjacent to the GOI. For genomes that contain long, seemingly random stretches of DNA between the genes, finding a particular gene among the many fragments of a genome can be quite a challenge - like finding the proverbial needle in a haystack... With the discovery of an enzyme called reverse transcriptase (RT) that uses an mRNA template to synthesize a strand of DNA (the reverse of transcription) a new approach was possible. Instead of starting with genomic DNA one could isolate mRNA from a tissue in which the GOI was known to be expressed. The pool of mRNAs could then be converted to DNA (called cDNA because they are complementary to mRNA) using RT. A cDNA from the GOI could then be isolated and sequenced. These cDNA clones do not contain introns because the introns had been spliced out prior to maturation of the mRNA.
In the early 1990s the development of automated sequencers and computer programs capable of analyzing lots of DNA made possible two new approaches to obtaining sequence information. Instead of going specifically after a gene of interest, people created rich cDNA libraries (containing many of the expressed genes of an organism), picked cDNA clones randomly, and rapidly determined some of the sequence of nucleotides from the end of each clone. These expressed sequence tags or ESTs could then be compared to all known sequences using a program called BLAST. An exact match to a sequenced gene meant that the gene encoding that EST was already known. If the match was close but not exact one could conclude that the EST was derived from a gene with a function similar to that of the known gene. These are highly valuable ESTs. Interestingly, many of the ESTs were completely unlike any known gene! The EST sequences with their putative identification are then deposited in the GenBank and the clones from which they were derived are kept in a freezer for later use if someone wants them. The combined number of ESTs from all organisms now in the NCBI's database of ESTs or dbEST (as of August 2002) is over 13.4 million!
<>Lastly, faster automated sequencers and faster computers have now made it possible to sequence entire genomes. The first genome sequenced was that of Haemophilus influenzae reported in 1995. This bacterial genome contains 1,830,137 base pairs or 1.83 mega base pairs (Mbp) of DNA and 1738 genes. Other genomes soon followed and now an entire bacterial genome can be sequenced in a matter of weeks or days. Links to dozens of completed microbial genomes can be found in the TIGR Comprehensive Microbial Resource. The first eukaryotic genome sequenced was that of yeast (Saccharomyces scerevisiae) completed in 1996. The yeast genome contains 12.5 Mbp of DNA and approximately 5,800 genes. The first plant genome to be completely sequenced was that of Arabidopsis thaliana in 2000 (~140 Mbp and ~25,000 genes). In 2001 the nearly completed human genome containing about 3.16 Mbp and perhaps 35,000 genes was announced by Celera and the Human Genome Project.Proceed to Tutorial Part 1 - Finding known genes
10/9/07 Copyright (C)
2007,
Jonathan Monroe, monroejd@jmu.edu.
All rights reserved.
URL:
http://csm.jmu.edu/biology/monroejd/amcp/genome2.html