Group Title: BMC Evolutionary Biology
Title: Phylogenomic approaches to common problems encountered in the analysis of low copy repeats : The sulfotransferase 1A gene family example
Full Citation
Permanent Link:
 Material Information
Title: Phylogenomic approaches to common problems encountered in the analysis of low copy repeats : The sulfotransferase 1A gene family example
Physical Description: Book
Language: English
Creator: Bradley, Michael
Benner, Steven
Publisher: BMC Evolutionary Biology
Publication Date: 2005
Abstract: BACKGROUND:Blocks of duplicated genomic DNA sequence longer than 1000 base pairs are known as low copy repeats (LCRs). Identified by their sequence similarity, LCRs are abundant in the human genome, and are interesting because they may represent recent adaptive events, or potential future adaptive opportunities within the human lineage. Sequence analysis tools are needed, however, to decide whether these interpretations are likely, whether a particular set of LCRs represents nearly neutral drift creating junk DNA, or whether the appearance of LCRs reflects assembly error. Here we investigate an LCR family containing the sulfotransferase (SULT) 1A genes involved in drug metabolism, cancer, hormone regulation, and neurotransmitter biology as a first step for defining the problems that those tools must manage.RESULTS:Sequence analysis here identified a fourth sulfotransferase gene, which may be transcriptionally active, located on human chromosome 16. Four regions of genomic sequence containing the four human SULT1A paralogs defined a new LCR family. The stem hominoid SULT1A progenitor locus was identified by comparative genomics involving complete human and rodent genomes, and a draft chimpanzee genome. SULT1A expansion in hominoid genomes was followed by positive selection acting on specific protein sites. This episode of adaptive evolution appears to be responsible for the dopamine sulfonation function of some SULT enzymes. Each of the conclusions that this bioinformatic analysis generated using data that has uncertain reliability (such as that from the chimpanzee genome sequencing project) has been confirmed experimentally or by a "finished" chromosome 16 assembly, both of which were published after the submission of this manuscript.CONCLUSION:SULT1A genes expanded from one to four copies in hominoids during intra-chromosomal LCR duplications, including (apparently) one after the divergence of chimpanzees and humans. Thus, LCRs may provide a means for amplifying genes (and other genetic elements) that are adaptively useful. Being located on and among LCRs, however, could make the human SULT1A genes susceptible to further duplications or deletions resulting in 'genomic diseases' for some individuals. Pharmacogenomic studies of SULT1Asingle nucleotide polymorphisms, therefore, should also consider examining SULT1A copy number variability when searching for genotype-phenotype associations. The latest duplication is, however, only a substantiated hypothesis; an alternative explanation, disfavored by the majority of evidence, is that the duplication is an artifact of incorrect genome assembly.
General Note: Start page 22
General Note: M3: 10.1186/1471-2148-5-22
 Record Information
Bibliographic ID: UF00100023
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: Open Access:
Resource Identifier: issn - 1471-2148


This item has the following downloads:


Full Text

BMC Evolutionary Biology Central

Research article

Phylogenomic approaches to common problems encountered in
the analysis of low copy repeats: The sulfotransferase I A gene
family example
Michael E Bradley and Steven A Benner*

Address: Department of Chemistry, University of Florida P.O. Box 117200, Gainesville, FL 32611-7200, USA
Email: Michael E Bradley; Steven A Benner*
* Corresponding author

Published: 07 March 2005 Received: 07 April 2004
BMC Evolutionary Biology 2005, 5:22 doi: 10.1 186/1471-2148-5-22 Accepted: 07 March 2005
This article is available from:
2005 Bradley and Benner; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background: Blocks of duplicated genomic DNA sequence longer than 1000 base pairs are known as low
copy repeats (LCRs). Identified by their sequence similarity, LCRs are abundant in the human genome, and
are interesting because they may represent recent adaptive events, or potential future adaptive
opportunities within the human lineage. Sequence analysis tools are needed, however, to decide whether
these interpretations are likely, whether a particular set of LCRs represents nearly neutral drift creating
junk DNA, or whether the appearance of LCRs reflects assembly error. Here we investigate an LCR family
containing the sulfotransferase (SULT) I A genes involved in drug metabolism, cancer, hormone regulation,
and neurotransmitter biology as a first step for defining the problems that those tools must manage.
Results: Sequence analysis here identified a fourth sulfotransferase gene, which may be transcriptionally
active, located on human chromosome 16. Four regions of genomic sequence containing the four human
SULTIA paralogs defined a new LCR family. The stem hominoid SULTIA progenitor locus was identified
by comparative genomics involving complete human and rodent genomes, and a draft chimpanzee genome.
SULTIA expansion in hominoid genomes was followed by positive selection acting on specific protein
sites. This episode of adaptive evolution appears to be responsible for the dopamine sulfonation function
of some SULT enzymes. Each of the conclusions that this bioinformatic analysis generated using data that
has uncertain reliability (such as that from the chimpanzee genome sequencing project) has been
confirmed experimentally or by a "finished" chromosome 16 assembly, both of which were published after
the submission of this manuscript.
Conclusion: SULT I A genes expanded from one to four copies in hominoids during intra-chromosomal
LCR duplications, including (apparently) one after the divergence of chimpanzees and humans. Thus, LCRs
may provide a means for amplifying genes (and other genetic elements) that are adaptively useful. Being
located on and among LCRs, however, could make the human SULTIA genes susceptible to further
duplications or deletions resulting in genomicc diseases' for some individuals. Pharmacogenomic studies of
SULTIAsingle nucleotide polymorphisms, therefore, should also consider examining SULTIA copy
number variability when searching for genotype-phenotype associations. The latest duplication is,
however, only a substantiated hypothesis; an alternative explanation, disfavored by the majority of
evidence, is that the duplication is an artifact of incorrect genome assembly.

Page 1 of 18
(page number not for citation purposes)


Experimental and computational results estimate that 5-
10% of the human genome has recently duplicated [1-4].
These estimates represent the total proportion of low-
copy repeats (LCRs), which are defined as homologous
blocks of sequence from two distinct genomic locations
(non-allelic) >1000 base pairs in length. LCRs, which are
also referred to in the literature as recent segmental dupli-
cations, may contain all of the various sequence elements,
such as genes, pseudogenes, and high-copy repeats. A set
of homologous LCRs make up an LCR family. Non-allelic
homologous recombination between members of an LCR
family can cause chromosomal rearrangements with
health-related consequences [5-7]. While data are not yet
available to understand the mechanistic basis of LCR
duplication, mechanisms will emerge through the study
of individual cases [8].

At the same time, the appearance of LCR duplicates may
be an artifact arising from one of a number of problems in
the assembly of a genome of interest. Especially when
classical repetitive sequences are involved, it is conceiva-
ble that mistaken assembly of sequencing contigs might
create in a draft sequence of a genome a repeat where
none exists. In the post-genomic world, rules have not yet
become accepted in the community to decide when the
burden of proof favors one interpretation (a true repeat)
over another (an artifact of assembly). Again, these rules
will emerge over time through the study of individual

Through the assembly of many case studies, more general
features of duplication and evolutionary processes that
retain duplicates should emerge. Although each LCR fam-
ily originates from one progenitor locus, no universal fea-
tures explain why the particular current progenitor loci
have been duplicated instead of other genomic regions.
From an evolutionary perspective, duplicated material is
central to creating new function, and to speciation. One
intriguing hypothesis is that genes whose duplication and
recruitment have been useful to meet current Darwinian
challenges find themselves in regions of the chromosome
that favor the generation of LCRs.

Browsing a naturally organized database of biological
sequences, we identified human cytosolic sulfotransferase
(SULT) 1A as a recently expanded gene family with bio-
medically related functions. SULT1A enzymes conjugate
sulfuryl groups to hydroxyl or amino groups on exoge-
nous substrates (sulfonation), which typically facilitates
elimination of the xenobiotic by the excretory system [9 ].
Sulfonation, however, also bioactivates certain pro-muta-
genic and pro-carcinogenic molecules encountered in the
diet and air, making it of interest to cancer epidemiolo-
gists [ 10,111. These enzymes also function physiologically

by sulfonating a range of endogenous molecules, such as
steroid and thyroid hormones, neurotransmitters, bile
salts, and cholesterol [9].

Three human SULT1A genes have been reported [12,13].
The human SULT1A1 and 1A2 enzymes are ~98% identi-
cal and recognize many different phenolic compounds
such as p-nitrophenol and ta-naphthol [14-19]. The
human SULT1A3 enzyme is ~93% identical to SULT1A1
and 1A2, but preferentially recognizes dopamine and
other catecholamines over other phenolic compounds
[19-23]. High resolution crystal structures of SULT1A1
and 1A3 enzymes have been solved [24-26]. Amino acid
differences that contribute to the phenolic and dopamine
substrate preferences of the SULT1A1 and 1A3 enzymes,
respectively, have been localized to the active site [27-30].

Polymorphic alleles of SULT1A1, 1A2, and 1A3 exist in
the human population [31-33]. An allele known as
SULT1A1 *2 contains a non-synonymous polymorphism,
displays only ~15% of wild type sulfonation activity in
platelets, and is found in ~30% of individuals in some
populations [31]. Numerous studies comparing SULT1A1
genotypes in cancer versus control cohorts demonstrate
that the low-activity SULT1A1 *2 allele is a cancer risk fac-
tor [34-36], although other studies have failed to find an
association [12]. Ironically, the protection from carcino-
gens conferred by the high activity SULTA1 *1 allele is
counterbalanced by risks associated with its activation of
pro-carcinogens. For example, SULT1A enzymes bioacti-
vate the pro-carcinogen 2-amino-a-carboline found in
cooked food, cigarette smoke and diesel exhaust [37]. The
sulfate conjugates of aromatic parent compounds convert
to reactive electrophiles by losing electron-withdrawing
sulfate groups. The resulting electrophilic cations form
mutagenic DNA adducts leading to cancer.

Recently, it has become widely understood that placing a
complex biomolecular system within an evolutionary
model helps generate hypotheses concerning function.
This process has been termed "phylogenomics" [38].
Through our bioinformatic and phylogenomic efforts on
the sulfotransferase 1A system, we detected a previously
unidentified human gene that is very similar to S ULT1A3,
transcriptionally active, and not found in the chimpanzee.
In addition, we report that all four human SULT1A genes
are located on LCRs in a region of chromosome 16 replete
with other LCRs. A model of SULT1A gene family expan-
sion in the hominoid lineage (humans and great apes) is
presented, complete with date estimates of three pre-
served duplication events and identification of the pro-
genitor locus. Positively selected protein sites were
identified that might have been central in adapting the
SULT1A3 and 1A4 enzymes to their role in sulfonating

Page 2 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

catecholamines such as dopamine and other structurally
related drugs.

Results and Discussion
Four human SULTIA genes on chromosome 16 LCRs
The human SULT1A1 and 1A2 genes are tandemly
arranged 10 kilobase pairs (kbp) apart in the pericentro-
meric region of chromosome 16, while the SULT1A3 gene
is located ~1.7 million base pairs (Mbp) away (Figure 1B
and 1C). In addition to the three known SULTIA genes,
we found a fourth gene, SULT1A4, by searching the
human genome with the BLAST-like alignment tool [39].
SULT1A4 was located midway between the SULT1A1/1A2
gene cluster and the SULT1A3 gene (Figure 1B and 1C).

The SULT1A4 gene resided on 148 kbp of sequence that
was highly identical to 148 kbp of sequence surrounding
the SULT1A3 gene (Figure 1A and Table 1). The high
sequence identity between the SULT1A3 and 1A4
genomic regions suggested that they were part of a low
copy repeat (LCR) family. This suspicion was confirmed
by mining the Recent Segmental Duplication Database of
human LCR families [40]. In addition to the four-member
SULTIA LCR family, the 148 kbp SULT1A3 LCR was
related to 27 other LCRs (Figure 1A and Table 1). Many of
the SULT1A3-related LCRs are members of the previously
identified LCR16a family [41,42]. The SULT1A3-related
LCRs mapping to chromosome 16 collectively amounted
to 1.4 Mbp of sequence or 1.5% of chromosome 16.

To determine if other genes in the SULT super family were
also recently duplicated during LCR expansions, we
searched the Segmental Duplication Database [4] for
human reference genes located on LCRs. No other com-
plete cytosolic SULT genes were located on LCRs, but 25%
of the SULT2A1 open reading frame (ORF) was located on
an LCR (Table 2).

The steroid sulfatase gene, which encodes an enzyme that
removes sulfate groups from the same biomolecules rec-
ognized and sulfonated by SULT enzymes, is frequently
deleted in patients with scaly skin (X-linked icthyosis) due
to nonallelic homologous recombination between LCRs
on chromosome X [43,44]. As demonstrated by the X-
linked icthyosis example, SULTIA copy number or activ-
ity in the human population could be modified with
health-related consequences by nonallelic homologous
recombination between LCRs on chromosome 16.

SULT I A4: genomic and transcriptional evidence
The sequence of the SULT1A4 gene region from the
human reference genome was so similar to that of the
SULT1A3 region (>99% identity) that the differences were
near those that might arise from sequencing error or
allelic variation. It was conceivable, therefore, that some

combination of sequence error, allelic variation, and/or
faulty genome construction generated the appearance of a
SULT1A4 gene that does not actually exist. We therefore
searched for additional evidence that the SULT1A4 gene
was material.

We asked whether any evidence was consistent with the
hypothesis of an artificial SULT1A4 LCR from erroneous
genome assembly, as opposed to the existence of a true
duplicate region. Here, the quality of the genomic
sequencing is important. The junction regions at the ends
of the SULT1A4 LCR were sufficiently covered; at least
nine sequencing contigs overlapped either junction
boundary (Figure 1C). This amount of evidence has been
used in other studies to judge the genomic placement of
LCRs [45].

As another line of evidence, we compared the nucleotide
sequences of the SULT1A4 and 1A3 genomic regions
(Table 3). Among the 876 coding positions the only dif-
ference was at position 105, where SULT1A4 possessed
adenine (A) and SULT1A3 possessed guanine (G). Thus, if
two genes do exist, they differ by one silent transition at
the third position of codon 35. The untranslated regions,
however, contained thirteen nucleotide differences while
the introns contained seven additional differences (Table
3). These 21 differences between the SULT1A4 and 1A3
genomic regions disfavor the hypothesis that sequencing
errors played a role in the correct/incorrect placement of
these LCRs.

The SULT1A4 gene was located near the junction of two
LCRs (Figure IC). For this reason, it was not clear whether
SULT1A4 had a functional promoter. We took a bioinfor-
matic approach to address this question. Expressed
sequences ascribed to SULT1A3 were downloaded from
the NCBI UniGene website [46]. Each sequence was
aligned to SULT1A3 and SULT1A4 genomic regions.
Based on the A/G polymorphism at the third position of
codon 35, five expressed sequences were assigned to
SULT1A4 and nine to SULT1A3 (Table 4). Other
expressed sequences were unclassified because they did
not overlap codon 35. If the SULT1A4 does exist, there is
ample evidence from expressed sequences to make con-
clusions about its transcriptional activity.

The codon 35 A/G polymorphism was reported as allelic
variation in SULT1A3 by Thomae et al. [33]. It is conceiv-
able that Thomae et al. sequenced both SULT1A3 and
SULT1A4 because of the identical sequences surrounding
them. In their study, 89% of CAA (1A4) and 11% of CAG
(1A3) codon 35 alleles were detected in one population.
Why were the frequencies not more equal, as would be
expected if SULT1A4 is always CAA and SULT1A3 is CAG?
One hypothesis is that SULT1A3 is indeed CAG/CAA

Page 3 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

1 20k


11A2 iB
I1 llII II N I


I G ----|



I 1 I II Ill
I *ff p12.2



1A X
S A4
Rill I1A3

S 13 16q22.1 1

Figure I
Genomic organization of the SULTIA LCR family. (A) 30 LCRs (red) aligned to the SULTIA3 LCR (blue). Core sequences of
SULTIA and LCR16a families are shown between dashed lines. (B) Chromosome 16 positions of 29 SULTIA3-related LCRs.
(C) Known genes, bacterial sequencing contigs, and LCRs (outlined in boxes) in three 350 kbp regions of chromosome 16.

Page 4 of 18
(page number not for citation purposes)




We M 1 11

BMC Evolutionary Biology 2005, 5:22

IM I 1 0




BMC Evolutionary Biology 2005, 5:22

Table I: SULTIA3-related LCRs

LCR Name* Chromosome















% Identityt

*LCR names are as in Figure I. tPercent identity is relative to the IA3 LCR.

Table 2: Duplication Status of SULT Genes


SULTIAI, phenol
SULTIA2, phenol
SULTIA3, dopamine

ORF Duplicated

Page 5 of 18
(page number not for citation purposes)




ORF Length


Table 3: SULTIA4 and SULTIA3 Genomic Region Differences

SULTIA4 Region

SULTIA3 Region

* 21 alignment positions are shown where the nucleotide/gapping (-) of the SULTIA4 region differed from that of the SULTIA3 region. Exon and
intron names of the SULTIA3 gene are according to [33]. All nucleotides are numbered relative to the first nucleotide of the start codon, which has
a value of + 1. There was no position 'zero'. The last nucleotide of the coding sequence occurs at position +3,188. Approximately 3 kb of upstream
(5' UTR) and downstream (3' UTR) nucleotides were included in the comparison.

Table 4: Evidence of SULTIA4 Expression




Pos. 105

[Genbank:CB 147451]
[Genbank:AA34913 1]


fetal heart
fetal heart
pancreas, epitheliod carcinoma
infant brain
optic nerve
large cell carcinoma
fetal adrenal gland

*Gene classifications made according to the nucleotide at position 105 as described in the text. tTissue descriptions were taken from GenBank

polymorphic as reported, while SULTIA4 is always CAA.
Interestingly, in both the chimpanzee and gorilla, codon
35 of SULTIA3 is CAA. This implies that the ancestral
SULTIA3 gene (prior to duplication) likely had a CAA
codon. An A to G transition might have been fixed in a

fraction of SULTIA3 genes after the divergence of humans
and great apes. If this scenario is true, some transcripts
assigned to SULTIA4 on the basis of codon 35 may actu-
ally be from individuals expressing the ancestral CAA ver-
sion of SULTIA3.

Page 6 of 18
(page number not for citation purposes)



5' UTR
5' UTR
5' UTR
5' UTR
5' UTR
Intron I B
Intron I B
Intron I B
Intron I B
Intron I A
Exon 2
Intron 4
Intron 4
Exon 8
Exon 8
Exon 8
Exon 8
3' UTR
3' UTR
3' UTR
3' UTR


BMC Evolutionary Biology 2005, 5:22

Hominoid SULT 1,
Hominoid SULT 1A I
Ancestor I

Human 1A4

Human 1A3

Chimp 1A3

Gorilla 1A3

Human 1A2

Chimp 1A2

Gorilla 1A2

Human 1A1

Chimp 1A1

SGorilla 1A1
00 0.08 Macaca

0.17 0.79 Mouse
CO 0.24 Rat
0.00 0.0

0.23 0.15 Pig
0.11 00.17 0.19 Ox

0.06 Dog

Figure 2
SULTIA gene tree. TREx upper-limit date estimates of hominoid SULTIA duplications are shown as Ma in red. KA/Ksvalues
estimated by PAML are shown above branches. Infinity (8) indicates a non-reliable KA/Ks value greater than 100. The IA3/IA4
branch is dashed. NCBI accession numbers of sequences used: chimpanzee IAl [Genbank:BK004887], chimpanzee IA2 [Gen-
bank:BK004888], chimpanzee IA3 [Genbank:BK004889], ox [Genbank:U34753], dog [Genbank:AY069922], gorilla IAl [Gen-
bank:BK004890], gorilla IA2 [Genbank:BK00489 I], gorilla IA3 [Genbank:BK004892], human IAl [Genbank:L19999], human
IA2 [Genbank:U34804], human IA3 [Genbank:L25275], human IA4 [Genbank:BK004132], macaque [Genbank:D85514],
mouse [Genbank:L0233 I], pig [Genbank:AY193893], platypus [Genbank:AY044182], rabbit [Genbank:AF360872], rat

Page 7 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

SULTIA progenitor locus
We aligned the coding sequences of all available SULT1A
genes and used various nucleotide distance metrics and
tree-building algorithms to infer the gene tree without
constraints. The unconstrained topology placed platypus
as the out group, with the placental mammals ordered
(ox,(pig,(dog,(rodents)),(rabbit,(primates)))). This dif-
fered from the topology inferred while constraining for
the most likely relationships among mammalian orders
(platypus,((dog, (ox,pig)), ((rabbit,(rodents)),
primates))) [47]. We considered both trees, and found
that the conclusions drawn throughout the paper were
robust with regard to these different topologies. Therefore,
only the tree inferred while constraining for most likely
relationships among mammalian orders is discussed (Fig-
ure 2).

Using the transition redundant exchange (TREx) molecu-
lar dating tool [48], we placed upper-limit date estimates
at the SULTIA duplication nodes (Figure 2). The SULT1A
gene family appears to have expanded ~32, 25, and 3 mil-
lion years ago (Ma). Therefore, the SULTIA duplications
likely occurred after the divergence of hominoids and old
world monkeys, with the most recent duplication occur-
ring even after the divergence of humans and great apes.

Mouse, rat, and dog genomes each contained a single
SULTIA gene. The simplest evolutionary model, there-
fore, predicted that one of the four hominoid SULTIA loci
was orthologous to the rodent SULT1A1 gene. Syntenic
regions have conserved order of genetic elements along a
chromosomal segment and evidence of synteny between
homologous regions is useful for establishing relation-
ships of orthology and paralogy. Human SULT1A1 is
most like rodent Sultlal in sequence and function and
before the advent of whole genome sequencing it was
assumed that they were syntenic and therefore ortholo-
gous [49]. Complete genome sequences have since
emerged and alignments between them are available in
the visualization tool for alignments (VISTA) database of
human-rodent genome alignments [50,51]. The VISTA
database contains mouse-human pairwise alignments
and mouse-rat-human multiple alignments. The multiple
alignments were found to be more sensitive for predicting
true orthologous regions between rodent and human
genomes [51]. We searched the VISTA database for evi-
dence of any human-rodent syntenic regions involving
the four SULTIA loci. The more sensitive multiple align-
ments failed to record any human-rodent syntenic regions
involving the SULT1A1, SULT1A2, or SULT1A4 loci but
detected synteny involving the SULT1A3 loci and both
rodent genomes (Figure 3). These results are indicative of
a hominoid specific SULTIA family expansion from a pro-
genitor locus corresponding to the genomic region that
now contains S ULT1A3. The results from the VISTA data-

base were not as clear when the less sensitive alignment
method was employed (Figure 3).

SULT1A3 and 1A4 LCRs were 99.1% identical overall
(Table 1). More careful inspection revealed that the
SULT1A3 and 1A4 LCRs were 99.8% identical over the
first 120 kbp, but only 98.0% identical over the last 28
kbp (data not shown). This 10-fold difference in percent
identities (0.2% vs. 2.0%) suggested that the SULT1A4-
containing LCR was produced by two independent dupli-
cations. The chimpanzee draft genome assembly aligned
with the human genome [52] provides evidence in sup-
port of this hypothesis. There is conserved synteny
between human and chimp genomes over the last 28 kbp
of the 1A4-containing LCR, but no synteny over the first
120 kbp where the SULT1A4 gene is located (data not
shown). This finding and the TREx date estimate for the
SULT1A3/1A4 duplication event at ~3 Ma indicate that
SULT1A4 is a human invention not shared by chimpan-
zees our closest living relatives.

It should be noted that the chimpanzee genome assembly
is less reliable than the assembly of the human genome.
The coverage is significantly lower, and the methods used
for assembly are viewed by many as being less reliable, in
part because they relied on the human assembly. Other
possibilities, less supported the available evidence, should
be considered, including deletion of the chimpanzee
SULT1A4 gene since the human-chimp divergence, or fail-
ure of the draft chimpanzee genome assembly to detect
the 120 kbp segment on which the SULT1A4 gene resides.

Adaptive evolution in hominoids
From an analysis of gene sequence change over time,
molecular evolutionary theory can generate hypotheses
about whether duplication has led to functional redun-
dancy, or whether the duplicates have adopted separate
functional roles. If the latter, molecular evolutionary the-
ory can suggest how different the functional roles might
be by seeking evidence for positive (adaptive) selection
for mutant forms of the native proteins better able to con-
tribute to fitness.

Positive selection of protein function can best be hypoth-
esized when the ratio of non-synonymous (replacement)
to synonymous (silent) changes normalized to the
number of non-synonymous and synonymous sites
throughout the entire gene sequence (KJ/Ks) is greater
than unity. Various models of evolutionary sequence
change can be used to calculate these ratios. The simplest
assumes a single KJ/Ks ratio over the entire tree (one-
ratio). More complex models assume an independent
ratio for each lineage (free-ratios), variable ratios for spe-
cific classes of sequence sites (site-specific), or variable

Page 8 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

BMC Evolutionary Biology 2005, 5:22

_- Gene




-16 *.h

28,420k 28,440k 28,460k 28,480k
5178 1A4
A _,*

CNS i Contig LCR


28,500k 28,520k

28,540k 28,560k


29,480k 29,500k 29,520k 29,540k 29,560k 29,580k 29,600k 29,620k 29,640k 29,660k


1A 5178 1A3 BP2 43
-41 %&



l cI 100

30,220k 30,240k 3 k

30,280k 30,300k 3 k


Figure 3
Synteny plots demonstrating SULTIA3 is the progenitor locus of the hominoid SULTIA family. Each box shows a VISTA per-
cent identity plot between a section of the human genome and a section of a rodent genome. Different rodent genomes and
alignment methods are indicated as follows: I = mouse (Oct. 2003 build) multiple alignment method (MLAGAN); 2 = rat (June
2003 build) multiple alignment method (MLAGAN); 3 = mouse (October 2003 build) pairwise alignment method (LAGAN).
Human gene locations are shown above and human chromosome 16 coordinates below.

Page 9 of 18
(page number not for citation purposes)






U^------ I----------



Table 5: Likelihood Values and Parameter Estimates for SULTIA Genes

Parameter Estimatest

-5,047.81 KA/KS= 0.15
- 5,005.18 KA/KS ratios for each
branch shown in
Figure 2
- 5,021.14 po= 0.48
- 4,884.89 po = 0.41
- 4,931.05 po = 0.68
KA/Ks 0 = 0.06
- 4,880.78 po = 0.59
KA/Ks 0 = 0.02
- 4,884.27 p = 0.27
- 4,879.97 p = 0.30
po = 0.98
Branch-site specific
-5,013.29 po= 0.48
KA/Ks o = 0
- 4,886.52 po = 0.68
KA/Ks 0 = 0.04




(P2= 0.46)
KA/KS2= 0.19

(P2= 0.08)
KA/Ks2 = 1.24

KA/Ks = > 2.0

(P2 = 0.03)
KA/Ks2 = > 2.0
(p2= 0.02)
KA/Ks2 = > 2.0

*f.p. is the number of free parameters in each model. tEvidence for positive selection is shown in boldface. Proportions of sites in each KA/Ks class,
po, P|, and p2, were not free parameters when in parentheses. Neutral site-specific model assumes two site classes having fixed KA/Ks ratios of 0 and
I, with the proportion of sites in each class estimated as free parameters. Selection site-specific model assumes a third proportion of sites with KA/
Ks estimated from the data. Discrete model assumes 2 or 3 site classes (k) with the proportion of sites, and KA/KS ratios for each proportion,
estimated as free parameters. Beta model assumes a beta distribution of sites, where the distribution is shaped by the parameters p and q.
Beta+selection model assumes an additional class of sites having a KA/KS ratio estimated from the data. Model A, an extension of the neutral model,
assumes a third site class on the I A3/ IA4 branch with KA/Ks estimated from the data. Model B, an extension of the discrete model with two site
classes (k = 2), also assumes a third site class on the I A3/IA4 branch with KA/KS estimated from the data.

ratios for specific classes of sequence sites along specified
branches (branch-site specific) [53-57].

Estimating the free parameters in each of these models by
the maximum likelihood method [58] enables testing two
nested evolutionary models as competing hypotheses,
where one model is a special case of another model. The
likelihood ratio test (LRT) statistic, which is twice the log
likelihood difference between the nested models, is com-
parable to a x2 distribution with degrees of freedom equal
to the difference in free parameters between the models
[59]. Evidence for adaptive evolution typically requires a
KA/Ks ratio >1 and a statistically significant LRT [60].

We estimated KJ/Ks ratios for each branch in the 1A gene
tree by maximum likelihood with the PAML program
[61]. A typical branch in the SULT1A gene tree had a ratio
of 0.16, and the ratio was 0.23 on the branch separating
extant SULT1A3/1A4 genes from the single SULT1A gene
in the last common ancestor of hominoids (Figure 2).
Thus, the KJ/Ks ratio estimated as an average over all sites
did not suggest adaptive evolution along the 1A3/1A4

We then implemented three site-specific and two branch-
site evolutionary models that allow KA/Ks ratios to vary
among sites. Four of the five models estimated that a pro-
portion of sites (2-8%) had KA/Ks >1 (Table 5). Each
model was statistically better at the 99 or 95% confidence
level than the appropriate null model as determined using
the LRT statistic (Table 6). Table 6 lists the specific sites
that various analyses identified as being potentially
involved in positive selection and a subset of these sites
that are changing along the SULT1A3/1A4 branch.

A hypothesis of adaptive change that is based on the use
of Ka/Ksvalues can be strengthened by joining the molec-
ular evolutionary analysis to an analysis based on struc-
tural biology [62,63]. Here, we ask whether the sites
possibly involved in an episode of sequence evolution are,
or are not, randomly distributed in the three dimensional
structure. To ask this question, we mapped the sites to the
SULT1A structure (Figure 4). Sites holding amino acids
whose codons had suffered synonymous replacements
were evenly distributed throughout the three-dimen-
sional structure of the enzyme, as expected for silent
changes that have no impact on the protein structure and

Page 10 of 18
(page number not for citation purposes)


(p, = 0.52)
KA/KS i= I
p, = 0.13
KA/KS i= I
p, = 0.32
KA/KS I = 0.77
pi = 0.33
KA/KS I = 0.31
q = 1.07
q = 1.33
p, = 0.02

pi = 0.49
p, = 0.30
KA/KsI = 0.56

Discrete (k = 2)

Discrete (k = 3)


Model A

Model B

BMC Evolutionary Biology 2005, 5:22

Table 6: Likelihood Ratio Tests for the SULTIA Genes

Selection vs. Neutral Discrete (k = 3) vs. One-ratio Beta+selection vs. Beta Model A vs. Neutral Model B vs. Discrete (k = 2)

- 4,880.78
- 5,047.81

Log L,
Log Lo
2ALog L

0.01 < P < 0.05

89 (0.88)

222 (0.58)

245 (0.99)

- 5,013.29
- 5,021.14
P < 0.001

- 4,886.52
- 4,931.05
P < 0.001



105 (0.72)
107 (0.82)
132 (0.87)
143 (0.51)
146 (0.80)


105 (0.53)
107 (0.75)
132 (0.78)


*In parentheses for each positively selected site is the posterior probability that the site belongs to the class with KA/Ks > I. Posterior probabilities
>90% are bold-face. Positively selected sites also experiencing non-synonymous change on the IA3/IA4 branch are underlined.

therefore cannot be selected for or against at the protein
level (Figure 4). In contrast, sites experiencing non-synon-
ymous replacements during the episode following the
duplication that created the new hominoid gene are
clustered on the side of the protein near the substrate
binding site and the channel through which the substrate
gains access to the active site (Figure 4 and Table 7). This
strengthens the hypothesis that replacements at the sites
are indeed adaptive. The approach employed here based
on structural biology does not lend itself easily to evalua-
tion using statistical metrics. Rather, the results are valua-
ble based on the visual impression that they give, and the
hypotheses that they generate.

We then examined literature where amino acids had been
exchanged between SULT1A1 and SULT1A3. One of the
sites, at position 146, identified as being involved in adap-
tive change, is known to control substrate specificity in
SULT1A1 and 1A3 [27-30]. The remaining sites identified
are nearby.

An interesting question in post-genomic science asks how
to create biological hypotheses from various drafts of
whole genome sequences. In generating these hypotheses,
it is important to remember that a genomic sequence is
itself a hypothesis, about the chemical structure of a small
number of DNA molecules. In many cases, biologists wish
to move from the genomic sequence, as a hypothesis, to
create hypotheses about biological function, without first
"proving" the genome sequence hypothesis.

This type of process, building hypotheses upon unproven
hypotheses, is actually common in science. In fact, very lit-
tle of what we believe as fact is actually "proven"; formal
proof is virtually unknown in science that involves obser-
vation, theory, and experiment. Rather, scientists gener-
ally accumulate data until a burden of proof is met, with
the standards for that burden being determined by experi-
ence within a culture. In general, scientists have an idea in
an area as to what level of validation is sufficient to avoid

Page 11 of 18
(page number not for citation purposes)

- 4,884.89
- 5,021.14
P < 0.001

P < 0.001
Positively selected sites*
3 (0.86)
7 (0.63)
35 (0.73)
71 (0.88)


236 (0.53)
245 (0.99)
261 (0.90)
275 (0.70)
288 (0.89)
290 (0.95)
293 (0.72)

BMC Evolutionary Biology 2005, 5:22


Active S rte
I Openinrg

S ubstrato

Figure 4
Non-synonymous changes along the I A3/I A4 branch cluster
on the SULT I AI enzyme structure [PDB: I LS6] [26]. Red
sites experienced non-synonymous changes, green sites
experienced synonymous changes. The PAPS donor sub-
strate and p-nitrophenol acceptor substrates are shown in
blue. Image was generated using Chimera [86].

making mistakes an unacceptable fraction of the time,
and proceed to that level in their ongoing work, until they
encounter a situation where they make a mistake
(indicating that a higher standard is needed), or encoun-
ter enough examples where a lower standard works, and
therefore come to accept a lower standard routinely [64].

Genomics has not yet accumulated enough examples for
the culture to define the standards for a burden of proof.
In the example discussed here, several lines of reasoning
would be applied to analyze the sulfotransferase gene
family. First, the fact that the draft genome for
chimpanzee contains three paralogs, while the draft
genome for human contains four, would normally be
interpreted (as it is here) as evidence that an additional
duplication occurred in the time since chimpanzee and
humans diverged. It would also, however, be consistent
with the loss of one of four hypothetical genes present in
the common ancestor of chimpanzee and humans in the
lineage leading to chimpanzee. Another possibility is that
the finishing stages of the chimpanzee genome project
will uncover a SULT1A4 gene.

Normally, one would resolve this question using an out
group taxon, a species that diverged from the lineage lead-
ing to chimpanzee and human before chimpanzee and
human themselves diverged. The nearest taxa that might
serve as an out group today are, however, rat and mouse.
As noted above, they diverged so long ago (ca. 150 MY
separates contemporary rodents from contemporary pri-
mates) that the comparison provides no information. And
no closer out group taxon (e.g., orangutan) has had its
genome completely sequenced.

Here, the two hypotheses (duplication versus loss after the
chimpanzee-human speciation) are distinguished (to
favor post-speciation duplication) based on an analysis of
the silent nucleotide substitutions using the TREx metric.
The very small number of nucleotide differences separat-
ing the SULT1A3 and SULT1A4 coding regions favors the
generation of the two paralogs after chimpanzee and
human diverged.

This comparison, however, potentially suffers from the
statistics of small numbers. The number of differences in
the coding region (exactly one) is small. By considering
~10 kbp of non-coding sequence, however, additional dif-
ferences were found. It is possible that in the assembly of
the human genome, a mistake was made that led to the
generation of a SULT1A4 region that does not actually
exist. In this hypothesis, the ~20 nucleotide differences
between the SULT1A3 and SULT1A4 paralogs must be the
consequence of allelic polymorphism in the only gene
that exists. This is indeed how some of the data were ini-
tially interpreted.

Does the preponderance of evidence favor the hypothesis
of a very recent duplication to generate a pair of paralogs
(SULT1A3 and SULT1A4)? Or does the evidence favor the
hypothesis that the SULT1A4 gene is an illusion arising
from gene assembly error coupled to sequencing errors
and/or allelic variation at ca. 20 sites? The culture does not
yet have a standard of assigning the burden of proof here,
although a choice of hypothesis based simply on the
count of the number of mistakes that would need to have
been made to generate each hypothesis (none for the first,
at least three for the second) would favor the former over
the latter. Thus, perhaps naively, the burden of proof now
favors the former, and we may proceed to generate the
biological hypothesis on top of the genomic hypothesis.

Here, the hypothesis has immediate pharmacogenomic
and genomic disease implications due to the specific
functional behaviors of SULT1A enzymes. LCR-mediated
genomic rearrangements could disrupt or amplify human
SULT1A gene copy number. Given our current environ-
mental exposure to many forms of carcinogens and pro-
carcinogens that are either eliminated or activated by

Page 12 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

Table 7: Non-synonymous Changes on the IA3/IA4 Branch

Site* Nucleotide

Hominoid SULTIA Ancestor

Hominoid SULTIA3 Ancestor

Residue PPt Physicochemical Properties

tiny polar
non-polar aromatic positive
non-polar aromatic
non-polar aromatic
small non-polar aliphatic
non-polar aliphatic
tiny non-polar
non-polar aliphatic
tiny polar
tiny non-polar
non-polar aromatic positive
tiny non-polar
small non-polar aliphatic
non-polar aliphatic

Residue PP Physicochemical Properties


small polar
small polar
small non-polar aliphatic
small non-polar aliphatic
small polar
small polar negative
polar negative
non-polar aliphatic
son-polar aliphatic
tiny polar
non-polar aromatic positive
polar positive
polar negative
tiny non-polar
non-polar aromatic

* Sites underlined were identified as being positively selected using the branch-site specific models. tPosterior probabilities that the ancestral
residues are correct, conditional on the model of sequence evolution used.

SULT enzymes, respectively, it is plain to see how SULT1A
copy number variability in the human population could
underlie cancer susceptibilities and drug or food allergies.

The majority of evidence indicates that a new transcrip-
tionally active human gene, which we refer to as
SULT1A4, was created when 120 kbp of chromosome 16
duplicated after humans diverged from great apes. Thus,
SULT1A4, or possibly another gene in this region, is likely
to contribute to distinguishing humans from their closest
living relatives. It is also conceivable that an advantage in
gene regulation, as opposed to an advantage from gene
duplication, was the driving force behind the duplication
of this 120 kbp segment. While cause and effect are diffi-
cult to separate, the examples presented here support the
hypothesis that genes whose duplication and recruitment
are useful to meet current Darwinian challenges find
themselves located on LCRs.

The SULT1A4 gene is currently the most obvious feature
of the duplicated region and has been preserved for -3 MY
without significant divergence of its coding sequence. One
suggestion for the usefulness of SULT1A4 is that it
expanded sulfonating enzymes to new tissues. The
SULT1A4 gene is located only 10 kbp upstream from the
junction boundary of its LCR and 700 kbp away from the
SULTIA3 locus. It is possible, therefore, that promoter
elements from the new genomic context of the SULT1A4

gene would drive its expression in tissues where SULTIA3
is not expressed a hypothesis testable by more careful
transcriptional profiling.

Multiple SULT1A genes were apparently useful inventions
by our stem hominoid ancestor. Following the duplica-
tion of an ancestral primate SULT1A gene ~32 Ma, posi-
tive selection acted on a small proportion of sites in one
of the duplicates to create the dopamine sulfonating
SULT1A3 enzyme. In the example presented here, the evi-
dence of adaptive change at certain sites is corroborated
by the ad hoc observation that the sites cluster near the
active site of the protein. The well known substrate bind-
ing differences at the active sites of SULT1A1/1A2 and
SULT1A3 (and now SULT1A4) substantiate these

When studying well-characterized proteins as we have
done here, episodes of functional change can be identified
by piecing together several lines of evidence. It is not
immediately possible, unfortunately, to assemble as
much evidence for the majority of proteins in the bio-
sphere. Thus, an important goal in bioinformatics is to
recognize the signal of functional change from a restricted
amount of evidence. Of the three lines of evidence
employed here (codon-based metrics, structural biology,
and experimental), structural biology, with its obvious
connections to protein function and impending growth

Page 13 of 18
(page number not for citation purposes)


BMC Evolutionary Biology 2005, 5:22

from structural genomics initiatives, will probably be the
most serviceable source of information for most protein
families. This should be especially true for protein fami-
lies not amenable to experimental manipulation, or with
deep evolutionary branches where codon-based metrics
are unhelpful. If we are to exploit the incontrovertible link
between structure and function, however, new structural
bioinformatic tools and databases relating protein struc-
ture to sequence changes occurring on individual
branches are much needed.

This bioinformatic study makes several clear predictions.
First, a PCR experiment targeted against the variation
between the hypothetical SULT1A3 and SULT1A4 human
genes should establish the existence of the two separate
genes. Second, a reverse transcription-PCR experiment
would be expected to uncover transcriptional activity for
the SULT1A3 and SULT1A4 human genes. Since this
paper was submitted, these experiments have been done,
and indeed confirm our predictions made without the
experimental information [65]. Further, after this manu-
script and its computationally-based predictions were
submitted for publication, a largely finished sequence for
chromosome 16 has emerged [66] that confirms our anal-
ysis here in every respect.

SULTIA LCR family organization in the human genome
The July 2003 human reference genome (based on NCBI
build 34) was queried with the SULT1A3 coding region
using the BLAST-like alignment tool [39], and search
results were visualized in the UCSC genome browser [67].
Two distinct locations on chromosome 16 were identified
as equally probable. One location was recognized by
NCBI Map Viewer [68] as the SULT1A3 locus. The other
locus was dubbed SULT1A4 following conventional nam-
ing for this family. The coding sequence and genomic
location of SULT1A4, as well as expressed sequences
derived from SULT1A4, have been deposited with the
GenBank Third Party Annotation database under acces-
sion [Genbank:BK004132].

To determine the extent of homology between the
SULT1A3 and 1A4 genomic locations, ~500 kbp of
sequence surrounding SULT1A3 and ~500 kbp of
sequence surrounding SULT1A4 were downloaded from
NCBI and compared using PIPMAKER [69,70]. Before
submitting to PIPMAKER, high-copy repeats in one of the
sequences were masked with REPEATMASKER [71].

The Human Recent Segmental Duplication Page [72] was
consulted to identify other LCRs related to the SULT1A3-
containing LCR. Chromosomal coordinates of 30
SULTiA3-related LCRs were arranged in GFF format and
submitted to the UCSC genome browser as a custom

track. Sequences corresponding to the chromosomal coor-
dinates of the 30 LCRs were then downloaded from the
UCSC genome browser and parsed into separate files.
Each LCR was aligned with the SULT1A3-containing LCR
using MULTIPIPMAKER [731. The Segmental Duplication
Database [74] was used to examine the duplication status
of each gene in the cytosolic SULT super family.

The bacterial artificial chromosome contigs supporting
each member of the SULTIA LCR family, and the known
genes within each LCR, were inspected with the UCSC
genome browser [75]. The DNA sequences of nine bacte-
rial artificial chromosome contigs supporting the
SULT1A4 genomic region [NCBI Clone Registry: CTC-
446K24, CTC-529P19, CTC-576G12, CTD-2253D5, CTD-
2324H19, CTD-2383K24, CTD-2523J12, CTD-3191G16,
RP11-28A61 and seven contigs supporting the SULT1A3
region [NCBI Clone Registry: CTD-2548B1, RP11-69013,
RP11-164024, RP11-455F5, RP11-612G2, RP11-787F23,
RP11-828J20] were downloaded from the UCSC genome
browser website.

The MASTERCATALOG was used for performing initial
inspections of the SULT gene family and for delivering a
non-redundant collection of SULT1A genes. Additional
SULT1A ORFs were extracted from gorilla working draft
contigs [Genbank:AC145177] (SULT1A1 and 1A2) and
[Genbank:AC145040] (SULT1A3) and chimpanzee
whole genome shotgun sequences [Gen-
bank:AACZ01082721] (SULT1A1),
[Genbank:AADA01101065] (SULT1A2), and [Gen-
bank:AACZ01241716] (SULT1A3) using PIPMAKER exon
analysis. These new SULT1A genes have been deposited
with the GenBank Third Party Annotation database under
accession numbers [Genbank:BK004887-BK004892].
DNA sequences were aligned with CLUSTAL W [76]. The
multiple sequence alignment used in all phylogenetic
analyses is presented as supplementary data [see Addi-
tional file 1 ]. Pairwise distances were estimated under var-
ious distance metrics (Jukes-Cantor, Kimura 2-parameter,
and Tamura-Nei) that account for among-site rate varia-
tion using the gamma distribution [77]. Phylogenies were
inferred using both neighbor-joining and minimum evo-
lution tree-building algorithms under the following con-
straints ((((primates), rodents), (artiodactyls,
carnivores)), platypus). Phylogenetic analyses were con-
ducted using the MEGA2 v2.1 [78] and PAUP* v4.0 [79]
software packages.

Parameter estimates of site class proportions, KJ/Ks ratios,
base frequencies, codon frequencies, branch lengths, and
the transition/transversion bias were determined by the
maximum likelihood method with the PAML v3.14 pro-
gram [61]. Positively selected sites, posterior probabilities,

Page 14 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

and marginal reconstructions of ancestral sequences were
also determined using PAML. Sites experiencing synony-
mous changes along the 1A3/1A4 branch were recorded
by hand from an ancestral sequence alignment.

Molecular dating
Starting with aligned DNA sequences, the number (n) of
two-fold redundant codons (Lys, Glu, GIn, Cys, Asp, Phe,
His, Asn, Tyr) where the amino acid had been conserved
in pairs of aligned sequences, and the number of these
codons where the third position was identically conserved
(c) were counted by the DARWIN bioinformatics platform
[80,81]. The pairwise matrix of n and c values for all
SULT1A genes is presented as supplementary data [see
Additional file 2]. The c/n quotient equals the fraction of
identities (f2) in this system, or the transition redundant
exchange (TREx) value [48]. TREx values were converted
to TREx distances (kt values) by the following equation: kt
= In [(f2 Eq.) / (1 Eq.)], where k is the rate constant of
nucleotide substitution, t is the time separating the two
sequences, and Eq. is the equilibrium state of the TREx
value [48]. The equilibrium state of the TREx value was
estimated as 0.54 for primates, and the rate constant at
two-fold redundant sites where the amino acid was con-
served (k) was estimated as 3.0 x 10 -9 changes/site/year
for placental mammals (T. Li, D. Caraco, E. Gaucher, D.
Liberles, M. Thomson, and S.A.B., unpublished data).
These estimates were determined by sampling all pairs of
mouse:rat and mouse:human orthologs in the public
databases and following accepted placental mammal phy-
logenies and divergence times [82,83]. Therefore, the date
estimates reported are based on the contentious assump-
tions that (i) rates are constant at the third position of
two-fold redundant codons across the genome, (ii) the
fossil calibration points are correct, and (iii) the mamma-
lian phylogeny used is correct. Branch lengths were
obtained for the constrained tree topology from the pair-
wise matrix of TREx distances using PAUP* v4.0. Upper-
limit date estimates for nodes corresponding to SULT1A
duplication events were obtained by summing the longest
path of branches leading to a node and dividing that value
by k.

Comparative genomics
Human-chimpanzee genome alignments were inspected
at the UCSC Genome Browser. Human-rodent genome
alignments were examined with the VISTA Genome
Browser [50,51,84]. VISTA default parameters were used
for drawing curves. Alignments constructed using both
the pairwise method (LAGAN) and the multiple align-
ment method (MLAGAN) between the human genome
builds frozen on April 2003 or July 2003 and both rodent
genomes were inspected.

Transcriptional profiling
All expressed sequences ascribed to SULTIA3 were down-
loaded from NCBI UniGene [85] and aligned with
SULT1A3 and SULT1A4 genomic regions using PIP-
MAKER. Alignments were inspected for the polymor-
phism in codon 35, as well as any other potential patterns,
to determine whether they were derived from SULT1A3 or

kbp (kilobase pairs); LCR (low copy repeat); Mbp (mil-
lion base pairs); Ma (million years ago); ORF (open
reading frame); SULT (sulfotransferase); TREx (transition
redundant exchange); VISTA (visualization tool for

Authors' contributions
M.E.B carried out the study and drafted the manuscript.
S.A.B participated in designing the study and preparing
the manuscript.

Additional material

Additional File 1
Multiple sequence ,i "I". f SULT1A genes.Multiple sequence align-
ment of SULT1A genes used in all phylogenetic analyses. Characters con-
served in all sequences are indicated with asterisks.
Click here for file
2148-5-22-S1 pdf]

Additional File 2
Pairwise n and c values for SULT1A genes. Pairwise n and c values
between SULT1A genes. The names of the sequences are the row-headers
and the column-headers. Lower triangular matrix contains n values, and
upper triangular matrix contains c values.
Click here for file

We thank the Foundation for Applied Molecular Evolution for providing
computational resources. This work was supported by grant DOD 6402-
202-LO-G from the USF Center for Biological Defense, and by an NIH post-
doctoral fellowship to M.E.B.

I. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J,
Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris
K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P,
McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J,
Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-
Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sul-
ston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N,
Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin
R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt
A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S,
Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S,

Page 15 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA,
Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL,
Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB,
Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T,
Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett
N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M,
Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley
KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS,
Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T,
Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T,
Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T,
Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L,
Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer
M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, WangJ, Huang G,
Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA,
Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood
J, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S,
Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser
J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia
N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bai-
ley JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge
CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T,
Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hay-
ashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS,
Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin
EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T,
Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J,
Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-
Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe
KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A,
Wetterstrand KA, Patrinos A, Morgan MJ, Szustakowki J, de Jong P,
Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ: Initial
sequencing and analysis of the human genome. Nature 2001,
2. Cheung VG, Nowak N, Jang W, Kirsch IR, Zhao S, Chen XN, Furey
TS, Kim UJ, Kuo WL, Olivier M, Conroy J, Kasprzyk A, Massa H,
Yonescu R, Sait S, Thoreen C, Snijders A, Lemyre E, Bailey JA, Bruzel
A, Burrill WD, Clegg SM, Collins S, Dhami P, Friedman C, Han CS,
Herrick S, Lee J, Ligon AH, Lowry S, Morley M, Narasimhan S, Osoe-
gawa K, Peng Z, Plajzer-Frick I, Quade BJ, Scott D, Sirotkin K, Thorpe
AA, Gray JW, Hudson J, Pinkel D, Ried T, Rowen L, Shen-Ong GL,
Strausberg RL, Birney E, Callen DF, ChengJF, Cox DR, Doggett NA,
Carter NP, Eichler EE, Haussler D, Korenberg JR, Morton CC,
Albertson D, Schuler G, de Jong PJ, Trask BJ: Integration ofcytoge-
netic landmarks into the draft sequence of the human
genome. Nature 2001, 409:953-958.
3. Eichler EE: Segmental duplications: what's missing, misas-
signed, and misassembled--and should we care? Genome Res
2001, 11:653-656.
4. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE: Segmental
duplications: organization and impact within the current
human genome project assembly. Genome Res 2001,
5. Stankiewicz P, Lupski JR: Genome architecture, rearrange-
ments and genomic disorders. Trends Genet 2002, I 8:74-82.
6. Emanuel BS, Shaikh TH: Segmental duplications: an 'expanding'
role in genomic instability and disease. Nat Rev Genet 2001,
7. Singleton AB, Farrer M, Johnson J, Singleton A, Hague S, Kachergus J,
Hulihan M, Peuralinna T, Dutra A, Nussbaum R, Lincoln S, Crawley A,
Hanson M, Maraganore D, Adler C, Cookson MR, Muenter M,
Baptista M, Miller D, Blancato J, Hardy J, Gwinn-Hardy K: alpha-
Synuclein locus triplication causes Parkinson's disease. Sci-
ence 2003, 302:841.
8. Horvath JE, Gulden CL, BaileyJA, Yohn C, McPherson JD, PrescottA,
Roe BA, de Jong PJ, Ventura M, Misceo D, Archidiacono N, Zhao S,
Schwartz S, Rocchi M, Eichler EE: Using a pericentromeric inter-
spersed repeat to recapitulate the phylogeny and expansion
of human centromeric segmental duplications. Mol Biol Evol
2003, 20:1463-1479.
9. Falany CN: Enzymology of human cytosolic sulfotransferases.
FASEBJ 1997, 11:206-216.
10. Miller JA: Sulfonation in chemical carcinogenesis--history and
present status. Chem Biol Interact 1994, 92:329-341.
II. Glatt H: Sulfation and sulfotransferases 4: bioactivation of
mutagens via sulfation. FASEBJ 1997, 11:314-321.

12. Coughtrie MW: Sulfation through the looking glass--recent
advances in sulfotransferase research for the curious. Pharma-
cogenomics j 2002, 2:297-308.
13. Freimuth RR, Wiepert M, Chute CG, Wieben ED, Weinshilboum RM:
Human cytosolic sulfotransferase database mining: identifi-
cation of seven novel genes and pseudogenes. Pharmacogenom-
icsj 2004, 4:54-65.
14. Zhu X, Veronese ME, Sansom LN, McManus ME: Molecular char-
acterisation of a human aryl sulfotransferase cDNA. Biochem
Biophys Res Commun 1993, 192:671-676.
15. Wilborn TW, Comer KA, Dooley TP, Reardon IM, Heinrikson RL,
Falany CN: Sequence analysis and expression of the cDNA for
the phenol-sulfating form of human liver phenol
sulfotransferase. Mol Pharmacol 1993, 43:70-77.
16. Hwang SR, Kohn AB, Hook VY: Molecular cloning of an isoform
of phenol sulfotransferase from human brain hippocampus.
Biochem Biophys Res Commun 1995, 207:701-707.
17. Ozawa S, Nagata K, Shimada M, Ueda M, Tsuzuki T, Yamazoe Y, Kato
R: Primary structures and properties of two related forms of
aryl sulfotransferases in human liver. Pharmacogenetics 1995,
5:S 135-40.
18. Zhu X, Veronese ME, locco P, McManus ME: cDNA cloning and
expression of a new form of human aryl sulfotransferase. Int
J Biochem Cell Biol 1996, 28:565-571.
19. Reiter C, Mwaluko G, Dunnette J, Van Loon J, Weinshilboum R:
Thermolabile and thermostable human platelet phenol sul-
fotransferase. Substrate specificity and physical separation.
Naunyn Schmiedebergs Arch Pharmacol 1983, 324:140-147.
20. Bernier F, Lopez Solache I, Labrie F, Luu-The V: Cloning and
expression of cDNA encoding human placental estrogen
sulfotransferase. Mot Cell Endocrinol 1994, 99:RI 1-5.
21. Zhu X, Veronese ME, Bernard CC, Sansom LN, McManus ME: Iden-
tification of two human brain aryl sulfotransferase cDNAs.
Biochem Biophys Res Commun 1993, 195:120-127.
22. Wood TC, Aksoy IA, Aksoy S, Weinshilboum RM: Human liver
thermolabile phenol sulfotransferase: cDNA cloning, expres-
sion and characterization. Biochem Biophys Res Commun 1994,
23. Jones AL, Hagen M, Coughtrie MW, Roberts RC, Glatt H: Human
platelet phenolsulfotransferases: cDNA cloning, stable
expression in V79 cells and identification of a novel allelic
variant of the phenol-sulfating form. Biochem Biophys Res
Commun 1995, 208:855-862.
24. Bidwell LM, McManus ME, Gaedigk A, Kakuta Y, Negishi M, Pedersen
L, Martin JL: Crystal structure of human catecholamine
sulfotransferase. j Mol Biol 1999, 293:521-530.
25. Dajani R, Cleasby A, Neu M, Wonacott AJ, Jhoti H, Hood AM, Modi
S, Hersey A, Taskinen J, Cooke RM, Manchee GR, Coughtrie MW:X-
ray crystal structure of human dopamine sulfotransferase,
SULTIA3. Molecular modeling and quantitative structure-
activity relationship analysis demonstrate a molecular basis
for sulfotransferase substrate specificity. J Biof Chem 1999,
26. Gamage NU, Duggleby RG, Barnett AC, Tresillian M, Latham CF,
Liyou NE, McManus ME, Martin JL: Structure of a human
carcinogen-converting enzyme, SULTIAI. Structural and
kinetic implications of substrate inhibition. J Biol Chem 2003,
27. Dajani R, Hood AM, Coughtrie MW: A single amino acid, glu 146,
governs the substrate specificity of a human dopamine sul-
fotransferase, SULT I A3. Mot Pharmacol 1998, 54:942-948.
28. Brix LA, Barnett AC, Duggleby RG, Leggett B, McManus ME: Analy-
sis of the substrate specificity of human sulfotransferases
SULTIAI and SULTIA3: site-directed mutagenesis and
kinetic studies. Biochemistry 1999, 38:10474-10479.
29. Brix LA, Duggleby RG, Gaedigk A, McManus ME: Structural char-
acterization of human aryl sulphotransferases. BiochemJ 1999,
30. Brix LA, Nicoll R, Zhu X, McManus ME: Structural and functional
characterisation of human sulfotransferases. Chem Bio! Interact
1998, 109:123-127.
31. Raftogianis RB, Wood TC, Otterness DM, Van Loon JA, Weinshil-
boum RM: Phenol sulfotransferase pharmacogenetics in
humans: association of common SULTIAI alleles with TS
PST phenotype. Biochem Biophys Res Commun 1997, 239:298-304.

Page 16 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

32. Raftogianis RB, Wood TC, Weinshilboum RM: Human phenol sul-
fotransferases SULTIA2 and SULTIAI: genetic polymor-
phisms, allozyme properties, and human liver genotype-
phenotype correlations. Biochem Pharmacol 1999, 58:605-616.
33. Thomae BA, Rifki OF, Theobald MA, Eckloff BW, Wieben ED, Wein-
shilboum RM: Human catecholamine sulfotransferase
(SULTIA3) pharmacogenetics: functional genetic
polymorphism. J Neurochem 2003, 87:809-819.
34. Saintot M, Malaveille C, Hautefeuille A, Gerber M: Interactions
between genetic polymorphism of cytochrome P450-IBI,
sulfotransferase IAI, catechol-o-methyltransferase and
tobacco exposure in breast cancer risk. Int J Cancer 2003,
35. Wu MT, Wang YT, Ho CK, Wu DC, Lee YC, Hsu HK, Kao EL, Lee
JM: SULTIAI polymorphism and esophageal cancer in
males. IntJ Cancer 2003, 103:101-104.
36. Zheng W, Xie D, Cerhan JR, Sellers TA, Wen W, Folsom AR: Sul-
fotransferase IAI polymorphism, endogenous estrogen
exposure, well-done meat intake, and breast cancer risk. Can-
cer Epidemiol Biomarkers Prev 2001, 10:89-94.
37. King RS, Teitel CH, Kadlubar FF: In vitro bioactivation of N-
hydroxy-2-amino-alpha-carboline. Carcinogenesis 2000,
38. Eisen JA, Fraser CM: Phylogenomics: intersection of evolution
and genomics. Science 2003, 300:1706-1707.
39. Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res
2002, 12:656-664.
40. Cheung J, Estivill X, Khaja R, MacDonald JR, Lau K, Tsui LC, Scherer
SW: Genome-wide detection of segmental duplications and
potential assembly errors in the human genome sequence.
Genome Biol 2003, 4:R25.
41. Johnson ME, Viggiano L, BaileyJA, Abdul-Rauf M, Goodwin G, Rocchi
M, Eichler EE: Positive selection of a gene family during the
emergence of humans and African apes. Nature 2001,
42. Eichler EE, Johnson ME, Alkan C, Tuzun E, Sahinalp C, Misceo D,
Archidiacono N, Rocchi M: Divergent origins and concerted
expansion of two segmental duplications on chromosome
16. J Hered 2001, 92:462-468.
43. Bonifas JM, Morley BJ, Oakey RE, Kan YW, Epstein EHJ: Cloning of
a cDNA for steroid sulfatase: frequent occurrence of gene
deletions in patients with recessive X chromosome-linked
ichthyosis. Proc Natl Acad Sci U S A 1987, 84:9248-9251.
44. Yen PH, Li XM, Tsai SP, Johnson C, Mohandas T, Shapiro LJ: Fre-
quent deletions of the human X chromosome distal short
arm result from recombination between low copy repetitive
elements. Cell 1990, 61:603-610.
45. Loftus BJ, Kim UJ, Sneddon VP, Kalush F, Brandon R, Fuhrmann J,
Mason T, Crosby ML, Barnstead M, Cronin L, Deslattes Mays A, Cao
Y, Xu RX, Kang HL, Mitchell S, Eichler EE, Harris PC, Venter JC,
Adams MD: Genome duplications and other features in 12 Mb
of DNA sequence from human chromosome 16p and 16q.
Genomics 1999, 60:295-308.
46. Pontius JU, Wagner L, GD S: UniGene: a unified view of the tran-
scriptome. In The NC81 Handbook Bethesda, National Center for
Biotechnology Information; 2003:21.1-21.12.
47. Murphy WJ, Eizirik E, O'Brien SJ, Madsen 0, Scally M, Douady CJ,
Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS: Res-
olution of the early placental mammal radiation using Baye-
sian phylogenetics. Science 2001, 294:2348-2351.
48. Benner SA: Interpretive proteomics--finding biological mean-
ing in genome and proteome databases. Adv Enzyme Regul 2003,
49. Weinshilboum RM, Otterness DM, Aksoy IA, Wood TC, Her C,
Raftogianis RB: Sulfation and sulfotransferases I: Sulfotrans-
ferase molecular biology: cDNAs and genes. Faseb J 1997,
I 1:3-14.
50. Couronne 0, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E,
Pachter L, Dubchak I: Strategies and tools for whole-genome
alignments. Genome Res 2003, 13:73-80.
51. Brudno M, Poliakov A, Salamov A, Cooper GM, Sidow A, Rubin EM,
Solovyev V, Batzoglou S, Dubchak I: Automated whole-genome
multiple alignment of rat, mouse, and human. Genome Res
2004, 14:685-692.
52. Chimpanzee Genome Project [

53. Goldman N, Yang Z: A codon-based model of nucleotide sub-
stitution for protein-coding DNA sequences. Mol Biol Evol 1994,
54. Muse SV, Gaut BS: A likelihood approach for comparing synon-
ymous and nonsynonymous nucleotide substitution rates,
with application to the chloroplast genome. Mol Biol Evol 1994,
55. Yang Z, Nielsen R: Codon-substitution models for detecting
molecular adaptation at individual sites along specific
lineages. Mol Biol Evol 2002, 19:908-917.
56. Nielsen R, Yang Z: Likelihood models for detecting positively
selected amino acid sites and applications to the HIV- I enve-
lope gene. Genetics 1998, 148:929-936.
57. Yang Z, Nielsen R, Goldman N, Pedersen AM: Codon-substitution
models for heterogeneous selection pressure at amino acid
sites. Genetics 2000, 155:431-449.
58. Felsenstein J: Evolutionary trees from DNA sequences: a max-
imum likelihood approach. J Mol Evol 1981, 17:368-376.
59. HuelsenbeckJP, Rannala B: Phylogenetic methods come of age:
testing hypotheses in an evolutionary context. Science 1997,
60. Bielawski JP, Yang Z: Maximum likelihood methods for detect-
ing adaptive evolution after gene duplication. J Struct Funct
Genomics 2003, 3:201-212.
61. Yang Z: PAML: a program package for phylogenetic analysis
by maximum likelihood. Comput AppI Biosci 1997, 13:555-556.
62. Gaucher EA, Das UK, Miyamoto MM, Benner SA: The crystal struc-
ture of eEFIA refines the functional predictions of an evolu-
tionary analysis of rate changes among elongation factors.
Mol Biol Evol 2002, I 9:569-573.
63. Gaucher EA, Miyamoto MM, Benner SA: Evolutionary, structural
and biochemical evidence for a new interaction site of the
leptin obesity protein. Genetics 2003, 163:1549-1553.
64. Galison PL: How experiments end. Chicago, University of Chicago
Press; 1987:xii, 330 p..
65. Hildebrandt MA, Salavaggione OE, Martin YN, Flynn HC, Jalal S, Wie-
ben ED, Weinshilboum RM: Human SULTIA3 pharmacogenet-
ics: gene duplication and functional genomic studies. Biochem
Biophys Res Commun 2004, 321:870-878.
66. Finishing the euchromatic sequence of the human genome.
Nature 2004, 431:931-945.
67. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,
Haussler D: The human genome browser at UCSC. Genome Res
2002, 12:996-1006.
68. NCBI MapVeiwer []
69. Schwartz S, Zhang Z, Frazer KA, SmitA, Riemer C, BouckJ, Gibbs R,
Hardison R, Miller W: PipMaker--a web server for aligning two
genomic DNA sequences. Genome Res 2000, 10:577-586.
70. PipMaker and MultiPipMaker []
71. Repeat Masker []
72. Human Recent Segmental Duplication Page [http://]
73. Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green
ED, Hardison RC, Miller W: MultiPipMaker and supporting
tools: Alignments and analysis of multiple genomic DNA
sequences. Nucleic Acids Res 2003, 31:3518-3524.
74. Segmental Duplication Database [http://humanparal]
75. UCSC Genome Bioinformatics []
76. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties
and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680.
77. Yang Z: Among-site rate variation and its impact on phyloge-
netic analyses. Trends in Ecology and Evolution 1996, I 1:367-372.
78. Kumar S, Tamura K, Jakobsen IB, Nei M: MEGA2: molecular evo-
lutionary genetics analysis software. Bioinformatics 2001,
79. Swofford DL: PAUP 4.0 Phylogenetic Analysis Using Parsi-
mony (And Other Methods). Sunderland, MA, Sinauer Associates;
80. Gonnet GH, Benner SA: Computational Biochemistry Research
at ETH. In Technical Report 154 Department Informatik Zurich, Swiss
Federal Institute of Technology; 1991.
81. DARWIN's Homepage [

Page 17 of 18
(page number not for citation purposes)

BMC Evolutionary Biology 2005, 5:22

82. Liu FG, Miyamoto MM, Freire NP, Ong PQ, Tennant MR, Young TS,
Gugel KF: Molecular and morphological supertrees for euthe-
rian (placental) mammals. Science 2001, 291:1786-1789.
83. Springer MS, Murphy WJ, Eizirik E, O'Brien SJ: Placental mammal
diversification and the Cretaceous-Tertiary boundary. Proc
Natl Acad Sci U S A 2003, 100:1056-1061.
84. VISTA Genome Browser []
85. NCBI UniGene []
86. Huang CC, Couch GS, Pettersen EF, Ferrin TE: Chimera: an exten-
sible molecular modeling application constructed using
standard components. Pacific Symposium on Biocomputing 1996,

Page 18 of 18
(page number not for citation purposes)

Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
Submit your manuscript here: BioMedcentral adv.asp

BMC Evolutionary Biology 2005, 5:22

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs