Small subunit ribosomal RNA sequences from Ginkgo biloba and phylogentic inferences from seed plant small subunit riboso...

MISSING IMAGE

Material Information

Title:
Small subunit ribosomal RNA sequences from Ginkgo biloba and phylogentic inferences from seed plant small subunit ribosomal RNA data
Physical Description:
vi, 119 leaves : ill., photos ; 29 cm.
Language:
English
Creator:
Nairn, Campbell Joseph, 1955-
Publication Date:

Subjects

Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1990.
Bibliography:
Includes bibliographical references (leaves 110-118).
Statement of Responsibility:
by Campbell Joseph Nairn III.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001619335
notis - AHP3857
oclc - 23725701
System ID:
AA00003755:00001

Full Text











SMALL SUBUNIT RIBOSOMAL RNA SEQUENCES FROM GINKGO BILOBA
AND PHYLOGENETIC INFERENCES FROM SEED PLANT SMALL SUBUNIT
RIBOSOMAL RNA DATA

















BY

CAMPBELL JOSEPH NAIRN III


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE
UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1990












This dissertation is dedicated to my brother David

Johns Nairn, First Lieutenant United States Marine Corps,

for his choice to serve and defend the principles of

freedom which allow us to pursue our quest for knowledge.

June 17, 1960-October 23, 1983.













ACKNOWLEDGMENTS


I would like to sincerely thank Dr. Robert Ferl for

his interest, guidance and support during this research

project. I would also like to thank Dr. Michael Miyamoto

and Dr. Walter Judd for their assistance in methods of

phylogenetic analysis. In addition, I would like to thank

Dr. Curt Hannah and Dr. Norris Williams for serving as

committee members and for their discussions and reviews of

this research. Financial support provided through the

Department of Botany, the Department of Vegetable Crops and

the Graduate School is gratefully acknowledged. Finally, I

would like to thank my wife Calene, my daughter Amberlee

and my parents for their support throughout my graduate

studies.


iii














TABLE OF CONTENTS

ACKNOWLEDGMENTS.......................................

ABSTRACT .............................................

CHAPTERS

1 INTRODUCTION...............................

2 REVIEW OF LITERATURE........................

Molecular Sequences in Plant Systematics ...
Small and Large Subunit rDNAs................

3 MATERIALS AND METHODS.......................

Cloning and Sequencing of ss rRNA Genes
from Ginkgo biloba........................
Gene Copy Number Analysis ...................
Analysis of DNA Sequences...................


4 PHYLOGENETIC ANALYSIS


OF PLANT ss rRNA


GENE SEQUENCES....................

Experimental Results...............
Discussion........................

5 RIBOSOMAL SEQUENCES IN THE
GINKGO BILOBA GENOME..............

Experimental Results...............
Discussion........................

6 SUMMARY AND CONCLUSIONS............

Phylogenetic Analysis..............
Small Subunit rRNA Sequences
of Ginkqo biloba ..................

REFERENCES..................................

BIOGRAPHICAL SKETCH..........................


iii

v



1

7

7
21

29


29
33
33


85

85
100

105

105

108

110

119


.........













Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy


SMALL SUBUNIT RIBOSOMAL RNA SEQUENCES FROM GINKGO BILOBA
AND PHYLOGENETIC INFERENCES FROM SEED PLANT SMALL SUBUNIT
RIBOSOMAL RNA DATA


By


Campbell Joseph Nairn III


August 1990


Chairman: Dr. Robert J. Ferl
Major Department: Department of Botany


Ginkqo biloba is a taxonomically isolated seed plant

frequently placed in an evolutionary grade group referred

to as gymnosperms. Precise phylogenetic relationships

between Ginkgo and other gymnosperm groups remain

uncertain. Small subunit ribosomal DNA sequences were

isolated from a Ginkco genomic library. One group of

clones represents the small subunit rRNA genes from the

major ribosomal repeat of Ginkqo and are present in

approximately 16,000 copies in the diploid Ginkgo genome.

A representative, Gbr-1000, from this group was subcloned

and sequenced. A representative, Gbr-6700, from a second

group of clones was also sequenced and contains a ss rRNA-










like sequence that is interrupted by a 1.1 kb insert 700 bp

into the ss rRNA-like region. These ss rRNA-like sequences

are present in approximately 3,400 copies in the Ginkgo

genome. The Gbr-1000 sequence was compared with homologous

sequences from other green plant taxa. Phylogenetic

relationships of these taxa were inferred using a cladistic

parsimony analysis of the combined sequence data. A single

most parsimonious tree was found which supports a

monophyletic grouping of Ginkgo and the cycad Zamia pumila.

The Gbr-6700 sequence was compared with that of Gbr-1000.

The homologous regions of the two Ginkqo sequences are 86%

similar compared with 92%-97% sequence similarities

observed between ss rRNA sequences of available seed plant

taxa. The Gbr-6700 sequence represents a ss rRNA

pseudogene which is present in multiple copies in the

Ginkgo genome. These sequences appear to be undergoing

concerted evolution within the group, similar to that

observed for eukaryotic rRNA gene families.













CHAPTER 1
INTRODUCTION


Ginkqo biloba is a taxonomically isolated seed plant.

It is frequently placed in an evolutionary grade group

referred to as gymnosperms along with cycads, conifers and

gnetopsids. All eukaryotic green plants have historically

been placed in the kingdom Plantae (Foster and Gifford

1974). More recently, based on cladistic studies a

grouping at the subkingdom level, the Chlorobionta has been

proposed (Bremer 1985). In the classification proposed by

Bremer, Ginkqo is placed in the subclass Ginkgoidae within

the class Spermatopsida. Spermatopsida contain all extant

seed plants including the remaining gymnosperms in the

subclasses Cycadidae (cycads), Pinidae (conifers), and

Gnetidae (gnetopsids) with all flowering plants

(angiosperms)' in the subclass Magnoliidae. Each of these

subclasses forms cohesive groups that are well defined

morphologically and anatomically. The Gnetidae are widely

regarded as the most probable extant sister group to the

angiosperms. However, relationships among the cycads,

Ginkgo, and the conifers remain poorly resolved in plant

systematic studies.








Comparative morphology has historically provided the

basis for systematic study and largely defines the

relationships of plant taxa as biologists currently

perceive them. In the past decade investigators have

significantly increased the morphological and anatomical

data sets. New information, re-examination of many

characters and a more complete treatment of fossil taxa

have refined plant evolutionary theory (Mishler and

Churchill 1985, Dahlgren and Bremer 1985, Crane 1985a,

1985b, Doyle and Donoghue 1986, 1987). These studies also

highlight those relationships that are weakly supported by

synapomorphic (shared derived) characters. The

relationships of cycads, Ginkqo and conifers to each other

and other seed plants remain unclear and are weakly

resolved in systematic studies.

Molecular sequences of homologous macromolecules, both

at the nucleotide and amino acid level, can be used to

infer phylogenetic relationships between extant organisms

(Zuckerkandl and Pauling 1965). These molecular data

provide comparable characters that can be used across a

broad range of organisms, including the Chlorobionta.

Within the eukaryotic genome different types of DNA

evolve at very different rates. Non-coding regions are

typically more variable than those regions coding for

polypeptides or functional RNA's and different rates of

nucleotide substitution exist within these types of DNA as

well. Different regions of the DNA are useful for









systematic studies at various taxonomic levels. Amino acid

sequences of functional polypeptides were among the first

utilized in molecular evolutionary studies and have been

used to examine relationships across a wide range of taxa

including the green plants (Fitch and Margoliash 1967,

Fitch 1976, Lumsden and Hall 1975, Martin and Dowd 1986).

DNA sequences that code for functional RNAs (tRNA and

rRNA) have also been used in phylogenetic studies. These

sequences are widely distributed in eukaryotes and are

highly conserved even at higher taxonomic levels. The 5S

ribosomal RNA (rRNA) is highly repeated in most eukaryotic

nuclear genomes including the green plants. The 5S rRNA

sequences have been determined for many species including

28 green plants (Hori and Osawa 1979, Hori et al. 1985).

Chloroplast 4.5S rRNA sequences have also been examined for

purposes of plant phylogenetic reconstruction (Bobrova et

al. 1987).

Nuclear sequences for the small and large subunit

ribosomal RNAs are present in the genomes of all eukaryotes

and have also been used for phylogenetic studies. These

rRNA sequences are highly conserved both in primary

sequence and at proposed secondary structural levels. The

small subunit (ca. 1800 base pairs [bp]) and large subunit

(ca. 3000 bp) rRNA sequences provide a tenfold larger

sample size than most other macromolecules thus far

employed in plant molecular evolutionary studies.







The coding regions for these rRNAs are much more

highly conserved than the surrounding spacer regions.

Within the coding region different areas, or subdomains,

exhibit varying levels of sequence conservation and which

in turn can frequently be correlated with evolutionarily

conserved functional domains of proposed secondary

structure models (Brimacombe and Stiege 1985, Gutell et al.

1985). This range of variation in sequence conservation

makes these molecules particularly useful in studies

comparing relatively closely related taxa as well as more

distantly related taxa.

Small subunit rRNA sequences are now available for

many eukaryotes including several diverse protist groups.

The wide diversity at the phenotypic level in the various

protist groups cause frequent problems when trying to

gather comparable character information for systematic

study. Studies using small subunit rRNA sequences have

been extremely useful in examining phylogenetic

relationships among the various protist groups and their

affinities to other eukarotic (Sogin et al. 1986, Gunderson

et al. 1987) and prokaryotic taxa (Lake 1989, Cedergren et

al. 1988).

Among the green plants the evolutionary relationships

of the gymnosperm groups remain unclear. Phylogenies

inferred from phenotypic characters differ as to which, if

any, of these gymnosperm taxa together form monophyletic

groups (clades) and which taxa are separately derived









within the seed plant lineage. Only a few molecular

studies have addressed relationships of these gymnosperm

taxa to each other and the angiosperms.

Complete small subunit rRNA sequences have been

published for only a few green plant taxa. These include

three angiosperms, a dicot and two monocots (both grasses),

a cycad and representatives from the Chlorophycophyta

(green algae). This study examines seed plant phylogeny

based on ribosomal RNA sequences. Several clones were

isolated from a Ginkgo genomic library which contain full

length small subunit ribosomal RNA (ss rRNA) coding regions

as indicated by separate 5' and 3' ss rRNA probes. Upon

restriction analysis the majority of these clones yield a

map consistent with conserved restriction sites in plant ss

rRNA coding regions. Restriction maps from a second group

of clones, however, were inconsistent with those expected

for a plant ss rRNA sequence in the size of the inferred ss

rRNA coding region and in restriction pattern.

The nucleotide sequence was determined for a

representative clone from each of these two groups, Gbr-

1000 and Gbr-6700, respectively. These sequences were

compared with each other and sequences from other plant ss

rRNA genes. The comparisons indicate that the ss rRNA

coding region contained in Gbr-1000 is 1811 nucleotides in

length and shares high sequence similarity with other

published plant ss rRNA sequences. The second sequence,








Gbr-6700, contains a ss rRNA-like sequence which is more

divergent in sequence composition and is interrupted by an

1100 nucleotide insert approximately 700 nucleotides into

the ss rRNA like coding region. The Gbr-1000 sequence from

Ginkgo was aligned with other published plant ss rRNA

sequences and these data used to examine phylogenetic

relationships of these taxa using a cladistic parsimony

approach. The Gbr-6700 sequence representing a ss rRNA

pseudogene has been compared with the sequence from Gbr-

1000 and its structure characterized.













CHAPTER 2
REVIEW OF LITERATURE


Molecular Sequences in Plant Systematics


Sequences from biological macromolecules have become

available in recent years and present a new type of

character for use in comparative biology. Molecular

evolutionary studies seek to reconstruct geneologies for DNA

sequences either directly from nucleotide sequences or from

amino acid sequences of proteins. Phylogenetic

relationships can be inferred from systematic studies using

these sequences.

Within a few years of the discovery of the double

helix structure of DNA (Watson and Crick 1953) and the

semiconservative nature of its replication (Meselson and

Stahl 1958), the use of molecular sequences for

evolutionary studies was proposed (Zuckerkandl and Pauling

1965). Initially, accumulation of homologous sequences

from various taxa was difficult. However, by the mid

1960's amino acid sequences for cytochrome c had been

determined for a number of taxa. Early studies utilized

these and amino acid sequences from other proteins.

Availablity of nucleotide sequences followed shortly

thereafter with the development of biochemical methods for







their determination (Maxam and Gilbert 1977, Sanger et al.

1977).

The amino acid sequences from cytochrome c were among

the first molecules used to examine evolutionary

relationships of higher plants (Boulter et al. 1972). An

analysis of cytochrome c sequences using maximum parsimony

(Baba et al. 1981) included 26 plant taxa. This work

showed that phylogenies inferred from the lowest nucleotide

replacement score were inconsistent with phylogenies based

on phenotypic data for plant taxa. However, when other

parameters were considered (ie. gene duplication and gene

expression) (Goodman 1979), a phylogeny more consistent

with phenotypic data was inferred.

Partial amino acid sequences of plastocyanin have been

used to examine familial relationships among angiosperms

(Boulter et al. 1979). These sequences evolve at a faster

rate than those of cytochrome c and are therefore

applicable at the familial level rather than higher

taxonomic levels. Sequences for 40 members of ten families

were examined and results of the study showed that members

of a family group together with one exception.

Representatives from two genera of the Fabaceae group

together but are separated from the remaining seven genera

of the family suggesting limits to resolution obtained from

these data. Further work has been done using plasotcyanin

amino acid sequences to examine relationships among genera

of the Ranunculaceae and its relationship to other plant








families (Grund et al. 1981). These data grouped the five

genera of the Ranunculaceae but inferences of relationships

between families were not clear, possibly due to the number

of taxa that was not represented in the study.

Sequences for cytochrome c and plastocyanin continued

to accumulate and amino acid sequences from other proteins

were determined as well. The amino acid sequences for the

small subunit of ribulose biphosphate carboxlase (RBC -

SSU) was determined for several, plant taxa and its use in

phylogenetic reconstruction examined (Martin et al. 1983).

In addition to these, sequences for 5S rRNA were beginning

to accumulate (Dyer 1982).

Martin et al. (1985) examined angiosperm phylogeny

using amino acid sequences from four proteins cytochromee

c, RBC SSU, plastocyanin, and ferredoxin) and the 5S

rRNA. Amino acid sequences were converted to inferred

nucleotide sequences, and a parsimony method was employed.

In addition to using unweighted data, a weighting scheme

was applied to the data. The ratios of observed/expected

incompatibilities were determined for nucleotide positions

and used to weight characters so that positions not showing

parallelism were favored. Majority rule consensus trees

presented indicate that genera within a family are grouped

together and inferred relationships between families were

largely consistent with phylogenies based on phenotypic

data. Although the grouping of Fabaceae and Brassicaceae







inferred from the data is not supported by phenotypic

evidence, the authors discussed the need for more complete

data sets and noted that detailed comparisons of published

phylogenies could not be made.

Further investigations of plant relationships using

partial amino acid sequences from RBC SSU were conducted.

Sequences for 15 monocot genera and four gymnosperm taxa

were determined (Martin and Dowd 1986). This analysis

showed once again that genera within a family grouped

together. In addition to this, the monocot taxa grouped

together relative to dicot taxa and gymnoposperm taxa.

Limits on resolution were apparent at higher levels, and

the precise joining of some internodes was not clear.

Although members of the Liliaceae grouped together, precise

relationships within the family were not resolved. The

node joining gymnosperms, dicots and monocots was also

ambiguous and therefore, left unresolved in the presented

phylogeny. The inferred phylogeny did show that of the

gymnosperms included, Ephedra grouped closer to the

angiosperms than cycad and conifer taxa. Also, of the

monocots included, members of the Alismatidae grouped

closest to the dicots. The authors noted that these

sequences are useful in deriving approximately accurate

phylogenies but that sequences from additional

macromolecules would be needed to resolve relationships

where internodes are short.







By 1989 the RBC SSU data set had been expanded to

include representative genera from 124 families. These

sequences were analyzed and again genera within a family

usually grouped together (Martin and Dowd 1989). Family

nodes were derived when members of a family grouped

dichotomously and then used to examine familial

relationships as well as those above the family level. Of

the 124 families represented 102, were placed in 24 groups

and 22 families remained ungrouped reflecting taxonomic

uncertainty. This analysis closely grouped the four

included gymnosperm taxa and the node was used to represent

gymnosperms in subsequent analyses of angiosperm taxa. The

family Schisandraceae always grouped closest to the

gymnosperms. Other major features of inferred phylogenies

showed that monocots grouped together and that these were

placed closest to the dicot families Piperaceae and

Nelumbonaceae. Criteria used in the analysis were shown to

affect the results significantly. Using a majority rule

consensus, 80% of the families grouped dichotomously

whereas using a strict consensus method the success rate

was reduced to approximately 44%. Based on evaluation

using independent taxonomic criteria the authors noted that

the intra-group results appear to be approximately correct

and that strict consensus trees may be too strict for use

with these data.

The use of sequences from organellular genomes can

also be utilized in comparative studies. Sequences for the








large subunit of ribulose biphosphate carboxylase/oxygenase

(rbc L) from the chloroplast genome have been determined

for five angiosperms and a green alga. Sequences were also

available for three prokaryotes. Higher plant

representatives for ATP synthase (at_ B), non-coding

sequences for a tRNA intron (trn VI) and a 5' leader of rbc

L (Ritland and Clegg 1987) were also utilized. The

analysis of these data was carried out using a maximum

likelihood approach to examine inferred tree topologies.

All 15 possible topolgies for five plant taxa were examined

using the rbc L sequences. Comparisons were also made

between phylogenies inferred from each of the three

different codon positions. A node grouping the two

monocots was strongly supported but resolution of the dicot

taxa was less certain. A bootstrap resampling method was

employed to examine the strength of the inferred

topologies. This indicated that a node joining tobacco and

pea with spinach placed intermediate relative to the

monocots was weakly supported over alternatives topologies.

Consistency between topologies inferred from the four

different types of sequences was examined. This indicated

that each of the three codon positions from rbc L and atD B

as well as the two noncoding sequences support the same

topology. For this topology the monocots are grouped

dichotomously at a node and the dicots are grouped together

in an unresolved trichotomy. Dendrograms presented based







on a UPGM unweightedd pair groups method) analysis agree

with other results in grouping the monocots together and

the dicots together. However, this approach supports a

grouping of spinach and tobacco with pea as intermediate in

contrast to the spinach intermediate inferred from maximum

likelihood analysis.

One possible reason for the discrepancy is that UPGM

assumes equal evolutionary rates while maximum likelihood

does not. The applicability of chloroplast sequences to

phylogenetic study is supported by the consistency of the

results using different sequences. This is apparent

despite unequal rates between codon positions, different

types of sequences and different lineages.

Sequences for 5S rRNA have also been utilized in

phylogenetic studies (Kimura and Ohta 1973, Hori 1975,

Schwartz and Dayhoff 1976). These molecules are 116 120

nucleotides in length and are present in a wide range of

organisms. Hori and Osawa (1979) presented a study based

on 5S rRNA sequences from 54 species which included five

angiosperms and a green alga. The rate of nucleotide

substitution Knuc, representing the number of replacements

per site per year (Hori 1975), was calculated for all pairs

of 5S rRNA sequences. Phylogenetic trees were inferred

using a matrix method. The phylogenetic tree presented

grouped the green plants together with the green algal

sequence basally divergent within this group. Later, a

study focussing on plant phylogeny was presented which








included sequences from 28 plant sequences (Hori et al.

1985). The phylogeny presented generally reflects plant

evolutionary theory in the relative order of divergence of

higher taxonomic groups. However, precise relationships

among some higher taxonomic groups (eg. bryophytes, ferns

and gymnosperms) are not consistent with phenotypic

studies. The green algae are basal in the proposed

phylogeny with the charophyte Nitella intermediate between

them and land plants. The ferns and fern allies group

together as do the bryophytes. However, these two groups

share a node after splitting from other higher plants which

is not supported by phenotypic data. The gymnosperm taxa

represented (not including gnetopsids) also form a group.

This has been proposed based on phenotypic data but is

neither strongly supported nor widely accepted by plant

systematists. The angiosperms group together relative to

other plants but small Knuc values did not provide

resolution within the group.

Chloroplast 4.5S rRNA sequences have also been

examined for purposes of plant phylogenetic study. A study

by Bobrova et al. (1987) included sequences from ten plant

taxa, two bryophytes, a fern, four monocots and three

dicots. This analysis infers a grouping of the two

bryophytes as well as a grouping of the angiosperm taxa.

Within the angiosperms, the dicots (Nicotiana, Liqularia,

Spinacia) group together but a grouping of all monocot taxa







is not well supported. The two grasses (Zea and Triticum)

group together but the positions of Acorus and Spirodella

are not well resolved. Another result inconsistent with

phenotypic data is the position of the bryophytes

intermediate between that of the fern and the angiosperms.

Poor resolution can at least in part be attributed to the

slow rate at which these sequences evolve. For example,

Nicotiana and Ligularia have identical 4.5S rRNA sequences

but belong to separate orders within the Asteridae

(Cronquist 1981).

Some of the most extensive studies of plant

chloroplast DNA (cpDNA) evolution have been undertaken by

Palmer and coworkers (Palmer and Thompson 1982, Palmer

1987, Palmer et al. 1988). Several methods of analyzing

cpDNA have been examined and are applicable at various

taxonomic levels. These include the study of the

arrangement and structure of the chloroplast genome,

restriction enzyme analysis, and direct DNA sequencing.

Restriction analysis was the earliest method used and has

been applied to phylogenetic problems from the inter-

species level to the intra-familial level. At lower

taxonomic levels data can be based on presence or absence

of restriction fragments. At higher levels restriction

analysis often requires complete mapping of the cpDNA.

From the restriction data, nucleotide substitutions are

inferred and used to construct phylogenies. Results of

these types of analysis have been particularly useful in








grouping species of a genus, genera within families, and

has provided some information on relationships between

families. Additionally, some information at the intra-

specific level is derived from this type of analysis, but

the slow rate of cpDNA evolution somewhat limits the

usefulness of restriction data at this level. To date

investigations of higher order relationships of plants have

been limited. However, initial studies using cpDNA

sequences (particularly from rbc L genes) from limited taxa

have indicated that this approach should be useful at

higher taxonomic levels. Additionally, the sequence

approach is useful within more diverse families where

restriction data alone does not provide sufficient

resolution.

The applicability of molecular data to plant

phylogenetic studies is viewed as promising by many

investigators although problems and limits of some

molecular data are recognized. Other investigators are

more critical and have chosen not to include molecular data

in analyses on which plant classifications are based.

Bremer et al. (1987) examined 5S rRNA sequences and

conducted a cladistic parsimony re-analysis of the data.

In this analysis 57 most parsimonious trees were found.

Only one example of these was presented but it was highly

incongruent with phenotypic data. The authors then

excluded the 5S rRNA data from the analysis citing







extensive homoplasy at levels higher than those found in

morphological data sets.

Steele et al. (1988) commented on the quick dismissal

of the 5S rRNA data in plant systematic studies. The

authors discussed several problems in the use of these

sequences and offered suggestions on how some of these

might be dealt with. Methodological problems exist in

analyzing molecular data. Many studies, especially early

ones, utilized phenetic approaches in molecular

systematics. These are viewed less favorably than

cladistic parsimony methods by many systematists. However,

more recently cladistic studies of molecular data have

become more common. Other problems such as homoplasy,

constraints imposed by secondary structure of the molecule

and non-random substitution have also been discussed.

One major concern when using structural RNA sequences

is the affect of stem pairing regions on substitutional

rates. Wheeler and Honeycutt (1988) examined rates of

change in non-paired (loop) and stem regions. They

observed a higher rate of sequence differences that

maintain base pairing than expected based on a neutral

model. They suggest that there is positive selection for

the second substitution in stem regions which restores base

pairing. These investigators also examined phylogenetic

recontructions using stem regions only, loop regions only,

and the combined data. Their results indicated that

phylogenies inferred from stem positions and the combined








data were not congruent with those inferred from non-

pairing positions and that the latter were more consistent

with phenotypic data.

Plant molecular data sets for cytochrome c, RBC SSU,

plastocyanin, and ferredoxin (Martin and Dowd 1986, Boulter

et al. 1979) have recently been re-examined using a

cladistic parsimony approach (Bremer 1988). Amino acid

sequences were converted to inferred nucleotide sequences.

The combined data for all four proteins contain potentially

informative positions for nine angiosperm families (usually

represented by more than one species). The data matrix was

analyzed using PAUP (Phylogenetic Analysis Using Parsimony)

developed by Swofford (1985). Two most parsimonious trees

are inferred from this data matrix which require 161 steps

(nucleotide replacements) with a consistency index of

0.689. This indicated that 50 steps were a result of

parallelisms and reversals and therefore homoplasy is

common in the data. Additionally one tree at 162 steps and

three at 163 steps were found. A strict-consensus tree for

these shortest trees found showed that there were no

monophyletic groups common to all five trees (161 163

steps). Although resolution is limited, these sequences

provided information on some relationships. They have been

useful at intra-familial levels and often have shown that

among families two families are more closely related to

each other than either is to other families. Once again it







was noted that additional sequences were needed before the

usefulness of this type of sequence data in plant

systematics could be assessed.

Mishler et al. (1988) also reviewed some of the

problems in using molecular data for phylogentic studies.

In sequences of structural RNAs, compensating substitutions

in stem forming regions may not represent independent

characters in an analysis. One solution to this problem

may be to eliminate one of two pairing positions from the

data matrix (Steele et al. 1988). A weighting system

which gives less weight to one of two positions in a paired

region may also be useful in dealing with the problem.

However, caution should be exercised as many secondary

structure models have not been tested biochemically, but

instead are inferred from structural models of homologous

sequences from other taxa.

Mishler et al. (1988) also addressed the

transition/transversion bias reported in some molecular

data sets. Again a weighting system can be applied to the

data if ratios of these types of changes are known. This

can itself present a problem when both types of change have

occurred in the same character position of a data matrix.

There has not yet been a computer parsimony program

available that can apply weights within a character, but a

forthcoming version of PAUP (Swofford 1985) will reportedly

contain such an option.







Another problem discussed by Mishler et al. (1988) is

the different rates at which various sequences evolve. If

substitution rates are too rapid for homologous sequences

at a given taxonomic level, information will be lost due to

multiple substitutions at individual sites. The problem

could be addressed by determining which molecules evolve at

rates useful for the taxonomic level being studied. It is

therefore not a question of whether molecular sequences are

appropriate for phylogenetic study, but rather which

sequences are useful at which taxonomic levels.

Homoplasy also presents a problem in molecular data.

In morphological data considerable character analysis is

performed prior to an analysis. This has not been the case

for most molecular data. An advantage of molecular data is

the potential to utilize very large data sets. Therefore,

it may be possible to acquire a large number of

historically informative characters despite the homoplasy

present in the data. Homoplasy is not restricted to plant

sequences. Miyamoto et al. (1987) encountered the problem

in primate sequences but have still been able to make sound

phylogenetic inferences by recognizing the problems and

incorporating methods of dealing with them into their

analysis. Although homoplasy is a problem in all data it

may be more prevalent in plant sequences than in animal

sequences. Plant and animal phylogenies have been

constructed based on cytochrome c sequences. The amount of

homoplasy due to convergence, parallelism and reversals was








also examined (Syvanen et al. 1989). These investigators

found that homoplasy was more common in the plant data than

in the animal data for the same protein sequence. The data

sets for 26 plant species and 27 animal species contain 84

and 85 characters, respectively. For unrooted phylogenies,

the consistency index for plant sequences was 0.50 compared

with 0.68 for animal sequences.

One of the greatest advantages of molecular sequences

is the potentially large number of characters they can

provide. However, most plant studies to date have used

relatively small sequences (<150 nucleotides) because of

the availability of homologous sequences from a number of

taxa. This is changing as more rapid sequencing methods

are becoming available.


Small and Large Subunit rRNAs


The coding regions for the small subunit ribosomal RNA

(ss rRNA) and the large subunit ribosomal RNA (Is rRNA) are

approximately 1800 and 3000 nucleotides, respectively.

These are present in all prokaryotes and eukaryotes. In

most eukaryotes and in all plant taxa examined these are

arranged in tandemly repeated arrays and contain the 5.8S

rRNA between the ss rRNA and Is rRNA. The coding regions

for the rRNAs are separated by spacer DNA (internal

transcribed spacer or ITS) which is transcribed as part of

the transcription unit. Intergenic spacer (IGS) DNA is







also present between the tandemly arrayed units and was

originally termed nontranscribed spacer or NTS (Jorgensen

and Cluster 1988). However, portions of this at the 5' end

of the ss rRNA and the 3' end of the Is rRNA are

transcribed as part of the transcription unit (Long and

Dawid 1980, Dvorak and Appels 1982). The large

transcription unit is subsequently processed in the nucleus

to yield mature rRNA molecules.

Plant genomes contain a large number of ribosomal

repeat units ranging from 200 to 22,000 copies per haploid

genome for species examined (Rogers and Bendich 1987).

Within the IGS region of plant species examined there are

subrepeats of 100 to 300 nucleotides. The number of these

subrepeats have been shown to vary between individuals of a

species, within individuals (Appels and Dvorak 1982,

Jorgensen et al. 1987), and between neighboring gene

repeats on a chromosome (Rogers and Bendich 1987).

Ribosomal repeats may be present on more than one

chromosome and in some cases where this occurs, individual

size classes related to subrepeat numbers have been

correlatd to different chromosomes (Appels and Honeycutt

1986). Inbred lines of maize have been examined in which

there is no detectable size variation in rRNA repeats but

some non-inbred lines do show variability among individuals

in the number of subrepetitive elements (Rogers and Bendich

1987).







The function of these subrepetitive elements has not

been experimentally determined for plants. However,

because of similarities with other systems inferences have

been made from animal and protist studies. Evidence has

suggested that these subrepetitive regions are hot spots

for recombination and provide a mechanism for change in

copy number of rRNA repeats. These subrepeats have been

called enhancers and may bind factors involved in RNA

polymerase I attachment thus affecting transciptional

rates. Another possible role of these elements is that

they function as RNA processing sites and/or transcription

termination sites (Rogers and Bendich 1987). These sites

have been shown to retain polymerase instead of allowing

dissociation at the 3' end of the Is rRNA gene. In some

systems these elements have been shown to be necessary for

transcription of the adjacent gene of a repeated array

(McStay and Reeder 1986). It has also been suggested

(Rogers and Bendich 1987) that if these elements do in fact

contain terminators they may play a role in preventing

unnecessary transcription in genomes containing large

numbers of rRNA repeats.

Although some types of heterogeneity exist among rRNA

repeat units, these units have been shown to be largely

homogeneous within an individual genome. The homogeneity

is presumably due to concerted evolution of the repeat rRNA

units (Arnheim et al. 1980). Variation in an individual's

rRNA sequences has been observed in length of repeat units,








nucleotide sequence, copy number and base modification (ie.

methylation of cytosine residues). These different types

of variation can be useful in examining phylogenetic

relationships at various taxonomic levels (Jorgensen and

Cluster 1988).

The most variable portion of the rRNA repeat has been

found within the region of the IGS that contains the

subrepetitve elements. The next most common variation

reported is in other regions of the IGS. Spacer regions

within the repeat (ITS) between rRNA coding regions are

more highly conserved than the IGS but are much more

variable than the ss rRNA, Is rRNA and the 5.8S rRNA coding

regions. These spacer regions have been useful in

examining plant relationships from the intra-populational

level to the inter-generic level (Schaal et al. 1987,

Schaal and Learn 1988, Hamby and Zimmer 1988).

The most highly conserved portions of the ribosomal

repeat are the regions coding for mature rRNA. Different

rates of change are also observed for areas within these

regions. Generally the ss rRNA is more conserved overall

than is the ls rRNA but the more variable regions within

the ss rRNA are not as highly conserved as some regions of

the ls rRNA. Overall the 3' regions of each of these are

more conserved than their respective 5' regions and this

difference is more pronounced in the Is rRNA sequence.

Length heterogeneity has not been detected in these coding







regions from restriction analysis of plant taxa (Jorgenesn

and Cluster 1988). However, length differences of one to

six nucleotides are indicated from comparisons of ss rRNA

available for plant species (Eckenrode et al. 1985, Nairn

and Ferl 1988). The different rates of substitution

observed within these coding regions make these sequences

potentially useful from the intergeneric level to higher

taxonomic levels including comparisons between kingdoms of

major groups of organisms (Jorgensen and Cluster 1988,

Hasegawa et al. 1985, Gouy and Li 1989, Lake 1989). These

and other studies (Wolters and Erdmann 1986, Gunderson et

al. 1987) using the limited numbers of sequences available

have been consistent with a monophyletic grouping of plants

including the green algae and excluding other algae from

this group.

Few complete sequences of ss rRNA genes are available

for plant species but the rate of accumulation of these is

increasing due to more rapid sequencing methods. The most

extensive rRNA studies for plants to date have utilized

partial sequences from both ss rRNA and Is rRNA. Direct

sequencing methods for RNA have allowed data accumulation

for a large number of plant taxa.

A study of nine species from the Poaceae (grass

family) has been conducted using the fern ally Psilotum as

an outgroup (Hamby and Zimmer 1988). A parsimony approach

using PAUP was used. The data contained 1648 positions.







Of these 244 positions were variable for all taxa, 119 were

variable within the seed plant taxa, and 85 were

potentially phylogenetically informative.

A single most parsimonious tree was inferred from the

analysis which required 161 steps. Additionally 59 next

most parsimonious trees were reported ranging from 162 -

167 steps. The results of the analysis were compared with

a classification proposed for the Poaceae (Watson et al.

1985). The most parsimonious tree grouped the members of

the subfamily Panicoideae (Zea, Tripsacum, Sorghum, and

Saccharum) monophyletically. Members of the subfamily

Pooideae also formed a monophyletic group (Triticum,

Hordeum, and Avena). The two members of the Bambusoideae,

Oryza and Arundinaria, however, did not group

monophyletically in the rRNA analysis. The classification

of Watson et al. (1985) placed Hordeum and Triticum in the

supertribe Triticanae while Avena is placed in the

supertribe Poanae. The rRNA analysis did not agree with

the classification at this level. For the most

parsimonious tree (161 steps) and the next most

parsimonious trees (162 167 steps) found, the rRNA data

indicated that Hordeum and Avena are more closely related

than either is to Triticum. However, the authors noted

that the data contained only 22 informative characters for

members within the Pooideae and that accumulation of more

sequence data may improve the resolution of these groups.







A study of higher level plant relationships has also

been conducted using partial ss rRNA and Is rRNA sequences

(Zimmer et al. 1989). About 1700 nucleotides for 39 plant

species were determined. Of these 514 were variable, 350

of which were potentially phylogenetically informative.

Two equally most parsimonious trees were inferred from PAUP

analysis of the data which required 1296 steps. These

differred only in the placement of two grass species. A

tree presented therein showing major groups of most

parsimonious trees reflects higher level relationships of

vascular plants included in the study. The analysis

indicated that the seed plants are monophyletic. The

gymnosperms were resolved as paraphyletic but cycads,

Ginkgo and conifers form a monophyletic group. The

angiosperms were also grouped monophyletically. Nymphaea

and Cabomba group together and are reflected as the sister

group to the other dicots included. The Arales and the

grasses also grouped together in this analysis but two

other monocots not represented in the tree were reported to

have grouped within the dicots. Another feature of this

phylogeny is that the cycad-Ginkco-conifer group was

resolved as the sister group to the angiosperms in contrast

to the widely supported view that the Gnetales are the

extant sister group to the angiosperms.

Like many other molecular data sets, a

transition/transversion bias has been reported for this

plant rRNA data. The data were also analyzed using the








evolutionary parsimony method developed by Lake (1987)

which utilizes primarily transversional differences in the

sequences. The resolution of taxa involved was reduced

using this method but the monophyletic grouping of

angiosperms was retained and the Gnetales were resolved as

the sister group to the angiosperms as supported by

phenotypic data sets. Further examination of the total

data was conducted using PAUP and it was found that it

required only three additional steps over the most

parsimonious trees to place the Gnetales as the sister

group to the angiosperms.

It is widely proposed in molecular systematic studies

that more nucleotide positions as well as sequences from

additional taxa are needed to address plant phylogeny with

molecular data. Another problem facing molecular

systematists is the availability of rigorous computer

methods for dealing with the data. Many investigators have

remarked on the need for more careful analysis of molecular

data sets (Patterson 1987, Bremer 1988, Steele et al. 1988,

Mishler et al. 1988, Humphries 1989). Caution has also

been urged in accepting phylogenetic systems based solely

on rRNA sequences (Rothschild et al. 1986) but few, if any,

molecular systematic studies have advocated such an extreme

approach. It is important that all available biological

data be considered when addressing phylogenetic

relationships.













CHAPTER 3
MATERIALS AND METHODS


Cloning and Sequencing of ss rRNA Genes from Ginkqo biloba


Genomic DNA for Ginkqo biloba was isolated from fresh

young leaves as described by Rivin et al. (1982). Lambda

vector DNA, EMBL3 (Frischauf et al. 1983), was grown using

K803 host cells in NZCYM media and isolated by the CsCI

procedure as described by Maniatis et al. (1982). The

plasmid pUC 19 (Messing and Vieira 1982) and subclones

therein were grown in host cell line TG1 (Gibson 1984) in

YT media (Miller 1972). Plasmid DNA was isolated by the

alkali lysis method (Birnboim and Doly 1979) and purified

over two successive CsCl gradients. M13 sequencing

vectors, MP 18 and MP 19, and subclones therein were

cultured using TG1 host cells in YT media and isolated by a

PEG (polyethylene glycol) precipitation method (Messing

1983).

A lambda genomic library for Ginkqo biloba was made

using EMBL3. Bam H1 compatible lambda arms were prepared

and genomic DNA was partially digested with Bam H1.

Genomic digestions were optimized to generate fragments in

the 15-20 kb range. Genomic DNA was ligated into EMBL3

arms at a 1:1 molar ratio overnight at 140 C. Recombinant







phage DNA was packaged using Packagene extract (Promega

Biotech) according to the manufacturers protocol. After

titering, the libraries were plated on 150 mm petri plates.

Duplicate blots of each plate were lifted on

nitrocellulose then denatured, neutralized and dried

(Benton and Davis 1977). Library lifts were probed using

plant small subunit ribosomal clones from Zamia pumila,

pZpr-1300 and pZpr-1400, which contain the 5' and 3'

portions of the gene. Inserts from these two clones

released by restriction digests, 550 bp and 1.3 kb

respectively, were isolated on dialysis membrane and 3MM

paper (Maniatis et al. 1982) after electrophoresis on 1.0%

agarose gels in TBE buffer (0.089 M Tris-borate, 0.089 M

boric acid, 0.002 M EDTA [ethylenediamine-tetraacetic

acid]). Insert DNA was labelled by nick translation (Rigby

et al. 1977). Prehybridization was carried out in 5X SSC

(0.75 M NaC1, 0.06 M NaH2PO4, 0.005 M EDTA, pH 7.0), 5X

Denhardts solution (Maniatis et al. 1982), 0.1% SDS (sodium

dodecyl sulfate) and 100 ug/ml herring sperm DNA for 4-6

hours at 650 C. Probe was added and hybridization

continued overnight. Blots were washed in 3X SSC with 0.5%

SDS once at room temperature for 5 minutes then twice at

650 C for 30 minutes each (Maniatis et al. 1982). Library

lifts were then blotted lightly between 3MM paper and

wrapped in PVC plastic (Fisher Scientific). Exposure was

overnight on Kodak XAR-5 film with X-Omatic intensifying

screens at -70 C. Positive plaques found on both of the








duplicate lifts were picked and purified through three

successive screenings using the Zamia ribosomal probes.

Recombinant lambda clones were mapped using a variety

of restriction enzymes. DNAs were separated by

electrophoresis on 1.0% agarose gels in TBE buffer. Gels

were blotted on nitrocellulose filters (Southern 1975).

Hybridization was carried out as above for library lifts.

A Sal I restriction fragment, ca. 6.1 kb, from lambda

Gb-01 containing the entire ss rRNA coding region was

subcloned into the Sal I site of pUC 19 to become pGbr-

1000. The Sal I insert of pGbr-1000 contains a single Eco

RI site. pGbr-1000 was digested with Sal I and Eco RI and

the two resulting fragments were subcloned into the Sal I-

Eco RI sites of the M13 sequencing vectors MP 18 and MP 19

(mGbr-1011,1012,1021,1022).

The genomic clone lambda Gb-06 contained a single Bam

HI fragment, ca. 12 kb, which hybridized to both 5' and 3'

ribosomal probes. Restriction fragments from the Bam HI

insert hybridizing to these probes were subcloned into M13

sequencing vectors MP18 and MP19.

Twelve primers (rsp-734,735,778-787), each 15 bp in

length were synthesized corresponding to highly conserved

regions of eukaryotic small subunit ribosomal DNA

approximately 300 bp apart for both strands. Sequencing

reactions were carried out on pGbr-1000 and derived

subclones in M13 using these twelve primers and the -40








universal sequencing primer. Subclones in M13 derived from

pGbr-6000 and lambda Gb-06 were sequenced using the -40

universal sequencing primer.

Initial dideoxy sequencing reactions (Sanger et al.

1977) were performed using the KB sequencing kit (Bethesda

Research Laboratories [BRL]). Duplicate reactions were

performed where needed to resolve regions of secondary

structure. These were accomplished using the Sequenase kit

(United States Biochemical), the M13 dideoxy kit (New

England Biolabs) and the KB kit (BRL) including

substitutions of deaza-7-GTP and dITP for standard dGTP.

Reactions were also modified by running them at higher

temperatures ranging from 46-500 C. Reactions were run on

gels by two or three successive loadings spaced 4-6 hours

apart. Initial gels were poured with 5% acrylamide and 7M

urea in IX TBE buffer using 65 cm X 33 cm plates and wedge

spacers from 0.4 mm at the top to 0.8 mm at the bottom.

These gels were run at 1800-2400 volts to maintain a gel

surface temperature of 45-500 C. Duplicate gels were from

4-8% acrylamide and 8M urea in lX TBE buffer. These were

run at 2000-3000 volts to maintain a gel surface

temperature of 55-62o C. Sequencing gels were exposed on

Kodak XRP-1 film at -70o C. Sequences for Gbr-1000 were

also generated by automated methods using the Genesis 2000

DNA analysis system (Dupont) according to the manufacturers

protocol.








Gene Copy Number Analysis


Restriction digests of plasmid, pGbr-1000 and pGbr-

6000 (a plasmid subclone containing a -3.5 kb Hind III

fragment from Gbr-6700), and genomic DNA for copy number

analysis were run on 1.0% agarose gels in TBE buffer. DNA

samples were quantitated using a Beckman DU-68 micro-

spectrophotometer. Gels were blotted overnight (Southern

1975) on Gene Screen (New England Nuclear) using 0.025 M

Na2HPO4/NaH2PO4 (pH 6.5). Blots were exposed to

ultraviolet light for 6 minutes to crosslink DNA to the

nylon membrane then prehybridized in 0.5 M NaPO4, 1% BSA

(bovine serum albumin), and 7% SDS for 6 hours at 680 C.

Fresh hybridization solution was added with the nick

translated probe and incubation continued overnight at 680

C. Blots were washed in 40 mM NaC1, 40 mM NaPO4, 1 mM

EDTA, and 1% SDS once at room temperature for 5 minutes

then twice at 680 C for 30 minutes each. Blots were

exposed to Kodak XAR-5 film. The optical densities of

autoradiograph band areas were determined using a laser

densitometer (Molecular Dynamics model 300A) according to

the manufacturers protocol.


Analysis of DNA Sequences


The Ginkqo sequences were entered on an IBM PC-XT

computer using the Microgenie software and a Gel Mate 1000

sonic digitizer (Beckman Corp.). The Microgenie software








was used to overlap and merge gel sequences. The completed

sequences were then examined for base composition and

restriction sites using the Analysis programs of

Microgenie.

The Gbr-1000 coding region and ss rRNA sequences for

corn, Zea mays (Messing et al. 1984), rice, Oryza sativa

(Takaiwa et al. 1984), soybean, Glycine max (Eckenrode et

al. 1985), a cycad, Zamia pumila (Nairn and Ferl 1988), and

the green alga Chlamydomonas reinhardtii (Gunderson et al.

1987) were entered and pairwise alignments were generated

for all six plant taxa. These alignments maximize

similarity of the two sequences involved using default

parameters for gaps and mismatches (Queen and Korn 1984).

Sequence similarities for pairwise alignments were

calculated as the number of matching nucleotides divided by

the length of the alignment (Table 1).

A multiple alignment of all six plant ss rRNA

sequences was constructed. Computer generated pairwise

alignments were used to first align the five seed plant

sequences, then incorporate the Chlamydomonas sequence.

The alignment was formatted as an input file for the

Reducseq program and the output from it subsequently used

for analysis on PAUP (Phylogenetic Analysis Using

Parsimony) software (Swofford 1985).

In these analyses transitions and transversion are

given equal weight and gap positions inferred from the

alignment are treated as missing data. The alltrees option









was employed to conduct an exhaustive search of unrooted

networks for the five seed plant taxa treating all

characters as unordered. The Chlamvdomonas sequence was

then restored to the data matrix and designated as the

outgroup. Exhaustive searches of five plant taxa rooted

networks were conducted using global outgroup rooting.

The total character set (95 positions) was divided into

subsets for further analysis. Transition character (57)

and transversion character (38) subsets were analyzed

separately. Several character positions contain both

classes of substitution. However, only one class of

substitution was phylogenetically informative and each of

these was included in the corresponding subset. The Ginkgo

sequence was compared with the secondary structure model

for the Zea mays ss rRNA sequence (Gutell et al. 1985) and

characters were separated into three subsets. Characters

from umpaired loop forming regions (38), paired stem

forming regions (29) and those from a region (ca. +650-850

bp) of undetermined secondary structure were analyzed

separately using methods described for the total data set.

Observed frequencies for the twelve possible types of

nucleotide substitution were estimated for the green plant

ss rRNA sequences. The 313 variable positions were

correlated with tree I (Fig. 5) and the minimum number of

observed substitutions tabulated. Observed substitution





36


frequencies between the two Ginkqo sequences Gbr-1000 and

Gbr-6700 were also estimated.

The Gbr-6700 sequence was aligned with that of Gbr-

1000 and sequence similarities examined. The region of the

Gbr-6700 sequence representing the 1100 nucleotide insert

in the ss rRNA like coding region was compared with

sequences from the Genbank sequence files using the search

programs of Microgenie.













CHAPTER 4
PHYLOGENETIC ANALYSIS OF PLANT SS RRNA GENE SEQUENCES


Experimental Results


Approximately 100,000 recombinant lambda clones were

screened on duplicate library lifts using the Zamia

ribosomal probes. The Sal I fragment, ca. 6.8 kb, from

lambda Gbr-01 subcloned in pUC 19, pGbr-1000, included the

complete ss rRNA coding region as indicated by

hybridization with separate 5' and 3' ribosomal probes

(Fig. 1). A 2000 bp region spanning the coding region was

sequenced. Boundaries for the ss rRNA coding region were

inferred by consensus with other eukaryotic ss rRNA

sequences. The coding region for the Ginkqo ss rRNA

inferred from this gene sequence is 1811 nucleotides in

length. The sequence similarities for pairwise alignments

of the five seed plant sequences range from 92% to 97% and

these are 87% to 88% similar to the sequence for the green

alga Chlamydomonas (Table 1).

An alignment of the six ss rRNA sequences was

constructed and is 1825 positions in length including gaps

introduced to facilitate alignment (Fig. 2). The majority

of positions for these sequences are readily aligned due to

the relatively high sequence similarities and the similar







lengths of the plant small subunit rRNA sequences.

However, alignment of some positions, primarily in the

Chlamydomonas sequence, with the other sequences remains

ambiguous. These positions are typically observed in

regions with both length variation due to insertional-

deletional events and low sequence similarity of positions

adjacent to the introduced gaps. These positions were

deleted from the data matrix and subsequently treated as

missing data in phylogenetic analyses.

The alignment was formatted as an input file and run

through the sequence reduction program (Reducseq) developed

for use with PAUP (Swofford 1985). This removes invariant

and uninformative variable positions from the data set. Of

the 1825 positions, 313 showed nucleotide variation and of

these, 95 positions were potentially phylogenetically

informative (Fig. 2). These were used to infer

phylogenetic trees using PAUP.

An exhaustive search of possible unrooted networks for

the five seed plant sequences was conducted. A single most

parsimonious network was found (Fig. 3) which required 120

steps (character state changes) with a consistency index

value of 0.867. The distribution of scores for the 15

possible network topologies was examined (Fig. 4). The

next best competitor requires an additional 23 steps at 143

with remaining network scores ranging to 189 steps. The

network at 143 steps splits the two grasses grouping

soybean and corn, with rice in an intermediate position.








In the network at 147 steps corn and rice switch and the

next two at 152 and 153 steps split Ginkqo and the cycad.

A statistical analysis was performed to correlate

character state changes with network A (120) and network B

(143) (Table 2). Of the 95 characters examined, 33 favor

one topology over the other. The most parsimonious network

(A) requires one less substitution each for 28 of these

positions. Network B is supported by only five of these

characters. To examine the significance of these values a

two tailed sign test (Sokal and Rolhf 1969) was used.

Confidence coefficients were inferred from a standard

statistical table of confidence limits for percentages.

For a sample size (n) of 33 with five positions favoring

network B (Y=5) the table indicates that the results are

significant at the 95% interval.

The outgroup sequence from Chlamydomonas was then

included and an exhaustive search for rooted trees was

conducted. A single most parsimonious tree was found that

requires 149 steps with a consistency index value of 0.78

(Fig. 5). This tree roots network A along the branch

between the gymnosperm node and the angiosperm node. The

next shortest tree requires an additional five steps at 154

and places the root along the branch between Zamia and the

gymnosperm node. Remaining tree scores range from 157 to

236 (Fig. 6). The five shortest trees (149-164) represent







five of seven possible rootings of the most parsimonious

network found.

A statistical analysis was performed to correlate

minimal substitutions with the most parsimonious tree, I,

and the next best competitor, II (Table 3). The sign test

was employed to examine the confidence limits for these two

phylogenies. Thirteen characters favor one topology over

the other with nine requiring one less step each for tree I

and four favoring tree II by one less step each. Again the

table for confidence limits of percentages was used. This

indicates that the number of characters supporting one

topology over the other is not significant at the 95%

confidence interval.

Five character subsets were used to infer unrooted

networks for the ingroup taxa and rooted phylogenies using

the green alga outgroup sequence. Each of these subset

analyses infer a single most parsimonious network (Fig. 3)

for the ingroup taxa identical in topology to the most

parsimonious network (A) found in the analysis using all

characters. For each of these subset analyses the next

most parsimonious network requires at least six additional

steps over the most parsimonious network found and overall

patterns of network distributions were similar (Fig. 4).

The number of characters in the data subsets ranged from 28

to 57. For each of the subset analyses, network scores

were also converted to percentages to compare network

distributions from analyses containing different numbers of









characters. Distributions of network percentages (Fig. 7)

were used for comparisons but are not intended to represent

a measure of resolution provided by the character subsets.

These distributions reveal that for all subsets examined

the next most parsimonious networks range from

approximately 10 to 25 percent longer than the most

parsimonious network found.

Rooted phylogenies were also inferred using the

character subsets. The transversion subset infers two

equally most parsimonious trees, tree I and tree IV, from

the data (Figs. 5 and 6). Transition characters infer a

single most parsimonious tree, I, and the next best

competitor requires an additional four steps. Analysis of

loop positions from unpaired regions produce two equally

most parsimonious trees, I and II. Stem positions from

paired regions produce a single most parsimonious tree, I,

in the analysis but the next best competitor, tree IV,

requires only one additional step. Remaining characters

from regions of undetermined secondary structure also

produce a single most parsimonious solution, tree I, in

this analysis with the next most parsimonious tree, II,

requiring two additional steps.

Observed frequencies of the twelve possible nucleotide

substitutions for the plant ss rRNA sequences were

examined. These substitution frequencies estimate the

minimum number of substitutions for the 313 variable







positions required to correlate the data with tree I found

in the analysis. For this analysis, transitional

substitutions are observed at a higher frequency than

transversions (Table 4). Substitutions between C and T are

most frequent, 18-20% of the total, and are observed at

approximately twice the frequency of substitutions between

A and G which represent 8.5-10% of total substitutions.

The eight possible transversion types are observed at

similar frequencies ranging from approximately 4-7% of the

total number of substitutions observed.

Congruence of seed plant phylogenies inferred from ss

rRNA sequences and those inferred from 5S rRNA sequences

was examined. Six plant 5S rRNA sequences were selected

from those available which represent the same taxonomic

groups as those available for ss rRNA. The 5S sequences

are available for Ginkgo and the outgroup Chlamydomonas.

The 5S rRNA sequences for the remaining four species in the

ss rRNA analysis are not available. Therefore the 5S

sequences for Cvcas revoluta, Vicia faba, Secale cereale,

and Triticum aestivum (Hori et al. 1985) were used in place

of Zamia pumila, Glycine max, Orvza sativa, and Zea mays.

The cladistic parsimony methods described for the ss

rRNA analysis were applied to these 5S rRNA data. The

alignment of the 5S rRNA sequences is 120 nucleotide

positions in length and of these, twelve positions are

potentially phylogenetically informative. A single most

parsimonious network was inferred from the 5S rRNA data









which required 13 steps (consistency 1.0) and is the same

topology as network A inferred from the ss rRNA data. The

next best competitors require an additional five steps and

there are two of these at 18 steps. For remaining networks

there are two at 19 steps and ten at 25 steps (Fig. 8).

The most parsimonious rooted tree using the Chlamydomonas

sequence as the outgroup requires 14 steps and is identical

in topology to tree IV inferred from ss rRNA data at 160

steps (+11). This tree roots network A between the node

for the dicot sequence and the node for the two monocots.

The next best competitor at 15 steps, requires only one

additional step and is the same topology as the most

parsimonious rooted tree, I, found in the ss rRNA analysis.

The 5S rRNA sequence from a more closely related

outgroup is available for the fern Dryopteris acuminata.

The root placement using this sequence in place of the

algal sequence infers a single most parsimonious tree with

the same topology as tree I (Fig. 5) found in this analysis

of ss rRNA sequences.















o 1 to aV ra w w

Y~ i c', k U1 0
0 tr 0 0 4 k 0 *-
r4J
0 X O' 0 .
0 l U4-


a) .0 4-)
Vu-4 r4 W-4
k n 4 EL k 4)

0.a 0

E-4-H E t


0 ) -.-4 0r,
0
V 4-
41o j o 4 o
E-4 ,q 0P







t3W4) 0-
ON "I

.-H Qi 0 3 '04 t4) am0












p 0 4J 01
ON t
o g v~

ro~r
0 CAE~
0"~~



rq 0 -)
0 a) r-4 0
to ID I o 4 -
0f r-j -0-4 1

(dUa 0 :t $
tn 4) (a (a ra 0c, r
6, -r 0 W 1 4 0
$4 rq k 9: 4-

4 0 o to U
o J l r tr -,q
-,1 4) Q, o 4 = g
A 9i p to tao 0 -ri



















I-


L


























Figure 2. An alignment of ss rRNA gene sequences from five
seed plant species and a green alga. Numbers to
the left of the alignment indicate the species
included as follows: 1) Chlamydomonas
reinhardtii; 2) Ginkqo biloba; 3) Zamia pumila;
4) Glycine max; 5) Oryza sativa; 6) Zea mays.
Positions of the alignment are indicated above
the alignment by numbers every 100 bp and
vertical lines every 50 bp. Positions of the
alignment that were deleted a priori from the
data matrix are underlined. Character positions
are indicated by a and synapomorphic
insertions by a A above the alignment. Gaps
introduced in individual sequences to facilitate
alignment are indicated within the sequence by a
hyphen (-).





47



| 100
1 ..................................................... ... ......... A....GC..-. .... ...... ... ....
2 TACCTGGTTGATCCTGCCAGTAGTCATATGCTTGCTCAAAGATTAAGCACATGCATGTGTAAGTATGAACTCTTTCAGACTGTGAAACTGCGAATGGCTC
3 ..................................... ............. ........ C............A...TG...G...................
4 .................................................................... AA...........................
5 ...................... ..... ............ ..................... ...........AA...GA......................
6 ............................................................C...........AA...GA ....................


A *
1 ..................A..........-.C......................................-G..C.A.................T
2 ATTAAATCAGTTATAGTTTCTTTGATGGTACCTTACTACTCGGATAACCGTAGTAATTCTAGAGCTAATACGTGCACCAAATCCCGACTTCTGGAAGGGA
3 .............................. TC.G.......................................................T.T.......
4 .................... ...... ...-TC........................................ .... .................
5 ................G....G ..........G.G................... ............ ......... .. .. ... ..... C..G....G
6 ...................G..........G.G.......................................... ..........C..G....G


** *300
1 ..T ...................-.GC.......T.....-AC..G ...... A ...........TC...A ...T.T..-G..C..C...A...T.T....
2 CGCATTTATTAGATAAAGGCCGACGCGGGCTC-GCCCGCTGCTTCGGTGATTCATGATAACTCGACGGATCGCACGGCCCTGGTGCCGGCGACGCTTCA
3 ..... C ................T........TT..... G.CG..T..... A ........C..T..T....T...T......C.A...............
4 T..................T.A..A.A.....T...T.T.....T.A.................T............T.T.............A...
5 .....................T................... A.C..A..................................C ............A...
6 .....................T...........T......C.A.C..A................................T.C................


I 400
1 ......................................................A...........................................
2 TTCAAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGAGGCCTACCATGGTGGTGACGGGTGACGGAGAATTAGGGTTCGATTCCGGAGAGGGAGCCT
3 ............................... ...... ........... .... ..... ......... .............................C
4 ......................................T
5 ...................................G...........................................................
6 .........................................................................


I *
1 ....G.T...................................... ..... C..................................C....G.-T.
2 GAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCTGACACGGGGAGGTAGTGACAATAA-ATAACAATACTGGGCTC-AT
3 ....................................... ....................................................
4 ......................................... ........ .. ...................... ....... ......
5 .............................................................................T.........C....G.-T.
6 ........................................................................................C...G.GT.


1 600
1 ..C...............................................................................................
2 CGAGTCTGGTAATTGGAATGAGTACAATCTAAATCCCTTAACGA-GGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAAT
3 .................................................................................................
4 T........................................... ....... ....................
5 A.T.......................................... .......................................................
6 A.T......................................... .......... ... .....................................


A* *** *** *
1 ........................................... C .....T-G.. .G.TG..........C.---.........T.CT.TG-CT.CA..
2 AGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGC-CGGGTCGGCCGGTCCGCCTTTT-CGGTGTGCACCGGCCGC-TCCGTCC
3 ..................................................A-...CC.......... T.. ...TG .............T-.T...
4 ..............................................T-T......AT........ C.--................G-CT.....
5 ..........................................C.....G...CCG............CA--...CAG......A..TG-CT..A..
6 ............................................C.... CG..CCG..TGCCG.....GTA--.....A....A....GTCT......


** ** I ** 800
1 T.C.......G.A..G........G...C.C..T.....-A.T...AGT.....AG..........GT.........T....................
2 CTTCTGCCGGCGGCGCGCTCCTGGCCTTAATTGGCTGGG-CCCTGCCGGCGCCGTTACTTTGAAAAAATTAGAGTGCTCAAAGCAAGCCTACGCTCT
3 ...T..TT..........A...........C..T.............A........................................T..T.....
4 ............AT.........T......C....C...-...T.C......T............5................................
5 ............AT............... C....C...T...T.C.................... G........................AT......
6 ............ AT...............C....C...-...T.C.....................G........................AT.....





48



** I 900
1 ......................A........C........-....C......T..GT....................G....GTA...............
2 GAATACATTAGCATGGAATAACGCGATAGGAGTCTGGTCCTATTGTGTTGGCCTTCGGGACCGGAGTAATGATTAATAGGGACGGTCGGGGGCATTCGTA
3 .......................GT.... ...................... ................................................
4 .T..............G.....A.C.C....T....A....................... ............... ... A...............
5 .G..............G.....ATC...............................................A................
6 .G.............. G....ATC......T..C........................ ................ ......A...............


I 1000
1 ..C.G..........................C.G..........AT.......................AC.......G.....................
2 TTTCATTGTCAGAGGTGAAATTCTTGGATTTATGAAAGACGAACCACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGG
S............................................. ...................................................
4 ......A.................................... A.....................................................
S......A................................................................ ..........................
6 ......A..................................... ................ ...................... ....................


1* 1100
1 ............T..........G................................T...A......CT...G.T.....T...A.-......A.....
2 CTCGAAGACGATCAGATACCGTCCTAGTCTCAACCATAAACGATGCCGACTAGGGATCGGCGGATGTTGCTTTAAGGACTCCGCCGG-CACCTTGTGAGA
3 ................................................. ................ .....................A.....
4 ..................................................C.. ..A ........... ....... ..T.. ......A.....
5 ............................................... C.. .................... AT...................A.....
6 ..................................................C....... .....A..AAT.....C....T..C...T .....


I* 1200
1 .................................................................... .C............
2 AATCAAAGTTTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATT
3 ................... ..............................................................................
4 .........C............................. .....................................................
5 .........C............................. ................................... C ..................
6 .........C.................................................................C..................


|* 1300
1 ..................................CG.G................................... G........................
2 TGACTCAACACGGGGAAACTTACCAGGTCCAGACATAGTAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTT
3 .....................................C............................................ .............
4 .................................................... .......................... .........
5 .................................... .... .......... .C.... ....... ......... .......... ................
6 .....................................C............ ......................................T.......


*** ** 1400
1 .........GTT.CC.....A.... G ..............................A...-.CA..ATC.CAC.TG.GGT.-C...GA........
2 AGTTGGTGGAGCGATTTGTCTGGTTAATTCCGTTAACGAACGAGACCTCAGCCTGCTAACTAGCTATGCGGAGGTTCGCCTTCGTGGCCAGCTTCTTAGA
3 ............................................. ... ................ .. .....G.TT.T....................
4 .................................... ....................... ...... .. ... AAC...C.AC...............
5 ..........................................................CCA.............CCATC...C..CA..T .......
6 .........................................................................CCATC...C...A.TT ..........


I *
1 .......T....-G..T..C.A........ A......G ....................... ........................... C.CGAC..
2 GGGACTA-TGGCCCTTCAGGCCATGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATGTATTCA
3.......-....G..T......................... ....................................................
4 .......-.....GC.T......C ................ ......................................................
5 .......-.....G..T...... C........................................................................C..
6 .......-.....G..T..... GC-.............. ....................... ................................. C..








Figure 2--continued





49


** I *
1 .....C....--...T................T....TG-T...CCG.G.-.................T............AG....C ..........G.
2 ACGAGTCTATAACCTGGGCCGAGAGGCCCGGGAAATCTGCCGAAATTTCAT-CGTGTGGGGGATAGATATTGCAATTATTGATCTTAAACGAGGAATTC
3 .................... G... ............. ........... ......... ........................ C............
4 ...........G...T......C...T.....T.....TT-..............GT................... G...G....C.........
5 ......A....G.....T....C.........T......TGG.......... .................... AC...G...G...C........ G.
6 ......A....G...T......C.....--..T.....TGG.......................................G...G...C..........G.

I 1700
1 ....................G...........T..............................................GG..TG.T..............
2 CTAGTAAGCGCGAGTCATCAACTCGCGTTGACTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGAATGATGGTGAAGTGTTCGG
3 ................................................................................................
4 ..................G............................... ..........................G...............
5 ..................G................. ........................................G...............
6 ...................G............................................................G...........

1800
1 ..T.A..--TT.G.T.G..--.AA.-.T...CT-T.--.T-T.............A....C.CC..CC.................................
2 ATCGCGCCGACGACGGCGGTTCGCCGCCGGCGA-CGT-CG-CGAGAAGTTCATTGAACCTTATCATTTAGAGGAAGGA TCGTAACAAGGTTTCCGT
3 ....T....................T.GGCA...-.........................................................
4 ..T...G.....TGA.........T...C....-...-T.-T.......C..C..............................................
5 ...... G.....GG...............CC.......-........ ...............................................
6 .G.T..G.CG.AC---..............CC...C...C..C........C..................................................

1825
1 .........................
2 AGGTGAACCTGCGGAAGGATCATTG
3 .........................
4 ........................
5 .........................
6 ........................


Figure 2--continued








Gb. G.m. Os. G.b. Os. G.m






Z/ A. z. B.
Zp. Zm. Zp. Zm..


a. 143
b. 58
c. 85
d. 61
e. 40
f. 42


Os. Gb. Zp. Os.






> E.
Zm. Gm. Zm.
a. 153
b. 58
c. 95
d. 63
e. 45
f. 45


G.b. Zm. Os.






C.
Zp. G.m.
a. 147
b. 59
c. 88
d. 63
e. 41
f. 43


Gb. Os. Zp.


S//



F.
G.m. Z m.
a. 180
b. 72
c. 108
d. 74
e. 54
f. 52


Figure 3.


The fifteen possible topologies for unrooted
networks of the five seed plant taxa. The
number of steps (character state changes)
required for each network are indicated below
the network for: a) all characters; b)
transversion characters; c) transition
characters; d) loop characters; e) stem
characters; f) undetermined secondary
structure position characters. Taxa included
are: Ginkqo biloba (G.b.); Zamia pumila
(Z.p.) ; Glycine max (G.m.); Oryza sativa
(O.s.); and Zea mays (Z.m.).


a. 120
b. 45
c. 75
d. 50
e. 34
f. 36


G.m.


D.
a. 152
b. 59
c. 93
d. 61
e. 45
f. 46










Gb. Zm. Os.






G.m. Zp.
a. 181
b. 72
c. 109
d. 75
e. 54
f. 52


Zp. Gb. G.m.





H.
s. Z.m.
a. 184
b. 74
c. 110
d. 77
e. 53
f. 54


G.. Zp. G.m.





I.
Os. Zm.
a. 184
b. 74
c. 110
d. 77
e. 53
f. 54


Gb. as. Zp. Gb. Zm. Zp.




J. K
Zm. G.m. Os G.m.
a. 184 a. 184
b. 75 b. 75
c. 109 c. 109
d. 76 d. 76
e. 54 e. 54
f. 54 f. 54


b. Zp. s. G.b. G.m. Zp.





SGm. N.
7 G.m. Os. Zm.


a. 187
b. 74
c. 113
d. 78
e. 54
f. 55


a. 188
b. 75
c. 113
d. 78
e. 55
f. 55


G.b. G.m. Os.




0.


Zmn. a. 189 Zp
b. 75
c. 114
d. 79
e. 55
f. 55


Figure 3--continued


Zm.


a. 186
b. 74
c. 112
d. 77
e. 54
f. 55


Zi



























Figure 4.


The distributions of network scores (Fig. 3).
The horizontal axis represents the number of
steps required for each network and the
vertical axis indicates the number of networks
found requiring that number of steps. The data
set used for each analysis and the number of
characters in the data set are listed to the
right of the histograms.









a,
AlI Positions
95 Characters



40 60 80 100 120 140 160 180 200


b.
Transversions
38 Characters



40 60 80 100 120 140 160 180 200
I I I I I I I I
C.
Transitions
57 Characters



40 60 80 100 120 140 160 180 200


d.
Loop Positions
38 Characters



40 60 80 100 120 140 160 180 200


e.
Stem Positions
29 Characters


40 60 80 100 120 140 160 180 200


f.
Undetermined
Structure
28 Characters


40 60 80 100 120 140 160 180 200


I: I


I


























Figure 5.


The most parsimonious tree (I) and the three
next most parsimonious trees (II-IV) found in
the analysis using the green algal sequence as
the outgroup. Numbers next to trees indicate
the number of steps for each internode
inferred from the total data set. Represented
taxa are: Ginkgo biloba (G.b.); Zamia pumila
(Z.p.); Glycine max (G.m.); Oryza sativa
(O.s.); Zea mays (Z.m.); and Chlamydomonas
reinhardtii (C.r.). Numbers listed below the
trees indicate the number of steps required
for each of the trees from: a) total data; b)
transversions; c)transitions; d) loop
positions; e) stem positions; and f)
undetermined secondary structure positions.






Gin. O.s. Zm.

A\ 6\


I.
a. 149
b. 60
c. 89
d. 59
e. 44
f. 46


G.b. Gm. O.s.


Zm.
03


II.
48
a. 154
b. 61
c. 93
d. 59
e. 47
f. 48


C.r.


Gb.






G.b.


Zp.


G.m.


O.s.


Zm.


0 III.
48 a. 157
b. 63
c. 94
d. 61
e. 47
f. 49


G.b. Zp. Gm. O.s. Zm.


IV.


Figure 5--continued


Cr,


a. 160
b. 60
c. 100*
d. 65
e. 43
f. 52



























Figure 6.


Distribution of scores for the 105 possible
topologies for six taxa trees. The horizontal
axis represents the number of steps required
for trees and the vertical axis indicates the
number of trees found for each score. Numbers
within the open boxes at the top of the
histograms indicate the percentage of steps
required for each tree over the most
parsimonious tree found for each character
set. The character set used for each analysis
is indicated to the right of the histograms.











10 5% 10% 20% 50%
9-
8 a.
7 All Positions
6 95 Characters
5-
4
3
2

0
140 150 160 170 180 190 200 210 220 230 240







9-
1 _b.
8
7 | Transversions
6 I 38 Characters


50 60





1
10 5% 1O0o
9
8
7 -
6
5
4
3
2
1
o


70 80 90 100


110







c.
STransitions
57 Characters


100 110 120 150 140









11 -
10
9 -
8 -
7
6 -
5 -
4 -
3 -
2 -
1 -
0 -
50


60 70 80 90


7 -


95 -











S I
40 50 60 70 80





S- 510 20% 50%

3
7 -

5 -
1-



3


40


d.
-Loop Posiitions
-38 Characters






100


e.
Stem Positions
29 Characters















f.
Undetermined
Structure
28 Characters


80


Figure 6--continued



























Figure 7. The distribution of network percentages for each
of six character subsets. Percentages were
calculated for network scores by dividing the
number of additional steps required for each
network over the most parsimonious network by
the number of steps required for the most
parsimonious network. The horizontal axis
represents these percentages and the vertical
axis indicates the number of trees found within
each percentage interval. The character set
used for each analysis is shown to the right of
the histograms.








8 -
7
6 -
5 -
4
3 -
2
1
0
0%
8
7
6
5
4
3 -
2
1 -
0
0%
8
7
6 -
5
4
3 -
2
1 -
0
0%


10% 20% 30% 40% 50% 60% 70%


a.
All Positions
95 Characters







b.
Transversions
38 Characters







C.
Transitions
57 Characters


10% 20% 30% 40% 50% 60% 70%


8 | I '-I I -' I -I '- I
7
6 -
5
4
3 -
2
0

0% 10% 20% 30% 40% 50% 60% 70%
8 I
7
6 -
5 -
4
3
2
1
0
0% 10% 20% 30% 40% 50% 60% 70%
8
7
6
5
4
3
2

0
0% 10% 20% 30%J 40% 50% 60% 70%


d.
Loop Positions
38 Characters







e.
Stem Positions
29 Characters







f.
Undetermined
Structure
28 Characters


10% 20% 30% 40% 50% 60% 70%




























Figure 8.


Distributions for unrooted networks and rooted
phylogenetic trees inferred from this re-
analysis of 5S rRNA sequences from selected
taxa. The horizontal axis represents the
number of character state changes for each
network or tree and the vertical axis indicates
the number of these found in the analysis. The
histograms represent scores for: a) unrooted
five taxa networks; and rooted trees using b)
the green alga sequence and c) the fern
sequence as the outgroup.


















i. 1111


15 20


10 15



24 --l
22 -
20 -
18-
16 -
14
12-
10-
8
6
4
2
0 -- -TT


a.
Unrooted
Networks


5 30 35


SI I I I


b.
Alga Root


20 25


c.
Fern Root


10 15 20 25 30 35
























Table 1. Sequence Similarities for Plant ss rRNA Genes


G.b. Z.p. G.m. O.s. Z.m.


C.r. 88.4 87.3 86.8 87.3 86.6

G.b. 95.9 94.0 94.0 92.4

Z.p. 92.3 92.3 91.0

G.m. 94.5 93.5

O.s. 96.9


Taxa included are: Chlamydomonas reinhardtii (C.r.);
Ginkco biloba (G.b.); Zamia pumila (Z.p.); Glycine max
(G.m.); Oryza sativa (O.s.); and Zea mavs (Z.m.).







Table 2. Characters for Statistical Analysis of Networks


CHARACTER CHARACTER NUMBER OF SIGN TEST
NUMBER STATE SUBSTITUTIONS SCORE

GO Z
p m s m NTW A NTW B NTW A NTW B

1 TTTCC 1 2 +1
4 GGGAA 1 2 +1
8 TTTCC 1 2 +1
9 AAAGG 1 2 +1
10 AAAGG 1 2 +1
11 C C C T T 1 2 +1
12 C G CAA 2 3 +1
13 T T T C C 1 2 +1
19 CCTCT 2 1 -1
23 T T T G G 1 2 +1
24 AAATT 1 2 +1
26 AAATT 1 2 +1
31 C C C G G 1 2 +1
32 C C T C T 2 1 -1
33 G G GAA 1 2 +1
37 TTTAA 1 2 +1
46 TTTAA 1 2 +1
47 AAATT 1 2 +1
51 C G C T T 2 3 +1
53 T T T C C 1 2 +1
60 GGAGA 2 1 -1
61 TTTAA 1 2 +1
63 C C T C T 2 1 -1
65 AAACC 1 2 +1
69 G G G CC 1 2 +1
70 T G T C C 2 3 +1
74 G G GAA 1 2 +1
75 CCCTT 1 2 +1
77 TTTCC 1 2 +1
78 CCCAA 1 2 +1
80 G GT GT 2 1 -1
88 TTT G 1 2 +1
94 GAGCC 2 3 +1

Number of Characters Supporting Each Network +28 -5

Total Non-zero Scores 33

Character states for positions favoring one topology
over the other for networks A and B (Fig. 3). A standard
statistical table was used to interpolate confidence
intervals for the sample size (n) of 33 with five (Y)
positions supporting the alternative network B over the
most parsimonious network A.





66












Table 3. Characters for Statistical Analysis of Phylogenies


CHARACTER CHARACTER NUMBER OF SIGN TEST
NUMBER STATE SUBSTITUTIONS SCORE

C Z GOZ
r p m s m TREE I TREE II TREE I TREE II

16 ATATTT 2 1 -1
17 TCTCCT 3 2 -1
18 T C T C C C 2 1 -1
22 CTTCCC 1 2 +1
36 CTTCCC 1 2 +1
38 AGGAAA 1 2 +1
40 TGT G G G 2 1 -1
50 AGGAAA 1 2 +1
56 AGGAAA 1 2 +1
58 ACCAAA 1 2 +1
80 TGGTGT 2 3 +1
83 TAATTT 1 2 +1
87 GAAGGG 1 2 +1


Number of Characters Supporting Each Tree +9 -4


Total Non-zero Scores 13


Character states for positions favoring one topology
over the other for trees I and II (Fig. 5). A standard
statistical table was used to interpolate confidence
intervals for the sample size (n) of 13 with four (Y)
positions favoring the alternative phylogeny II over the
most parsimonious phylogeny I.























Substitution Frequencies for Plant ss rRNA Genes


No. of % of
Subst. Subst.


3.7

4.4

6.1

5.7

8.6


20.1


Subst.
Type


C > A

T > A

G>C

T > G

G >A

T > C


No. of % of
Subst. Subst.


4.2

4.9

7.4

6.9


42 10.3

72 17.7


The observed frequencies for nucleotide substitution
types were estimated from variable positions of the plant
ss rRNA gene alignment (Fig. 2). Substitutions were
correlated with the most parsimonious tree (Fig. 5) found
in the analysis.


Subst.
Type


A > C

A> T

C > G

G > T

A > G

C > T


Table 4.







Discussion


Introduction to Discussion


The length of the Ginkgo small subunit rRNA coding

region contained in Gbr-1000 is 1811 nucleotides in length

as determined by consensus with other eukaryotic sequences.

This is within the range reported for other seed plants

which range from 1807 nucleotides in soybean to 1813

nucleotides for Zamia. Length conservation coupled with

high sequence identity makes alignment of most nucleotide

positions straightforward. Only a few positions which

surround gaps in the alignment and show nucleotide

variation are not so easily aligned (Fig. 2).

A basic assumption in a phylogenetic analysis is that

the characters being compared are homologous (share a

common evolutionary origin). In a cladistic study using

molecular sequences, homology of a nucleotide position is

defined by the alignment. For some nucleotide positions

where more than one alignment is possible, misalignment

will result in comparing non-homologous characters in the

analysis. To minimize this, several nucleotide positions

were removed from the data matrix (Fig. 2), mostly in the

Chlamydomonas sequence, where alignment is ambiguous and

consequently homology is uncertain for those characters.

The remaining nucleotide positions include 95 potentially

phylogenetically informative positions.







For this study we adopt a cladistic approach using

parsimony. Eukaryotic relationships have been examined

based on ss rRNA sequences using a variety of approaches

(reviewed in Felsenstein 1988), however, the cladistic

parsimony approach has become widely used by plant

systematists.


Ingroup Analysis of Seed Plant ss rRNA Sequences


Ingroup analysis of the five seed plant taxa infers a

single most parsimonious network (Fig. 3). This network,

A, is 120 steps in length and has a consistency index of

0.87. Network A groups the two grasses together with the

soybean sequence intermediate between the grass node and

the node grouping Ginkqo and Zamia. The next shortest

network, B, requires 23 additional steps at 143 and splits

the two grasses placing soybean with corn and placing rice

intermediate between these and the gymnosperms. The four

next most parsimonious networks (B, C, D and E) requiring

between 143 to 153 steps split either the grass node (B and

C) or the gymnosperm node (D and E) grouping soybean with

one member of the group and placing the other member of the

group in an intermediate position in the network.

Remaining arrangements of the five ingroup sequences (F

through O) require between 180 and 189 steps (Fig. 4).

A statistical estimate of the significance of

character support for the most parsimonious network (A)

over the next best competitor (B) was calculated. An








estimate of the minimum number of substitutions was made by

correlating each character position with network A and with

network B (Table 2). Of the 95 characters, 33 favor one

network over the other by one less substitution each.

Characters favoring network A totalled 28 with only five

favoring network B.

The nonparametric sign test (Sokal and Rolhf 1969) was

used to examine confidence limits for percentages.

Calculations were made for a sample size (n) of 33 with

only five characters (Y) supporting the alternative network

topology using a two tailed test. Results indicate that at

the 5% significance level it is unlikely that the number of

characters supporting the two alternatives are equal as

postulated by the null hypothesis.

The most parsimonious network, A, inferred from the ss

rRNA sequences is highly congruent with the phenotypic

data. Other possible topologies (B-O) for these taxa are

not plausible alternatives given the morphological and

anatomical characters for these seed plants. The strength

of support for network A from these data is therefore

consistent with the assumption that ss rRNA sequences

contain phylogenetically informative characters applicable

to study at these taxonomic levels.








Analysis of Rooted Phylogenetic Trees


Rooted phylogenetic trees were inferred from the data

using the Chlamvdomonas sequence as the outgroup (Fig. 5).

The green algae are widely supported as a basally divergent

group in green plant lineage and for ss rRNA sequences

currently available is the closest relative of the higher

seed plants. An exhaustive search of rooted phylogenies

was conducted and inferred tree distributions were examined

(Fig. 6).

A single most parsimonious tree is found which

required 149 steps and has a consistency index value of

0.78 (Fig. 5). In tree I the algal sequence roots network

A (Fig. 3) along the internode between the two gymnosperms

and the angiosperm node. Therefore a monophyletic grouping

of Ginkqo and Zamia is inferred from the most parsimonious

rooting of seed plant networks.

The next two shortest trees found, tree II and tree

III, require an additional five and eight steps

respectively (Fig. 5). These support a polyphyletic origin

for these two gymnosperm taxa. Tree II, which requires 154

steps, roots network A along the branch leading to Zamia

suggesting that Ginkco is more closely related to the

angiosperms than the cycads. Tree III (Fig. 7), at 157

steps, roots the network along the Ginkqo branch and this

placement of the root suggests that Ginkqo diverged prior

to the divergence of the cycad lineage from that of the








angiosperms. The five shortest trees found in this

analysis (Fig. 6) represent five of seven possible rootings

of the most parsimonious network, A, found in the ingroup

analysis.

A statistical analysis was performed by calculating

the minimum number of nucleotide substitutions for each

character position required to correlate the sequence data

with the most parsimonious tree, I, and the next best

competitor tree II. Of 95 character positions in the data

13 favor one phylogeny over the other by one less

substitution each (Table 3). Nine of these support tree I

and four support tree II (Fig. 5). The sign test was

employed to examine the significance of these values. The

test indicates that support for one phylogeny over the

other is not significant at the 95% confidence interval.

Therefore, although a monophyletic group is inferred by

using a parsimony criterion, the ss rRNA data using the

algal outgroup support the grouping rather weakly.

Resolution for rooted phylogenies is not as clear as that

obtained in the ingroup analysis. These results suggest

that rates of evolution in ss rRNA sequences may be too

rapid to provide informative characters for such distantly

related taxa (eg. green algae and seed plants),

particularly if time between divergence of ingroup taxa

(eg. Ginkgo and cycads) is brief. Therefore, sequences







from more closely related outgroup taxa are needed in order

to root seed plant phylogenies inferred from ss rRNA

sequences.


Comparison of Character Subsets


Character weighting has frequently been used in

phylogenetic analyses. Weighting based on transition and

transversion types of substitutions can be examined for

molecular characters. In rRNA sequences the position of

the character in either an unpaired, loop region or a

paired, stem forming region can also be used as a basis for

character weighting. Compensating mutations in stem

forming regions occur at high frequency and may be under

positive selection pressure. Studies using 5S and 5.8S

rRNA sequences have suggested that it may be desirable to

give less weight to compensating mutations or eliminate

them from the analysis (Wheeler and Honeycutt 1988).

Wheeler and Honeycutt (1988) examined phylogenies

inferred from subsets of characters based on stem or loop

regions of the rRNA molecule and from their total data.

This approach was used to examine subsets of characters

from the ss rRNA data. The 95 characters were separated

into subsets representing different types of characters and

phylogenetic inferences made using these subsets.

Characters were first separated based on nucleotide

substitution classes into subsets containing transversions

(38 characters) and transitions (57 characters). The data








were also divided based on the position within a secondary

structure model into loop region characters (38), stem

region characters (29) and remaining characters (28) from

regions where secondary structure models have not been

determined.

The ingroup analysis from each subset supports a

single most parsimonious network identical in topology to

the most parsimonious found, network A, using the total

data set (Fig. 3). Direct comparison of these analyses is

difficult because of the variation in the number of

characters each subset contains. Network distributions

(Fig. 4) for each subset analysis were converted into

percentages reflecting the number of additional steps

required for alternative networks over the most

parsimonious network (Fig. 7). These percentage

distributions are used to compare subset analyses but are

not intended as a quantitative measure of distribution

significance.

The analysis of transversion characters supports a

single most parsimonious network (A) which requires 45

steps for 38 characters (Fig. 3). The consistency index is

0.96 and the next most parsimonious networks (B and E) each

require an additional 13 steps. For transition data the

most parsimonious solution requires 75 steps for 57

characters (network A). The consistency index is 0.81 for

the network and the next best competitor (network B)







requires an additional 10 steps. Consistency index values

for most parsimonious solutions suggest that transversion

characters contain less homoplasy than the transition

characters and may therefore provide better resolution than

transitions or total characters when both types of

substitutions are given equal weight. Network percentage

distributions (Fig. 7) indicate that for all characters the

next most parsimonious network is 19% longer than the most

parsimonious network. For the transversion analysis the

next competitor is 29% longer than the most parsimonious

solution and the transition subset supports a next most

parsimonious network 13% longer than the most parsimonious

found. These results suggest that giving more weight to

transversion substitutions may reduce the amount of

homoplasy in the ss rRNA sequence data and increase

resolution. However, results also indicate that transition

substitutions provide informative characters and therefore

should be included at some level in phylogenetic analyses.

The analyses of character subsets based on secondary

structure of the ss rRNA molecule reveal less variation in

consistency index values and network distributions (Fig.

4). Loop position characters infer a most parsimonious

network (A) requiring 50 steps for 38 characters. Next

most parsimonious alternatives (B and D) each require an

additional 11 steps. The consistency index for network A

from loop positions is 0.86 compared with 0.85 obtained for

stem positions. Therefore this analysis indicates that








these two types of characters contain similar levels of

homoplasy. The most parsimonious network for 29 stem

position characters requires 34 steps with the next best

competitor requiring an additional six steps. Next most

parsimonious solutions for these data subsets are 22% and

18% longer than most parsimonious networks for loop

characters and stem characters respectively (Fig. 7).

Remaining positions from a region of undetermined secondary

structure also produce a single most parsimonious network

(A) at 36 steps for 28 characters. The consistency index

for the network is 0.89, slightly higher than those from

loop and stem character subsets.

Analyses of phylogenies using the algal sequence as

the outgroup root and the character subsets listed above

were also carried out. From the transversion subset two

equally parsimonious trees, I and IV, are found at 60 steps

each with a consistency index of 0.78 (Fig. 5). Rooting

the phylogeny with the algal sequence therefore cannot

distinguish between tree I, supported by the total data,

and tree IV which splits the angiosperms and is highly

incongruent with morphological and anatomical data. The

transition subset does infer a single most parsimonious

solution at 89 steps with a consistency index of 0.78 (Fig.

5). Homoplasy and tree distributions (Fig. 6) are similar

to those observed for the total data set.







Analysis of the loop position subset also infers two

equally parsimonious trees, I and II. The consistency

index value for these trees is 0.81 compared to 0.74 for a

single most parsimonious tree (IV) inferred from stem

characters (Fig. 5). The next most parsimonious solution

(tree I) requires only one additional step for the stem

character subset. Characters from regions of undetermined

secondary structure infer a single most parsimonious tree

(I) which is identical in topology to that obtained from

the total data set. This tree requires 46 steps and has a

consistency index of 0.78.

None of the data subsets examined appear to improve

the resolution of alternative rooted phylogenies inferred

from the total data. Transversion and loop character

subsets, which might be expected to contain less homoplasy,

fail to produce a single most parsimonious tree from their

respective analyses. A phylogeny which is highly

incongruent with phenotypic data (IV) is the most

parsimonious found using stem characters and is among the

two equally most parsimonious found using transversion

characters. Results of this analysis suggest that the

rates of substitution for all types of characters examined

are too rapid to use the green alga sequence as an outgroup

root.

For the ingroup analysis where the total ss rRNA data

set provides good resolution, some differences are observed

in the results from character subset analysis. The








transversion subset appears to contain less homoplasy than

transition characters. Observed frequencies for the twelve

possible substitutions were estimated by correlating the

313 variable positions of the data set with tree I (Fig.

5). The frequencies for transitions are twice that of

transversions only if the number of types of possibilities

are considered for transitions (four possible) and

transversions (eight possible) (Table 4). Overall,

transitions represent 57% of the total substitutions

estimated from these data. Frequencies of substitution

types also vary approximately twofold among different

transversion types. A twofold difference in frequencies is

also observed for transition types with substitutions

between C and T observed at twice the frequency of those

between A and G.

The differences in the frequencies of various

substitution types suggest that individual types of

substitutions, in addition to the overall ratio of

transitions to transversions, might be considered when

developing character weighting schemes. Williams and Fitch

(1990) have developed a method that, in addition to

weighting of character positions, allows character state

transformation weighting within the character position.

Therefore, this method allows different weights to be given

to the various nucleotide substitution types.







Congruence of Phylogenies from ss rRNA Sequences and
Phenotypic Characters


The most parsimonious tree and the next two shortest

alternatives found in this analysis (Fig. 5) represent the

most plausible phylogenies for the seed plant taxa

considered. Phylogenies consistent with each of these have

been proposed based on phenotypic evidence (Beck 1966,

1985, Meyen 1984, 1986, Rothwell 1985). More recent

cladistic studies examining seed plant phylogeny have

reviewed the evidence supporting these alternatives. These

investigations include a broad range of morphological and

anatomical characters and address most seed plant taxa

including fossil groups as well as extant gymnosperms

(Crane 1985a, 1985b, Doyle and Donoghue 1986, 1987).

Phylogenies consistent with each of the three shortest

trees found in this analysis are found among equally or

nearly as parsimonious alternatives in these phenotypic

studies.

Phylogenies consistent with a clade containing Ginkao

and cycads (tree I) have been proposed based on phenotypic

data. Extant seed plant taxa have been examined using a

cladistic approach and have included ferns and fern allies

as outgroups (Hill and Crane 1982). Three equally

parsimonious cladograms are presented, all of which link

Ginkgo with the cycads but vary in their placement of

conifers and gnetopsids. In portions of another study

(Doyle and Donoghue 1987) limited to extant seed plant








taxa, a clade containing Ginkgo, cycads and conifers is

found in one of two equally parsimonious alternatives. The

second separates Ginkgo (plus conifers) from the cycads.

Other nearly as parsimonious alternatives found also

separate Ginkgo from the cycads as in trees II and III

(Fig. 5) found in this analysis. Furthermore, in analyses

by these investigators which include fossil taxa,

phylogenies linking Ginkgo with cycads become less

parsimonious than those separating these two groups (Doyle

and Donoghue 1986, 1987). Crane (1985a, 1985b) has also

addressed extant and fossil taxa. A consensus cladogram,

one of two presented therein, supports a clade containing

Ginkgo and cycads (plus conifers). However, the second

cladogram retains a Ginkgo-conifer clade but it is

separated from the cycads. The latter arrangement is also

supported in an analysis of extant taxa which includes

other green plant groups (Bremer 1985).

The most parsimonious network A, inferred from ss

rRNA sequences is strongly supported over the alternatives.

This network is highly congruent with phenotypic evidence

for these taxa. Alternative networks for these taxa are

implausible given the morphological and anatomical evidence

condradicting them. Inferences of the evolutionary

relationships between Ginkgo and cycads and their

relationships to the angiosperms require rooting this

network with an outgroup sequence. Placement of the root







using the algal sequence is relatively weak compared with

the resolution obtained in the ingroup analysis. Use of

sequences from more closely related outgroups (eg. ferns)

will improve the ability to root networks inferred from ss

rRNA sequences.

Homoplasy is a common problem in plant systematic

studies at higher taxonomic levels. This problem persists

in this ss rRNA data as it does in other molecular and

phenotypic data sets for higher plant taxa. These problems

with the ss rRNA data, like results of studies based on

phenotypic data, suggest that trees II and III (Fig. 5)

present plausible alternatives to the most parsimonious

tree, I, found in this analysis.


Additional Data from rRNA Sequences


The characters used in this PAUP analysis included

only base substitutions observed in the sequence alignment.

Differences in ribosomal sequences caused by insertional or

deletional events occur at a frequency at least an order of

magnitude lower than substitutional events (Zimmer et al.

1989). These were not incorporated into the data matrix

because of uncertainty on how to treat them relative to

substitutions. The majority of the insertion/deletion

events observed are apomorphic change in individual

sequences. However, two synapomorphic insertional events

are inferred from our alignment (Fig. 2). One supports the

node for the two grasses corn, and rice. The other one








supports the node for Ginkgo and the cycad. Given the low

frequency observed for insertions it would appear less

likely that this is a result of either independent

insertional events in the two lineages or an

insertion/deletion event at the same position to correlate

the data with trees II or III. Therefore, the

monophyletic grouping of Ginkqo and cycads inferred from

nucleotide substitutions is also supported by this

insertion.

Other studies using rRNA sequences also appear to

provide support for a clade containing Ginkgo and the

cycads. The first is a study using 5S rRNA sequences from

28 species of green plants (Hori et al. 1985) which

includes Ginkqo, a cycad (Cycas) and a conifer

(Metasequoia). Phylogenies were inferred by a distance

matrix method rather than cladistic parsimony. The

phylogeny presented therein does, however, suggest a

grouping of Ginkqo, cycads and conifers.

The 5S sequences for a limited number of taxa were re-

analyzed using the methods described here for ss rRNA

sequences. Like the ss rRNA analysis, the 5S rRNA analysis

infers a single most parsimonious network that is superior

to its competitors under a parsimony criterion. This re-

analysis of 5S rRNA sequences also parallels the ss rRNA

analysis in the weakness of the root placement using the

algal sequence from Chlamydomonas.







Unlike ss rRNA data a closer outgroup sequence is

available for the 5S rRNA data. The Chlamydomonas sequence

was replaced with that from a fern, Drvopteris acuminata.

The placement of the root using this sequence infers a

single most parsimonious tree identical in topology to that

found in the ss rRNA analysis, tree I. Tree distributions

found using the fern outgroup (Fig. 8) more closely

parallel those found for the ss rRNA analysis. Therefore

for this analyses of limited taxa, the most parsimonious

tree inferred from 5S rRNA data using the fern outgroup is

congruent with that from ss rRNA sequences (Fig. 5).

Another study, a cladistic one using PAUP, utilizes

partial sequences from both small and large subunit rRNA

and includes approximately 1700 nucleotide positions

(Zimmer et al.1989). This study includes Ginkgo, cycad and

conifer representatives. A tree presented therein

representing "major features" of the most parsimonious

found also appears to support a clade containing Ginkqo,

cycads, and conifers.

The relationships of Ginkgo, cycads and conifers

inferred from phylogenetic studies have generally been

supported by relatively few characters. Alternative

arrangements of these taxa are found in equally or nearly

as parsimonious trees. There is considerable variation in

the number of taxa as well as the number and type of

characters used in these studies. Therefore, when studies

support different most parsimonious phylogenies, it is








difficult to directly compare the strength of support for

most parsimonious solutions over alternative topologies.

Recent cladistic studies using morphological and anatomical

data appear to favor a paraphyletic arrangement of Ginkqo

and cycads but grouping these taxa monophyletically is

equally or nearly as parsimonious. In contrast, molecular

studies favor a monophyletic grouping of these taxa over

paraphyletic arrangements, but the latter are found as next

most parsimonious alternatives. Therefore, both

monophyletic and paraphyletic relationships of these taxa

should be considered plausible alternative phylogenies.

The complete phylogeny of seed plants cannot by

addressed with molecular studies due to the large number of

fossil taxa that are involved in their evolutionary

history. However, further examination of sequences from

extant taxa, including conifers and the gnetopsids should

aid in clarifying relationships of these groups and may

provide clues as to the polarity of phenotypic characters.

Sequences from more closely related outgroups (ie.

pteridophytes) will aid in the rooting of seed plant

networks.













CHAPTER 5
RIBOSOMAL RNA SEQUENCES IN THE GINKGO BILOBA GENOME


Experimental Results


A group of clones was identified from the Ginkgo

genomic library in lambda that contained a ca. 12 kb Ba HI

fragment which hybridized to both 5' and 3' ss ribosomal

probes. Restriction analysis of these clones yielded

fragment sizes inconsistent with those expected based on

conserved sites within plant ss rRNA coding regions. A

restriction map (Fig. 9) indicated that the inferred ss

rRNA coding region spanned approximately three kilobases of

DNA, longer than those reported for eukaryotic ss rRNA

sequences.

A 3770 base region covering the inferred ss rRNA

coding region of one of these clones, Gbr-6700, was

sequenced. The sequenced region of Gbr-6700 was compared

with that of Gbr-1000 (Fig. 10). The Gbr-6700 sequence

contains a region that is evolutionarily homologous to the

ss rRNA coding region of Gbr-1000 as inferred from the

sequence similarities. This region is interrupted by a 1.1

kb insert approximately 700 bases from the 5' border of the

ss rRNA-like coding region. The -800 nucleotide region 5'

to the 1.1 kb insert and the 1200 nucleotides 3' to the







insert were aligned with the ss rRNA sequence from Gbr-

1000. For the pairwise alignment of regions homologous to

the ss rRNA coding region, the two Ginkqo sequences share

86% sequence similarity.

The number of copies of each of the sequences, Gbr-

1000 and Gbr-6700, in the Ginkqo genome was estimated. A

-1440 bp Xba I-Eco RI (Fig. 1) from pGbr-1000 and a -1440

bp Hind III-Stu I fragment from pGbr-6000 (Fig. 9) were

isolated and used as probes. Xba I-Eco RI digests of

genomic and pGbr-1000 DNA and Hind III-Stu I digests of

genomic and pGbr-6000 DNA were separated by

electrophoresis, blotted and probed using the respective

isolated DNAs. Blot autoradiograms (Fig. 11) were scanned

by laser densitometry (Table 5). The number of copies of

each sequence contained in the genomic lanes was estimated

by interpolation of values obtained for known quantities of

plasmid DNAs. This value was divided by the number of

genomic equivalents in the genomic digest lanes. Values

obtained indicate that the diploid Ginkqo genome contains

an estimated 16,000 copies of the Gbr-1000 sequence and

3,400 copies of the Gbr-6700 sequence.

The nucleotide compositions of Gbr-1000 and Gbr-6700

were determined and regions homologous to the ss rRNA

coding region compared with available plant sequences

(Table 6). The Gbr-1000 sequence contains 49.5% A+T

nucleotides and 50.5% G+C nucleotides. The 3770 base

sequence from Gbr-6700 sequence is A-T rich containing








58.6% of these nucleotides. The 1814 nucleotides

homologous to the ss rRNA coding region, as inferred by

sequence similarity, is also A-T rich. This region

contains 61.3% A-T nucleotides.

The minimum number of observed nucleotide

substitutions between the ss rRNA coding region of Gbr-1000

and the homologous region of Gbr-6700 was estimated (Table

7). Substitutions replacing G and C nucleotides with A and

T nucleotides are observed at a much higher frequency than

other substitutions and represent 91% of the total

substitutions. Transitions of G>A and C>T are the most

common representing 75% of the total observed

substitutions.

A search was conducted of Genbank data files using the

sequence of the 1100 nucleotide insert region of Gbr-6700.

No large regions (>20 bp) sharing significant sequence

similarity were found using an 80% minimum match search

parameter. One region at the 3' end of the insert was

found to share >80% sequence similarity with 15 to 20 base

segments from a wide variety of sequences. However,

examination of alignments generated for these sequences

reveals that the high degree of similarity can be

attributed to homonucleotide A runs contained within the

sequences. Therefore the origin of the insert in Gbr-6700

cannot be determined at present.















4) -4 I m -4 I >
V $'.4 UV .r4)
k r. 4) V > 4j, o
1 -40 W4' 4-4 (
$4~o $4IIo4U>4-'
kI r 4 4 V : 3 : M k
4~~) r. w44
44U) k -4 0 4 U

0 0 0 4.,4
00 'i
..(44$4 U r

S0 $4'
o 0



00

C "` U I to 4-)
4.) k E-



U..U3 .4.





'oto
o m to 4-1


00~0
UU



Q) 0



404 O>4

UV-0 4j t
k CE q r- 44 w
0 00k OrA0 U)



k Q1 U)k


U 4U) (0 -4~ 0 0k

0 t
4J (n 4)l


$4$

0 0 44-110 1
u V o o



O 5 o ptto to
4J M r4 4) 4 to 0
ol 0 vOC C
r. to r. 0 4
E-4 -A 44 4 J 4-


03d
5 a u rAcC
44co~mm












10t



0tA
4'


* 0
m
co
0
U)


O





o0 0
5 5


0
CO


I
1I
"\I


L


1-


CO)






OD
(D
0 0


























Figure 10.


The sequence from Gbr-6700 (top sequence) is
aligned with the sequence from Gbr-1000
(bottom sequence). The Gbr-6700 sequence is
numbered approximately every 100 bp with dots
(.) marking every 10 bp from position one.
The Gbr-1000 sequence is numbered below the
alignment with +1 indicating the first
nucleotide position of the ss rRNA coding
region. Positions which contain identical
nucleotides in both sequences are indicated by
a vertical line between the two sequences.





91





100
TCTAGAGTCGATCAAAGTTAGAACATAGGTGTAAAGGGTATCGTACGGGATACTCCTCCGAGGGTCCTGTGAACCAAAAGATGACATCGTCCTTCTCCM

200
CTAGTTTTTGTGCATCGCCCCCCCTCCCTCAAATATGTCCCTCCACCATATCAGATAACTCCCTCCCCTCAGGGTTATGAATTTAAACTTGTTCCCTCTC

..300
TCTTAGACCATTAGTCCACCTCCTCCTCAAGTTTGCTCAAGTCTTAAGGTTAATGAGGTATCAATCCCCTTGCTCTTTATTATTGAACATCGCCCATAGT

.. 400
ATGTTCAT TTT TTTACCATCTCTACGAGCCCGCCTTCACCTATATCCACCCTCATCTTTCTTCACTGTCAGAATTTATGTTCCATCACCCGCAAGAATA

.. 500
TAICCCACTGCGCTCCTTGAATATATTCATCCCTTCCTCTGCTTTATCGTTGAATCTCTCTAAGAGAGAGTTTGTCAAGTCAAATGAAAAGAGATCTCCT
111111 IIII III IIIIIII IIII
TTTGTCGAGTCGGATGCGAAGAGATGTCCT
-98v

597
CCC-ACTATTTTGATGGGTGTCTTGTCTTCTATT -ATAGCGAGTGGGGTGCCCAACGTTCAAGCATGCTACCTATTGATCATGCTAGTAGTCATATGC
11 I I II I I II I I I I I I11111I III IIII IIIIIII I I11 IIII I111III III IIIIII IIIII I
CCCAACCAGTTCGACGGGCGCCTCGTCTCCTCTGCACAGCGAG-CGGGCGCCCGACGTTCAGGGATGCTACCTGGTTGATCCTGCCAGTAGTCATATGC
-50A +1"

697
TTGCCTCAAAGAGTAAGCCATGCATGTGTAAGTATGAACTCTTTCAAACTATGAAACTGTGAATGGATCATTAAATCAGTTATAGTCTATTTGATATTAC
111 I1IIIII lll IIIIIIIIIIIIIIIIIIIIIIIIIIIII III 11111111 11111 IIIIIII II llIIIII I I111 1 III
TTGTCTCAAAGATTAAGCCATGCATGTGTAAGTATGAACTCTTTCAGACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTCTTTGATGGTAC
100"

797
CTTACTACTCAGATAAGCGTAGTAATTCTAGAGATAATATGTGCACCAAATCCTAACTTCTAGAAGGGAAACATTTATTAGATAAAAGGTTGACGCGGGC
II IIIIII Il ill IIIIIIIl IIIIIIII H ill IIIllllIIIIIIII IllI l 11111 1 IIIIII IIIlIIIIIII IIIIIIIII
CTTACTACTCGGATAACCGTAGTAATTCTAGAGCTAATACGTGCACCAAATCCCGACTTCTGGAAGGGACGCATTTATTAGATAAAAGGCCGACGCGGGC
200^

S897
TCGCTTGCTACTTTAGTGATTCATGATAATTCAACATATACCATTGCCCTGGTGCTAGCGATGCTTCATTCAAATTTCTATCCTATCAACTTTTTGTT
illl Ill Ill IIIll lll llll l l II I I I I I I I llllllllll l Il l ll llllll llIII IllIllIlll l lll I
TCGCCCGCTGCTTCGGTGATTCATGATAACTCGACGGATCGCACGGCCCTGGTGCCGGCGACGCTTCATTCAAATTTCTGCCCTATCAACTTTCGATGGT
300^

997
AGGATAGAGGCCTACCATGGTGGTGATGAGTGATAGATAATTAGGGTTCAATTTTAGAGAGGGAGCTTGAGAAACAACTACCATATCTGAGGAAGGAAAC
IIIIIIIIIIlll IIIIllII IIIII I IIll II IllIIIIII ll III IIIllIIII I lIIIII IIllil III 111111 I I
AGGATAGAGGCCTACCATGGGTGTGACGGGTGACGGAGAATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGCTACCACATCCAGGAAGGCAGC
400"

..1097
AGGCGACAAATTATCCAATCTTGACATGGGGAGGTATGATAATAATAAACAATATTGGGCTCATCGATTCTGGTAATTGGATGAGTATAATCTAAT
111111 1111 1 IIIIII 11Il I11111111 III IIIIIIIllIIIIII I IIII IIII IIIllII III IIIIIII I11111111
AGGCGCGCAAATTACCCAATCCTGACACGGGGAGGTAGTGACAATAAATAACAATACTGGGCTCATCGAGTCTGGTAATTGGAATGAGTACAATCTAAAT
500"





92




1197
CCCTTAATGAGAATTCATTGGAGGGAAAGTCTGGTGCCAACAACCACAGTAATTCTATCTCTAATAGTGTATGATTAAGTTTTTGCAGTTAAAAATATCA
1111111I II 11 1111111111l II 1111111 11 11 1111111 I 11 11111 1111 1111111 1111111111111 II
CCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCACCGCCGGGTAATTCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCG
600"

1297
TAGTTGGATATTGGGTCGGGTTGGTTGGTCTACCTTTTTGGTTTGCACCGATCAGTCTATCCCTTCTTCCTTATTGCATTGTTCAATGGAGGGTAGAGAG
IIIIIIII 11111 11111 11 111 111111 III 1111111 I II 11111111 11
TAGTTGGATCTTGGGCCGGGTCGGCCGGTCCGCCTTTTCGGTGTGCACCGGCCGCTCCGTCCTTTCTGC
700^

1397
CCAGAGGATTGTTTAAGTATAGAGGCTTGAAGACTTAGATTGTATACCTAGTCAACCTCCCTAGTGAGTATAGAGTTGGGGAGAAGTGGAGTTGGGTCCA

S 1497
TCGCACAAAATCATGGGTTGCGAAGCATGCTTCGCAATGTCAGAATCAGGGGTTCTTGGTGCAAGAGGTAGCTTGGGTTGCATGGGAGTCAAAGCACTCT

S 1597
ATTGAGGCTTTGCAATTTTACAAAGATTTTAATTTAGTTGATTCAAGGGATTTTTTTAAGAGAGCAATGAGGTATGAGGATGTGTTCCCTGATAAAGTTT

1697
TCTTACTGCTCTTGGCTAGGGTTGGGGGGTGTCCATTAGTTAAGAGAGTTGTTATTGTGGGTAGTGGATCATAGGTTAGATTTAGGTGTTCGCTCTGT

S 1797
AATGAAGAAGGAGGTGATTCTCTTTTTCATGTCTTGATTAAATTCCCAAGCTTGGTTCATATTTGGAGTAGGGTTGAAG GTCAAGAGAGTCTTGTGTCAA

1897
ACCTTTTACTGACCAGTGATGTTATCGATGATAGGTTTTTACTCGCATTACTCCTTGGTGGGGAGATAGGCAAGAGGCTCCGTAATTGGTTATGTGGTCG

1997
ATCGTAGTGGAGGGGCACCATAGGCTCTGTACATGTAGTGAACCTCCTTGCTGAGGTCATCCACTATATAACTAAGCCCTAAGGGATTGGGATGGGTCT

2097
CAACCCATCCTAGTTGAGGGGCCTTTAAGAAGTGATACCACCATTAAACCMAATGCCTATAAGGGTAGGGATGGCCTTTCATCAAGGCACATAGTGTCAA

2197
GGTAATGAGTGTTTACCTAGGTTAGAGTAACACCTAGGTTGGTTGCTAATAAGTAGGGTGTGACACTAGGTGAAGCACACTAGATTTGTTGGCGATCAAC

S 2297
CTAAGTGGGTATACTTGAACCTTTGGAGAATTCCCCAAGTCATTGGGACTGAGAGAAAAAGTTCTCAAAGAGTGAGTATCCTCCAAGCCATGATATAACA

S 2397
TGGGAGCCTAATAAGATCGAGGTGTGCTTTTAGCCTTGGCTTTGCCCATCTCAAATGAAAAAAATAAAAAATCTGTTAGTGACT CCTGGCCTTAAT
I I 1 IIIIIIIII lIIll
GGCGGCGCCTCCTGGCCTTAAT
702^
S 2497
TGGGTAGGTCGCGACTCTAGCATCGTTACTTTAAAAAAAATTGAGTGCTCAAAGCAAGCCTATGCTCTGAATACATTAGCATGGAATAACATGATAAGAG
l I 1 111111 I1l II 111111111 11 l I IIIIIIIIIIIlIIIII IIIIlllIIIllIllIIllIIIIIIIII lill III
TGGCTGGGTCGCGGCTCCGGCGCCGTTACTTTGAAAAAATTAGAGTGCTCAAAGCAAGCCTACGCTCTGAATACATTAGCATGGAAAAACGCGATAAG
800^


Figure 10--continued










2597
TCTGGTCCTATTGTGTTGGCCTTTAGGACAGAAGTAATGATTAATAGAGATGGTTGAGGGATTCATATTTCATTGTTAGAGGTGAMTTCTTGGATTTA
IIIIIIIIIIIIIIIIIIIIIII 1111 1 IIIIIIIIIIIIIII 11 111 1 III liii Ill1ll1ll1l IIIIIIIIIIIIII11tIIIII
TCTGG TCCTATTGTGTTGGCCTTGGCGATACTATAGAGTGGGATGAT TTCATGTeGAGTGAAATCTTGATT
900A

2697
TGAGATGACCCGC MAATTCAGCATTTTATATAGATGAMGif TTGGGACTAMATGATTAGTACAMTGTT
1111111 Il1l1ll1ll liii liii II1lI1tIIIIIIIIIIIIIIIIIII Ill~ll~ll~ l III 11 1111 1111111 Il1ll1ll1ll
TGAAAGACGAA CCACTGCGAAAATGCAAGTTATACAAACG; GTTGGGGCTCAAACATACATACTCTAGTT
10ooo

2796
ACCATAAATG TGCCGACTAGGGACAGAGTGTTAGATC TCAGACTTTGGAATAAGTTTGGTTCC -AGGATT

ACCATAAACGATGCCGACTAGGGATCGGCGGATGTTGCTTTAAGG&CTCCGCCGGCACCTTGTGAGAATCA TTTTTGGGTT=GG GAGTATGG
1100A

2895
TC-CAAGGATAAAACTTMAGGAATTGATGGAAGGGAATCGCCAAGAGTGGAGCCTACGACTTAATTTAACCCAACATGGAGAAACTTACCAGGTCCAGA
11 11111 1 11111111111111111 1111111 1 1 111 11111111111 11 11111111 11 Hill1 11 1111111111111111111
TCGCAAGGCTGAAACTTAMGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGA
1200^

2995
CATAGTMGGATTGACAAATTGAGACCTCTTTCATGATTCTATGGGTGGTGGTGCATGGTCGTTCTTAGTTGGTGAGCGGTTATCTGGTTATTTTTT
IIIIIIIIIIIIIIIII 1111111 1111111 IIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIIIII II Il1ll1ll1ll
CATAGTAAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTAATTCCGT
1300^

3095
TAAAGAATGAGACCTCAACCTACTAACTAGCTATACGTGTTCCTGGCAATCTGGGATTTCCTAGCTGAGTA
III III 111111111 III 111111111111 1111111 11111 1111111 IIIIIIIIIIIIIIIIII 1111111 IIlIlIIIIIIII I
TAACGAACGAGACCTCAGCCTGCTAACTAGCTATGCGGAGGTTCGCCTTCGTGGCCAGCTTCTTAGAGGGACTATGGCCCTTCAGGCCATGGMGTTTGA
1400^

3195
GGCAATAATAGGTTTAACTTGTTAGGTTGTAGTGACTGATGTTTTCAACMGTTATAA TGGCGGGGCCTAGT
11111111 l1lt 111111 111111111 11 1111 1 111i I 1111111111 1l1llIIIIIII 1lIIIII 11111111 Ii i
GGCMTMCAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATGTATTCAACGAGTCTATAACCTGGGCCGAGAGGCCGGGAA
1500^

*3295
ATTTAATTATC ATGATGGGAA GATCTTAAATATTATCTAACAAGAATCTATMCGTAGTCTCAACAATTACT
IIII I IlIIIIIII IIIIIIIIIIIIItIIII IIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIII 111111111 11 1 1111111
ATCTGCCGAATTACTAGGAAGATCTTGCAATT TGATCTAACGGGAATCTGTA GGAGTCATCA TCGCTTAC
1600A

3395
GTCCATGCCCTTTGTACACAATTCCCATCCCTCCTACCMTGAGTCGGATGTGACTTACAGAATCCATGGC
liii II1tIIIIIIIIIII 111 11 11111111 Ill1ll1l1l Ill1l1l1 ll111111 111111 I11 1111 11 1111111
GTCCCTGCCCTTTGTACACACCGCCGCCCTCCATATACCGGATTCGGTCGCCCGAGAGCGTTGCGGCGAC
1700A

.3494
TTGCGAGAAGTTTATT-AATGTTATCATTTAGAGGAAGGAGAGTCATAGCAAGGTTTTCGTAGGTrGCCTGcGMGATCATTGATCATCCTGATGT
I IIIIIIIIII III II rIIIIIIIIIIIIIIIIIIIIIII II 11111111 IIIIIIIII1lIIII IIIIIIIIIIIIII 1111 1 II
TCGCGAGAAGTTCATTGCCTTA TTTAG 181ATCATT1A CGT
181 JA


Figure 10--continued










3593
TTGTAAAAAACAGAACACACTCAAGGGCACMACACATTTTGCG-TTU ACCCTGATGGUGGTAATATATT
IIIII 1 11 111 1 111 1 II 1 11111 111 111
CCGTA- -MGGAACGGGGCMTGGCGAACT-GTGAGGATCGTGGGATGGACCCGTMCCTCGTCGGCCGTCGTTGCCCTC
1900W

3693
GTAGTGCAATTGTCTGACTGGTAGGACGCACCCTTCATCCTATTATCTAGGGATATCGGGGTGACCCACCTTCCATCTGACCGTGTTACMTCTGAC

3770
CGAATAGCTATGTGATGGTGTTGCCAGCATGTTCATTATGTGGGTAAGGACATACCCTTAGTMACCCTCCAAGTGC


Figure 10--continued