journal orf ILn.er.r.3,dua.3- Re--.earch
..oluinie 9, issue 3 - Sprinn. 200ii
When a Penguin Is a Bird: An On-Line Study of Category Co-Reference
Maia A. Petee and H. Wind Covules
When you hear the word "bird," what image comes most readily to mind? Is it that of a small, feathered,
flighty animal, similar to a robin or a bluebird? Or is it that of a much larger, sleek, oblong creature that never
takes to the air but prefers diving among ice floes instead? Chances are, if you haven't grown up in the Antarctic,
the first description far more closely meshes with your experience of birds, of their appearance and actions. Since
a robin so well exemplifies all the basic characteristics of a bird, represents the unbidden and instant "mental
picture" of one, it can be said that it is a typical bird, and that a penguin, described by the second set
of characteristics, is an atypical bird (speaking formally, an atypical exemplar of the category "bird").
Taking this into consideration, what happens during normal language use when we are forced to mentally
associate "bird" and "penguin"? This is exactly what happens when we use a category anaphorr" (e.g., bird) to
refer back to the referent of an antecedent expression (e.g., penguin). Do we simply focus on the representation
of "penguin" that we already have in mind, or do we shift our focus to more bird-like features of penguins? That
is, does referring to a penguin as a "bird" cause us to think of the penguin differently? In this paper, we explored
this question in an eye-tracking study that examined how actions or characteristics of atypical antecedents
are perceived when they follow an anaphor. Properties that were either typical of the category, typical of
the exemplar, or equally typical for both were tested and analyzed. Each stimulus was presented in a short
two-sentence passage in which the first sentence introduced the antecedent:
"A penguin was sitting on the ice"
and the second referred back to it with a category anaphor, then going on to assert one of three possible
properties of the antecedent:
1) Antecedent-Specific: "The bird suddenly dove into the water."
2) Category-Specific: "The bird suddenly preened its feathers."
3) Antecedent-Incompatible: "The bird suddenly flew into the air."
In example (1), the predicate given (flying) is very typical of birds, but impossible for penguins. Either a typical
bird or a penguin could be engaging in the preening described in (2), but diving into the water, (3), is not
usually associated with a typical kind of bird, like a bluejay or a robin-only for water birds, such as penguins.
When a reader is first faced with a subject/predicate combination that strikes him or her as impossible or
unlikely, this may disrupt ordinary reading processes, causing longer reading times, and he or she may
possibly retrace his or her eye movements to try to make some more sense of the sentence.
What is not yet known is which combinations will register with the participant as being semantically odd and
which will be processed with no problem. Simply put, immediately after reading "the bird," will the reader so
strongly expect to see something typical of birds that he will accept an action impossible for a penguin, or will
the connection that this "bird" is a penguin be made in time? If the first is true, sentences (3) and (2) will cause
no problems, but sentence (1), because of its resulting semantic oddity, will cause a delay in reading and
processing time. Conversely, if the antecedent information is still most prevalent, (1) and (2) should be read
without difficulty, but (3) will take much longer to process.
The typicality of an antecedent and its category anaphor plays an important role in how quickly anaphoric reference
is processed (e.g., Garrod & Sanford, 1977). Garrod and Sanford (1977) were the first to examine whether
typicality of an exemplar antecedent affected reading time for a companion sentence, containing a category
anaphor. The design of this experiment was simple enough: it sought to determine whether typicality of
an antecedent had any effect on reading time for the sentence containing its anaphor. In terms of our "bird/
penguin" example, Garrod and Sanford's experiment tested whether a pair of sentences like,
"The robin was bright-eyed and alert. The bird preened its feathers."
"The penguin was bright-eyed and alert. The bird preened its feathers."
would be processed any differently; specifically, whether "robin" as a typical antecedent would be read more
quickly than "penguin" as an atypical one.
The results of their study did indeed show a significant effect with typical antecedents. Wanting to further
explore this, Garrod & Sanford (1977) conducted a similar experiment that incorporated the extra variable
of inserting an intervening sentence between the sentences containing the antecedent and the anaphor and
tested whether the effect still took place, finding that it did.
To more completely understand the comprehension process that Garrod and Sanford were testing, it is necessary
to understand that each time a person encounters a discourse entity while reading, his mind automatically checks
all previous entities to identify possible antecedents, beginning from his current center of discourse and working back.
Having proven that readers do check and realize when an antecedent and anaphor refer back to the same object
in the text, Garrod and Sanford next turned their attention to discovering whether the same checking
procedure occurred when the antecedent and anaphor actually refer to different objects (Garrod & Sanford,
1977). They found that the effect does still take place, and any difference noted in the processing times between
co-referring noun-phrase pairs and unrelated noun-phrase pairs is negligible; the checking procedure takes about
the same length of time regardless of whether the nouns are actually co-referring.
Much research has been conducted on anaphoric processing since Garrod and Sanford's (1977) influential paper,
and their cornerstone finding of a typicality effect has been upheld. However, very little has been learned about
what happens after the anaphor - that is, what effect naming an atypical exemplar by its category has on
the representation of that exemplar in the mental model of the discourse that readers develop over time. After
all, using a full semantic label like "bird" is different from simply using an anaphor with very little semantic
content such as a pronoun like "it" or even repeating the same name again ("penguin").
One model of anaphoric processing (Garnham & Cowles, in press) has argued that one function of using
category anaphors is to place additional focus on category-specific features. Thus, using "bird" to refer to
"penguin" may be a signal that the text will go on to highlight the more bird-like aspects of the penguin. If this is
the case, then bird-specific features may become more available after encountering an anaphor like "bird."
To recap, we return to an example set of stimuli that will be presented to participants. Words in italics are those
that are being looked back to (the antecedents); underlined information is that which is consistent with either
the antecedent referent, the anaphor, or both; text in bold is the anaphor.
"A penguin was sitting on the ice. The bird suddenly dove into the water."
"A penguin was sitting on the ice. The bird suddenly preened its feathers."
"A penguin was sitting on the ice. The bird suddenly flew into the air."
In all three of these stimuli, at the moment the participant encounters bird, he immediately runs a check to
search for possible antecedents to this noun; all three of these stimuli give him the same result, penguin.
The question we are seeking to explore is, upon encountering bird and automatically drafting a potential set
of features that will be relevant in the upcoming discourse, does bird influence the representation of its
referent penguin? Thus, will future information most salient to penguins or most salient to birds be encountered
more naturally, with the least amount of resistance? If antecedent-specific features are still most relevant, then
a predicate such as "dove into the water" would be accepted most naturally; conversely, if anaphor-specific
features become more relevant, then a predicate featuring a flying bird would be considered far more natural
than one featuring a diving bird.
PREPARATION AND METHOD
Stimuli gathered for use in this study were generated by consulting several resources, including Battig &
Montague (1969), who published a study of category norms in which semantic associations between categories
and exemplars were examined. The data were based on questionnaires given to participants to determine what
they perceived to be typical exemplars of a variety of categories. Exemplars were not suggested, but were
generated by the participants themselves during sessions in which they wrote down all that came to mind in
thirty seconds. The results were expansive, ranging over 56 categories and including at least ten exemplars for
each of these, and were presented in a very clean, list-like format that made internal comparisons easily
accessible. The goal for the current experiment was to have pairs of category/atypical exemplar nouns that
shared one or two basic characteristics or actions, but which were still different enough that there was very
little overlap. To each category and to each of its exemplars, three possible predicates were given: an
exemplar-biased predicate (X-Bias), a neutral predicate (Neutral), such as "is preening," and a category-
biased predicate, C-Bias. A representation of the range of possible items for one category/exemplar pair is given
in Table 1 below, where "bird" is the category and "penguin" is here the exemplar.
Both the subjects and the predicates were run through CELEX frequency databases to ensure that they were
neither too frequent nor too infrequent in the English language. All were fairly uniform in frequency.
However, frequency alone is not sufficient to determine true usability, and so the subject/predicate pairs were
also evaluated by participants before the experiment was run.
Antecedent-Anaphor Feature Sets
Antecedent Anaphor Feature
Bird is flying (C-Bias)
Bird is preening. (Neutral)
Bird is diving. (X-Bias)
Penguin is flying. (C-Bias)
Penguin is preening (Neutral)
Penguin is diving. (X-Bias)
To ensure that the subject/predicate pairs chosen would be widely viewed, typicality-wise, as they had been
intended to be, a "pre-experiment" was prepared. Three versions of a 98-question questionnaire were drafted,
each with 98 different subject/predicate pairings whose order had been randomized by an online random-
number generator. For each item, participants were shown a subject in a column marked "Thing" and
a corresponding action or characteristic (e.g., "is breathing" or "is salty") under the heading "Action/Property,"
then prompted to rate them on a scale from 1 (Very Typical) to 7 (Very Atypical). An example of the format of
the pre-test is given in Table 2.
A number of measures were taken to prevent each stimulus from interfering with the interpretation of those
around it. First, all of the stimuli taken from the same item were divided evenly among the three non-
overlapping "lists," so that the same stimulus was not seen twice. Further, all stimuli with semantically
related subjects or predicates (e.g. "tree" and "cypress") were spaced, at the closest, no fewer than ten items apart.
Participants were 18 undergraduate students at the University of Florida. Ten were females, and eight were
males. Their ages ranged from 18 to 23, with mean age being 19.78. Participants were compensated at an
hourly rate, and all consented to the experiment when briefed on what it would involve. Testing was done in a
quiet, plainly decorated room free of distraction. Participants were instructed not to begin until they had
thoroughly read a set of instructions detailing the meaning and specific context of terms such as "typical"
and "atypical" and had been presented sample items in order to familiarize them with the format in Table 2 below.
Action / Property
1 turkey is flying 1 2 3 4 5 6
2 ship is submerged 1 2 3 4 5 6
Data were compiled and organized by arranging the average responses for each item in a matrix that grouped
them vertically by item number and horizontally by whether they were categories or exemplars and what "bias"
their predicate contained: category bias, exemplar bias, or neutral. These averages were then used to see
how successful the stimuli had been in distinguishing their intended typicality.
Stimuli whose typicality was rated either significantly less or significantly more than what the experimenters
intended were evaluated and removed if necessary. For "neutral" conditions, if the "category" and
"exemplar" subjects yielded a significant difference, they were removed, because there should be no difference in
the rated typicalities for category-subjects versus exemplar-subjects. Similarly, when evaluating responses
for category-biased predicates, a much higher overall typicality is desired for category-biased subjects
versus exemplar-subjects, and in responses for exemplar-based predicates, a higher typicality for exemplar-
subjects. In either of these cases, the more marked a difference that was found, the better. There was no
absolute cutoff in any condition; however, a discrepancy in directionality, such as a Category C-Bias sentence-
set being rated as less typical than a Category X-Bias sentence-set, was immediately disqualified. This
screening resulted in the elimination of seven of the remaining 30 statistically sound stimuli, leaving 23 usable
items out of an original 49. Of these, t-tests of the pairs showed that category and exemplar properties were
rated as significantly different (p < .0001) in their association with categories and exemplars, while there was
no difference in this rating for the neutral properties. (p = . 76).
Preparation and Method: Final Test
To leave as little room for error as possible in the final test, some additional preparation of the stimuli was
needed. The design of the experiment required that the number of stimuli be a multiple of three, and so the 23
items were pared down to 21 before they were formalized. The final 21 items were then divided into three lists
such that each list contained equal numbers of items from each condition and every item was in each list
To prevent participants from becoming aware of the repetitive grammatical structure of the stimuli and
anticipating the anaphoric sentence before they read it, 39 "filler" stimuli were created for interspersion among
the true stimuli. The formats of these were carefully controlled using each salient feature of the true
stimuli, adjusting each in some way that "violated" the stimuli's strict format. For example, eighteen percent of
the filler had the NP antecedent not being referred to again, but had the anaphor in the second sentence
referring back to a second entity that had been introduced. Another eighteen percent adopted the same
structure, but made the NP anaphor a pronoun instead. Examples of such filler stimuli can be seen in Table 3.
As there was no semantic overlap to take into question, the same filler items appeared in each of the three lists.
Also drafted at this time were comprehension questions; for each stimulus and filler item, a short yes-or-no
question was written that would test the participant's understanding of the stimulus. These were written mainly
for verification purposes; to ensure the experimenters that the participant truly was understanding the stimuli as
it was projected they would be understood, and that the recorded reading times were based on this understanding.
The stimuli and filler were presented in as large a typeface as would permit each to fit on one line, and were saved
as image files, parts of which could be tagged and labeled (as S-Subj., T-Subj., Predicate, and other salient parts)
to allow the eye-tracker to superimpose its findings onto the image and identify the main measure of interest:
the total time readers spent looking at the predicate of the sentence. This was visually represented by the image
file of the stimulus being overlaid with a series of small circles whose breadth was dependent on the amount of
time that word was dwelt on, marked with specific times (in milliseconds). Data gathering was only
somewhat automatic, as it was necessary for the experimenter to go into the program to retrieve and compile,
as well as analyze, these recorded numbers for each participant.
Participants and Equipment
The 22 participants used in this experiment were members of the University of Florida community, 11 females and
11 males, with ages ranging from 19-31 and mean age being 22. All but one was compensated at the rate of
$7.50/hr.; that one received course credit for participation.
The eye-tracker used is the Eyelink II model from SR Research in Canada, and was run with the SR
Research Experiment Builder software v. 1.4.55.RC.
Examples of Filler Stimuli
"A monk paced the halls of the cloister. The pathways were
"Allen picked up a Kleenex and blew his nose. It was cherry-red
from the cold."
"A marker rolled off the desk and bounced on the ground. It was
"A boy casually munched a poppyseed bagel. He slowly savored
"A laptop hummed softly on the caf6 table. The computer rested,
"A policeman spoke into his radio. The official was afflicted with a
"A young girl jumped around in the snow. The weather delighted
"A group of boys was playing frisbee on the lawn. After a while, it
The participant was seated in a chair centered in front of a monitor, and was briefed via a set of instructions on
the monitor about what he or she would be required to do and the form that the stimuli and comprehension
questions would take. The participant was given a video-game controller with two shoulder buttons, and used
this throughout the experiment to either forward the instructions and sentences on the screen or to answer yes
(right shoulder button) or no (left shoulder button) to the comprehension questions. After the participant read
the instructions and indicated that he or she was ready, the experimenter fitted him or her with the eye-
tracking device: a light headset with two tiny cameras to monitor eye movements. The most time-consuming part
of the setup was positioning and focusing each camera (one for each eye) precisely for height, distance,
horizontal and vertical angles, and horizontal position. Head motion was monitored and corrected for by a signal
sent from the headset to sensors on each corner of the monitor. This set-up generally took five to ten minutes.
After the cameras were properly positioned and the participant had found a comfortable place he or she could sit
for an extended period of time without moving, the machine was calibrated by having the subject look steadily at
a small circle as it jumped around the screen. The results of this test were validated a second time.
The experimenter was stationed at a nearby monitor that showed a real-time readout of the participant's
progress, rotated so the participant could not see or be distracted by it, and, after prompting him, began
the experiment. The participant pushed a shoulder button when he was done reading the stimulus to bring up
the comprehension question, and after he had answered it and refocused his eyes on the center of the
screen (marked with a small circle), the experimenter manually forwarded the program to display the next
stimulus. This portion of the experiment took at most twenty minutes, as there were sixty stimuli (21 true items
and 39 filler).
Results and Analysis
There were several complicating factors involving the equipment or participant understanding that resulted in
some data being omitted from further analysis. Sometimes head movements on the part of the participant during
a trial would cause the eye-tracker to show eye movement "drift" that could not be corrected during data
analysis. One or two trials were lost for each participant due to this. Several participants had to have their entire
data set excluded from analysis because of low accuracy in answering the comprehension questions. Intended as
a control to ensure full reading, these were simple reading-comprehension questions, and should not have
been answered incorrectly if the participant was reading normally. A participant's data were excluded if he or
she answered more than four out of twenty-one questions incorrectly; that is, if his percentage answered
correctly was less than 81%. Nine participants were disqualified in this manner, with the worst of these
answering only 62% of the questions correctly.
Another participant's data were lost due to experimenter error, bringing the total of unused participants up to ten.
The data of the twelve remaining participants were analyzed. Average dwell times (in milliseconds) on the targets
of each of the three bias conditions (Category, Exemplar, and Neutral) were calculated. These data, reproduced
in Table 4, represent the total amount of time that participants spent looking at the target region of the text.
To recap, "Category" refers to those two-sentence pairs whose targets (or predicates) were semantically
biased toward the category anaphor as opposed to the exemplar antecedent. Returning to the "penguin/
bird" example, "Category" would be the sentence set ending "The bird flew into the air," and "Exemplar" the
set ending "The bird dove into the water," with "flew" and "dove" being the targets, respectively.
Looking at these data, a clear effect can be seen immediately. On average, the target reading times for
category-biased items were over 50 milliseconds longer than reading times for exemplar-biased items, showing
that participants had more difficulty processing the category-biased items despite the nearness of the categories
to the targets.
Reading Category-Bias Exemplar-Bias Neutral
This pattern of results was confirmed by paired t-tests across each condition. As expected, the longer
category-biased reading times were significantly different from the neutral control reading times (t (1,11) = 2.25,
p < .046), meaning that the probability that the reading-time effect was due to chance was extremely low.
Further, the significance of the exemplar-biased reading times as compared to the neutral times was not
remotely significant (t < 1). This shows that there was basically no increase in target-processing time for
exemplar-biased items, and therefore no added difficulty.
"A penguin was sitting on the ice. The bird suddenly dove into the water."
"A penguin was sitting on the ice. The bird suddenly flew into the air."
These findings suggest that readers still retain the semantic information from the antecedent and rank
that information highly: more highly, even, than information from a more recent reference to the
antecedent. Practically speaking, "bird" does seem to take its meaning from "penguin," but not so much so that
it disrupts the processing of its atypical antecedent.
The wealth of data gathered in this experiment provides opportunity for many more, and much more in-
depth, analyses. The analysis presented here focuses on only one factor: target dwell-times, showing a
basic directionality correspondence between NP processing and typicality. Exploration of dwell times for any of
the other five "critical areas" of the passages, as well as regression paths and times, could bring more
rounded insight. However, current data go far enough to suggest that Garnham & Cowles' prediction was not
borne out, as in this study it predicted incorrectly that "bird" would entirely disrupt semantic memory of "penguin":
in reality, no such effect was found. This suggests that a penguin remains a penguin in all circumstances, even
when it is also a bird.
1. Almor, A. "Noun-Phrase Anaphora and Focus: The Informational Load Hypothesis." Psychological Review 106
2. Battig, W.F., and W.E. Montague. "Category Norms for Verbal Items in 56 Categories: A Replication and Extension
of the Connecticut Norms." Journal of Experimental Psychology 80 (1969): 1-46.
3. Casey, Paul J. "A Reexamination of the Roles of Typicality and Category Dominance in Verifying
Category Membership." Journal of Experimental Psychology 18-4(1992): 823-834.
4. CELEX English database (Release E25) [On-line]. 1993. Available: Nijmegen: Centre for Lexical
Information [Producer and Distributor].
5. Garnham, Alan, and Wind Cowles. "Looking Both Ways: The JANUS Model of Noun Phrase Anaphor
Processing." Reference and Reference Processing (to appear): 1-45.
6. Garrod, Simon, and Anthony Sanford. "Interpreting Anaphoric Relations: The Integration of Semantic
Information while Reading." Journal of Verbal Learning and Verbal Behavior 16 (1977): 77-90.
7. Garrod, Simon, Daniel Freudenthal, and Elizabeth Boyle. "The Role of Different Types of Anaphor in the On-
Line Resolution of Sentences in a Discourse." Journal of Memory and Language 33 (1994): 39-68.
8. Grosz, B., A. Joshi, and S. Weinstein. "Centering: A Framework for Modelling the Local Coherence of
Discourse." Computational Linguistics 21 (1995): 203-226.
9. Murphy, Gregory L., and Mary E. Lassaline. "Hierarchical Structure in Concepts and the Basic Level
of Categorization." Knowledge, Concepts, and Categories. MIT Press: Cambridge, (1997): 93-131.
10. Rips, Lance J., Edward J. Shoben, and Edward E. Smith. "Semantic Distance and the Verification of
Semantic Relations." Journal of Verbal Learning and Verbal Behavior 12 (1973): 1-20.
11. Vanoverberghe, Veerle, and Gert Storms. "Feature Importance in Feature Generation and Typicality
Rating." European Journal of Cognitive Psychology 15 (2002): 1-18.
12. Van Overschelde, James P., Katherine A. Rawson, and John Dunlosky. "Category Norms: An Updated and
Expanded Version of the Battig and Montague (1969) Norms." Journal of Memory and Language 50 (2004): 289-335.
Back to the Journal of Undergraduate Research
College of Liberal Arts and Sciences I University Scholars Program I University of Florida I
UP UNIVERSITY of
ï¿½ University of Florida, Gainesville, FL 32611; (352) 846-2032.