When a Penguin Is a Bird: An On-Line Study of Category Co-Reference

Material Information

When a Penguin Is a Bird: An On-Line Study of Category Co-Reference
Petee, Maia A.
Cowles, H. Wind
Place of Publication:
Gainesville, Fla.
University of Florida
Publication Date:


serial ( sobekcm )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.


This item has the following downloads:

Full Text

journal orf,dua.3- Re--.earch

..oluinie 9, issue 3 - Sprinn. 200ii

When a Penguin Is a Bird: An On-Line Study of Category Co-Reference

Maia A. Petee and H. Wind Covules


When you hear the word "bird," what image comes most readily to mind? Is it that of a small, feathered,

flighty animal, similar to a robin or a bluebird? Or is it that of a much larger, sleek, oblong creature that never

takes to the air but prefers diving among ice floes instead? Chances are, if you haven't grown up in the Antarctic,

the first description far more closely meshes with your experience of birds, of their appearance and actions. Since

a robin so well exemplifies all the basic characteristics of a bird, represents the unbidden and instant "mental

picture" of one, it can be said that it is a typical bird, and that a penguin, described by the second set

of characteristics, is an atypical bird (speaking formally, an atypical exemplar of the category "bird").

Taking this into consideration, what happens during normal language use when we are forced to mentally

associate "bird" and "penguin"? This is exactly what happens when we use a category anaphorr" (e.g., bird) to

refer back to the referent of an antecedent expression (e.g., penguin). Do we simply focus on the representation

of "penguin" that we already have in mind, or do we shift our focus to more bird-like features of penguins? That

is, does referring to a penguin as a "bird" cause us to think of the penguin differently? In this paper, we explored

this question in an eye-tracking study that examined how actions or characteristics of atypical antecedents

are perceived when they follow an anaphor. Properties that were either typical of the category, typical of

the exemplar, or equally typical for both were tested and analyzed. Each stimulus was presented in a short

two-sentence passage in which the first sentence introduced the antecedent:

"A penguin was sitting on the ice"

and the second referred back to it with a category anaphor, then going on to assert one of three possible

properties of the antecedent:

1) Antecedent-Specific: "The bird suddenly dove into the water."

2) Category-Specific: "The bird suddenly preened its feathers."

3) Antecedent-Incompatible: "The bird suddenly flew into the air."

In example (1), the predicate given (flying) is very typical of birds, but impossible for penguins. Either a typical

bird or a penguin could be engaging in the preening described in (2), but diving into the water, (3), is not

usually associated with a typical kind of bird, like a bluejay or a robin-only for water birds, such as penguins.

When a reader is first faced with a subject/predicate combination that strikes him or her as impossible or

unlikely, this may disrupt ordinary reading processes, causing longer reading times, and he or she may

possibly retrace his or her eye movements to try to make some more sense of the sentence.

What is not yet known is which combinations will register with the participant as being semantically odd and

which will be processed with no problem. Simply put, immediately after reading "the bird," will the reader so

strongly expect to see something typical of birds that he will accept an action impossible for a penguin, or will

the connection that this "bird" is a penguin be made in time? If the first is true, sentences (3) and (2) will cause

no problems, but sentence (1), because of its resulting semantic oddity, will cause a delay in reading and

processing time. Conversely, if the antecedent information is still most prevalent, (1) and (2) should be read

without difficulty, but (3) will take much longer to process.


The typicality of an antecedent and its category anaphor plays an important role in how quickly anaphoric reference

is processed (e.g., Garrod & Sanford, 1977). Garrod and Sanford (1977) were the first to examine whether

typicality of an exemplar antecedent affected reading time for a companion sentence, containing a category

anaphor. The design of this experiment was simple enough: it sought to determine whether typicality of

an antecedent had any effect on reading time for the sentence containing its anaphor. In terms of our "bird/

penguin" example, Garrod and Sanford's experiment tested whether a pair of sentences like,

"The robin was bright-eyed and alert. The bird preened its feathers."

"The penguin was bright-eyed and alert. The bird preened its feathers."

would be processed any differently; specifically, whether "robin" as a typical antecedent would be read more

quickly than "penguin" as an atypical one.

The results of their study did indeed show a significant effect with typical antecedents. Wanting to further

explore this, Garrod & Sanford (1977) conducted a similar experiment that incorporated the extra variable

of inserting an intervening sentence between the sentences containing the antecedent and the anaphor and

tested whether the effect still took place, finding that it did.

To more completely understand the comprehension process that Garrod and Sanford were testing, it is necessary

to understand that each time a person encounters a discourse entity while reading, his mind automatically checks

all previous entities to identify possible antecedents, beginning from his current center of discourse and working back.

Having proven that readers do check and realize when an antecedent and anaphor refer back to the same object

in the text, Garrod and Sanford next turned their attention to discovering whether the same checking

procedure occurred when the antecedent and anaphor actually refer to different objects (Garrod & Sanford,

1977). They found that the effect does still take place, and any difference noted in the processing times between

co-referring noun-phrase pairs and unrelated noun-phrase pairs is negligible; the checking procedure takes about

the same length of time regardless of whether the nouns are actually co-referring.

Much research has been conducted on anaphoric processing since Garrod and Sanford's (1977) influential paper,

and their cornerstone finding of a typicality effect has been upheld. However, very little has been learned about

what happens after the anaphor - that is, what effect naming an atypical exemplar by its category has on

the representation of that exemplar in the mental model of the discourse that readers develop over time. After

all, using a full semantic label like "bird" is different from simply using an anaphor with very little semantic

content such as a pronoun like "it" or even repeating the same name again ("penguin").

One model of anaphoric processing (Garnham & Cowles, in press) has argued that one function of using

category anaphors is to place additional focus on category-specific features. Thus, using "bird" to refer to

"penguin" may be a signal that the text will go on to highlight the more bird-like aspects of the penguin. If this is

the case, then bird-specific features may become more available after encountering an anaphor like "bird."

To recap, we return to an example set of stimuli that will be presented to participants. Words in italics are those

that are being looked back to (the antecedents); underlined information is that which is consistent with either

the antecedent referent, the anaphor, or both; text in bold is the anaphor.

"A penguin was sitting on the ice. The bird suddenly dove into the water."

"A penguin was sitting on the ice. The bird suddenly preened its feathers."

"A penguin was sitting on the ice. The bird suddenly flew into the air."

In all three of these stimuli, at the moment the participant encounters bird, he immediately runs a check to

search for possible antecedents to this noun; all three of these stimuli give him the same result, penguin.

The question we are seeking to explore is, upon encountering bird and automatically drafting a potential set

of features that will be relevant in the upcoming discourse, does bird influence the representation of its

referent penguin? Thus, will future information most salient to penguins or most salient to birds be encountered

more naturally, with the least amount of resistance? If antecedent-specific features are still most relevant, then

a predicate such as "dove into the water" would be accepted most naturally; conversely, if anaphor-specific

features become more relevant, then a predicate featuring a flying bird would be considered far more natural

than one featuring a diving bird.


Stimuli gathered for use in this study were generated by consulting several resources, including Battig &

Montague (1969), who published a study of category norms in which semantic associations between categories

and exemplars were examined. The data were based on questionnaires given to participants to determine what

they perceived to be typical exemplars of a variety of categories. Exemplars were not suggested, but were

generated by the participants themselves during sessions in which they wrote down all that came to mind in

thirty seconds. The results were expansive, ranging over 56 categories and including at least ten exemplars for

each of these, and were presented in a very clean, list-like format that made internal comparisons easily

accessible. The goal for the current experiment was to have pairs of category/atypical exemplar nouns that

shared one or two basic characteristics or actions, but which were still different enough that there was very

little overlap. To each category and to each of its exemplars, three possible predicates were given: an

exemplar-biased predicate (X-Bias), a neutral predicate (Neutral), such as "is preening," and a category-

biased predicate, C-Bias. A representation of the range of possible items for one category/exemplar pair is given

in Table 1 below, where "bird" is the category and "penguin" is here the exemplar.

Both the subjects and the predicates were run through CELEX frequency databases to ensure that they were

neither too frequent nor too infrequent in the English language. All were fairly uniform in frequency.

However, frequency alone is not sufficient to determine true usability, and so the subject/predicate pairs were

also evaluated by participants before the experiment was run.

Table 1.

Antecedent-Anaphor Feature Sets

Antecedent Anaphor Feature

Bird is flying (C-Bias)

Bird is preening. (Neutral)

Bird is diving. (X-Bias)

Penguin is flying. (C-Bias)

Penguin is preening (Neutral)

Penguin is diving. (X-Bias)


To ensure that the subject/predicate pairs chosen would be widely viewed, typicality-wise, as they had been

intended to be, a "pre-experiment" was prepared. Three versions of a 98-question questionnaire were drafted,

each with 98 different subject/predicate pairings whose order had been randomized by an online random-

number generator. For each item, participants were shown a subject in a column marked "Thing" and

a corresponding action or characteristic (e.g., "is breathing" or "is salty") under the heading "Action/Property,"

then prompted to rate them on a scale from 1 (Very Typical) to 7 (Very Atypical). An example of the format of

the pre-test is given in Table 2.

A number of measures were taken to prevent each stimulus from interfering with the interpretation of those

around it. First, all of the stimuli taken from the same item were divided evenly among the three non-

overlapping "lists," so that the same stimulus was not seen twice. Further, all stimuli with semantically

related subjects or predicates (e.g. "tree" and "cypress") were spaced, at the closest, no fewer than ten items apart.

Participants were 18 undergraduate students at the University of Florida. Ten were females, and eight were

males. Their ages ranged from 18 to 23, with mean age being 19.78. Participants were compensated at an

hourly rate, and all consented to the experiment when briefed on what it would involve. Testing was done in a

quiet, plainly decorated room free of distraction. Participants were instructed not to begin until they had

thoroughly read a set of instructions detailing the meaning and specific context of terms such as "typical"

and "atypical" and had been presented sample items in order to familiarize them with the format in Table 2 below.

Table 2.

Pre-Test Format



Action / Property

Very Typical

Very Atypical

1 turkey is flying 1 2 3 4 5 6

2 ship is submerged 1 2 3 4 5 6

Data were compiled and organized by arranging the average responses for each item in a matrix that grouped

them vertically by item number and horizontally by whether they were categories or exemplars and what "bias"

their predicate contained: category bias, exemplar bias, or neutral. These averages were then used to see

how successful the stimuli had been in distinguishing their intended typicality.

Stimuli whose typicality was rated either significantly less or significantly more than what the experimenters

intended were evaluated and removed if necessary. For "neutral" conditions, if the "category" and

"exemplar" subjects yielded a significant difference, they were removed, because there should be no difference in

the rated typicalities for category-subjects versus exemplar-subjects. Similarly, when evaluating responses

for category-biased predicates, a much higher overall typicality is desired for category-biased subjects

versus exemplar-subjects, and in responses for exemplar-based predicates, a higher typicality for exemplar-

subjects. In either of these cases, the more marked a difference that was found, the better. There was no

absolute cutoff in any condition; however, a discrepancy in directionality, such as a Category C-Bias sentence-


set being rated as less typical than a Category X-Bias sentence-set, was immediately disqualified. This

screening resulted in the elimination of seven of the remaining 30 statistically sound stimuli, leaving 23 usable

items out of an original 49. Of these, t-tests of the pairs showed that category and exemplar properties were

rated as significantly different (p < .0001) in their association with categories and exemplars, while there was

no difference in this rating for the neutral properties. (p = . 76).

Preparation and Method: Final Test

To leave as little room for error as possible in the final test, some additional preparation of the stimuli was

needed. The design of the experiment required that the number of stimuli be a multiple of three, and so the 23

items were pared down to 21 before they were formalized. The final 21 items were then divided into three lists

such that each list contained equal numbers of items from each condition and every item was in each list

exactly once.

To prevent participants from becoming aware of the repetitive grammatical structure of the stimuli and

anticipating the anaphoric sentence before they read it, 39 "filler" stimuli were created for interspersion among

the true stimuli. The formats of these were carefully controlled using each salient feature of the true

stimuli, adjusting each in some way that "violated" the stimuli's strict format. For example, eighteen percent of

the filler had the NP antecedent not being referred to again, but had the anaphor in the second sentence

referring back to a second entity that had been introduced. Another eighteen percent adopted the same

structure, but made the NP anaphor a pronoun instead. Examples of such filler stimuli can be seen in Table 3.

As there was no semantic overlap to take into question, the same filler items appeared in each of the three lists.

Also drafted at this time were comprehension questions; for each stimulus and filler item, a short yes-or-no

question was written that would test the participant's understanding of the stimulus. These were written mainly

for verification purposes; to ensure the experimenters that the participant truly was understanding the stimuli as

it was projected they would be understood, and that the recorded reading times were based on this understanding.

The stimuli and filler were presented in as large a typeface as would permit each to fit on one line, and were saved

as image files, parts of which could be tagged and labeled (as S-Subj., T-Subj., Predicate, and other salient parts)

to allow the eye-tracker to superimpose its findings onto the image and identify the main measure of interest:

the total time readers spent looking at the predicate of the sentence. This was visually represented by the image

file of the stimulus being overlaid with a series of small circles whose breadth was dependent on the amount of

time that word was dwelt on, marked with specific times (in milliseconds). Data gathering was only

somewhat automatic, as it was necessary for the experimenter to go into the program to retrieve and compile,

as well as analyze, these recorded numbers for each participant.

Participants and Equipment

The 22 participants used in this experiment were members of the University of Florida community, 11 females and

11 males, with ages ranging from 19-31 and mean age being 22. All but one was compensated at the rate of

$7.50/hr.; that one received course credit for participation.

The eye-tracker used is the Eyelink II model from SR Research in Canada, and was run with the SR

Research Experiment Builder software v. 1.4.55.RC.

Table 3.
Examples of Filler Stimuli


"A monk paced the halls of the cloister. The pathways were
completely silent."

"Allen picked up a Kleenex and blew his nose. It was cherry-red
from the cold."

"A marker rolled off the desk and bounced on the ground. It was

"A boy casually munched a poppyseed bagel. He slowly savored
each bite."

"A laptop hummed softly on the caf6 table. The computer rested,

"A policeman spoke into his radio. The official was afflicted with a
bad cold."

"A young girl jumped around in the snow. The weather delighted
the child."

"A group of boys was playing frisbee on the lawn. After a while, it
got dark."


The participant was seated in a chair centered in front of a monitor, and was briefed via a set of instructions on

the monitor about what he or she would be required to do and the form that the stimuli and comprehension

questions would take. The participant was given a video-game controller with two shoulder buttons, and used

this throughout the experiment to either forward the instructions and sentences on the screen or to answer yes

(right shoulder button) or no (left shoulder button) to the comprehension questions. After the participant read

the instructions and indicated that he or she was ready, the experimenter fitted him or her with the eye-

tracking device: a light headset with two tiny cameras to monitor eye movements. The most time-consuming part

of the setup was positioning and focusing each camera (one for each eye) precisely for height, distance,

horizontal and vertical angles, and horizontal position. Head motion was monitored and corrected for by a signal

sent from the headset to sensors on each corner of the monitor. This set-up generally took five to ten minutes.

After the cameras were properly positioned and the participant had found a comfortable place he or she could sit

for an extended period of time without moving, the machine was calibrated by having the subject look steadily at

a small circle as it jumped around the screen. The results of this test were validated a second time.

The experimenter was stationed at a nearby monitor that showed a real-time readout of the participant's

progress, rotated so the participant could not see or be distracted by it, and, after prompting him, began

the experiment. The participant pushed a shoulder button when he was done reading the stimulus to bring up

the comprehension question, and after he had answered it and refocused his eyes on the center of the

screen (marked with a small circle), the experimenter manually forwarded the program to display the next

stimulus. This portion of the experiment took at most twenty minutes, as there were sixty stimuli (21 true items

and 39 filler).

Results and Analysis

There were several complicating factors involving the equipment or participant understanding that resulted in

some data being omitted from further analysis. Sometimes head movements on the part of the participant during

a trial would cause the eye-tracker to show eye movement "drift" that could not be corrected during data

analysis. One or two trials were lost for each participant due to this. Several participants had to have their entire

data set excluded from analysis because of low accuracy in answering the comprehension questions. Intended as

a control to ensure full reading, these were simple reading-comprehension questions, and should not have

been answered incorrectly if the participant was reading normally. A participant's data were excluded if he or

she answered more than four out of twenty-one questions incorrectly; that is, if his percentage answered

correctly was less than 81%. Nine participants were disqualified in this manner, with the worst of these

answering only 62% of the questions correctly.

Another participant's data were lost due to experimenter error, bringing the total of unused participants up to ten.

The data of the twelve remaining participants were analyzed. Average dwell times (in milliseconds) on the targets

of each of the three bias conditions (Category, Exemplar, and Neutral) were calculated. These data, reproduced

in Table 4, represent the total amount of time that participants spent looking at the target region of the text.

To recap, "Category" refers to those two-sentence pairs whose targets (or predicates) were semantically

biased toward the category anaphor as opposed to the exemplar antecedent. Returning to the "penguin/

bird" example, "Category" would be the sentence set ending "The bird flew into the air," and "Exemplar" the

set ending "The bird dove into the water," with "flew" and "dove" being the targets, respectively.

Looking at these data, a clear effect can be seen immediately. On average, the target reading times for

category-biased items were over 50 milliseconds longer than reading times for exemplar-biased items, showing

that participants had more difficulty processing the category-biased items despite the nearness of the categories

to the targets.

Table 4.

Target Dwell-Times

Reading Category-Bias Exemplar-Bias Neutral

Total 411.75

This pattern of results was confirmed by paired t-tests across each condition. As expected, the longer

category-biased reading times were significantly different from the neutral control reading times (t (1,11) = 2.25,

p < .046), meaning that the probability that the reading-time effect was due to chance was extremely low.

Further, the significance of the exemplar-biased reading times as compared to the neutral times was not

remotely significant (t < 1). This shows that there was basically no increase in target-processing time for

exemplar-biased items, and therefore no added difficulty.

"A penguin was sitting on the ice. The bird suddenly dove into the water."

"A penguin was sitting on the ice. The bird suddenly flew into the air."

These findings suggest that readers still retain the semantic information from the antecedent and rank

that information highly: more highly, even, than information from a more recent reference to the

antecedent. Practically speaking, "bird" does seem to take its meaning from "penguin," but not so much so that

it disrupts the processing of its atypical antecedent.


The wealth of data gathered in this experiment provides opportunity for many more, and much more in-

depth, analyses. The analysis presented here focuses on only one factor: target dwell-times, showing a

basic directionality correspondence between NP processing and typicality. Exploration of dwell times for any of

the other five "critical areas" of the passages, as well as regression paths and times, could bring more

rounded insight. However, current data go far enough to suggest that Garnham & Cowles' prediction was not

borne out, as in this study it predicted incorrectly that "bird" would entirely disrupt semantic memory of "penguin":

in reality, no such effect was found. This suggests that a penguin remains a penguin in all circumstances, even

when it is also a bird.


1. Almor, A. "Noun-Phrase Anaphora and Focus: The Informational Load Hypothesis." Psychological Review 106

(1999): 748-765.

2. Battig, W.F., and W.E. Montague. "Category Norms for Verbal Items in 56 Categories: A Replication and Extension

of the Connecticut Norms." Journal of Experimental Psychology 80 (1969): 1-46.

3. Casey, Paul J. "A Reexamination of the Roles of Typicality and Category Dominance in Verifying

Category Membership." Journal of Experimental Psychology 18-4(1992): 823-834.



4. CELEX English database (Release E25) [On-line]. 1993. Available: Nijmegen: Centre for Lexical

Information [Producer and Distributor].

5. Garnham, Alan, and Wind Cowles. "Looking Both Ways: The JANUS Model of Noun Phrase Anaphor

Processing." Reference and Reference Processing (to appear): 1-45.

6. Garrod, Simon, and Anthony Sanford. "Interpreting Anaphoric Relations: The Integration of Semantic

Information while Reading." Journal of Verbal Learning and Verbal Behavior 16 (1977): 77-90.

7. Garrod, Simon, Daniel Freudenthal, and Elizabeth Boyle. "The Role of Different Types of Anaphor in the On-

Line Resolution of Sentences in a Discourse." Journal of Memory and Language 33 (1994): 39-68.

8. Grosz, B., A. Joshi, and S. Weinstein. "Centering: A Framework for Modelling the Local Coherence of

Discourse." Computational Linguistics 21 (1995): 203-226.

9. Murphy, Gregory L., and Mary E. Lassaline. "Hierarchical Structure in Concepts and the Basic Level

of Categorization." Knowledge, Concepts, and Categories. MIT Press: Cambridge, (1997): 93-131.

10. Rips, Lance J., Edward J. Shoben, and Edward E. Smith. "Semantic Distance and the Verification of

Semantic Relations." Journal of Verbal Learning and Verbal Behavior 12 (1973): 1-20.

11. Vanoverberghe, Veerle, and Gert Storms. "Feature Importance in Feature Generation and Typicality

Rating." European Journal of Cognitive Psychology 15 (2002): 1-18.

12. Van Overschelde, James P., Katherine A. Rawson, and John Dunlosky. "Category Norms: An Updated and

Expanded Version of the Battig and Montague (1969) Norms." Journal of Memory and Language 50 (2004): 289-335.


Back to the Journal of Undergraduate Research

College of Liberal Arts and Sciences I University Scholars Program I University of Florida I


� University of Florida, Gainesville, FL 32611; (352) 846-2032.