<%BANNER%>

An Ontology Framework for Digital Humanities Collections : Grant Proposal for NEH's Digging into Data Program using the ...

HIDE
 Division of Sponsored Research...
 Main
 
MISSING IMAGE

Material Information

Title:
An Ontology Framework for Digital Humanities Collections : Grant Proposal for NEH's Digging into Data Program using the Baldwin Library of Historical Children's Literature
Physical Description:
Book
Language:
English
Creator:
Beck, Howard
Taylor, Laurie
Publisher:
George A. Smathers Libraries, University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:
Copyright Date:
2009

Subjects

Subjects / Keywords:
Grant proposal
Genre:

Notes

Abstract:
Intellectual significance of the project: The project, A Ontology Framework for Digital Humanities Collections, is an opportunity to develop a framework for organizing humanities collections by focusing on the organization and interpretation of digitized materials. The problem to be addressed is how to archive entire collections (all ontology objects) in a way that the collection objects can be interpreted and accessed for general purposes. Currently, digital libraries afford access to static texts, and provide little user interactivity or input. An example of this is the Archive of Indigenous Languages of Latin America (AILLA) at the University of Texas, Austin (http://www.ailla.utexas.org), which contains a valuable collection of indigenous language materials but does not provide tools to allow users to mine the data for specific research purposes. Another example is the University of Florida’s (UF) Baldwin Library of Historic Children’s Literature (http://www.library.ufl.edu/baldwin/). Currently this archive permits full-text search over digitized text, but conventional full-text search engines do not allow humanities scholars to search for concepts. They can only retrieve documents containing search terms (and the search engine cannot disambiguate these terms). Therefore, scholars must study each retrieved document to see if it meets their requirements. The existing Jaqi archive (University of Florida Digital Collections, 2009, Beck et al., 2007) is currently only a static collection of data objects. They can only retrieve documents containing search terms (and the search engine cannot disambiguate these terms). Therefore, scholars must study each retrieved document to see if it meets their requirements. The existing Jaqi archive (University of Florida Digital Collections, 2009, Beck et al., 2007) is currently only a static collection of data objects. It must be searched by manual navigation, and currently does not have a search engine. Likewise, interpretation and inclusion of expert knowledge about the collection must also be done manually. There are opportunities for automatically discovering relationships among these data objects (also leading to concept-based searching) that needs to be exploited in order to enrich interpretation of the collection. Advances in natural language processing and machine learning can lead to improvements in interpretation and access to these collections. The focus of this proposal is to incorporate such capabilities within a framework for managing digital humanities collections.
General Note:
Ontology Framework for Digital Humanities Collections is a collaborative grant proposal for the National Endowment for the Humanities (NEH), Digging into Data Program using the Baldwin Library of Historical Children's Literature. The proposal was not funded.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
System ID:
UF00094671:00001

MISSING IMAGE

Material Information

Title:
An Ontology Framework for Digital Humanities Collections : Grant Proposal for NEH's Digging into Data Program using the Baldwin Library of Historical Children's Literature
Physical Description:
Book
Language:
English
Creator:
Beck, Howard
Taylor, Laurie
Publisher:
George A. Smathers Libraries, University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:
Copyright Date:
2009

Subjects

Subjects / Keywords:
Grant proposal
Genre:

Notes

Abstract:
Intellectual significance of the project: The project, A Ontology Framework for Digital Humanities Collections, is an opportunity to develop a framework for organizing humanities collections by focusing on the organization and interpretation of digitized materials. The problem to be addressed is how to archive entire collections (all ontology objects) in a way that the collection objects can be interpreted and accessed for general purposes. Currently, digital libraries afford access to static texts, and provide little user interactivity or input. An example of this is the Archive of Indigenous Languages of Latin America (AILLA) at the University of Texas, Austin (http://www.ailla.utexas.org), which contains a valuable collection of indigenous language materials but does not provide tools to allow users to mine the data for specific research purposes. Another example is the University of Florida’s (UF) Baldwin Library of Historic Children’s Literature (http://www.library.ufl.edu/baldwin/). Currently this archive permits full-text search over digitized text, but conventional full-text search engines do not allow humanities scholars to search for concepts. They can only retrieve documents containing search terms (and the search engine cannot disambiguate these terms). Therefore, scholars must study each retrieved document to see if it meets their requirements. The existing Jaqi archive (University of Florida Digital Collections, 2009, Beck et al., 2007) is currently only a static collection of data objects. They can only retrieve documents containing search terms (and the search engine cannot disambiguate these terms). Therefore, scholars must study each retrieved document to see if it meets their requirements. The existing Jaqi archive (University of Florida Digital Collections, 2009, Beck et al., 2007) is currently only a static collection of data objects. It must be searched by manual navigation, and currently does not have a search engine. Likewise, interpretation and inclusion of expert knowledge about the collection must also be done manually. There are opportunities for automatically discovering relationships among these data objects (also leading to concept-based searching) that needs to be exploited in order to enrich interpretation of the collection. Advances in natural language processing and machine learning can lead to improvements in interpretation and access to these collections. The focus of this proposal is to incorporate such capabilities within a framework for managing digital humanities collections.
General Note:
Ontology Framework for Digital Humanities Collections is a collaborative grant proposal for the National Endowment for the Humanities (NEH), Digging into Data Program using the Baldwin Library of Historical Children's Literature. The proposal was not funded.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
System ID:
UF00094671:00001


This item has the following downloads:


Table of Contents
    Division of Sponsored Research Form
        Page 1
        Page 2
    Main
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
Full Text







Office of Research
Division of Sponsored Research
PO Box I 15500 219 Grinter Hall
Gainesville, FL326 11-5500
Phone. (352) 392-1582
Fax (35) 392-44(00


Princi al Investi rar: Taylor, Laurie


DSR-- I

Sponsored Projects
Approval Form


F.r Majlpll P e pwir -i-r ( f Pt


...............------- ------ ----- Muliple PI Project: LM Yes JNo i1n Ntii..iii.rwbi

Department: Diital enter Coeg UF Library Cnr Coee: U Libraies Current UPNI: (DSR Completes)

Project Tille: A Data Base Framework for Digital Humanities Collections
if Known:
Funding Agency: National Endowment for the Humanities ____
. ...... ............ . ..... .... .. PeopleSoft Proposal : _

Type: New 0 Categor): Research E] LF/Dept Person to discuss Applicalo
Renewal T training 0 (name/phone/emaill: PeopleSoft Project #:
Continuation El Extension E Bess de Farber
Supplemental [ n Ti 2 Application Deadline;
*Supplemental 0L Clinical Trial O 273-2519
Revised D Other' 0 r Postmark 0 Receipt D None
Change ofpl bdefarber@ufl.edu
Change Dept te E (Fellowships, paltent services, public service, Da-e: _-
Change DeptD conference. etc.)___ e:

Check all that apply: Yes No Pending Application Mailing Instrucrions: [ Grants.gov
*Human Subjects (IRB) O E E Mail Original and Copies to [D Ofher etmlronk System
"Animal Subjects(IACUC) [l El El grants.gov OFedE
Recombinant ONA/RNA El El Other Overnigh
Blohazards D U Oth- E ni, h aM
Biohaards [] [] First Class Mail
*lit es. anach the IRB and.'or the IACUC approval wiener __ Fa tro:

Cost Sharing: Ifyes, complete the follo-ing: El Email PD
Yes Mandatory: 5 Attach the required cost share letter and agency guldeUnes 1 Release back to PI
No O Voluntary Committed: $ J S q tac h the Dean's Approval" L tfer El Internal Only Ino nlling)

(DSR Use) DSR Staff: Received Acton Date ___(FtedlEx Account Number)

Multiple Principal investigator Projects: For those projects designated as a Multiple PI Project the listed PIs share the r;sponsihily for directing and managing the projecl in accordance
w;i Un\iversitv and Sponsor policies and procedures ['he Contac' PI uii1 be responsible for rela ing cormmunicarions between all of the Pis. University Officials and the Sponsor
Principal Investigator Endorsement: By signing below you avrce to perform the work and manage the project in accordance with University and Sponsor policies and procedures
Investigators) Assurance Statement as Required by Federal Regulation: Irvestgator (s). by signing this DSR-1 form. further certify tha (1) the information submitted witlnn tlle
applicant on is !re, complete and accurate to the best Of their knowledge. (2) dlat any false, fictitious, or fraudulent statements or claims may subtecl Ihe Investigatorts) to criminal civil.
or admninsitraune penalties. and (3) that the Pnncipal Invtesgator(s) agree to accept responsibility or the scientific conduct of the project and to provide the required progress repons and
the final report if a grant is awarded as a result ol'the application
L'niiersity Endorsement: 'his project has been reviewed by the officials whose signatures appear below as !he, relate to their areas and are satisfied iha all faculty involved in the
!ro;ect '-ave agreed to participate and that all obligations and commrnmenrs described herein are acceptable.
Indinret Cost Distribuitions: Upon receip ofDSR's Notice ofAward. Principal investigators) are instructed to use the O'ice of Research web-based F&A t'. Mser toi declare how ihe
mdire. n costs collected under (he award shall be distributed he return of indirect costs generally occurs in the Fall ofeach year and a based upon the indirect costs collected from grants
and coIrnacts during the preceding fiscal year (July I June 30)


Princ lanveltigator: Check here if Contact PI O


,Aiif Taylor. Laurte DAT
TITtE Interim DOrector
,r o. 9221.6290 TCLEPnO#4E 273-2902
n pARTiFNT Digital Library Center
epart Chair


rAaNF T gal Uby C
PPARTM1N:T Digial Ubrary Center


6 U,_1 UFelbrarie


qlh0?

t4At


Co-Principal Investigator:


'ire

UMPARTMINT
Other Endorsement (Where Needed).


NAME
Trrte
A( ADEW I UMT
Vice Presldenl for Research:


Sof ponor
Div sion of Sponsored Resecarh


Picas!e add addittosal 1icnatlfrC sheertts as tsccdc1


UNIVERSITY of

UF IFL 0R-ID


mPOEv a


AI F


SITr


D,,R-. PLf Seprmtxr 2o0i0









Office of Research
Division of Sponsored Research
PO Box 1I 55X) 00/ 219 Grinter Hall
Gainesville. FL 3261 1-5500
Phone. (352) 392-1582
Fax (352) 392-9605


UNIVERSITY of


UFIFLORIDA


DSR-IA

Sponsored Projects
Approval Form


Additional Signature Sheet


Multiple Principal Investigator Projects: For those projects designated as a Multiple PI Project the listed Pis share the responsibility fbr directing and managing the project in accordance
with Unversity and Sponsor policies and procedures The Contact PI will be responsible for relaying communications between all oft'he Pis. University Officials and the Sponsor
Principal Investigator Endorsement: By signing below you agree to perform the work and manage the project in accordance with University and Sponsor policies and procedures
Investigator(s) Assurance Statement as Required by Federal Regulation: Investigator (s), by signing this DSR- I form. further certity that (11 the information submitted wthln the
application is true. complete and accurate to the best of their knowledge; (2) that any false, fictitious, or fraudulent statements or claims may subject the Investigator(s) to criminal. c~ i).
or administrative penalties,and (3) that the Principal Investigator(s) agree to accept responsibility for the scientific conduct of the protect and to provide the required progress reports and
the final report it a grant is awarded as a result of the application
University Endorsement: This project has been reviewed by the officials whose signatures appear below as they relate to their areas and are sausfied that all faculty involved in the
protect have agreed to participate and that all obligations and commitments described herein are acceptable
Indirect Cost Distributions: Upon receipt of DSR's Notice of Award, Pnncipal Investigatorls) are instructed to use the Office of Research web based F&A Manager to declare how the
indirect costs collected under the award shall be distributed. The return of indirect costs generally occurs in the Fall ofeach year and is based upon the indirect costs collected from grants
and contracts during the preceding fiscal year (July I June 30).


Principal Investigator: Check here if Cnntact PI ]


NAME
Trn r


LIFOD


Other Endorsement (Where Needed):


DATF


TELEPHONE d


NAME
1 rrLL
ACADEMIC UNIT


DEPARTMENT

Principal Investigator: Check here if Contact PI ]


NA MI


DATE


TELEPHONE d


DEPARTMENT


Principal Investigator: Check here if Contact PI F


DATE


TITLE F
IOTD


Other Endorsement (Where Needed):


NAME
TITLE
AC(ADEkUC UNIT



Department (hair:


NAME-
DTA'krrMFNT


TELEPHONE


DEPARTMENT

Co-Principal Invy .


v\MF Howard Beck
riTlF Professor


I'IID t 2769-8090


9/ 1/09
DATE


S/De 1 09

NAME Jorota Z.
DEPARTMFNsT Ag & Bio Engineering


TEI EPHONE r 392-3797


DtPARTlFlNr Ag & Bio Engineering


Co-Priacipal Inve tigator:


College Dean:


DATE


NAME:
D PTLME


DEPARTMENT


NAME
COL.I.EGE


TELEPHIONLE


College Dean:


COL IEGE


DS-. la l')F( No 1. 21W> i


DATE


)ATE


DATE


!Mll






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

Narrative

1. Intellectual significance of the project

The project, A Ontology Framework for Digital Humanities Collections, is an opportunity to

develop a framework for organizing humanities collections by focusing on the organization and

interpretation of digitized materials. The problem to be addressed is how to archive entire

collections (all ontology objects) in a way that the collection objects can be interpreted and

accessed for general purposes. Currently, digital libraries afford access to static texts, and

provide little user interactivity or input. An example of this is the Archive of Indigenous

Languages of Latin America (AILLA) at the University of Texas, Austin

(http://www.ailla.utexas.org), which contains a valuable collection of indigenous language

materials but does not provide tools to allow users to mine the data for specific research

purposes1. Another example is the University of Florida's (UF) Baldwin Library of Historic

Children's Literature (http://www.uflib.ufl.edu/spec/baldwin/baldwin.html). Currently this archive

permits full-text search over digitized text, but conventional full-text search engines do not allow

humanities scholars to search for concepts. They can only retrieve documents containing

search terms (and the search engine cannot disambiguate these terms). Therefore, scholars

must study each retrieved document to see if it meets their requirements. The existing Jaqi

archive (University of Florida Digital Collections, 2009, Beck et al., 2007) is currently only a

static collection of data objects. It must be searched by manual navigation, and currently does

not have a search engine. Likewise, interpretation and inclusion of expert knowledge about the

collection must also be done manually. There are opportunities for automatically discovering

relationships among these data objects (also leading to concept-based searching) that needs

to be exploited in order to enrich interpretation of the collection. Advances in natural language

processing and machine learning can lead to improvements in interpretation and access to



1 We are working with AILLA on procedures for archiving the Jaqi collection.






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

these collections. The focus of this proposal is to incorporate such capabilities within a

framework for managing digital humanities collections.

2. The use of digital technologies

The principal research issue is to determine how to organize a humanities collection in a way

that preserves the collection while making the knowledge that it contains easily accessible.

This proposal addresses how a collection can be structured in a new and interesting way. The

general approach will be to use new ontology-based technologies for organizing the collection.

The Lyra system developed by Dr. Howard Beck, proposed fellow at UF, codifies information

and knowledge about a subject field and provides integrated access to the materials (Beck,

2008). The specific emphasis in this proposal is on data visualization tools and machine

learning methodologies which can discover 1) new data structures within the collection, and 2)

how they can assist language analysis and organizing, and accessing digital humanities

collections.

Dr. Beck has a long history of interdisciplinary research, with emphasis on computer

science and digital collections. His interest in humanities began during an undergraduate

program at the University of Illinois that combined a dual major in electrical engineering and

philosophy. The electrical engineering program explored cybernetics and biological

computations. Dr. Beck has always had a strong interest in applying this work to environmental

problems, leading ultimately to his faculty position in the Agricultural and Biological Engineering

Department, involving current work on soil water and nutrient modeling (Beck et al., 2008) and

virtual environments for visualizing forest ecosystems (Beck, 2009). Philosophy studies have

been in the area of theories of knowledge and philosophy of language. His work in computer

science has been to combine artificial intelligence and database management in order to build

a platform for organizing digital collections in many interdisciplinary domains, leading to

development of the Lyra ontology management system described in this proposal. Recent

work in humanities is highlighted by his involvement in the Jaqi language project (Beck et al.,






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

2007), currently funded by NSF, in which Lyra is used to organize and archive linguistic data

structures for several languages. Dr. Beck is now actively exploring the application of this work

as a general framework for organizing humanities collections.

The technology is based on ontologies that organize information as categories that

reflect current psychological, philosophical, and computational theories of category formation

(Wittgenstein, 1958, Ziff, 1960, Gopnik and Meltzoff, 1997; Cimiano, et al., 2004). A particular

collection is structured using ontology objects (a taxonomy of classes, instances, and

relationships). This permits a more accurate modeling of the complex data structures needed to

represent humanities knowledge including linguistics. Ontology-based systems contrast with

traditional relational database systems in which information is organized as tables, with no

attempt to model categories, and with little ability to deal with structured information. Lyra is an

ontology management system that maintains many of the advantages of traditional database

systems such as efficient management of large amounts of data, query processing, security

and integrity maintenance, as well as richer modeling capabilities of ontologies.

Ontologies model concepts and relationships in a domain. Concepts in any domain are

not simple entities; a complex category structure is required to adequately represent even basic

concepts. The connection between language and cognitive categories is also integral to this

approach, and likewise language has a central role in humanities collections. Ontology

categories can incorporate raw field data and cultural resources (notebooks, sound recordings,

image), linguistic knowledge (words, grammars, concept semantics), expert knowledge, and

interpretive information. Finally, automated ontology reasoners can compare and contrast

concepts leading to self-organized concept clusters and concept-based query processing.

Lyra is an open-source software system. In addition, all data stored in the Lyra

management system are exportable in XML format. To the extent possible, these data are

published in standards that comply with existing language archiving techniques. Project

collaborators are active participants in language archiving communities and workshops






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

(Thieberger et al., 2007), in order to contribute to and keep informed of developing standards, to

insure interoperability with other systems.

Specific machine learning techniques

A main focus of this proposal is to incorporate machine learning tools within the ontology

management system. This will facilitate discovery of new relationships within and across

humanities collections, and form the basis for providing access to knowledge in the collection.

The machine learning techniques examine linguistic and conceptual data structures. Linguistic

data structures at various levels, morphological, grammatical, semantic, and discourse, will be

studied. Cross collection studies, in particular studies of similarities and differences among

three Jaqi languages, will be particularly interesting. Finally, access to information in the

collection at all levels (from original field notebooks and sound recordings to expert knowledge)

will be enhanced through query processing techniques based on ontology reasoners which are

an integral part of machine learning. A collaborative relationship has been established with the

University of Illinois Urbana-Champaign (UIUC), Department of Linguistics (Dr. Corina Girju) to

co-develop the machine learning techniques for integration into the humanities framework.

Analyze and discover morpheme behavioral rules

Morphemes are the smallest unit of meaningful utterance, and are at the base of a language

processing system. They play the crucial role in conveying speaker intention, social context, and

of course inflectional and derivational information which form the basis of the grammar of the

languages, and thus the obligatory perceptual categories. In the Jaqi languages, morphemes

carry most of the syntactic information such that word order is optional at the syntactic level and

is obligatory only in a few phrase structures (such as N+N making the first N an adjective).

Study of morphology in these languages can lead to valuable insights that can be applied to

other languages in other collections.

The basic approach to studying morpheme behavior is to observe individual morphemes

over a range of occurrences. This can be done by extracting a concordance for each base






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

morpheme, and studying how and when that base morpheme is transformed into allomorphs.

The highly annotated nature of the Jaqi ontology facilitates such an analysis. Rules for

morpheme behavior can be induced by looking at the influence of neighboring morphemes, and

what influence is being exerted by other words within the overall syntactic structure of the

phrase.

Learn grammar for parsing phrases and sentences

Relatively good parsers for automatic analysis of syntax are available for a wide variety of

languages. For example, the Stanford Parser (http://nlp.stanford.edu/software/lex-parser.shtml)

has been integrated into the Lyra OMS, and includes dictionaries and general grammars for

English, Chinese, Arabic, and German. However, it has limitations that can only be overcome by

further refinement of grammar rules and phrase patterns needed to drive the parsing process.

The research objective here is to use the corpus of the humanities collection to improve parsing

by automatically discovering new grammatical relationships within the corpus.

For example, low-level phrase patterns can be induced on subsets of the languages.

The Appendix shows a study of a set of phrases from the Aymara collection. These phrases

are used in a training exercise, and illustrate a general pattern with variations. The general

pattern was induced by studying similarities and differences among the phrases in the set. The

induced pattern incorporates low-level morphological and semantic influences, and thus can do

a better job of parsing than systems based only on abstract grammars.

Discover ontoloqy concepts automatically

A variety of techniques can be applied to topic analysis and ontology construction. The goal is

to discover concepts and relationships among concepts, which may be expressed by a word, by

different words, or by phrases, appearing in the corpus. By building on syntactic analysis,

arguments (subject/object) and modifiers (adjectives, adverbs), roles of concepts can be

identified. For example, in the Baldwin digital collection, by studying a concordance for "doll", it

was possible to identify features such as what sorts of things a doll can possess (clothes,






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

house), and who can own dolls and what dolls can do, based on evidence from the corpus (see

example in Appendix).

Study cross-lanquage similarities and differences

All of the preceding analyses can be run on subsets of a corpus, or similarly to corpora from

different collections, in order to do comparisons across subsets. One example would be time.

The Baldwin collection spans nearly two centuries of children's literature, and can be

segmented by decades (or any other suitable division) to study how language changes or how

approaches to issues such as morality or gender roles change over time. In the Jaqi collection,

three sister languages are related by geography and history, and similar linguistic data

structures appear in each of the languages. Studying similarities and differences among these

structures leads to understanding the origin and evolution of the languages and diversification

over geographic boundaries.

Simple measures of statistical frequency can give insights into which types of words and

morphemes are shared among the languages and how often. It may also be beneficial to

perform clustering here, to get concise lists of similarities and differences.

Develop concept-based search techniques to assist scholars in analysis of humanities

collections

The analysis of humanities collections can be facilitated by concept-based searching. Currently

humanities scholars have limited access to collections. Digital collections mainly utilize classic

fulltext search engines (currently Baldwin has such a search engine) which are limited to finding

documents in which the user's search terms appear most frequently. These systems suffer

from poor precision and recall, and of course users must manually examine each retrieved

document to determine if it is relevant to the problem being studied.

The framework presented here has the potential for concept-based searching that

attempts to understand, in some sense, both the content of the collection and the user's search

interest or queries, resulting in a better search and retrieval. The process of searching should






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

be viewed as the same problem as automatically parsing and extracting information from the

text of the collection. Ultimately users could express queries in natural language, such as "find

examples of sentient dolls," or "find examples in Aymara where zero-complement-verb patterns

occur," or "discover which nouns go with which shape verbs." These could be parsed and

analyzed the same as the text in the collection. Concept matching techniques would match the

user's query with ontology content to access information in a way that is much more precise

than classic fulltext search.

The Pellet reasoner (Pellet, 2008) has been implemented within the Lyra framework to

provide automatic classification of concepts. This deductive reasoning technique can

automatically classify new objects within an existing (static) taxonomy. It can also be used for

query processing. This proposal will extend the capabilities of reasoning to include inductive

generation of new classes through conceptual clustering (Cimiano et al., 2004), in order to

dynamically expand the taxonomy.

Specific Examples

The techniques described above will be applied to following digital collections and the

particular problems they present (these will be the deliverables):

Example: Tools for visualization and organization of the Jagi collection

Working closely with scholars, Dr. M. J. Hardman, and Dr. Sue Legg, at the UF Center for Latin

American Studies (with assistance from Dimas Bautista Iturrizaga, a Jaqi language expert who

has worked closely on the Jaqi project), the task will be to make extensions to the existing Jaqi

system. These extensions will 1) enhance on-line visualization tools to facilitate collaboration

on database development by language experts and, 2) implement machine learning techniques,

such as described above, to automate organization of the database.

Collaboration tools will expand the authoring system into a more complete "wiki"

environment so that local language experts in Peru and Bolivia can make contributions to the

database. Up until now, only a few experts have been given access to these tools. Involving






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

more people will require better security, so that submitted material will go through a process of

review and evaluation prior to publishing in the database.

Tools for automatically organizing the collection are badly needed, as this has been a

manual process so far. Automatic parsing will assist experts in analysis of syntax at the phrase

level, as well as the morphological level (the Jaqi languages are heavily influenced by a

complex system of suffixes). Cross-language comparisons are also an important area of study,

as the three sister languages in the Jaqi family contain important similarities and differences that

can be discovered through such an analysis.

Example: Dissertation index on African Studies

In this example, the problem is to identify which among the 1,000+ theses and dissertations,

produced each year by graduate and professional students at UF, are relevant to African

Studies. The Center for African Studies (CAS) at UF employs the results and analysis of the

index as an essential element in its understanding of campus research activities relating to

Africa, reporting and communicating developments internally to the College of Liberal Arts and

Sciences (CLAS), as well as to similar academic programs nationally, internationally and to the

US Department of Education as a National Resource Center for African Area Studies Title VI

program, which it has received continuously since 1981.

Electronic dissertations and theses (EDTs submitted in digital formats) are archived

online by the Florida Center for Library Automation (FCLA), while the UF Digital Library Center

(DLC) also scans and archives paper theses and dissertations in the UF Institutional Repository.

Indexing has been done manually since 1995 by Dr. Daniel Reboussin, assistant librarian for

the Africana Collection & Anthropology. Dr. Beck will collaborate closely with Dr. Reboussin to

develop a system to automate this process. This is a classic problem of categorization. It will

require an ontology describing domain concepts including super-national regions (North Africa,

East Africa, etc.), country names (with historical antecedents such as colonial names), regional

names, cities (many of which have been renamed since Independence) and, possibly, city






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

neighborhoods (as with township names in South Africa, for example: Sharpeville, Soweto, etc.)

as well as languages, dialects, ethnic groups (and possibly flora and fauna endemic to African

areas) along with other aspects of "things relating to Africa." Natural language techniques will

extract concepts from dissertation abstracts (and the fulltext if available) for classification within

the Africa ontology. Dr. Reboussin already has a preliminary index of relevant terms. This will

be formalized into an ontology, using tools that can be provided directly to Dr. Reboussin.

Example: Concept analysis in the Baldwin Collection of Children's Literature

Some preliminary studies on the Baldwin digital collection are presented in the Appendix. This

will be explored further, with the goal of studying individual concepts appearing in the text

corpus of the Baldwin collection. Dr. Beck will work with Rita Smith, Curator of the Baldwin

Collection, to identify problems of interest to humanities scholars currently studying the Baldwin.

The text corpus of all 5,785 digitized volumes is 1GB in size. The basic analysis is to study

individual words in context. For a given word (e.g. "good", or "doll") a concordance is created by

locating every phrase where that word appears in the corpus. Next, syntactic and semantic

analyses are applied to each phrase. The syntactic analysis results in a parse tree, and the

semantic analysis results in a structured representation of the meaning of the word as used in

the phrase. Conceptual clustering can then be used to categorize the individual cases in the

concordance, and produce groupings in which similar usages occur in the same category. This

tells scholars how the word is being used. It can also be applied across time (the collection can

be segmented into time-periods) to show how word use changes over time.

3. History, scope, and duration of the project

As a system, Lyra has been evolving since the early 1990s, as part of an interdisciplinary

research and development effort, initially within the UF Institute of Food and Agricultural

Sciences (IFAS). Lyra has been used successfully to construct a variety of applications

including. EDIS (Extension Digital Information System), a digital library of over 10,000

agricultural extension publications, FAWN (Florida Automated Weather Network), DISC






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

(Decision Information Systems for Citrus), Northern and Southern Trees, SPDN (Southern Plant

Diagnostics Network), CBC (Crop Biosecurity Curriculum), the Conserve Florida Water

Conservation Clearinghouse, and the NUMAPS water and nutrient management system. It has

also been used as the basis for the ExtensionU eLearning platform.

Lyra was built with funding from a variety of sources, including the US Department of

Agriculture, NSF, and the Florida Department of Environmental Protection. Internal funding and

support from the system is maintained by IFAS. As Lyra is an integral part of IFAS Information

Technologies, its continued support is assured for the foreseeable future.

In 2004, the Aymara on the Internet project, funded by the US Department of Education,

was created both as an eLearning platform and as an archive of linguistic knowledge about the

Aymara language. This project was expanded to cover the Jaqi languages, including Aymara,

Jaqaru, and Kawki, with funding for endangered language preservation provided by NSF. Lyra

has provided support for all these activities, including the database archive of linguistic

structures (including morphology, grammar, phrases, dialogs, and associated cultural

resources), and the database is used to automatically generate eLearning web sites. Lyra

authoring tools are used directly by language experts to build the database.

Overall web sites supported by Lyra are receiving nearly 2 million visitors per month.

Statistics on the Jaqi project (based on statistics on the UF Digital Library collection) are on the

order of 100 per month (web site information is being expanded and not yet widely publicized).

The Baldwin Library of Historical Children's Literature is part of the Department of

Special and Area Studies Collections at UF and contains approximately 103,000 books

published for children in the US and Great Britain from 1656 through 2009. The Baldwin Digital

Collection contains materials from 1850 through 1904. It began in 2000 with under 600 titles

and now features 5,792. Within this collection, there are 185 editions of Robinson Crusoe, 28

editions of Pilgrim's Progress, 225 items with "alphabet" in the full citation (which includes

subject keywords),129 items with "biography," 311 with "fairy tales," 273 with "tract," 48 with






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections

"American Sunday School Union." The Library is of international significance and supports

research in areas such as education and upbringing; family and gender roles; civic values;

racial, religious, and moral attitudes; literary style and format; textual criticism; and the arts of

illustration and book design. Scholars, students, and researchers from UF and worldwide

heavily use this collection. Baldwin has an average of 100,000 users each month so far for

2009. UFDC as a whole has an average of 400,000 users per month. These statistics are

cleaned to remove robots and other non-human accesses, and the users only reflect usage of

collection items and users who only browse or only search without clicking on a result item.

4. Collaboration of fellow, center, and staff

If this proposal is awarded, Dr. Beck will formally request a one year sabbatical (concept has

been approved by the department chair through submission of this proposal) to work fulltime on

this project. This will leave a salary savings from his department position that will be used to

hire an additional graduate research assistant (GRA) (one GRA is proposed in this NEH budget)

to assist Dr. Beck with this project. Dr. Beck will work directly with these two GRAs to conduct

the research needed on this project. Together, as a team, they will collaborate with DLC center

staff at several levels-- directly with the humanities and libraries scholars involved in the

example applications to be addressed, coordinated around DLC collections; and with the DLC

technical staff involved in all matters related to computer databases and other software. Finally,

Dr. Beck will be involved in presenting seminars and other interactions with the entire DLC staff

to keep the Center apprised of the ongoing research efforts.

Dr. Beck will continue developing relationships with the Department of Linguistics at

UICU in order to collaborate on integration of natural language processing and machine learning

techniques into the humanities collection framework. Meetings with UIUC faculty during the

course of the year are planned.






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections



5. Plan of work

The purpose of this grant would be to enhance existing work, and where needed develop new

components. These tasks will be performed mostly in parallel. Dr. Beck, supported by 2 GRAs

along with DLC staff will collaborate to:

* Continue integration of machine learning techniques into Lyra, including morphological

analyzers, parsers, concept induction algorithms, and concept searching engines.

* Enhance authoring tools for on-line development and data visualization for the Jaqi Iproject.

* Incorporate tools for automatically organizing the Jaqi collection. Examine cross-language

analysis opportunities by looking for similarities and differences between Aymara, Kawki,

and Jaqaru syntactic structures.

* Develop automatic classification system for dissertations on African Studies.

* Further develop concept clustering techniques for analysis of the Baldwin corpus.

* Share results within the DLC and the UF Libraries through seminars and interactions with

humanities scholars, DLC technical staff, and Libraries staff as a whole. Publish results in

journals and conference proceedings.

6. Final product and dissemination

The project will be disseminated through various media (presentation of articles, participation in

professional meetings, electronic media). In particular, presentations will be made at the annual

conferences of the School for Oriental and Asian Studies (SOAS), London, at E-MELD

workshops, and at Universities in Peru and Bolivia. In the case of digital products, provisions

will be made for their long-term maintenance and interoperability with other resources through

the support of the Digital Libraries of UF and UIUC.






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections


Description of the host digital humanities center

The mission of the University of Florida Digital Library Center is to provide a forward-thinking

framework for expanding the UF Libraries in the information age. To meet current and future

needs, the Digital Library Center advances collaborative interdisciplinary research by creating

digital content; implementing and integrating multiple interoperable standards to ensure optimal

access and preservation; and additional tools to enhance digital content and extend research

possibilities.

The Digital Library Center facilitates and focuses the Libraries' development and

integration of digital programs and services within and extending from the University of

Florida. The Digital Library Center was established in 1999 to support ongoing research into

preserving and enhancing access to materials. Given the University of Florida's role as the

primary preservation partner for Florida and the Caribbean and the incredible need in the

region, the Digital Library Center quickly expanded from the exploration of digitization for

preservation into a large-scale digitization facility.

To meet the ongoing preservation and access needs, the Digital Library Center also

provides the infrastructural base for many collaborative projects through the UF Digital

Collections (UFDC) System. Because of the infrastructure costs for digital preservation and

online open access, the Digital Library Center leveraged the robust infrastructure of the UFDC

System to support all internal and collaborative projects.

The UFDC System features a robust standards-reliant infrastructure that allows for the

automatic translation among multiple metadata standards (MODS/METS, MARC, DC) for

maximized interoperability and allows for customized interfaces and views depending on the

institution contributing the materials, the collection or project, and the material type. The Digital

Library Center provides technical support and training for all partners to digitize their materials,






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections


and the Digital Library Center continues to digitize materials as the primary contributor to all of

the collections in and hosted by the UFDC System, adding unique materials regularly.

Because digital content and collections are incomplete without context, the Digital

Library Center undertakes collaborative scholarly research initiatives to create the necessary

contextual supports through interdisciplinary research. One example of this is the Ephemeral

Cities project. The Ephemeral Cities project allows users to browse through cities spatially,

showing one new method for accessing materials in relation to each other geographically and

in relation to the cities themselves. Allowing users to see and use materials in new ways

creates new information, new types of information, and new avenues for research.

The Digital Library Center's core areas of scholarly focus and digital expertise thus lie within:

* Historical children's literature, as found in the Baldwin Library collection. These materials

are supported in coordination with the Baldwin Library curator, Rita Smith, and the faculty in

children's literature studies, including Kenneth Kidd and John Cech.

* Materials from and about the Caribbean and Latin America, as digitized by the University of

Florida and partners for inclusion in the Digital Library of the Caribbean. These materials

are supported in coordination with Richard Phillips, the Curator for the Latin American

Collection in the UF Libraries, faculty in the Center for Latin American Studies, and partners

across the Caribbean and Latin America.

* Florida Newspapers in the Florida Digital Newspaper Library, with over 804,000 pages of

historic through current newspapers. These materials are supported in coordination with

James Cusick, the Curator for the P.K. Yonge Library of Florida History, Carl Van Ness, the

Curator for the Manuscript Collections in the UF Libraries, and various others in the

Libraries and the UF teaching faculty.

* Technologies of digitization and digital collection creation, support, and extension, including

the infrastructure to support large multi-lingual and multi-national projects.






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections


Given the core areas of scholarly focus and digital expertise, along with the mission to preserve,

make accessible, and enhance resources, the majority of projects in the Digital Library Center

are ongoing. For instance, the Digital Library of the Caribbean was funded through a grant from

the Department of Education for 2004-2009. While the grant will end by October, the project

was developed for an ongoing need and it continues to grow sustainably. The Ephemeral Cities

project also ended; however, current work has already begun to recreate the project using

Keyhole Markup Language (KML) so that the project can be explored using a Google Maps

interface instead of the ArcGIS interface, which is more difficult for the majority of users. Plans

are also underway to extend the Ephemeral Cities project using Encoded Archival Description

(EAC) to develop authorities for cities, places, and people, and then allow users to explore the

historical world through digitized documents and through the people populating the world.

The Center's projects and mission require faculty collaborations both within and beyond

the Libraries and University of Florida. Partners include the Florida Museum of Natural History,

the Matheson Historical Trust, Florida's State Library and Archives, other state university

libraries and university libraries across Florida, and university and special libraries beyond

Florida. The Digital Library Center has not yet hosted a research fellow; however, the Digital

Library Center has supported visiting scholars and fellows hosted by the Baldwin Library, the

P.K. Yonge Library, the Center for Latin American Studies, and other units in the Libraries and

other colleges. As a unit within the University of Florida Libraries, the Digital Library Center is

funded and supported by the UF Libraries. Past projects have been funded by: the Department

of Education; National Historical Publications and Records Commission; National Endowment

for the Humanities; Andrew W. Mellon Foundation; Institute for Museum and Library Services;

Florida Humanities Council; and Florida's Library Services and Technology Act program.






University of Florida Digital Library Center
An Ontology Framework for Digital Humanities Collections


Statement of significance and impact:
The project, An Ontology Framework for Digital Humanities Collections, provides a
mutually beneficial collaborative opportunity for Dr. Howard Beck, the proposed fellow,
and the University of Florida (UF) Digital Library Center (DLC). Dr. Beck and a
graduate research assistant will partner with DLC staff and four humanities scholars
from UF, and one from the University of Illinois and Urbana Champaign (UIUC) to
create a general framework for organizing, interpreting, and providing accessibility to
digitized humanities collections. The framework will be based on the Lyra ontology
management system, developed by Dr. Beck at UF. In this proposal, Lyra will be
expanded in several directions, including further development of data visualization tools
and enhancements in machine learning capabilities for automatically organizing and
searching humanities collection to support 1) cross-language comparison, and 2)
structural ontology development.

Initially, this project will utilize Jaqi language materials stored within the UF Digital
Collections (UFDC) system. The UFDC System also stores the Baldwin Library of
Historical Children's Literature Digital Collections with over 849,000 pages of children's
literature, and the Africana Collections with archival resources as well as theses and
dissertations. Because of the robust nature of both the UFDC System and the Lyra
ontology management system, the work to integrate and extend the two will be applied
to the Jaqi Collection, the Baldwin, and Africana Collections during the grant period. In
doing so, all of the resulting semantic analyses results will be used to enhance the
UFDC System by defining and implementing hierarchical and ontological relationships
in the concepts found in the materials and collections. These hierarchical and
ontological relationships will be used to create paratextual materials (indices, thesauri,
dictionaries) that will be accessible to users and that will be integrated into the search
indexing and ranking system to improve search results based on the newly created
conceptual mappings.

Dr. Howard Beck, a professor in the Agricultural and Biological Engineering, is an
expert in information systems that integrate artificial intelligence techniques with
database management. His interdisciplinary expertise has been sought by several
colleges at the UF including extensive humanities work in linguistics with the Center for
Latin American Studies. He has been working closely with the UF Digital Library
Center on many of these projects. The fellowship will allow him to focus on research
and development efforts in close collaboration with experts working on collections
managed by the DLC, to interact directly with DLC technical staff, and to share
research results with the UF and UIUC colleagues and the broader humanities
community.











Bibliography

Badal, R., S. Kim, J. Owens and H. Beck. 2006. An Integrated Database Approach for Managing
Educational Resources in Agricultural and Biological Engineering. IJEE 22(6): 1210-1218.

Baldridge, J., S. Chatterjee, A. Palmer, and B. Wing. 2007. VisCCG: Wiki and Programming Paradigms for
Grammar Engineering. Grammar Engineering across Frameworks 2007. Stanford University.

Beck, H. 2006. The role of ontologies in E-Leaming. Educational Technology 46(1):32-39.

Beck, H. 2008. Evolution of Database Designs for Knowledge Management in Agriculture and Natural
Resources. Journal of Information Technology in Agriculture. 3(1). 23 pages.

Beck, H. 2009. Lyra Virtual World Environment (Lyra VWE): Educational Applications in Agriculture and
Natural Resources. World Congress on Computers in Agriculture. Reno, NV.

Beck, H. W., S. Legg, E. Lowe, and M. Hardman. 2007. Aymara on the Internet A Step Towards
Interoperability and User Access. In Austin P., O. Bond, and D. Nathan (eds.) Proceedings of Conference
on Language Documentation & Linguistics Theory. SOAS, London.

Beck, H., K. Morgan, Y. Jung, J. Wu, S. Grunwald, and H. Kwon. 2008. Ontology-based Simulation
Applied to Soil Water and Nutrient Management. In Papajorgji, P. ed. Advances in Modeling Agricultural
Systems. Springer.

Cimiano, P., A. Hotho and S. Staab. 2004. Comparing Conceptual, Divisive and Agglomerative Clustering
fro Learning Taxnomics from Text. In Proceedings ECAI 2004. R. L6pez de Mantaras and L. Saitta (Eds).
IOS Press. pp. 435-444.

Gopnik, A. and A. Meltzoff. 1997. Words, Thoughts, and Theories. The MIT Press. Cambridge, MA.

Pellet. 2008. http://pellet.owldl.com.

Thieberger, N., E. Hinrichs, M. Cysouw, H. Sloetjes, H. Yi, L. Veselinova, D.T. Langendoen, H. Beck, and
D. Anderson. 2007. Report from TILR working group 6: Standards and Data Models. In E-MELD
Workshop: Toward the Interoperability of Language Resources. http://emeld.mseag.org/wiki

University of Florida Digital Collections. 2009. Jaqi Collection (http://www.uflib.ufl.edu/ufdc/?s=avmara).

W3 Consortium. 2004. Web Ontology Language. http://www.w3.org/2004/OWL

Wittgenstein, Ludwig. 1958. The Blue and Brown Books: Preliminary Studies for the Philosophical
Investigations. Harper. New York.

Ziemba, Lukasz, Camilo Comejo and Howard W. Beck. 2009. A Water Conservation Digital Library Using
Ontologies. Proceedings 3rd International Conference on Metadata and Semantics Research. Milan, Italy.


Ziff, Paul. 1960. Semantic Analysis. Comell University Press. Ithaca. 255pp.














National Endowment for the Humanities
Budget Form
Project Title: An Ontology Framework for Digital Humanities Collection

Section A-Year #1
Budget detail for the period from: June 1, 2010 through May 30, 2011


Computation NEH Funds Cost
Method (a) Share (b) Total (c)


Asst Unlv Librarian
Lib Assoc 2
Professor of Ag & Bio Engin
Graduate Research Asst


Subtotal Salaries & Wages


$
5% of $55,045 $
5% of $36,069 $
$
9 months @ $12 63/hr $
$
$
$
$
$
$
$
$
$
$
$
$


-$ -$
-$ 2,752 $ 2,752
-$ 1,803 $ 1,803
50,400 $ 33,021 $ 83,421
19,700 $ $ 19,700
$ -$
$ -$
$ -$
$ -$
$ -$
$ -$
$ -$
$ -$
$ -$
$ -$
$ -$
70,100 $ 37,577 $ 107,677


2. Fringe Benefits


Salary Base
278% $
331% $
421%
116% $
05%
21%


2,752
1,803

19.700


NEH FuI
$
$
$
$


Subtotal Fringe Benefits


3. Consultant Fees

Name or Type of Consultant


SUB-TOTAL


4. Travel

From/To


Daily rate of
No of days in pr compensation





Subsistence Transportation
* Costs + Costs =


nds (a)


Cost Share
(b)


-$ 765 $
$ 597 $
$ $
2,285 $ $
$ $
$ $
2,285 $ 1,362 $


Cost Share
NEH Funds (a) (b)
$ -$ -$
$ $ $


Total (c)
765
597

2,285


3,647



Total (c)


5. Supplies and Materials

Item


Computation
Method


SUB-TOTAL


Cost Share
NEH Funds (a) (b)
$ -$ -$
$ -$ -$
$ -$ -$
$ -$ -$
$ $ $


$ -$
$ -


Cost Share
NEH Funds (a) (b)

$


NEH Budget Form
Page 3


Total (c)


$ 72,385 $ 38,939 $ 111,324


SUB-TOTAL


1. Salaries & Wages


Name/Title of Position


Taylor, Laurie
Benson, Dina
Beck, Howard
TBA


NEH Budget Form

Page 2


$ -$
$ -$


Total (c)


6. Services



7. Other Costs


Item


SUB-TOTAL


SUB-TOTAL


Basis/Method of
Cost
Computation


8. Total Direct Costs











9. Indirect Cost Computation

This budget item applies only to institutional applicants. If indirect costs are to be charged to this project,
CHECK THE APPROPRIATE BOX BELOW and provide the information requested. Refer to the budget
instructions for explanations of these options.


X Current indirect cost rates) has/have been negotiated with federal agency. (Complete items A and B.)

o Indirect cost proposal has been submitted to a federal agency, but not yet negotiated. (Indicate the
name of the agency in Item A and show proposed rates) and base(s) and the amounts) of indirect
costs in item B.)

o Indirect cost proposal will be sent to NEH if application is funded. (Provide in Item B an estimate of
the rate that will be used and indicate the base against which it will be charged and the amount of
indirect costs.)

o Applicant chooses to use a rate not to exceed 10% of direct costs, less distorting items, up to a
maximum charge of $5,000 per year. (Under Item B, enter the proposed rate, the base against which
the rate will be charged, and the computation of indirect costs or $5,000 per year, whichever value is
less.)

o For Public Program projects only: Applicant is a sponsorship (umbrella) organization and chooses to
charge an administrative fee of 5% of total direct costs. (Complete Item B.)


NEH Budget Form
Page 4


Name of federal agency: DHHS


Date of agreement:


7/6/2006


NEH Cost Share
Funds (a) (b)

$ 24,321 $


TOTAL INDIRECT PROJECT COSTS $ 24,321 $

10. Total Project Costs $ 96,707 $
(Direct and Indirect) for budget period


Total (c)

$ 24,321

$ 24,321


38,939 $ 135,646


Item A.



Item B.


Rate(s)
33.60%


Base(s)
$ 72,385


-










National Endowment for the Humanities
Summary Budget
Project Title: An Ontoloav Framework for Dinital


Humanities Collection


1. Salaries & wages
2. Fringe benefits
3. Consultant fees
4. Travel
5. Supplies & materials
6. Services
7. Other costs
8. Total direct costs
9. Indirect costs
10. Total project costs


First Year

$ 107,677
$ 3,647
$
$
$
$
$
$ 111,324
$ 24,321
$ 135,646


TOTAL COSTS FOR
Second ENTIRE GRANT
Year PERIOD


- $
- $
- $
- $
- $
- $
- $
- $
- $
- $


1. REQUESTED FROM NEH


Outright


Federal Matching

TOTAL NEH FUNDING


$ 96,707

$

$ 96,707


2. COST SHARING

Applicant's contributions
Third-party contributions
Project income
Other federal agencies
TOTAL COST SHARING


$ 38,939
$
$
$
$ 38,939


3. TOTAL PROJECT FUNDING (Total NEH Funding + Total Cost Sharing): $ 135,646


107,677
3,647






111,324
24,321
135,646


Proiect Title: An Ontolonv Framework for Dinital__ umanities Collection_










George A. Smathers Libraries
Office of the Dean of University Libraries


535 Library West
PO Box 117000
Gainesville, FL 32611-7000
352-273-2505
352-392-7251 Fax
www.uflib.ufl.edu


September 11, 2009

Mr. Brian Prindle
Associate Director of Research
University of Florida
PO Box 115500
213 Grinter Hall
Gainesville, FL 32611

Dear Mr. Prindle,

The George A. Smathers Libraries, as the lead applicant for NEH Fellowship for Digital
Humanities Centers titled A Data Base Framework for Digital Humanities Collections, is
voluntarily contributing $5,917. The project will take one year to complete (June 1, 2010
through May 30, 2011) and the cost share will be allocated to staff time for supporting the work
of proposed fellow, Dr. Howard Beck. I agree to the cost share as outlined in the attached
budget.

Sincerely,


eudith C. Russell
Dean of University Libraries


The Foundation for The Gator Nation
An Equal Opportunity Institution


UNIVERSITY of
UFIFLORIDA







I)
(t ~ *.


Iniitiutl of Food and \rnictulturail Sciences
IFAS Sponsored I'i I: 1ii
iijp ] i'p I' i i' .1 1l


(.'"_2 McCarty i] I I)
P, i i 1101
Sn.. I I. FL32611-0110
Telephone: 352-392-235
r **; -392-8479


Cost Sh rint Commnitilent


TO: If AS p,. Pr -' I') Ir ii .. Division of Sponsored Research

FROM: PI: Howard W. Beck

DEPT: A\-riculture and Bni 1T .i. .- 1 r1nirikCri.'P

SUB ECT: I'i p I..ll I ul A Data Base Framework for Digital Iuiln.nitlc.. Colle

Sponsor: NE


J I 11 I .A


r I >N: .1,I NEH 21iklClilic. l I",,. Iuidiii' is rarely awarded, but cost

'lul ili, i. not required. This cost .liriin, will make up lh.'
ill lk Il 0 n


I 1i. is to advise you ih.ii our Unit has reviewed the above L'c~I Ctl ed proposal. In the event the Sponsor makes
an award to the University of Florida as a result of this proposal's acceptance, We, Ice to commit i h
fiillhT inL as Cost h.irinI for the award.

F-
C I .. Slui nI. f r P. l'nno l 33 0"1 On



I h I i l' rl: (. a ti S lh i ii.- I 1

1( T 11 I ,
T. t .. -l .- of s . 11



This Commitment is ,nI .... In I,-..d and ,.i A'L.'d to 1I1ii i. i )D. of Cpji .'il l-i nii


U nit i 1 l1 -,', ll in .


Mark McLellan







P rI mI',IIIIII |

t':1111t', I' .'Ird I ..___

-------- '- -- -- I---,------- il-- -- ,t. ------
( a ti'ort l it p ub.


a r \ ailabilit\ l .ilar\ PIlrLLIrJilJ ltl' 'f BI s 1,i:l;ar i| I ril ,tll
il I f i rt 11 Rate i! ( iinniitnicnt
. . I ii '' I

Total tS 3 211



I Total Cost Si n.r.i,'P for P ir .iiln is 3.I.' 1 ,i I


I I Total Cost h ll. r iI- fort( iI11 Costs is 'i 111


I II Total Cost Sh,i iiii for Third Parties is 1 iII








Howard Beck Biographical Sketch
Professor, Agricultural and Biological Engineering Department, University of Florida
(60% Extension, 30% Research, 10% teaching)

Education:

B.A. Philosophy University of Illinois 1976
B.S. Electrical Engineering University of Illinois 1976
M.S. Electrical Engineering University of Illinois 1977
Ph.D. Computer and Information Sciences University of Florida 1990

Appointments:

Professor, Agricultural Engineering Department, University of Florida (Tenured) 7/1/02-Present
Associate Professor, Agricultural Engineering Department, University of Florida, (Tenured),
7/1/95-6/30/02
Assistant Professor, Agricultural Engineering Department, University of Florida, (Non-Tenured),
5/1/90-6/30/95
University of Florida, Graduate Research Associate, 8/1/89-4/30/90
Entomology and Nematology Department, University of Florida., Associate In, 8/1/77-7/31/89

Expertise:

Application of information technologies in interdisciplinary environments, with emphasis on
database management, data modeling, natural language processing, and decision support systems.

Publications Related to Project:

Beck, H. 2008. Evolution of Database Designs for Knowledge Management in Agriculture and
Natural Resources. Journal of Information Technology in Agriculture. 3(1). 23 pages.
Beck, H., K. Morgan, Y. Jung, J. Wu, S. Grunwald, and H. Kwon. 2008. Ontology-based
Simulation Applied to Soil Water and Nutrient Management. In Papajorgji, P. ed. Advances in
Modeling Agricultural Systems. Springer.
Xuelian X., J. DePree, S. Degwekar, S. Su., and H. Beck. 2008. Integrated Specification and
Processing of Knowledge and Process for Achieving Knowledge Sharing among Collaborating
Organizations. ICISTM-08, Dubai.
Beck, H. W., S. Legg, E. Lowe, and M. Hardman. 2007. Aymara on the Internet A Step Towards
Interoperability and User Access. In Austin P., O. Bond, and D. Nathan (eds.) Proceedings of
Conference on Language Documentation & Linguistics Theory. SOAS, London.
Beck, H., R. Badal, and Y. Jung. 2007. Ontology-Based Simulation in Agriculture and Natural
Resources. In Handbook of Dynamic System Modeling. P. Fishwick (ed.). CRC Press.
Oliverio, J., Y. R. Masakowski, H. Beck, and R. Appuswamy. 2007. ISAS: A Human-Centric
Digital Media Interface to Empower Real-Time Decision-Making Across Distributed Systems.
In Proceedings of the Ti el)fih international Conference on 3D Web Technology (Perugia, Italy,
April 15 18, 2007). Web3D '07. ACM, New York, NY, 81-87. DOI=
http://doi.acm.ore/10.1145/1229390.1229403








Badal, R., S. Kim, J. Owens and H. Beck. 2006. An Integrated Database Approach for Managing
Educational Resources in Agricultural and Biological Engineering. IJEE 22(6):1210-1218.
Kim, S., and H. Beck. 2006. A Practical Comparison Between Thesaurus and Ontology
Techniques as a Basis for Search Improvement. Journal of Agricultural & Food Information.
7(4):23-42.
Beck, H. 2006. The role of ontologies in E-Learing. Educational Technology, Vol 46, No. 1, 32:39.
Papajorgji, P.P., H.W. Beck, and J. L. Braga. 2004. An Architecture for Developing Service-
Oriented and Component-Based Environmental Models. Ecological Modeling; 179:61-67.
Badal, R., S. Kim, and H.W. Beck. 2004. Educational Simulation A Database Integrated
Approach for Disseminating Research Information. International Conference on Environmental
Systems (ICES). Paper 2004-01-2421.
Beck, H.W. 2003. Integrating Ontologies, Object Databases, and XML for Educational Content
Management. Proceedings of the World Conference on E-Learning in Corporate, Government,
Healthcare, and Higher Education. Association for the Advancement of Computing in
Education.


Recent Grants and Contracts:
"An Accessible Linguistic Research Database for the Endangered Jaqaru and Kawki Languages",
M.J. Hardman (PI), E. Lowe, H. Beck, and S. Legg. $110,000. National Science Foundation.
7/1/08-6/30/11.
"Processing Dynamic Event Data and Multifaceted Knowledge in a Collaboration Federation",
Stanley Su (P.I.), H. Beck. $559,864.00, National Science Foundation, 8/1/06 8/1/09.
"Implementation and Grower Evaluation of a Web-based Nutrient Management Plan Support
(NUMAPS) System for Florida Crops". H. Beck (PI), K. Morgan, S. Grunwald. Florida
Department of Environmental Protection. 8/15/02 -8/15/09. $600,000.
"Educational Outreach". H.Beck (PI). NASA. 10/1/04-12/30/06. $70,000
"Creation of a National Training Program in Crop Biosecurity for First Detectors". G. Holmes
(PI), H.Beck, and K. Wright. USDA. 6/1/04-4/30/07. $350,000.
"Southern Regional Plant Diagnostics Center Laboratory". G. Wisler (PI) R. McGovern, M.
Momol, P. Roberts, H. Beck. 6/1/04-5/31/10. $2.8M.
"Northern Trees Expert System". E. Gilman (PI), 8/15/04-8/15/06. $70,000.
"Conserve Florida Clearinghouse J. Heaney (PI), D. H. Beck, D. Haman. Florida Water
Management Districts. 5/1/06-12/30/06. $150,000.
"The Integrated Situational Awareness System (ISAS)" J. Oliverio (PI) H. Beck, and R. Lind.
$50,000. UF Opportunity Grant. Digital Worlds Institute. 5/1/07-4/30/07.
"The Aymara E-Learning Project: Using the Internet to Preserve and Promote an Indigenous
Language" Elizabeth Lowe (PI) et al. $468,141. U.S. Department of Education International
Research and Studies Program (Title VI). 10/1/04-9/30/07.









Dina Benson
1206 NE 9th St.
Gainesville, FL 32601
dinabenson@gmail.com


Employment

Institutional Repository Coordinator January 2008-present
Digital Library Center
Smathers Libraries, University of Florida

Develops goals, policies, and procedures related to the development of the
University of Florida's Institutional Repository collection, including selection of serial
titles and harvesting individual items from all educational units at the university
Works collaboratively with campus units to assure appropriate materials are added to
the online collection in a timely manner
Hires, trains, and supervises student assistants in all processes completed within the
department
Provides training throughout the university community on the contents of the
collection and potential for material use and contribution
Analyzes collection building and usage data to submit reports as requested
Participates in usability studies to determine patron needs and creates subsequent
resources for increasing accessibility and visibility

Assistant Coordinator of Public and Access Services August 2005-January 2008
Interim Rare Book Librarian
Department of Special & Area Studies Collections
Smathers Libraries, University of Florida

Supervised the Collections' reading room during daily shifts, attending to the needs
of patrons via research and retrieval of items from closed stacks
Maintained awareness of departmental needs and priorities and exercised initiative
to be as productive as possible at all times
Served as coordinator in the absence of the coordinator of Public and Support
Services
Managed the rare book collection, including oversight of incoming collections and
cataloging of unprocessed material backlog
Oversaw an annual budget of $12,000 plus grant and endowment funds as
applicable
Answered reference questions in person, on phone, and via email
Addressed classes interested in using departmental resources
Assisted with various ongoing projects of activities as assigned by the department
chair, such as collection moves and barcoding of collections
Provided computer and technical support for department









Education


Master of Science in Library and Information Studies August 2003-April 2005
Florida State University, Tallahassee, FL
Included coursework in communities of practice and information technology

Bachelor of Arts in Philosophy and Classics August 1999-May 2003
University of Florida, Gainesville, FL
Included coursework in the rationalists and Latin language


Special projects

Project in lieu of thesis digitization program launched to offer an alternative for graduate
students in the College of Fine Arts and College of Design, Construction and Planning whose
atypical terminal projects are excluded from the university's electronic thesis and dissertation
program, beginning Spring 2009. Gathered documents and copyright permissions from current
graduates and alumni, integrated items into the Institutional Repository collection, created
documentation and best practices for the ongoing project, and created an online access point
highlighting the collection.

Aymara language display in the Irene Zimmerman Memorial Display Case in conjunction with
Smathers Libraries' Latin American Collection, supplementing the 56th Conference of the
University of Florida Center for Latin American Studies, February 2007. Collaborated with Latin
American Collection librarians to identify, retrieve, and arrange materials for viewing by
university patrons and conference attendees.

John D. MacDonald Collection Processing Plan minigrant to devise a strategy for the
organization and marketing of the author's collection of more than 369 linear feet of
manuscripts, correspondence, galleys, photos, books, and magazines currently housed in
Special Collections stacks, awarded October 2006. Trained and supervised three temporary
employees in the organization, classification, and conservation of materials culminating in a
comprehensive finding aid and the opening of the collection to the public.










CORINA ROXANA GIRJU Biographical Sketch


Department of Linguistics and Beckman Institute,
University of Illinois at Urbana-Champaign
Foreign Language Bldg., room 4016B
707 S. Mathews Ave, Urbana, IL 61801
Email: girju@illinois.edu


Professional Preparation:
Ph.D., Computer Science, University of Texas at Dallas, 2002
M.S., Computer Science, Southern Methodist University, Dallas, Texas, 2000
B.A., International Economic Transactions, Academy of Economic Studies, Bucharest, Romania, 1997
B.S., Computer Science, "Politechnica" University of Bucharest, 1995

Academic Employment
Assistant Professor of Linguistics and Beckman Institute, UIUC (2005 present)
Affiliate Assist. Professor of Computer Science, UIUC (2006 present)
Affiliate Assist. Professor of Spanish, Italian, and Portuguese, UIUC (2006 present)
Assistant Professor of Computer Science, Baylor University, Texas (2002 2005)
Visiting Research Assistant Professor of Computer Science, UIUC (2004 2005)

Relevant Publications
R. Girju: The Syntax and Semantics of Prepositions in the Task of Automatic Interpretation of Nominal
Phrases and Compounds: a Cross-linguistic Study In Computational Linguistics 35(2) Special Issue on
Prepositions in Applications, A. Villavicencio, V. Kordoni, and T. Baldwin (eds.), 2009.
M. Paul and R. Girju. Cross-cultural Analysis of Blogs and Forums with Mixed-collection Topic Models.
The Empirical Methods in Natural Language Processing Conference (EMNLP), 2009.
M. Paul and R. Girju. Topic Modeling of Research Fields: An Interdisciplinary Perspective. The Inter-
national Conference on Recent Advances in Natural Language Processing (RANLP), 2009.
R. Girju. Out of context Noun Phrase Interpretation with Cross-linguistic Evidence. The ACM 15th
Conference on Information and Knowledge Management (CIKM), Washington D.C., 2006.
D. Moldovan, S. Harabagiu, R. Girju, F. Lacatusu, A. Novischi, A. Badulescu, and O. Bolohan. LCC
Tools for Question Answering. The Text REtrieval Conference Question Answering Track (TREC-QA),
2002.

Other Publications
S. Harabagiu, D. Moldovan, M. Pasca, M. Surdeanu, R. Mihalcea, R. Girju, V. Rus, F. Lacatusu,
P. Moraescu, R. Bunescu. Answering Complex, List and Context Questions with LCC's Question-
Answering Server. The TExt Retrieval Conference for Question Answering (TREC 10), 2001.
S. Harabagiu, D. Moldovan, M. Pasca, R. Mihalcea, M. Surdeanu, R. Bunescu, R. Girju, V. Rus, and
P. Morarescu. The Role of Lexico-Semantic Feedbacks in Open-Domain Textual Question Answering.










The 39th Annual Meeting of the Association for Computational Linguistics (ACL-2001), Toulouse, France,
2001.
D. Moldovan and R. Girju. An Interactive Tool For The Rapid Development of Knowledge Bases.
International Journal on Artificial Intelligence Tools (IJAIT), 2001.
S. Harabagiu, D. Moldovan, P. Morarescu, F. Lacatusu, R. Mihalcea, V. Rus, R. Girju. GISTexter: A
System for Summarizing Text Documents. ACM SIGIR'01 Workshop on Text Summarization, 2001.
R. Girju. Answer Fusion with On-Line Ontology Development. The North American Chapter of the
Association for Computational Linguistics (NAACL) SR Workshop, 2001.

Synergistic Activities
Co-organizer [with P. Batoma and E. Lowe (UIUC), P. Minacori (U. of Paris) of the Panel "Preparing
translators for the current technology landscape" at the Machine Translation Summit, Ottawa, Canada,
Aug. 2009
Co-organizer [with J. Hockenmeier] of the 2009 North American Computational Linguistics Olympiad
(NACLO) UIUC site, Feb. 2009
organizer of the roundtable on Tools and Technologies in Digital Humanities: Bridging Communi-
ties and Fostering Interdisciplinary Research at the Chicago Colloquium on Digital Humanities and
Computer Science Pre-colloquium, University of Chicago, October 31 November 2, 2008
Invited participation NSF panel, 2008
Co-organizer [with R. Sproat] of the 2008 North American Computational Linguistics Olympiad (NA-
CLO) UIUC site, Feb. 2008
Co-organizer [P. Nakov, V. Nastase, S. Szpakowicz, P. Turney, D. Yuret] of SemEval-2009 Task 4:
Classification of Semantic Relations between Nominals (ACL), Prague.
Co-organizer [with D. Moldovan] of the Workshop on Computational Lexical Semantics, Human Lan-
guage Technology (HLT/NAACL), Boston, MA.
Co-organized [with D. Moldovan] a Knowledge Discovery from Text Tutorial at the 41st Conference of
the Association for Computational Linguistics (ACL), Sapporo, Japan, July 2003
Organized a Special Track on Recent Advances in Natural Language Processing, the 16th International
Florida Artificial Intelligence Research Society Conference (FLAIRS-2003), in cooperation with AAAI,
St. Augustin, Florida, May 2003

Collaborators in the last 24 months
Dan Roth (CS, UIUC), Julia Hockenmaier (CS, UIUC), Peter Lasersohn (UIUC: semantics/pragmatics),
Silvina Montrul (UIUC: psycholinguistics), Rakesh Bhatt (UIUC: sociolinguistics), Jerry Morgan (UIUC:
pragmatics), Peter Nardulli (UIUC: political science), Richard Sproat (Oregon Health & Science Univ.),
Preslav Nakov (National University of Singapore), Stan Szpakowicz (University of Ottawa), Deniz
Yuret (Koc University, Turkey), Peter Turney (National Research Council of Canada).

Advising of Graduate Students at UIUC
B. Beamer (Undergrad. in CS), PhD in Linguistics: expected 2012
A. Fister (Undergrad. in CS), PhD in Linguistics: expected 2012
C. Li (Master in CS), PhD in Linguistics: expected 2013
M. Riaz PhD in Computer Science: expected 2015









M.J. Hardman Biographical Sketch


Professional Preparation:

B.A. University of Utah, 1955.
M.A. University of New Mexico, 1956.
Ph.D. Stanford University, 1962.

Appointments:

Professor, Department of Anthropology and Department of Linguistics, University of Florida,
College of Arts and Sciences (Tenured).
Affiliate Faculty, Center for Latin American Studies.

P.I., The Aymara e-Learning Project. Title VI, U.S. Department of Education (2004-2007)
P.I., An Ontology-Based Linguistic Research Database for Jaqaru and Kawki (2008-2011)

Research and Teaching Specializations: Reconstruction of proto-Jaqi (Andean pre-history),
Language and gender; Language and Culture; Comparative Study of the Jaqi languages;
Aymara, Language Change

Field Research: Peru, Bolivia, Chile

Selected Publications:

2000 2001 Aymara Munich, Germany: Lincom Europa.

2000 Hearing Many Voices, with A. Taylor. Cresskill, N.J.: Hampton Press.

2001 Jaqaru. Munich, Germany: Lincom Europa.

1995 "Jaqi Onomastics." Namenforschung Name Studies Les Noms Propres. Berlin: Walter
de Gruyter, 970-974.

1994 "And if We Lose our Name, Then What About our Land? Or, What Price Development?
In Differences That Make a Difference, Examining the Assumptions in Gender
Research. Lynn H. Turner and Helen M. Stark, eds. Westport& London: Bergin &
Garvey, 151-162.

1988 Aymara: Compendio de Estructura Fonol6gica y Gramatical, with Hardman et
al.,translation by Chavez, final revision Briggs. La Paz, Bolivia: Editorial ILCA.

1988 "Jaqi Aru: La Lengua Humana." In Raices de America: El Mundo Aymara, Xavier Albo,
ed. Madrid: Alianza Editorial, 155-205.

Synergistic Activities:

Dr. M.J. Hardman is Professor of Linguistics and Anthropology, currently in Linguistics. Within
the University of Florida, she has participated in the building of both Linguistics and Women's
Studies and has been associated with and taught courses within Anthropology, Linguistics,
Women's Studies, Latin American Studies and also Honors. Her principal area of research has









been the Jaqi languages (Jaqaru, Kawki, Aymara) for which she has written grammar, cultural
studies and applied materials. She is currently the world authority in this language family. For
twenty-one years the Aymara language was taught as a regular foreign language on the
campus of the University of Florida, as part of the Aymara Language Materials Program, with
grants under the U.S. Department of Education Title VI program. In addition, and growing out
of work related to the Jaqi languages, she has recently done substantial research in the areas of
language and culture, gender and violence, for which she has received several awards. In
addition to the current work for Aymara on the Web, she is preparing with a colleague from
George Mason University a DVD on language, gender and violence for classroom use. Most
recently she has been named Consultant to the Proyecto de Educacion Rural Bilingce by a
Resolution of the Provincial Government of Yauyos, Peru.

Collaborators and Other Affiliations:

Dr. Hardman's work on Jaqaru began while she was a Fulbright Scholar in Peru in 1958. The
contacts with the scholars at San Marcos University at the time have continued to this day. Dr.
Hardman, together with Dr. Julia Elena FortQn, founded the Instituto Nacional de Estudios
Linguisticos (INEL) in 1965, while Hardman was in Bolivia as a Fulbright-Hays Professor. The
first class of INEL graduated 25 linguists which made linguistics part of the Bolivian academic
environment. ILCA (Instituto de Lengua y Cultura Aymara), through which Aymara speaking
teachers learned reading and writing Aymara, grew out of INEL. Over the years Dr. Hardman
has lectured at, worked with, and signed agreements with a number of universities throughout
Latin America, especially in Peru, Bolivia and Chile. Within the United States she is part of the
Wise Women's Council of the Organization for the Study of Communication, Language and
Gender and has conducted workshops on language, gender and violence at numerous national
meetings. She has received awards for intercultural work, scholarship on the Jaqi languages
and for mentoring on a national level. She consults on publications in both English and Spanish
and on matters regarding the Jaqi languages, as well as on gender, language and violence.

Recent Honors

Honorary Doctorate, Doctora Honoris Causa, for pioneering work in Linguistics, Anthropology
and History, Universidad Nacional Mayor de San Marcos, Lima, PerQ, declared 24 November
2008, conferred 14 July 2009.






Principal Investigator/Program Director (Last, first, middle)


BIOGRAPHICAL SKETCH


NAME
Sue M. Legg, PhD


POSITION TITLE
Director Emeritus, Office of Instructional Resources
And Center for Instructional and Research
Computing Activities (CIRCA)


EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing,
and include postdoctoral training.)

INSTITUTION AND LOCATION DEGREE YEAR(s) FIELD OF STUDY
(if applicable)


University of California, Berkeley
University of California, Berkeley

University of Florida, Gainesville, Florida


University of Florida, Gainesville, Florida


A. Positions and Honors.
Positions and Employment


1962-1965
1974-1979
1978-1980
1980-1994
1995-2001
1997-2001
1998-2001
2001 2004

Honors
1978
1978
1978


BA
Teaching
Certificate
MA


PhD


1960
1962

1976


1978


Political Science
Education

Research,
Measurement,
Evaluation
Research,
Measurement,
Evaluation
Minor: Statistics


Teacher, Richmond Public Schools, Richmond, California
Statistician, University of Florida Office of Instructional Resources
Faculty Associate In, University of Florida Office of Instructional Resources
Associate Director, University of Florida Office of Instructional Resources
Director, University of Florida Office of Instructional Resources (Now, Academic Technologies)
Director, University of Florida, Center for Instructional Research and Computing Activities
University Distance Education Coordinator, University of Florida
Executive Director, Partnership in Global Learning, University of Florida


Phi Kappa Phi National Honor Society
Kappa Delta Pi Education Honor Society
Pi Lambda Theta Honor Society


Committee Appointments (Selected)


1990-1991
1992-1993
1991-1992
1991-1992
1992-94,97
1995-1997
1995-1997
1995-2000
1998-1999


Book Review Editor, Journal of Educational Measurement
Review Editor, Journal of Educational Measurement
President, Measurement Services Association
President, Florida Institutional Research Association
Field Grant Reviewer, U.S. Department of Education
Chair, SAT National Advisory Board of the College Board
Subcommittee Member, Articulation Coordinating Committee, Florida Board of Regents
Assessment Technical Committees, Florida Department of Education
Chair, Nominating Committee, American Educational Research Association
Page 1






Principal Investigator/Program Director (Last, first, middle)


2001- 2002 Chair, Learning Technology Consortium
1993- 2009 Measurement Consultant, Florida Bar Board Certification Examinations

B. Selected peer-reviewed publications

Legg, S. Book Review: Teacher Certification Testing. In Journal of Educational Measurement,

Legg, S. and Algina, J. (Eds.) Cognitive Assessment of Language and Math Outcomes. Ablex Publishing
Corporation, Norwood, New Jersey. 1990. (Book)

Legg, S. and Buhr, D. The Effect of a Computerized Adaptive Test on Different Examinee Groups.
Educational Measurement: Issues and Practice. Summer, 1992.

Miller, M.D. and Legg, S. Assessment in a High Stakes Environment. Educational Measurement: Issues and
Practice. Summer, 1993.

Legg, S. Review of Hay Aptitude Test Battery. In Buros Mental Measurement Yearbook. AN-10072567.
1993.

Wolcott, W. with Legg, S. Overview of Writing Assessment: Theory, Research and Practice. NCTE, Urbana
Illinois. 1998. (Book)

Schaub,D., Legg,S.Svoronos,S.,Coopman,B.,Sherman,B. Applying TQM in an Interdisciplinary Engineering
Course. In Journal of Engineering Education.1999.
Brown, D., assisted by Legg, S. et al. Teaching with Technology. Anker Publishing, Boston, Mass. 2000.
(Book in collaboration with the Learning Technology Consortium)

C. Research Support
Ongoing Research Support
No current research support in my name. PGL support includes:


Completed Research Support (Selected)
Project Manager: N.S.F. grant: An Accessible Linguistic Research Database of the Endangered Jaqaru
and Kawki Languages. 2008-2011
U.S. DOE. Title Six grant: Development of Online Courses in the Aymara Language 2004-2007
Verizon Corporation-University of Tampa subcontract:
Knowledge Connection-Florida, Brazil and Mexico K-12 schools 2004-2005.
:Principal Investigator: Sustainability grant for the Partnership in Global Learning. William and Flora Hewlett
Foundation. 2002-2004. $300,000.

Principal Investigator with Martin Vala and Marvel Townsend. Cost Effective Uses of Technology in Teaching.
Grant from the Mellon Foundation, 2000-2003. $270,000.

Principal Investigator. Florida Teacher Certification Examination. Florida State Department of Education.
Responsibilities include administration, test form development, development, scoring, reporting and test
analysis. 2001. Current Award: $4,200,000.

Principal Investigator. College Level Academic Skills Program. Florida State Department of Education.
Responsible for form development, scoring, reporting and test analysis. 1982-2001. Annual Award:
$700,000.

Principal Investigator. Development of Licensure Examination Items. Florida Department of Business and
Professional Regulation. 1997. Award: $280,971.


Page 2









Daniel A. Reboussin
Assistant Bibliographer, Africana Collection and Selector for Anthropology
Special & Area Studies Collections Department
University of Florida Libraries
Gainesville, Florida

Education
B.A. in anthropology, Grinnell College, Iowa, 1983
M.A. in anthropology, University of Flroida, 1986
Ph.D. in anthropology, University of Florida, 1995

Research Focus

Dr. Reboussin received his Ph.D. in anthropology from the University of Florida in 1995.
His dissertation From Affiniam-Boutem to Dakar: migration from the Casamance, life in
the urban environment of Dakar, and the resulting evolutionary changes in local Diola
organizations presented a "case study of rural-urban migration among the women of
Affiniam-Boutem." Funding for his field work came from a Fulbright Scholarship and a
Foreign Language Area Studies Fellowship. His scholarly interest in West Africa and his
knowledge of anthropology and its methodologies are integral to his current position in
the University of Florida libraries. Within his duties as an Assistant Bibliographer, he is
solely responsible for the acquisitions of West African monographic materials. In
collaboration with the Africana Bibliographer, he provides a full range of research and
reference services including assisting in the teaching of the for credit class on Africana
Bibliography which is required for the graduate Certificate in African Studies.
As the Anthropology selector, Dr. Reboussin serves the faculty and students of one of
the largest, most respected and diverse departments at UF. Ranked 11th among all U.S.
university anthropology departments in the last National Research Council ratings, the
department includes thirty-five faculty members including anthropologists appointed to
the Colleges of Medicine, Nursing, the Florida Museum of Natural History, Centers for
African, European and Latin American Studies and other campus units. His most recent
article "Migrant Labor: Africa" was published in the Oxford Encyclopedia of the Modern
World, v.5, edited by Peter N. Stearns, 2008. He is an active contributor to both African
and library focused newsletters on campus, and publishers frequently request his critical
assessment of new books.

Digital Resource Creation

In collaboration with the Digital Library Center, Dr. Reboussin is working on the creation
of the African Studies General Collections
http://www.uflib.ufl.edu/ufdc/?c=africal&m=hhs which is being developed and
managed to support the past, ongoing and future needs of University of Florida's Center
for African Studies. This Center is the only United States Department of Education Title









VI Center for African Studies in the American southeast and is one of the most active
and well regarded African study centers in the U. S.
With financial sponsorship from multiple sources, Dr. Reboussin's digital projects
include the Martin Rikli Photographs, 1935-36 http://www.uflib.ufl.edu/ufdc/?s=fotoaf
documenting his Ethiopian (Abessinien) expedition from 1935-36, which coincided with
the second Italo-Abyssinian War. This project and others in the planning stages offer
tremendous research potential to humanities scholars across the globe. By carefully
integrating traditional humanities research tools such as finding aids with digital
expressions, Dr. Reboussin expects that scholars will be discern new ways of analyzing
content leading to new research discoveries.









Rita J. Smith
Associate Librarian and Curator
The Baldwin Library of Historical Children's Literature
Department of Special and Area Studies Collections
University of Florida Libraries, Gainesville, FL 32611

Date June 17, 2009 [Note: Rita Smith may retire April, 2010]

Recent Work Experience:

January 1994 Present:
Curator, Baldwin Library of Historical Children's Literature, University of Florida
Library
June 2006-June 2007 Chair, Department of Special and Area Studies Collections
June 1992-December 1993:
General Humanities Cataloger, University of Florida Library
October 1989-May 1992
Project Cataloger, University of Florida Library, on U.S. Department of Education
Title IIC grant to catalogue books from the Baldwin Library of Historical
Children's Literature
Education:

BA in English, Goshen College, May 1967
MA in Library Science, University of Michigan, June, 1972

Publications/Exhibits:

Exhibit: "Pop-up, Spin, Pull, Fold: Toy Books from the Baldwin Library," an exhibit of
50 items from the Baldwin Library, Smathers Library Exhibit Area, September 2-
October 31, 2008.
Exhibit: "Alice Ever After," Various editions of Alice 's Adventures in Wonderland,
Harn Museum of Art, August-November 2008
Exhibit: "The Afterlife of Alice In Wonderland," an exhibit on cultural reincarnations
and uses of Lewis Carroll's Alice's Adventures In Wonderland Library East Exhibit
Area, October 15, 2007 January 15, 2008.
"The Baldwin Library of Historical Children's Literature," in Journal of Children's
Literature, vol. 31, no. 1, Spring, 2005. pp. 48-53.
"Life Is Short, Art Is Long: Randolph Caldecott, 1846-1886," The Newbery and
Caldecott Awards: A Guide to the Medal and Honor Books, p. 11-17. Chicago:
American Library Association, 2000.
"Recess!" Over 160 essays written for Recess!, a 3-minute program recorded at the
University of Florida and aired nationwide over National Public Radio, September
1999-August 2007
"Caught Up in the Whirlwind: Ruth Baldwin," The Lion and the Unicorn, p. 289-302,
Vol. 22, No. 3, September 1998.










Papers, Speeches, Presentations:


"Randolph Caldecott and the Caldecott Award," Presentation for the Conversations In
Children's Literature monthly meeting, February 17, 2009
"The First Alice," Presentation at the Harn Museum of Art, Gainesville, Florida, on first
editions of Alice's Adventures in Wonderland, October 12, 2008
"Claiming New Territory: Louise Seaman Bechtel and the Establishment of Juvenile
Departments in American Publishing Houses," A Paper Presented at the Children's
Literature Association Annual Conference, Newport News, Virginia, June 15, 2007.
"The Quest for the Quotidian," a paper presented as part of a panel entitled Culture of
Comics: The Sol and Penny Davidson Special Collection at the University of Florida.
Popular Culture Association Annual Conference, April 13-15, 2006, Atlanta.
"The History of the Baldwin Library." NEFLIN Workshop, University of Florida, March
17, 2006
"Collecting the Everyday: Popular Culture, the Academic Library and the Scholar,"
Paper presented at the Conference on Comics and Childhood, University of Florida,
February 24, 2006
"Children's Science Books to 1900," A talk and visual presentation on the history of
children's science books. "Transforming Encounters II: Children and Science,
Imagination and Inquiry," a colloquium, at the Unviersity of Florida, February 18-19,
2005.

Grants:
October, 2007. Principal Investigator. 21 month National Endowment for the Humanities
grant to catalog items from the Baldwin Library published from 1890 through 1905
and to digitize and make available through the internet those items with color
illustrations. $285,000
March, 2004. Co-Principal Investigator. Two year National Endowment for the
Humanities grant, to catalogue Baldwin Library holdings dated 1870-1889 and to
digitize and make available through the internet those items from that time period
which contain color illustrations. $298,185
May, 2000. Co-Principal Investigator. Two year National Endowment for the
Humanities grant, to catalogue and microfilm Baldwin Library holdings from 1850-
1869 and to digitize and make available through the internet those items with color
illustrations. $381,220

University Service
Associate Director, Center for the Study of Children's Literature and Culture, an
interdisciplinary center housed in the UF English Department. June, 1997-Present.

National Service:
American Library Association, Association of Library Service to Children, Bechtel
Fellowship Award Committee member, 1998-Present
American Library Association, Association of Library Service to Children, 2005
Caldecott Award Selection Committee, 2003-2005, Appointed, Member










Digital Library Center 352.273.2900
UF Libraries, PO Box 117003 marsull@uflib.ufl.edu


Mark Vincent Sullivan




Experience 2005 2009 Digital Library Center, UF Gainesville, FL
Programmer and Systems Architect
Designer, architect, developer, and programmer for suite of production
tools for the UF Digital Library Center and partners, including all
partners in the Digital Library of the Caribbean (dLOC)
Tool suite is the "DLC Toolbox" for production line installations with
multiple simultaneous workflows and is the "dLOC Toolkit" for
single user workflows. These offer interfaces in English, Spanish,
and French
DLC Toolbox and dLOC Toolkit are Open Source and currently
support over digitization by over two dozen institutions across the
US and the Caribbean
Designer, architect, developer, and programmer for all aspects of the
SobekCM system which:
Uses Asp.net to harness the abilities of the Greenstone Digital
Library System, enterprise-level full text indexing and searching
through Lucene, and MS SQL database, and to integrate them
into a robust and dynamic digital library and content management
system
Powers the University of Florida Digital Collections (UFDC), which
have over 203,000 volumes with over 4 million pages of books,
archival materials, maps and other large format items,
photographs, audio and video, newspapers, objects, etc. UFDC
also includes materials from over 24 languages, which required
implementing intensive indexing optimization
Designed customized supports based on user needs (multi-lingual
interface support, automatic customized interfaces for all
partners), material type needs (zooming, objects in rotation), and
internal user needs (usage statistics, search engine optimization
for external engines to search the UFDC materials)
Developed and maintain documentation on all tools

2004 2005 Digital Library Center, UF Gainesville, FL
Systems Architect and Programmer, Ephemeral Cities Project
Designed and implemented software and database for the Ephemeral
Cities Project, a grant to create geographic interfaces to browse
through maps, documents, museum objects, and photographs for
three Florida cities from 1884-1903.
Designed, created, and maintained workflow applications and
databases in .NET, C#, MS SQL.
Automated image manipulation and creation of metadata for image
class items prior to web mounting.

2001 2004 Diaital Library Center. UF Gainesville. FL


",


I










Internet Server Manager and Database Developer
* Prepared and manage electronic collections of digitized images.
* Developed automation techniques, programming in C# and Visual
Basic.
* Designed databases and manage information workflows for current
projects in both MS Access and MS SQL.
* Created user interfaces to access the databases and assist students
entering data.

1999 2001 MCI Worldcom [MCIW] Tampa, FL
Implementation Consultant
* Responsible for the PriceWaterhouseCoopers [PwC] account's
installation processes, from design and pricing assistance to solving
any technical issues and configuration of routers and PBX's during
activations with the customer.
* Managed projects increasing bandwidth of PwC's WAN, raising total
revenue from $16M to $42M annually.
* Aided the customer and MCIW in troubleshooting of all service and
technical issues.
* Partnered with PwC, as well as Home Shopping Network, to sell, price,
and provide both off-the-shelf and custom data and voice solutions.

1997- 1999 MCI Worldcom San Francisco, CA
Global Service Consultant
Worked on the Bank of America account team with responsibilities for
data and voice network implementation
* Assisted with general project management and customer notifications
* Provided seminars for the customer to educate on MCIWs products
and processes

1994 1996 Preservation Dept, UF Libraries Gainesville, FL
Administrative Assistant
* Aided in the preservation of brittle books


2004 2009
Computer Engineering, BA


University of Florida


Gainesville, FL


Selected
Publications &
Presentations


* "Developing an Open Access, Multi Institutional, International Digital
Library," in Resource Sharing & Information Networks; by Brooke
Wooldrige, Mark Sullivan, and Laurie Taylor, forthcoming 2009
* "Digital Library of the Caribbean : a User-centric Model for Technology
Development in Collaborative Digitization Projects," Invited paper to a
Special issue of OCLC Systems & Services: International Digital
Library Perspectives; by Marilyn Ochoa and Mark Sullivan, forthcoming
2009
* Digital Library of the Caribbean (dLOC) Training; US Embassy in Haiti


Education








SHORT CURRICULUM VITAE
Laurie N. Taylor
Interim Director, Digital Library Center
University of Florida Libraries

ADDRESS: Digital Library Center TEL: (352) 273-2900
Smathers Library FAX: (352) 846-3702
P.O. Box 117003 EMAIL: Laurien@ufl.edu
University of Florida
Gainesville, FL 32611-7003

EDUCATION:
Ph.D. 2006 University of Florida
(English/Digital Media)
M.A. 2002 University of Florida
(English/Digital Media)
B.A. 1999 Jacksonville University
(English)

RECENT POSITIONS HELD
2008 Interim Director, Digital Library Center, George A. Smathers Libraries,
University of Florida
2007 2008 Digital Projects Librarian, Digital Library Center, George A. Smathers Libraries,
University of Florida
2006 2007 Associate Director, Flexible Learning, Division of Continuing Education,
University of Florida
2000 2006 Instructor, College of Liberal Arts & Sciences, University of Florida

PROFESSIONAL AFFILIATIONS
Editorial Board, International Journal of Gaming and Computer-Mediated Simulations
Modem Language Association
American Library Association
Library & Information Technology Association

GRANTS

Caribbean Newspaper Digital Library (Department of Education; 2009-2014)
Florida Aerial Photographs / From the Air: the Photographic Record of Florida's Lands,
Phase III (Library Services and Technology Act, 2009-2010)
America's Swamp: the Historical Everglades (National Historic Publications and Records
Commissions, 2009-2011)

PUBLICATIONS

Selected Referred Publications
"Snow White in the City: Teaching Fables, Nursery Rhymes, and Revisions in Graphic
Novels," in Approaches to Teaching the Graphic Novel. Ed. Stephen E Tabachnick. New








York: MLA, forthcoming 2009.
Playing the Past: Video Games, History, and Memory, co-edited with Zach Whalen.
Nashville, TN: Vanderbilt University Press, 2008.
"Bioactive," in Gaming in Academic Libraries Casebook, co-authored with Sara Russell
Gonzalez, Valrie Davis, Carrie Newsom, Chelsea Dinsmore, Cynthia Frey, and Kathryn
Kennedy. Ed. Amy Harris and Scott Rice. ACRL, 2008.
"Gaming Ethics, Rules, Etiquette and Learning." Handbook of Research on Effective
Electronic Gaming in Education. Ed. Richard E. Ferdig. Information Science Reference,
2008.
"Making Nightmares into New Fairytales: Goth Comics as Children's Literature," in The
Gothic in Children's Literature: Haunting the Borders. Eds. Anna Jackson, Karen Coats,
and Roderick McGillis. New York: Routledge, 2008: 195-208.
"Console Wars: Console and Computer Games," in The Player's Realm: Studies on the
Culture of Video Games and Gaming. Eds. J. Patrick Williams and Jonas Heide Smith.
Jefferson, NC: McFarland Press, 2007: 223-237.
"Cameras, Radios, and Butterflies: the Influence and Importance of Fan Networks for
Game Studies." Fibreculture Journal 8 (2006):
http://joural.fibreculture.org/issue8/issue8 taylor.html.
"Playing in Neverland: Peter Pan Video Game Revisions," collaboratively written with
Cathlena Martin, in J. M. Barrie's Peter Pan In and Out of Time: A Children's Classic at
100. Eds. Carole Anita Tarr and Donna White. Scarecrow Press, 2006.
"Positive Features of Video Games," in Handbook of Children, Culture, and Violence.
Eds. Nancy E. Dowd, Dorothy G. Singer, and Robin Fretwell Wilson. Thousand Oaks,
CA: Sage, 2005. 247-265.
"Gaming's Non-Digital Predecessors," collaboratively written with Cathlena Martin, in
The International Digital Media & Arts Association Journal 2.1 (Spring 2005): 25-29.
"Practicing What We Teach: Collaborative Writing and Teaching Teachers to Blog," co-
authored with Cathlena Martin, in Lore: an E-Journalfor Teachers of Writing (Fall
2004): http://www.bedfordstmartins.com/lore/digressions/content.htm?disl2.
"Open Source and Academia," co-authored with Brendan Riley, in Computers and
Composition Online (Spring 2004): http://www.bgsu.edu/cconline/tayloriley/intro.html.
"When Seams Fall Apart: Video Game Space and the Player," in Game Studies: the
International Jounral of Computer Game Research 3.2 (Dec. 2003):
http://www.gamestudies.org/0302/taylor/.

SELECTED PRESENTATIONS

"Practical Steps Towards Your Local and/or Regional Digitalisation Project," at the
Seminar for Libraries of the Dutch Caribbean CuraCao, University of the Netherlands
Antilles. Willemstad, CuraCao: September 25-6, 2008.
"Bioactive: A Game for Library Instruction" at the ALA Annual Conference. Anaheim,
CA: June 30, 2008.
"The Digital Library of the Caribbean (dLOC)" in the "Microfilm to Digitization
Roadshow: Hidden Treasures in the Vault" hosted by the OCLC Preservation Service
Centers at the ALA Annual Conference, Anaheim, CA: June 29, 2008.
"Choices for Building Digital Libraries" at the College of the Bahamas' Virtual Library
Committee at the College of the Bahamas, Nassau, Bahamas; Mar. 3, 2008.










Preliminary Analysis

1. Concordance Analysis Through Conceptual Clustering: Experiments on the Baldwin Library of
Historical Children's Literature

2. Phrase Pattern Induction

3. Topic Analysis












Concordance Analysis Through Conceptual Clustering
Experiments on the Baldwin Library of Historical Children's Literature

Howard Beck
University of Florida

July, 2009

Abstract: A conceptual clustering algorithm is applied to analysis of words appearing in a corpus on
children's literature (the Baldwin Collection). The algorithm leads to a representation of word meaning
that can be used for natural language analysis of text and that can lead to answer questions that
humanities scholars have about the literature collection. The first stage of the clustering algorithm looks
at syntax. A concordance for a particular word (like "good" or "doll") is first extracted from the corpus.
Each case in the concordance is parsed using the Sanford Parser, which creates a parse tree with nodes
labeled from parts of speech based on the Penn Treebank POS labels. The clustering algorithm begins
by doing a pair-wise intersection of parse trees for all combinations of cases in the concordance. The
intersection shows what parts of two parse trees are similar. The resulting intersection forms a class
with the two cases as instances. A classification technique based on subsumption of these classes
results in a taxonomy that groups together similar cases in the concordance at various levels of
abstraction. The entire corpus is covered by this cluster. Future versions of this algorithm will
incorporate semantic analysis.

Justification

The long term goal is to produce a story analysis system that can read and represent the text in the
literature collection. Achieving this goal will, among other things, require a rich ontology of the
concepts appearing in the literature, and an extensive lexicon of terms appearing in the text. The first
steps in obtaining such facilities can be implemented relatively easily through concordance analysis.
Concordance analysis attempts to build lexical entries and ontology concepts that are rich in syntax and
semantics. These can then be used to represent word and concept meaning, as well as facilitate text
analysis. That is, the lexical entry is associated with a concept in the ontology, and also contains a
broad range of grammatical patters and semantic roles that can be used for understanding how a word
is used in a particular situation. These can be built up through concordance analysis.

The language usage theory of word meaning claims that words do not get meaning from abstract
definitions but over the many different situations and contexts in which they are used. Concordance
analysis over a large corpus provides the basis for building word meanings and usage patterns that
would satisfy a language usage theory. The goal of concordance analysis is to group concordance cases
that are similar, and thus be able to relate a particular usage of a word with existing similar cases. One
way to do this is to use conceptual clustering in which the similarities of cases are compared. Here
similarity is defined as the extent to which graph representations of case properties overlap. No
numerical analysis is involved (thus far), and conceptual clustering is in contrast with statistical,
numerical clustering method.

Concordance analysis is the most common type of computational analysis in the digital humanities.
However, the traditional presentation techniques, such as keyword in context (KWIC) views still require










careful, time-consuming manual analysis to discover patterns, repetitions, and changes in word use over
various segments of a collection (e.g. over time, between genders, across authors). Given the scale of
the Baldwin collection, manual concordance analysis is not practical. Therefore, automatic techniques
based on structural analysis of syntax and semantics must be used. An elementary conceptual clustering
algorithm is presented here to demonstrate the process.

Conceptual Clustering Algorithm

Step 1: Concordance Building. A concordance for a particular word ( "good" and "doll") was created by
extracting all sentences containing the word from the Baldwin corpus. Raw text of the Baldwin corpus
was provided by the UF Library. Simple string matching was used to search the corpus for a particular
word. Once located, the sentence containing the word was extracted by looking for period boundaries.
The extracted sentence is called a case (Figure 1). All the cases were saved for processing by Step 2.
The Baldwin corpus contained 250,000 cases for "good", and 10,000 cases for
"doll". presented using the KWIC view [Luhn, 1960]:

I will give away my doll and all my play things, if she may come up .
she had just received a present of a beautiful doll so delicate and lovely it seemed as if it could not
be born to sorrow and care .
but the little girl 's brothers, had carried off the doll set her on the branch of a tall tree in the garden
and run away.

Step 2: Syntactic Analysis. Each sentence in the case file was parsed using the Stanford Parser
(http://nlp.stanford.edu/software/lex-parser.shtml). This created a parse tree that identified all the
sub-phrases within the sentence, down to the individual words. Each node in the tree is labeled by a
part-of-speech (POS). The POS labels come from the Penn Treebank
(http://bulba.sdsu.edu/ieanette/thesis/PennTags.html). These parse trees were forwarded for
processing in Step 3. Each entry consisted of the original phrase followed by the parse tree. Here is a
sample output from the Stanford Parser:

119: the doll and her friends .
number of tokens in phrase: 6
ROOT
NP
NP
NP
DT the
NN doll<====
CC and
NP
PRP$ her
NNS friends

Step 3: Pair-wise Intersection. For every pair of parse trees (for any 2 cases), an "intersection" tree is
built to identify how the two parse trees are similar. For 1000 cases there are 499 ,500 unique pairs of
parse trees. The intersection tree begins by comparing the two nodes in the two parse trees
corresponding to the concordance term (e.g., the node for "doll" in one tree is matched to the node for
"doll" in the other). If the POS tags for those two nodes do not match, the resulting intersection is null.
If they do match, the intersection tree is initialized with a single node containing the matching POS.










Next, the parents of the matched nodes in the two parse trees are compared. If they match they are
added to the intersection graph, likewise for the children. The intersection graph propagates via
matches within the two parse trees until no more matches can be found. A new class is created
containing the intersection tree.

Step 4: Classify all classes and cases. A taxonomy is created from all classes induced in Step 3 based on
class subsumption. Class A subsumes Class B if the intersection tree for class A is contained entirely
within the intersection tree for Class B. An initial taxonomy is built from a single "universal" class
containing a null intersection graph (this class subsumes all other classes). Each class is added to the
taxonomy one at a time by doing a breadth-first search of tree, descending as low as possible as long as
the subsumption relationship holds true between an existing class and the new class. This finds all the
most specific classes that subsume the new class, and the new class is added below those in the
taxonomy. The same process is then used to classify all cases within the taxonomy.

Results

The resulting cluster contains all cases as well as a large number of classes induced from intersection.
The taxonomy has the most abstract classes at the top, and more specific classes ranked in order below.
The leaf nodes of the cluster are the cases, and interior nodes are all classes. Degree of similarity is
defined by proximity in the cluster. Cases appearing as leaves of the same node are syntactically similar.
Cases in sibling class nodes are also similar, but less similar than cases appearing in the same node.

Below are some interesting clusters, taken as branches from the complete cluster. The classes are
shown as tree structures, similar to Figure 2, but with the tree collapsed into a single line using list
notation ( ()) rather than indents to show the nesting. The phrases following the classes are the
instances from the concordance that are classified within that class.

The first few examples are from the "good" concordance. This first example clusters verb phrase (VP) -
adjective phraseADJP) prepositional phrase (PP) patterns. For instance "taste good for lunch"
matches with "make good to you in any way" because "taste" matches with "make" (verb), and "for
lunch" matches with "in any way" (prepositional phrases). In this example a class subclass
relationship is also shown, (ADJP(JJ good)(PP(IN(NP)))) subsumes it's subclass (VP(VB(ADJP(JJ
good)(PP(IN(NP)))))) because the former phrase pattern is contained within the later.

(ADJP(JJ good)(PP(IN(NP)))) lyra.nlp.ClusterNode@18a7efd instances:0
(VP(VB(ADJP(JJ good)(PP(IN(NP)))))) lyra.nlp.ClusterNode@1971afc instances:3
he thought; it might taste good for luncheon .
i could n't make good to you in any way.
iving in a contempt of heaven, when be good at the hand of maker, I but is unc
(ADJP(JJ good)(PP(IN(NP(NP(DT)(PP(IN(NP)))))))) lyra.nlp.ClusterNode@16cd7d5 instances:2
teps of d avid and did that which was good and right in the eyes of the go d .


These cases have the general structure "good as adjective" noun prepositional phrase. Thus it finds
"good thing", "good laugh", "good channel" as the adjective -noun (good modifies the noun), and
"towards the..", "at him", and "for a mile"... are the prepositional phrases.


(NP(DT(JJ good(NN)))(PP(IN(NP)))) lyra.nlp.ClusterNode@lef9fld instances:6










d spring up in the dreary wastes; some good thing towards the; should be found
ver backward into the mud, and i had a good laugh at him afterwards.
often they followed a good channel for a mile, only to have i
it was then i called to mind the good advice of my father how and was a m
maintain himself and his colony in the good grace s of white hall and the board
in this, the good advice of my father came to my mind

This is possibly the most useful cluster so far, because it shows conjunctions involving good, "good and
right", "good and evil", "good and heavy", "good and carefully", which happen to be synonyms and
antonyms (but the clustering algorithm doesn't know that yet because it is only looking at syntax).

(JJ good(CC(JJ))) lyra.nlp.ClusterNode@fe748f instances:9
teps of d avid, and did that which was good and right in the eyes of the go d .
vein s, and his reign was a mixture of good and evil, resembling that of his p
it is a / i time, and if the crop is a good and heavy one, it is a time .
a great deal for him, and as she is a good and careful reader, he derives muc
ngs till you out fitting for the 55 get good and ready .
pulled good and hard, the b arb bar b caught i
; you need to be good and strong before i tell you what i think of that.
e d he i 4 good and religious, that lie thought of
remember that every good and perfect gift proceeds from go d

The analysis of "doll" was a bit messy, due to various errors, but here are a few interesting clusters.
This first group found personal pronoun doll prepositional phrase. "her doll on the floor" vrs. "her
doll in the cradle". Eventually these would help figure out who typically owns dolls, and place that dolls
can be (that requires semantic analysis).

(NP(PRP$(NN doll))(PP(IN(NP(DT(NN)))))) lyra.nlp.ClusterNode@8916a2 instances:1
shame,; she began aloud, tossing her doll on the floor; and her very hands t
(NP(NP(PRP$(NN doll))(PP(IN(NP(DT(NN))))))) lyra.nlp.ClusterNode@2ce908 instances:2
ntact with the chair, on which had her doll in the cradle .

This group is doll possessive noun, so "doll's bonnet", "doll's dress", and "doll's house". Eventually
these would be typical things a doll can own.

(NP(NP(DT(NN doll(POS)))(NN))) lyra.nlp.ClusterNode@1ef9157 instances:3
ow blame from herself, she held up the doll 's bonnet, and ex claimed, it is
into my room to show me a basket and a doll 's dress had given her.
what, is s i it a doll 's house; / no; it is something alive .

Finally these are infinitive doll (doll as object) patterns, "to have the doll", "to buy a doll", and "to
own the doll".

(S(VP(TO(VP(VB(NP(DT(NN doll)))))))) lyra.nlp.ClusterNode@1e97f9f instances:3
said but you like to have the doll .
our money until you had enough to buy a doll;; i know i did, mother; but i d
so she went to own the doll .












Cluster for "good"

The complete cluster for about 400 cases of "good", including over 800 induced classes, is available as
an ASCII text file from:

http://lyra.ifas.ufl.edu/temp/GoodCluster.txt

The structure of the cluster is a taxonomy rooted in "universal". Subclasses appear indented below that,
with each successive level indented further. The universal class has just 3 immediate subclasses: (JJ -
good), (RB -good) and (NN good). Each class shows the intersection tree in list format in a single line.
Instances in each class follow immediately below the class at the same level of indentation.










Phrase Pattern Induction


Below is a set of phrases from a training exercise from the Aymara collection. The objective is to induce
a phrase pattern capturing similarities and differences among the phrases. A phrase pattern covering
these phrases is shown at the bottom, including categories (, ). and
constraints placed on constituents in the phrase.

Is this your ?
Yes, it is my _
Jis, naya\n ch'uqi\ja\wa .
Jis, jupa\n asukara\pa\wa .
Jis, juma\n up"isina\ma\wa.
Jis, naya\n p"ina\ja\wa.
Jis, jupa\n kisu\pa\wa .
Jis, juma\n yapu\ma\wa .
Jis, naya\n kanasta\ja\wa.
Jis, jupa\n apilla\pa\wa .
Jis, juma\n t'ant'a\ma\wa .
Jis, naya\n kanasta\ja\wa.
Jis, jupa\n misa\pa\wa .
Jis, juma\n ch'uqi\ma\wa .

Pattern:
Jis, n \\ .

Constraints:
= question.,
.person = question..inverse person,
.person=.person,
.target="personal knowledge"










Topic Analysis

These are clusters generated from analysis of annotated phrases from the Jaqi collection by using
Cross-collection latent Dirichlet allocation (ccLDA). Clusters automatically group related
words. For example, Topic 1 contains words describing people and places. Topic 3 contains
words describing time.


Ayarama P(x=0)
0.27


Jaqaru P(x=0) = 0.15 Kawki P(x=0) = 0.16


uliwyan 0.017
rusintita 0.013
sutipax 0.012
qantuta 0.011
tiwanakunkiriwa.
0.009
mamani 0.009
kuna 0.009
suxta 0.009
nayan 0.008
mamanin 0.008
sirwisamp 0.007
ya. 0.007
nuwy 0.007
yuqapaw 0.007
ruwirtu 0.007
uliwya 0.006
niya 0.006
uliwyamp 0.006
sisku 0.005
tintat 0.005


jupax
0.115
niy 0.072
r. 0.025
utji. 0.023
markan
0.022
mamapax
0.019
sataw
0.015
sutipax
0.015
laja 0.014
rusintita
0.014
uliwya
0.013
sataw.
0.012
uliwyan
0.012
akat 0.011
tuktur
0.010
mamani
0.010
jinaru
0.009
nuwy
0.008
nayatak
0.008
ruwirtu
0.007


\u.
0.078
\na.
0.040
\?ta
0.040
\u?
0.040


C=vs 0.244
antz 0.024
C=simpre
0.018
qa 0.015
aka 0.015
manha 0.012
C=nada 0.012
ni 0.011
C=kulijyu
0.010
C=padri
0.009
C=n 0.009
novecientos
0.007
a 0.007
illk"kajttxi
0.007
k"uw 0.007
pakawshu.
0.007
C=todo 0.006
C=se 0.006
C=ladu 0.006
C=dr. 0.006


\taki.
0.071
\kam
0.041
\sana
0.041
\ru.
0.041
\ja 0.033
\ma>
0.003
\sn
0.003
\kasa
0.003
\yuy
0.003
\rqaya
0.003
\isn.
0.003
\n. 0.003
\nhna
0.003
\ata.
0.003
\yatx
0.003
\ush
0.003
\kama.
0.003
\shqa
0.003
\ill
0.003


C=spa 0.407 \psa. 0.039
C=bal 0.014 \t"a. 0.039
kachuy 0.014 \kamaya
C=este 0.013 0.039
C=para 0.012
q 0.011
C=aa 0.010
w. 0.010
C=asensyon
0.009
C=komunero
0.009
C=ke 0.008
C=ser 0.008
C=mm 0.008
akisha 0.007
C=kaden 0.007
C=gobyern 0.007
C=fyest 0.006
C=contar 0.006
C=ador 0.006
C=ni 0.006


Topic 0 -
Common


II II II


II II II










\jam
0.003


Topic 1 Ayarama P(x=0) = 1
Topic Ayarama P(x) Jaqaru P(x=0) = 0.13 Kawki P(x=0) = 0.17
Common 0.23


suxta 0.012
patakan 0.011
m?pit 0.009
pusi 0.006
niya 0.006
waywaykaris
0.005
ruwirtur 0.005
maran 0.004
university 0.004
Ilatun 0.004
qillqa? 0.004
k"itirus 0.003
utji? 0.003
rusintitan 0.003
mamita 0.003
jina. 0.003
kimsaqallq 0.003
churi. 0.003
wuliwy 0.003


p? 0.105
tunka
0.083
tunk 0.072
waranq
0.066
kimsa
0.062
p"isqa
0.038
paqallq
0.025
patak
0.022
Ilatunk
0.015
t'ant 0.010
suxta
0.010
maran
0.010
1. 0.008
of 0.007
qallt?na.
0.006
mama.
0.004
k"itinakas
0.004
pusi 0.004
iyaw 0.004
kunas
0.003


\mp
0.066
\itasp
0.040
\p?
0.040
\na.
0.024


C=vs 0.275
C=a 0.197
C=spa 0.070
C=sino 0.017
C=padri 0.016
w0.015
yacxi 0.008
manha 0.006
watqa 0.005
ps 0.005
C=unibirsidad
0.005
C=mm 0.004
ipi 0.003
C=tiligrama
0.003
novecientos
0.003
C=ba?a 0.003
C=sanmaraksu
0.003
C=veces 0.003
C=ii 0.003
eso 0.003


\ja.
0.063
\kam
0.041
\utm.
0.041
\sana
0.028
\pmina
0.028
\mama.
0.028
\sn
0.003
\ma>
0.003
\rqaya
0.003
\kasa
0.003
\kama.
0.003
\yuy
0.003
\ill
0.003
\isn.
0.003
\chi
0.003
\nhna
0.003
\n.
0.003
\yatx
0.003
\shqa
0.003
\ush
0.003


C=spa 0.062
C=vs 0.061
C=error 0.056
may 0.052
C=barios 0.034
C=kada 0.029
kachuy 0.026
C=mm 0.022
C=estamos 0.019
wa. 0.012
C=fundasy?n
0.010
C=ni 0.010
C=familyari
0.010
kastro 0.010
C=no 0.010
C=funda 0.009
C=para 0.009
C=presidenti
0.009
C=komuni 0.008
C=juyshu 0.008


\imama
0.065
\psa. 0.065
\jam 0.065











Topic 2 -
Common
mama. 0.042
ya 0.015
ya. 0.010
c. 0.007
purini. 0.005
k"itis 0.005
mama 0.005
tukuykamax
0.004
alantati? 0.004
ukatx 0.004
sirwisamp 0.004
karmilu 0.004
maruj 0.003
kasiru. 0.003
0.003
t 0.003
janipuniw 0.003
awissraa 0.003
nayraqatamana.
0.003
sarakimay 0.003


Ayarama P(x=0)
0.27


aka 0.098
ch'iyar
0.045
ya 0.041
j. 0.037
akax
0.032
c. 0.029
y. 0.026
mama.
0.022
jis 0.018
qawq"a
0.017
tunka
0.015
mam
0.015
t'ant 0.012
tiyu.
0.011
k"itis
0.009
jumatak
0.008
tiyas
0.007
winus
0.007
1. 0.007
tata 0.007


\cha
0.041


Jaqaru- P(x=0)= 0.16


C=spa 0.053
C=mas 0.042
C=vs 0.025
C=simpre
0.024
C=y 0.019
C=n 0.015
k"uw 0.015
antz 0.014
kapr 0.013
illk"kajttxi
0.013
q 0.012
wanwan 0.010
irina 0.010
sillitu?q.
0.010
C=sino 0.010
C=unibirsidad
0.010
uk 0.009
C=primir
0.009
mayqa 0.009
ya 0.008


\wshqa.
0.045
\sn
0.041
\pmina
0.028
\wshqa
0.028
\mama.
0.028
\ra.
0.024
\shqa
0.024
\war
0.003
\ma>
0.003
\rqaya
0.003
\kasa
0.003
\yuy
0.003
\sna
0.003
\ill
0.003
\isn.
0.003
\n.
0.003
\nhna
0.003
\yatx
0.003
\ush
0.003
\kama.
0.003


Kawki P(x=0)


C=spa 0.430
C=que 0.029
C=masa 0.027
C=hay 0.016
C=a?o 0.015
C=no 0.011
C=pueblo 0.008
q 0.008
C=ser 0.008
irq 0.008
C=solamente
0.007
C=ador 0.006
?imaj 0.006
C=trabaj 0.006
C=nada 0.006
C=kaden 0.006
awt 0.006
C=mallas 0.006
C=katoliko 0.006
C=borracho 0.005


0.15


\n 0.064
\wa. 0.064
\p"a 0.050
\ish 0.047
\? 0.047
\iri 0.035
\uk" 0.020
\yaq 0.020
\yaq"
0.020


Topic 3 Ayarama P(x=0)


Jaqaru P(x=O) = 0. 12


Kawki P(x=O) = 0. 15










Common 0.57


jani 0.746
iya 0.034
ampara 0.011
q"ipuru 0.006
waliki 0.005
maymara 0.004
anch"ita 0.003
masayp'u 0.003
sawaru 0.002
ch"armirja 0.002
ch'illa 0.002
manuyla 0.002
jinchu 0.001
wirnis 0.001
las-uchu 0.001
las-tusi 0.001
juywis 0.001
jayp'u 0.001
juywisa 0.001
siw. 0.001


waliki
0.116
j. 0.087
wirnisa
0.086
alay 0.073
pasir
0.052
inas 0.045
sawaru
0.022
jinchu
0.017
iya 0.017
anch"ita
0.008
1. 0.007
masayp'u
0.006
nayat
0.006
ch'ux?a
0.006
tunka
0.006
wimis
0.004
q"ipuru
0.003
sawar
0.003
maymara
0.003
mamanin
0.003


\w
0.971
\kama
0.009
rakii
0.006
\w.
0.002
\x?
0.002
\wjita
0.002


C=spa 0.225
C=vs 0.099
C=sino 0.025
C=mas 0.015
aka 0.014
k"uw 0.011
C=con 0.010
C=krus 0.008
C=primir
0.007
Iluqall 0.007
C=unibirsidad
0.006
C=dr. 0.006
ajtz'a 0.005
C=ut 0.005
C=plaswila
0.005
C=tupina.
0.005
bandida.
0.005
upaq 0.005
C=k"umuda
0.005
C=unibirsidad
i 0.004


\w 0.238
\utm.
0.057
\ru.
0.052
\mama.
0.033
\rqaya
0.002
\sn
0.002
\ma>
0.002
\kasa
0.002
\yuy
0.002
\cxunhk
a 0.002
\kama.
0.002
\ill
0.002
\isn.
0.002
\n.
0.002
\nhna
0.002
\yatx
0.002
\chi
0.002
\shqa
0.002
\ush
0.002
\jam
0.002


C=spa 0.128
C=estamos 0.074
C=ashi 0.055
C=eks 0.048
C=eksprop 0.029
C=tam 0.021
C=propyando.
0.019
spa 0.015
q 0.013
t"a. 0.012
wa. 0.012
C=trabaj 0.011
C=fyest 0.011
C=que 0.010
erasmo 0.010
C=sinkoa?s 0.009
C=redusidu 0.009
am? 0.009
C=no 0.008
C=ultimaora
0.008


\t"a. 0.065
\imama
0.065
\psa. 0.038
\watx 0.038
\kasa 0.038


Topic 4 Ayarama P(x=0) =
oic 4 A ma P Jaqaru P(x=0) = 0.15 Kawki P(x=0) = 0.19
Common 0.26
mark 0.112 ch'uq \?0.927 C=a 0.194 \ps qa 0.031 \cha. 0.089
qullq 0.087 0.184 \tuq C=spa 0.123 0.077 C=error 0.028 \? 0.064










uk 0.065
wan 0.023
kanast 0.019
qull 0.017
awt 0.017
making 0.017
t'ant' 0.016
kart 0.016
p"iry 0.016
wak 0.015
las 0.014
kustal 0.013
tint 0.013
nin 0.013
lapis 0.013
sirwis 0.011
kuchill 0.009
jum 0.008


iwis
0.045
yap
0.043
lijwan
0.037
kis 0.037
tunt
0.035
um 0.029
qullq
0.028
aymar
0.026
ut 0.025
nay
0.025
k'awn
0.025
away
0.024
ch'u?
0.022
sirwis
0.022
t'ant'
0.022
liwr
0.021
is 0.018
tunq
0.018
lapis
0.014


0.021
\t 0.019
\cha?
0.017
\st 0.005
\sti
0.002
\x 0.002


ranh 0.039
C=mand 0.019
mata 0.009
manha 0.009
C=pues 0.009
C=salud 0.008
C=m 0.008
C=padri 0.008
C=unibirsidad
0.007
C=su 0.007
C=cuando
0.007
C=unclear
0.007
C=asta 0.007
C=nombre
0.006
C=tiligrama
0.006
C=negro
0.006
C=istadusunid
us 0.005
C=masa 0.005


\ra
0.063
\m.
0.041
\ps.
0.021
\rqaya
0.003
\sn
0.003
\kama.
0.003
\ma>
0.003
\yuy
0.003
\kasa
0.003
\isn.
0.003
\nhna
0.003
\ata.
0.003
\n.
0.003
\yatx
0.003
\shqa
0.003
\ush
0.003
\ill
0.003
\cxunhk
a 0.003
\sana
0.003


C=banjeliko 0.025 \tna. 0.064
C=defenda 0.025 \psa. 0.047
C=presidenti \imama
0.023 0.047
C=vs 0.021
C=ke 0.017
C=eksprop 0.017
C=fyest 0.013
wa. 0.013
C=trabaj 0.012
C=mil 0.012
C=no 0.012
q 0.012
putinsa. 0.011
C=a? 0.011
C=komunidadi
0.011
q. 0.011
C=ya. 0.011
C=fundadora
0.009


Topic 5 Ayarama P(x=0) =
Common 0.26 Jaqaru P(x=0) = 0.13 Kawki P(x=0) = 0.18
Common 0.26
ya. 0.019 ukat 0.088 \na. C=spa0.196 \n0.041 akish0.122 \qa 0.295
alasi. 0.013 uka 0.047 0.041 C=error 0.118 \p"a. C=vs 0.060 \kama 0.235
mamax 0.012 sasaw \p? C=vs 0.109 0.041 C=barios 0.030 \kamaya.
mamarux 0.012 0.046 0.041 C=i 0.025 \taki. C=pueblo 0.025 0.079













apustapxatayna.
0.008
sari 0.008
kutininiw. 0.007
miliku 0.006
si 0.006
ala? 0.006
words 0.006
0.006
jan 0.005
sari. 0.005
with 0.005
satayna 0.005
jisk'a 0.005
munta> 0.004
munt> 0.004
qutu 0.004


siwa.
0.043
tatax 0.036
kuns 0.026
aljir 0.023
mamax
0.015
p"iryan
0.011
r. 0.009
p. 0.008
sari. 0.008
y. 0.008
tuktur
0.008
tourist
0.008
ak"am
0.008
1. 0.008
ch'uqimp
0.007
ala? 0.007
0.007


con 0.011
C=mand
0.010
antz 0.009
las 0.006
C=mak 0.006
a 0.005
ajtz'a 0.005
wijchiwt"a
0.004
janhq'u 0.004
C=padri
0.004
C=mi 0.004
C=unibirsidad
i 0.004
illk"kajttxi
0.004
watqa 0.004
ru. 0.004
C=nada 0.004


0.041
\ch
0.041
\pmina
0.041
\ma>
0.003
\sn
0.003
\kasa
0.003
\kama.
0.003
\yuy
0.003
\rqaya
0.003
\isn.
0.003
\nhna
0.003
\ata.
0.003
\n.
0.003
\yatx
0.003
\shqa
0.003
\ush
0.003
\ill
0.003
\cxunhk
a 0.003


pachi 0.021
kanchna 0.017
akisha 0.016
C=a?o 0.015
C=que 0.015
C=i 0.014
C=defenda
0.014
C=para 0.014
C=famiri 0.013
kamsh 0.010
am? 0.009
kapill 0.009
C=a? 0.009
erasmo 0.008
C=atx 0.008
C=kaden 0.008


\ya. 0.079
\kamya 0.055
\wa 0.055
\jam 0.023
\j. 0.012
\mama 0.011
\tna. 0.011
\kamaya
0.011


Topic 6 Ayarama P(x=0) =
oc 6 Arma Jaqaru P(x=0) = 0.15 Kawki P(x=0) = 0.18
Common 0.25
quyx 0.016 kunarus \cha C=spa 0.166 \na. may 0.107 \t". 0.065
kisu 0.008 0.107 0.065 q 0.045 0.063 C=error 0.086 \isna
imilla 0.008 aparap?ma \cha? aka 0.033 \p"a. C=spa 0.085 0.065
jach'a 0.008 0.076 0.062 ni 0.027 0.052 C=trabaj 0.026
misa 0.008 (___ x \?ani. C=masa \pmina C=masa 0.023
kuka 0.006 0.051 0.032 0.016 0.041 na. 0.019
qalltat 0.006 kawkirus \xa C=error \qa. C=gana 0.014













ch'uqx 0.005
uka? 0.004
mamita 0.004
k'awnx 0.004
jiwasaw 0.003
chachax 0.003
rusintitan 0.003
mayt'awayita.
0.003
j. 0.003
uk 0.003
waywaykarisam
pin 0.003
k"itirak 0.003
panqara 0.003


0.044
chur?ma
0.038
aka 0.031
t. 0.020
kanasta
0.019
p. 0.017
jisa 0.015
kisx 0.014
t'ant 0.012
aparap?ma?
0.011
qawq"a
0.009
uka 0.009
iyaw 0.008
uk"amax
0.007
quyx 0.006
k'awnx
0.006
jupatak
0.005


0.032
\wa
0.027
\na.
0.027


0.015
irina 0.013
C=ya 0.011
C=padri
0.010
C=mas 0.009
C=su 0.008
C=siga 0.007
C=iskuyl
0.006
ya. 0.006
C=n 0.006
illk"kajttxi
0.005
C=mand
0.005
C=dimas
0.005
C=unibirsida
d 0.005
C=negro
0.005


0.041
\ma.
0.027
\nh
0.027
\taki.
0.027
\nushu.
0.027
\mn
0.027
\sn
0.015
\p"l
0.015
\ill 0.002
\kama.
0.002
\cxunhk
a 0.002
\shqa
0.002
\isn.
0.002
\yuy
0.002
\ush
0.002
\ma>
0.002
\rqaya
0.002


C=pedasu 0.014
C=fyest 0.011
erasmo 0.011
C=a?o 0.011
C=mas 0.010
C=kashuypsa 0.009
C=made 0.009
C=desaparese
0.009
C=mm 0.008
C=del 0.008
C=asensyon 0.008
C=ke 0.008
"ama 0.008


Topic 7 Ayarama P(x=0) =
Common 0.27 Jaqaru P(x=0) = 0.14 Kawki P(x=0) = 0.16
Common 0.27
kunj?ms 0.014 taqi 0.054 \qat C=i 0.170 \na. 0.063 C=spa 0.339 \tna. 0.105
walja 0.012 mama 0.102 C=vs 0.152 \mn 0.041 qa 0.030 \kisa 0.105
ukanakxat 0.012 0.030 \w; may 0.044 \shqa C=que 0.025 \isna 0.046
jaqix 0.011 ukax 0.029 0.052 C=spa 0.041 C=ser 0.017 \t"a. 0.046
um 0.009 jaqix 0.029 \sp 0.025 \nushu. C=hay 0.017 \mama 0.046
kuna 0.008 yatxati. 0.040 ni 0.021 0.041 uw 0.014 \j. 0.046
sarnaqawxata 0.028 \it?tu C=ya 0.015 \rqaya C=no 0.011
0.007 ukaw 0.027 C=simpre 0.003 kachuy. 0.011
itnulujiyax 0.006 0.024 \ist?sta 0.012 \sn 0.003 C=pero 0.009
kawkits 0.006 kunas .0.027 wanwan \kasa C=token 0.008













u??spach?na
0.005
waw?kans 0.005
jan 0.005
yatxatiw 0.005
with 0.004
q"ar?rux 0.004
arunx 0.004
sissnaw 0.004
janiraki. 0.004
tiknulujixata
0.004
rilijiyunxata
0.004


0.024
nayra
0.021
jan 0.021
ukxat
0.018
yatxata??ki
s 0.015
ukanakxat
0.013
ar 0.013
kunjams
0.011
tata 0.010
amay
0.008
jaqin 0.006
jupan
0.006
yati??kis
0.005
itnulujiyax
0.005


\cha?
0.027


0.011
C=m 0.010
C=se 0.009
w 0.008
C=venir
0.007
C=no 0.007
C=a 0.007
C=siga
0.007
ridund
0.006
antis 0.006
C=salud
0.006
irina 0.005
C=kulijyu
0.005


0.003
\yuy 0.003
\war 0.003
\kama.
0.003
\ma>
0.003
\ill 0.003
\n. 0.003
\nhna
0.003
\yatx
0.003
\ush 0.003
\isn. 0.003
\chi 0.003
\cxunhka
0.003
\jam 0.003


C=atx 0.008
C=porke 0.008
C=a?o 0.008
C=la 0.008
erasmo 0.008
C=ni 0.008
C=may 0.007
C=eks 0.007
C=toda 0.007
marka 0.006


Topic 8 Ayarama P(x=0) =
Common 0.6 Jaqaru P(x=0) = 0.15 Kawki P(x=0) = 0.15
Common 0.26
ya 0.019 m. 0.172 \ut C=spa 0.318 \j. C=spa 0.436 \p"a
kat 0.011 tata 0.051 0.271 trump 0.021 0.391 C=a?o 0.027 0.318
niya 0.011 a. 0.040 \utu. estadosunidos \ja. C=katoliko 0.023 \yaq"
nayan 0.010 t. 0.037 0.226 0.019 0.248 wa. 0.020 0.107
us 0.008 v. 0.035 \u? wanwan 0.016 \na. C=krey 0.013 \uk"
tata. 0.008 mama 0.174 q 0.014 0.046 C=a?os 0.013 0.107
suma 0.008 0.029 \tam? C=du?a 0.012 \kas C=dyosi 0.012 \yaq
t'uk 0.008 us 0.029 0.094 ps 0.011 0.016 C=defenda 0.010 0.107
kimsa 0.007 uk"amax \ur C=unclear \"a. spa 0.009 \txi
t'ant'x 0.005 0.014 0.076 0.010 0.016 C=pueblo 0.009 0.043
k"a 0.005 t'ant 0.012 \xat kilikill 0.010 \mn C=obras 0.009 \t"a.
aparapita 0.005 m?pit 0.017 C=unibirsidadi 0.013 C=milnobesyentos 0.019
mama? 0.005 0.010 \p? 0.009 \nushu. 0.008 \jam
apustapxatayna. winus 0.007 C=a= 0.008 0.012 C=kambya 0.008 0.019
0.004 0.010 C=mas 0.007 C=no 0.008
taykax 0.004 ch'iyar C=manda ?imaj 0.007
sar?. 0.004 0.009 0.007 C=domini 0.007
.0.004 kat 0.009 patarwayll C=todo 0.007
markar 0.003 k"itis 0.006 w 0.006













janipuniw 0.003
kawkits 0.003


0.006
tiyas 0.006
t'uk 0.005
jich"ast
0.004
tata. 0.004
suma
0.004
linkwistik
0.004


C=ladu 0.006
C=yatx 0.006
C=isti 0.005
midy 0.005
shurur 0.005
C=abandon
0.005


C=ni 0.006
C=ser 0.006


Topic 9 Ayarama P(x=0) = .
Topic 9 Ayarama P(x) Jaqaru P(x=0)= 0.14 Kawki P(x=0) = 0.14
Common 0.25


jum 0.012
parlasipxi 0.008
ukch'akiw 0.008
k? 0.005
wal 0.005
niy 0.005
waywaykaris
0.005
taykax 0.004
utapar 0.004
anak?ta. 0.004
munta. 0.004
arunakxat 0.004
sasina. 0.004
jilapan 0.004
qutu 0.004
misa 0.003
aymarat 0.003
fn 0.003
mayinini? 0.003
tumati 0.003


jis 0.346
qawq"a
0.032
jumar
0.018
m. 0.015
tuktur
0.010
jinaru
0.010
y. 0.008
mama
0.008
tata 0.007
janit
0.007
kuna
0.006
janiw
0.005
jupar
0.005
jiwasax
0.004
sasaw
0.004
p. 0.004
miliku
0.004
niy?
0.004
juwana


\p?
0.041
\?ta
0.041
\cha
0.028
\cha?
0.028
\sp
0.028


may 0.057
C=m 0.039
C=ya 0.038
C=vs 0.026
kapr 0.021
C=midya
0.013
wanwan 0.012
C=a= 0.012
ru. 0.011
antz 0.010
C=unibirsidad
0.008
C=du?a 0.008
purumut 0.008
nistram 0.008
watqa 0.008
C=algo 0.007
C=a?u 0.007
tiya. 0.007
tiya 0.007
C=al 0.007


\ra.
0.104
\taki.
0.066
\nha
0.052
\na.
0.046
\"a.
0.032
\wa.
0.027
\psa
0.027
\n
0.017
\sana
0.017
\sn
0.002
\nhna
0.002
\rqaya
0.002
\yuy
0.002
\kasa
0.002
\ill
0.002
\shqa
0.002


C=spa 0.455
qa 0.041
may 0.023
C=kada 0.021
C=a?o 0.013
C=ke 0.011
C=unido 0.010
C=all? 0.009
C=juysh 0.009
irq 0.009
C=entusyasmastu
0.008
C=apm 0.008
C=todo 0.008
C=nuestra 0.008
sa. 0.008
C=nada 0.007
uk"am 0.006
C=jwisyo 0.006
C=fyest 0.006
C=funda 0.006


\tna. 0.064
\kamaya
0.047
\isna
0.047
\? 0.047
\taki.
0.047













0.003 \yatx
p"iry 0.002
0.003 \n.
0.002
\kama.
0.002
\ush
0.002


Topic 10 Ayarama P(x=0) = 1
Topic 10 Ayarama P(x) Jaqaru P(x=0)= 0.14 Kawki P(x=0)= 0.15
Common 0.22


wirnisaw 0.016
kunapachas
0.008
yatiqapxa?apa.
0.007
utapar 0.006
qutu 0.005
qala 0.005
miliku 0.005
sistati? 0.004
susanat 0.004
k"itirus 0.004
aliq 0.004
k"ay 0.004
tintan 0.004
churani? 0.004
jiwaya?ataki.
0.003
apam. 0.003
qull 0.003
( )r 0.003
t"uqu? 0.003
sasina. 0.003


jupar0.178 \kam
jumax 0.070
0.156 \ss
jan 0.110 0.040
jiwasax \p?
0.033 0.033
kunapachas \na.
0.024 0.033
jupaw \spa.
0.009 0.033
ch'uq 0.008
1. 0.007
sasaw 0.007
nayatak
0.006
suxta 0.005
akan 0.005
utapar
0.005
churani?
0.004
susanar
0.004
tiyu. 0.003
jumatak
0.003
jupatak
0.003
jiwasat
0.003


C=error
0.104
C=spa 0.101
aka 0.034
C=unclear
0.016
C=vs 0.015
ary 0.013
C=ya 0.012
C=doctor
0.011
C=sino
0.011
manha 0.011
C=con 0.011
C=mas
0.011
C=kunfsi
0.007
C=a?u 0.007
C=salud
0.007
C=tupina.
0.006
C=masa
0.006
C=se 0.006
C=dimas
0.006
C=manda
0.006


\wshqa.
0.141
\q 0.041
\ra. 0.041
\na. 0.039
\m. 0.032
\taki.
0.032
\kas
0.023
\psa
0.021
\p"a.
0.021
\sn 0.021
\rqaya
0.002
\kasa
0.002
\yuy
0.002
\isn.
0.002
\yatx
0.002
\n. 0.002
\nhna
0.002
\ush
0.002
\shqa
0.002
\ill 0.002


C=spa 0.333
C=este 0.043
C=barios 0.036
C=vs 0.030
pachi 0.022
t"a. 0.019
C=lleno 0.015
C=awtoridadi
0.014
C=mm 0.011
C=trabaj 0.010
C=aburridos
0.010
wa. 0.009
C=presidenti
0.009
C=ador 0.008
C=pueblo 0.008
C=gobyem 0.007
C=porke 0.007
C=no 0.007
C=karmaja 0.006
sa. 0.006


\kamaya
0.124
\psa. 0.065










Ayarama P(x=0)
0.11


Jaqaru P(x=0) = 0.27


Kawki- P(x=0)= 0.17


waka 0.027
uta 0.027
aka 0.027
jaqi 0.026
mama 0.022
ujt 0.018
wanwani 0.016
sipsa 0.016
yak 0.016
papa 0.014
upa 0.014
um 0.013
jak 0.012
marka 0.010
qayll 0.009
ap 0.008
shutx 0.007
ir 0.007
ant' 0.007
wat 0.006


jis 0.342
wali 0.052
arux 0.014
a. 0.012
jaqi 0.010
y?mas
0.010
awtut
0.009
uk"am
0.008
jumaw
0.008
1. 0.007
aruk 0.007
suxta 0.005
sarasin
0.005
tata. 0.005
kunas
0.004
sutiyawit
0.004
puryikturu
x 0.004
maystru
0.004
u. 0.004
ch'amawa.
0.004


\ur
0.052
\tam?
0.045
\na.
0.040
\sp
0.032
\ista.
0.032
\tayna

0.027
\xat
0.024
\itasp
0.024


sa 0.071
uk 0.058
ut 0.046
uka 0.041
ma 0.040
na 0.033
isha 0.030
may 0.021
ill 0.015
ak 0.013
ik" 0.012
manh 0.012
yatx 0.012
amru 0.011
ary 0.011
jaq 0.010
pur 0.010
tata 0.009
mark 0.009
nur 0.008


\w 0.111
\k 0.072
\? 0.055
\qa 0.054
\q 0.050
\n 0.040
\i 0.034
\t" 0.027
\nh 0.023
\na 0.020
\wa 0.020
\t"a 0.017
\shu 0.016
\p" 0.014
\aj 0.014
\q" 0.014
\s 0.012
\cha 0.011
\jal 0.011
\r 0.010


C=i 0.086
C=unclear 0.077
C=masa 0.045
C=para 0.035
marka 0.026
C=planu 0.025
C=seysi 0.023
C=estarem 0.019
pachi 0.019
C=sinkoa?s 0.018
C=esta 0.018
kachuy 0.016
C=estamos 0.016
w 0.014
C=iste 0.014
C=lleno 0.014
C=komo 0.014


C=ekspropya...pag...es
tes 0.014
pachi. 0.012
C=barios 0.012


Topic 12 Ayarama P(x=0) =
c 12 A a P ) Jaqaru P(x=0) = 0.13 Kawki P(x=0) = 0.12
Common 0.25
ya. 0.009 k"? 0.227 \cha? C=spa \qa. 0.041 C=de 0.156 \isna 0.066
umar 0.008 jan 0.071 0.095 0.164 \ma> C=a 0.104
jan 0.007 uk"am \p? C=error 0.041 C=si 0.065
nayan 0.006 0.037 0.074 0.145 \kama. C=me 0.057
ach'a 0.006 uka 0.033 \kam C=vs 0.044 0.028 C=puede 0.057
tatanak 0.005 v. 0.024 0.051 aka 0.021 \p"a 0.028 C=del 0.042
mama? 0.005 kunas \xat irina 0.016 \sn 0.028 C=ver 0.039
manuyl 0.004 0.011 0.044 kapr 0.016 \yuy 0.003 C=la 0.039
mama 0.004 a 0.010 \sp C=m 0.015 \ush 0.003 C=contar 0.028
chachax 0.004 k. 0.009 0.032 wanwan \yatx C=cosechas 0.020


Topic 11-
Common


\t" .
0.123
\kamay
a 0.064
\txi
0.064
\imama
0.064













pamparu. 0.004
ch'iyar-imilla
0.004
k"a 0.004
jisa. 0.004
yuqapaw 0.003
qawq"asa? 0.003
jutapxan 0.003
jinchu 0.003
arusamp 0.003
wiyaja 0.003


jupaw
0.008
arus
0.007
t'ant'
0.005
sum
0.005
jis 0.005
nayan
0.005
ch'ux?a
0.004
ak"am
0.004
puri
0.004
juwana
0.004
jich"ast
0.004
jupan
0.003


\wa.
0.032
\itasp
0.027
\ss
0.023


0.012
C=mas
0.010
C=manda
0.008
dimas 0.007
C=a?u
0.007
nha 0.007
C=mayr
0.007
C=dimas
0.007
C=uka
0.006
uka 0.006
C=unclear
0.006
nilla 0.006
C=cut
0.005


0.003
\cxunhka
0.003
\jam 0.003
\kasa
0.003
\rqaya
0.003
\isn. 0.003
\ill 0.003
\nhna
0.003
\n. 0.003
\shqa
0.003
\al 0.003
\sa 0.003
\sana
0.003


C=aqui. 0.020
C=las 0.020
C=y 0.016
C=contrar 0.016
C=ah 0.015
C=se 0.014
C=asensyon 0.014
C=(aa) 0.009
C=kreasyon 0.009
w. 0.007


Topic 13 Ayarama P(x=0) =
Common 0.7 Jaqaru P(x=0)= 0.14 Kawki P(x=0)= 0.15
Common 0.27
ar 0.008 aymar \? 0.077 ni 0.149 \wshqa. C=este 0.217 \qa. 0.330
paqallq 0.007 0.131 \ss C=spa 0.077 C=mas 0.090 \watx 0.241
isa. 0.007 jisa 0.082 0.065 0.112 \na. 0.063 antis 0.057 \am? 0.104
jupaw 0.006 jich"ax \nuk C=vs 0.078 \ja. 0.063 am? 0.049 \war 0.052
uk"am 0.006 0.050 0.040 C=error \"a. 0.063 kachuy 0.044 \"maya
ukax 0.006 p. 0.022 \p? 0.072 \mn. 0.041 p" 0.041 0.052
imill 0.005 ch'ux?a 0.027 C=kasara \kasa war 0.025 \uk 0.052
qillqa? 0.005 0.013 \itasp 0.010 0.003 C=del 0.010 \kasa 0.034
jaqaru 0.005 v. 0.011 0.027 ru. 0.009 \rqaya C=banjeliko \psa. 0.015
mama? 0.005 k. 0.011 C=kuyd 0.003 0.010 \kisa 0.013
uk 0.005 kunas 0.009 \yuy 0.003 sant 0.010 \p" 0.011
linkwistik 0.005 0.009 ranh 0.009 \nhna C=bal 0.008 \wa. 0.011
yatiqa?ani. 0.005 iyaw C=manda 0.003 putinsa. 0.008 \taki. 0.009
the 0.004 0.009 0.007 \cxunhka uk" 0.008
at 0.004 qawq"a C=y 0.007 0.003 jiwsa 0.008
parltan. 0.004 0.008 a 0.006 \kama. C=asensyon
wali 0.004 winus C=negro 0.003 0.008
waywaykarisam 0.007 0.006 \sn 0.003 t"a. 0.008
pin 0.004 tiyas jir 0.006 \ill 0.003 C=enemigos













yatiqapxa?apa.
0.004
sutimaxa? 0.003


0.007
ch'iyar
0.006
uk"am
0.005
m. 0.005
jum 0.005
jiwasar
0.005
p"iry
0.005
jupaw
0.005
nayaw
0.005


C=mak
0.006
C=nada
0.005
C=dimas
0.005
purumut
0.005
sarasara
0.005
pajsh 0.005
janhq'u
0.005


\shqa
0.003
\ush 0.003
\n. 0.003
\isn. 0.003
\yatx 0.003
\jam 0.003
\ma>
0.003


0.006
qa 0.006
C=ay 0.006
C=eksprop 0.006


Topic 14 Ayarama P(x=0) =
oc 1 A ma Jaqaru P(x=0)= 0.15 Kawki P(x=0) = 0.14
Common 0.26


uk"amax 0.030
janit 0.019
tata. 0.008
t 0.007
niya 0.007
jumar 0.006
s. 0.006
q"ar?rux 0.005
t'ant' 0.005
t"uq 0.005
janjaw 0.005
k"itinakas 0.005
k"a 0.004
1. 0.004
asukarampx
0.004
university 0.004
pirqa 0.004
p"iriyus 0.004
alj anirapit?taw>
0.004
yaq"a 0.004


t. 0.136
w. 0.099
tata 0.084
a 0.055
ach 0.022
mama
0.020
jich"ax
0.015
jiwasax
0.011
k. 0.010
qawq"a
0.010
arus 0.009
walikiw
0.009
walikiw.
0.008
uk"amax
0.008
jich"ast
0.006
kuns 0.006
ak"am
0.006
jisa 0.005


\u
0.549
\u.
0.212


C=spa 0.101 \wshqa.
C=a 0.074 0.078
C=error \p"a.
0.065 0.045
C=m 0.036 \ru. 0.041
may 0.017 \pmina
C=ya 0.015 0.024
wanwan \utm.
0.014 0.024
C=doctor \sn 0.003
0.011 \ma>
C=dimas 0.003
0.009 \rqaya
C=a?u 0.009 0.003
a 0.008 \yuy
illk"kajttxi 0.003
0.008 \kasa
C=primir 0.003
0.008 \ill 0.003
kapr 0.008 \ush
C=llama 0.003
0.007 \n. 0.003
pajsh 0.006 \nhna
C=manda 0.003
0.006 \yatx
cxunhka 0.003
0.006 \kama.


C=spa 0.533
C=masa 0.019
wa. 0.012
C=que 0.010
t"a. 0.008
q 0.008
C=nwestra 0.008
C=porke 0.008
uw 0.006
C=atx 0.006
sirio 0.006
C=deskap 0.006
C=pero 0.006
C=ratu 0.004
C=ba 0.004
q. 0.004
C=propyando. 0.004
C=ayt 0.004
C=dyosi 0.004
C=ombre 0.004


\t". 0.124
\wa 0.065













t"uq 0.005 yacxi 0.006 0.003
p"iry 0.004 C=simpre \shqa
0.006 0.003
\isn.
0.003
\jam
0.003
\sana
0.003


Topic 15 Ayarama P(x=0) =
oic 5 m Jaqaru P(x=0)= 0.13 Kawki P(x=0)= 0.16
Common 0.25


kuna 0.018
jisk'a 0.006
mama 0.005
alasi. 0.005
satayna 0.005
sisku 0.005
jum 0.005
ruwirtur 0.005
ya. 0.005
sark? 0.004
juwana 0.004
k"itits 0.004
kast 0.004
Ilatun 0.004
u?ja? 0.004
antrupuluj?ya
0.004
qalltat 0.003
k"itirus 0.003
puri 0.003
sapa 0.003


m? 0.267 \wa.
rat 0.056 0.068
t'ant 0.022 \sp
r. 0.017 0.049
k"itis 0.015 \cha?
v. 0.011 0.040
niy 0.010 \na.
jis 0.010 0.040
uk"am \itasp
0.008 0.032
p. 0.007 \?ani.
jich"ax 0.019
0.007
nayan
0.007
linkwistax
0.005
iskinan
0.005
jumaw
0.005
uka 0.005
paqallq
0.005
nayaw
0.005
pusi 0.004
m?pit
0.004


C=spa
0.359
antz 0.019
irina 0.016
C=ya 0.013
k"uw 0.013
kapr 0.011
manha
0.010
C=sino
0.008
purumut
0.008
C=a= 0.008
C=error
0.008
C=unclear
0.007
C=klas
0.007
C=kunfsi
0.006
C=cosas
0.006
C=y 0.006
C=mil
0.005
ajtz' 0.005
paqawshu.
0.005
yacxi 0.005


\psa 0.078
\p"a 0.053
\nushu.
0.028
\rqaya
0.003
\kasa
0.003
\yuy 0.003
\cxunhka
0.003
\kama.
0.003
\sn 0.003
\ma>
0.003
\n. 0.003
\nhna
0.003
\ata. 0.003
\yatx
0.003
\ush 0.003
\ill 0.003
\isn. 0.003
\shqa
0.003
\jam 0.003
\shq 0.003


C=spa 0.190
C=kada 0.128
uk" 0.039
qa 0.029
awt 0.027
C=vs 0.019
C=fyest 0.017
C=biyenunidu
0.014
kapill 0.012
q 0.011
C=juyshu 0.010
yaq 0.010
C=kuynt 0.010
C=trabaj 0.009
C=ador 0.009
C=presidenti
0.008
alkil 0.008
este 0.008
C=bastan 0.008
C=defenda 0.007


\iri 0.296
\p" 0.167
\isna 0.098
\taki. 0.061
\kamaya
0.034
\t". 0.019
\kisa 0.019
\jam 0.019
\t"a. 0.019
\j. 0.019










Ayarama P(x=0)
0.23


Jaqaru P(x=0)= 0.11


Kawki- P(x=0)= 0.12


uka 0.282
ch 0.018
wal 0.009
w 0.009
k" 0.007
masa 0.006
pampa 0.005
kawki 0.005
jak 0.005
C=bariosa?u
0.004
sh 0.004
jayllta 0.004
kachyu 0.004
C=komunero
0.004
C=pedaso 0.004
C=sino 0.003
nir 0.003
taki 0.003
C=pur 0.003
C=komunidad
0.003


jis 0.403
mama
0.042
t'ant 0.011
sum 0.009
uk"am
0.009
k"itis
0.008
k. 0.007
kuna 0.006
tata. 0.005
jumaw
0.004
wali 0.004
aka 0.004
sarasin
0.003
jaqin 0.003
taykax
0.003
rusintitan
0.003
k"itinakas
0.003
walikiw.
0.003
pataka
0.003
kunatakis
0.003


\xay
0.345
\aka
0.119
\p?
0.084
\na?
0.038
\ss
0.022
\pin
0.016
\xa?
0.014
\spa?
0.014
\itasp
0.014
\kam
0.014


C=de 0.144
C=si 0.078
C=que
0.066
C=ver
0.062
C=cuentas
0.044
C=me
0.044
C=las 0.044
C=ahora
0.038
C=la 0.034
C=como
0.029
C=lo 0.028
C=nos
0.022
C=todo
0.017
C=puedes
0.014
C=a 0.013
C=y 0.012
C=se 0.009
C=ese
0.008
C=a?os
0.006
C=del
0.006


\nushu.
0.080
\utm.
0.041
\m. 0.041
\sn 0.003
\rqaya
0.003
\yuy 0.003
\cxunhka
0.003
\kama.
0.003
\ma>
0.003
\kasa
0.003
\shqa
0.003
\nhna
0.003
\ata. 0.003
\n. 0.003
\yatx
0.003
\isn. 0.003
\ush 0.003
\ill 0.003
\jam 0.003
\sana
0.003


uk 0.097
asha 0.036
ut 0.033
isha 0.028
sa 0.028
marka 0.026
aka 0.025
ak 0.023
na 0.021
jaqi 0.020
nur 0.020
ma 0.019
maya 0.018
mark 0.017
al 0.017
uwa 0.016
n 0.015
s 0.015
qallya 0.014
ni 0.014


Topic 17 Ayarama P(x=0) =
Common 02 Jaqaru P(x=0) = 0.11 Kawki P(x=0) = 0.16
Common 0.52
jupa 0.126 juma \x 0.105 C=spa 0.263 \wshqa. C=error 0.184 \mama
naya 0.088 0.084 \s 0.055 C=vs 0.074 0.042 pachi 0.072 0.247
jiwasa 0.039 uka 0.040 \t 0.047 k"uw 0.016 \sn C=lleno 0.054 \mam
apa 0.035 kuna \n 0.047 C=kuyd 0.013 0.004 atxma 0.034 0.154
ala 0.032 0.027 \r 0.046 C=como \ma> C=masa 0.032 \w 0.154
sara 0.028 aka 0.026 \w 0.011 0.004 C=spa 0.021 \q. 0.108
k"iti 0.024 mama 0.038 janhq'u 0.011 \rqaya pachi. 0.020 \wa. 0.028
sar 0.021 0.024 \ni kapr 0.011 0.004 C=porke 0.016 \kun 0.024


Topic 16 -
Common


\w 0.135
\k 0.094
\qa 0.078
\i 0.070
\q 0.061
\wa 0.041
\n 0.037
\tna 0.032
\t 0.027
\? 0.024
\t" 0.023
\uk 0.020
\cha 0.019
\tn 0.015
\ata 0.013
\s 0.012
\q" 0.012
\na 0.010
\kas 0.010
\sha 0.010













suti 0.021
kunapacha 0.021
ap 0.020
s 0.019
alja 0.018
chur 0.016
jila 0.012
yapu 0.011
mun 0.011
kawk"a 0.011
uta 0.010
mayi 0.009


utj 0.022
chura
0.021
kun
0.016
yati
0.015
uta 0.015
alj 0.015
wawa
0.014
tata 0.014
sa 0.012
jich"a
0.012
ch'uqi
0.012
juta
0.010
sis 0.009
tinta
0.009
sara
0.009


0.034
\k 0.031
\i 0.029
\? 0.026
\ta
0.022
\ti 0.021
\wa
0.017
\wa.
0.016
\ru
0.016
\ka
0.016
\p 0.015
\rap
0.015
\pa
0.015
\ja
0.013


a 0.010
q 0.009
C=simpre
0.009
C=primir
0.007
C=gwadalupi
0.007
novecientos
0.006
C=olvid?
0.006
C=dimas
0.006
C=a?u 0.006
C=unibirsidad
i 0.006
manha 0.006
C=shimprn
0.005
ru. 0.005


\kasa
0.004
\kama.
0.004
\yuy
0.004
\ill
0.004
\isn.
0.004
\nhna
0.004
\ata.
0.004
\u?a
0.004
\n.
0.004
\yatx
0.004
\shqa
0.004
\ush
0.004
\cxunhk
a 0.004
\jam
0.004
\sa
0.004
\sana
0.004


C=awtoridadi
0.015
pach 0.015
q. 0.014
atx 0.012
C=banjeliko 0.011
C=k 0.010
C=ver 0.010
wa. 0.010
C=ayt 0.010
spa 0.009
C=made 0.009
?imaj 0.009


Topic 18 Ayarama P(x=0) .
Common 0.25 Jaqaru P(x=0) = 0.14 Kawki P(x=0) = 0.16
Common 0.25
wiyaja 0.023 janiw 0.300 \?ta? C=tu 0.057 \ch 0.041 C=spa 0.468 \kisa 0.066
umar 0.018 nayax 0.139 0.17 C=con \nushu. C=error 0.033
um 0.016 nayar 0.098 7 0.053 0.041 qa 0.027
jupat 0.011 jumat 0.035 \ss C=cuando \utm. ?imaj 0.012
utan 0.010 t'ant' 0.016 0.04 0.047 0.041 akisha 0.010
iskinan 0.008 iskinan 0 C=pap? \ja. 0.041 uw 0.010
tuktur 0.006 0.016 \cha 0.044 \ra. 0.041 C=presidenti 0.010
uka 0.006 jumax 0.011 ? C=no 0.030 \rqaya C=defenda 0.010
sara?axaw. 0.006 awtut 0.011 0.02 C=a= 0.022 0.003 tx 0.009
ch'iyar 0.005 utan 0.007 3 C=le?a. \yuy 0.003 C=para 0.008


\q" 0.019
\jam 0.010
\kat 0.010
\qa 0.010
\kam 0.010
\ta 0.010
\imna
0.010
\m 0.010













p"iryan 0.005
akar 0.005
nayan 0.005
kuna 0.004
sapa 0.004
apkit?tati. 0.004
tunt 0.004
jach'a 0.004
jisus 0.004
jisk'a 0.004


jumatak
0.006
marsilan
0.005
walikiw
0.004
p"iryan
0.003
justinar
0.002
jupan 0.002
jiskt'kit?tati.
0.002
yatiqapxa?ap
a. 0.002
qullq 0.002
mam 0.002
aparapita
0.002


\nuk
0.02
3
\st
0.01
9
\y
0.01
9
\x?
0.01
9


0.019
C=cargan
0.016
C=famoso
0.016
C=salud)
0.015
C=hizo
0.015
C=(salud
0.015
C=quer?as
0.015
C=castigo
0.014
C=ese
0.012
C=te 0.012
C=eras
0.011
antz' 0.010
cxunhka
0.008
ajtz' 0.008


\cxunhka
0.003
\kama.
0.003
\sn 0.003
\kasa
0.003
\isn. 0.003
\n. 0.003
\nhna
0.003
\yatx
0.003
\ush 0.003
\ill 0.003
\shqa
0.003
\ma>
0.003
\shq 0.003


C=porke 0.008
C=uno 0.008
C=(aa) 0.008
C=mil 0.006
C=mallas 0.006
C=mayu 0.006
C=bajo 0.006
ishaw 0.006
C=funda 0.006
C=may 0.005


Topic 19- Ayarama- P(x=0) =
oc 1 m Jaqaru P(x=0)= 0.14 Kawki P(x=0) = 0.17
Common 0.26
. 0.012 a. 0.127 \? 0.062 C=spa 0.238 \pmina may 0.079 \cha.
nayaw 0.011 nayat \sp C=error 0.072 C=spa 0.069 0.364
alj?ta? 0.006 0.077 0.040 0.097 \utm. C=este 0.037 \p" 0.061
akan 0.006 jumar \?ta. wanwan 0.041 karmaja 0.031 \tna. 0.048
asukara 0.005 0.074 0.040 0.012 \wshqa. uk" 0.023 \t"a. 0.041
maystru 0.005 m. 0.033 \itasp C=ya 0.012 0.041 C=error 0.018 \taki.
jupat 0.005 k. 0.021 0.040 manha 0.010 \ra. C=i 0.016 0.035
jich"ast 0.005 nayaw C=simpre 0.033 C=timpo 0.015 \sa 0.029
qillqa? 0.004 0.016 0.010 \sn C=desaparese \kisa
tintan 0.004 y. 0.014 sarasara 0.003 0.015 0.022
k"uri 0.004 jich"ax 0.009 \ma> C=komo 0.013 \jam 0.022
apustapxatayna. 0.013 C=mil 0.009 0.003 C=mil 0.013
0.004 awtut C=a?u 0.008 \kama. spa 0.012
manuyl 0.004 0.011 C=unibirsida 0.003 C=unclear 0.012
ar 0.004 r. 0.011 d 0.007 \rqaya C=apm 0.012
qawq"asa? 0.004 ch'iyar C=trimindu 0.003 C=presidenti 0.012
nanakax 0.004 0.010 0.007 \kasa C=defenda 0.012
susanar 0.004 k"itis C=he 0.006 0.003 C=wan 0.010













antrupuluj?ya
0.004
waywaykarisam
p 0.003
p"iriyus 0.003


0.010
kuns
0.007
qala
0.005
jinaru
0.005
sum
0.005
t'ant'
0.005
kuna
0.005
jumaw
0.005
jupaw
0.004


C=voy 0.006
antz' 0.006
C=masa
0.005
yacxi 0.005
aka 0.005
C=negro
0.005
w 0.005
C=mand
0.005


\ill 0.003
\yuy
0.003
\shqa
0.003
\nhna
0.003
\ata.
0.003
\n. 0.003
\yatx
0.003
\cxunhk
a 0.003
\ush
0.003
\isn.
0.003
\shq
0.003


C=fundasy?n 0.009
C=gana 0.009
wa 0.009