<%BANNER%>

Improving the Automatic Diacritization of Arabic using a Web-Based Bootstrapping Algorithm

Permanent Link: http://ufdc.ufl.edu/UFE0041352/00001

Material Information

Title: Improving the Automatic Diacritization of Arabic using a Web-Based Bootstrapping Algorithm
Physical Description: 1 online resource (61 p.)
Language: english
Creator: Hettick, Christian
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: arabic, automatic, bootstrapping, computational, corpus, diacritics, diacritization, disambiguation, machine, natural, tashkeel, web
Linguistics -- Dissertations, Academic -- UF
Genre: Linguistics thesis, M.A.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The availability of large bodies of electronically-stored natural language data, or ?corpora,? is important for Natural Language Processing (NLP) and Corpus Linguistics. Natural Language Processing develops tools like machine translation, speech recognition, and text-to-speech synthesis that depend on a probabilistic model of language calculated from a corpus. Corpus Linguistics uses annotated corpora to identify or verify linguistic phenomena. The development of useful Arabic NLP tools and research into Arabic using Corpus Linguistics methodology is hindered by the lack of diacritics, or tashkeel, in written Arabic. This lack of diacritics creates ambiguity in cases of poly-diacritizable word forms ? word forms that can be diacritized 2 or more ways. Ambiguity affects the performance of Arabic NLP systems. Thus the insertion of diacritics, itself a Natural Language Processing task, is a desirable goal for facilitating higher-level NLP applications. My study presents an algorithm that is simple yet accurate in adding diacritics to Arabic text. It uses available diacritized corpora to search the Web for texts to increase the size of the training set for an automatic diacritizer. This ?bootstrapping? increases the size of the training corpus by adding contexts that have a strong association with a specific diacritization of an ambiguous word form. Results from a test of this algorithm using the Linguistic Data Consortium?s Arabic Tree Bank are promising, showing improving performance in terms of precision and recall over previous algorithms. These results demonstrate the utility of constructing Web-scale training corpora for diacritization. In addition, an examination of the corpus of bootstrapped texts identifies some interesting linguistic phenomena ? patterns in the diacritization practices of Arabic writers.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Christian Hettick.
Thesis: Thesis (M.A.)--University of Florida, 2009.
Local: Adviser: Filip, Hana.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041352:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041352/00001

Material Information

Title: Improving the Automatic Diacritization of Arabic using a Web-Based Bootstrapping Algorithm
Physical Description: 1 online resource (61 p.)
Language: english
Creator: Hettick, Christian
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: arabic, automatic, bootstrapping, computational, corpus, diacritics, diacritization, disambiguation, machine, natural, tashkeel, web
Linguistics -- Dissertations, Academic -- UF
Genre: Linguistics thesis, M.A.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The availability of large bodies of electronically-stored natural language data, or ?corpora,? is important for Natural Language Processing (NLP) and Corpus Linguistics. Natural Language Processing develops tools like machine translation, speech recognition, and text-to-speech synthesis that depend on a probabilistic model of language calculated from a corpus. Corpus Linguistics uses annotated corpora to identify or verify linguistic phenomena. The development of useful Arabic NLP tools and research into Arabic using Corpus Linguistics methodology is hindered by the lack of diacritics, or tashkeel, in written Arabic. This lack of diacritics creates ambiguity in cases of poly-diacritizable word forms ? word forms that can be diacritized 2 or more ways. Ambiguity affects the performance of Arabic NLP systems. Thus the insertion of diacritics, itself a Natural Language Processing task, is a desirable goal for facilitating higher-level NLP applications. My study presents an algorithm that is simple yet accurate in adding diacritics to Arabic text. It uses available diacritized corpora to search the Web for texts to increase the size of the training set for an automatic diacritizer. This ?bootstrapping? increases the size of the training corpus by adding contexts that have a strong association with a specific diacritization of an ambiguous word form. Results from a test of this algorithm using the Linguistic Data Consortium?s Arabic Tree Bank are promising, showing improving performance in terms of precision and recall over previous algorithms. These results demonstrate the utility of constructing Web-scale training corpora for diacritization. In addition, an examination of the corpus of bootstrapped texts identifies some interesting linguistic phenomena ? patterns in the diacritization practices of Arabic writers.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Christian Hettick.
Thesis: Thesis (M.A.)--University of Florida, 2009.
Local: Adviser: Filip, Hana.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041352:00001


This item has the following downloads:


Full Text

PAGE 1

1 IMPROVING THE AUTOMATIC DIACRITIZATION OF ARABIC USING A WEB BASED BOOTSTRAPPING ALGORITHM By CHRISTIAN HETTICK A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS UNIVERSITY OF FLORIDA 2009

PAGE 2

2 2009 Christian Hettick

PAGE 3

3 To Erin with love

PAGE 4

4 ACKNOWLEDGMENTS I would like to thank all of my teachers at the University of Florida for challenging me and opening my mind to the fascinating world of linguistics Special thanks are due to each member of my committee Galia Hatav graciously gave her time for discussi ons of the intricacies of Arabic grammar. Khaled Elghamry provided a wealth of encouragement, ideas, and incalculable help all along Hana Filip took me on as an advisee and shepherd ed me through this project. I owe a debt of gratitude to my parents Steve and Kate for introducing me to the wide world of language and culture and to my brother Nathan and my sister Ingrid for being great fellow explorers Finally, I thank my wif e Erin and my son Evan for being such gracious and graceful source s of encouragement and inspiration.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF TABLES ............................................................................................................ 7 A BSTRACT ..................................................................................................................... 8 CHAPTER 1 INTRODUCTION .................................................................................................... 10 2 BACKGROUND ...................................................................................................... 13 2.1 The Arabic Language ..................................................................................... 13 2.1.1 Ar abic Syntax. ..................................................................................... 14 2.1.2 Arabic Morphology ............................................................................... 16 2.1.3 The Arabic Writing System .................................................................. 18 2.2 Natural Language Processing and the Automatic Diacritization Problem ....... 20 2.2.1 Natural Language Processing Uses, Challenges, and Approaches .... 21 2.2.2 The Automatic Diacritization Problem .................................................. 22 2.3 Literature Review ........................................................................................... 23 2.3.1 Comm on Features ............................................................................... 23 2.3.2 Limitations of Previous Work ............................................................... 25 2.3.2.1 Performance ............................................................................ 26 2.3.2.2 Reli ance on preprocessing ...................................................... 27 2.3.2.3 Sparseness of data .................................................................. 27 3 EXPERIMENT ........................................................................................................ 31 3.1 Supervised Machine Learning Experimentation ............................................. 32 3.1.1 Algorithm Design ................................................................................. 32 3.1.2 Corpus Partitioning and Algorithm Implementation .............................. 33 3.1.3 Performance Evaluation ...................................................................... 33 3.2 Bootstrapping Algorithm and Experiment ....................................................... 34 3.2.1 Foundational Assumptions of the Current Approach ........................... 34 3.2.2 The Algorithm ...................................................................................... 36 3.2.3 Exampl e .............................................................................................. 37 3.2.4 Implementation of the Algorithm .......................................................... 38 3.2.5 Evaluation ............................................................................................ 39 3.3 Error Analys is ................................................................................................. 41 4 CONCLUSION ........................................................................................................ 43 4.1 Discussion and Potential Limitations .............................................................. 43 4.1.1 Implications of the Current Work ......................................................... 43

PAGE 6

6 4.1.2 Potential Limitations of the Current Work. ........................................... 44 4.2 Future Directions Extensions of the Current Work ...................................... 47 4.2.1 Algorithm Modification ......................................................................... 47 4.2.2 Iterative Bootstrapping ......................................................................... 49 4.2.3 Dialectal Arabic on the Web ................................................................ 49 4.2.4 Bootstrapping from the Web as a Methodology for other Natural Language Processing Applications ...................................................... 52 4.3 Future Directions Interesting Data ............................................................... 52 4.3.1 Spotlight Diacritization ......................................................................... 53 4.3.2 Explicit Mention Diacritization .............................................................. 53 4.3.3 Harnessing these Phenomena for further Natural Language Processing Applications ...................................................................... 55 APPENDIX .................................................................................................................... 57 LIST OF REFERENCES ............................................................................................... 58 BIOGRAPH ICAL SKETCH ............................................................................................ 61

PAGE 7

7 LIST OF TABLES Table P age 2 1 Previous diacritization research .......................................................................... 25 3 1 Diacritization results for algorithm and baselines ................................................ 39

PAGE 8

8 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts IMPROVING THE AUTOMATIC DIACRITIZATION OF ARABIC USING A WEB BASED BOOTSTRAPPING ALGORITHM By Christian Hettick December 2009 Chair: Hana Filip Major: Linguistics The availability of large bodies of electronically stored natural language data, or corpora, is important for Natural Language Processing (NLP) and Corpus Linguistics. Natural Language Processing develops tools like machine translation, speech recogniti on, and text to speech synthesis that depend on a probabilistic model of language calculated from a corpus. Corpus Linguistics uses annotated corpora to identify or verify linguistic phenomena. The development of useful Arabic NLP tools and research int o Arabic using Corpus Linguistics methodology is hindered by the lack of diacritics, or tashkeel in written Arabic. This lack of diacritics creates ambiguity in cases of poly diacritizable word forms word forms that can be diacritized 2 or more ways. Ambiguity af fects the performance of Arabic NLP systems. Thus the insertion of diacritics, itself a Natural Language Processing task, is a desirable goal for facilitating higher level NLP applications. My study presents an algorithm that is simple yet accurate in adding diacritics to Arabic text. It uses available diacritized corpora to search the Web for texts to increase

PAGE 9

9 the size of the training set for an automatic diacritizer. This bootstrapping increases the size of the training corpus by adding contexts that have a strong association with a specific diacritization of an ambiguous word form. Results from a test of this algorithm using the Linguistic Da ta Consortiums Arabic Tree Bank are promising, showing improving performance in terms of precision and recall over previous algorithms These results demonstrat e the utility of constructing Webscale training corpora for diacritization. In addition, an examination of the corpus of bootstrapped texts identifies some interesting linguistic phenomena patterns in the diacritization practices of Arabic writers.

PAGE 10

10 CHAPTER 1 INTRODUCTION Insight into natural human language and linguistic behavior is a central goal of linguistic research. The ability to represent, manipulate, and generate natural language is a central goal of computational linguistics and Natural Language Processing. This paper describes a novel approach to a Natural Language Processing problem, automatic diacritization of Arabic, which led to some interesting insights into the diacritization patterns of human writers. In the process it illustrates the reciprocal nature of corpus development, Natural Language Processing, and linguistic insight. The lack of diacritics in most written Arabic presents a problem of ambiguity for a range of Natural Language Processing (NLP) tasks. Restoring these diacritics, or automatic diacritization, is a trivial task when there is only one possible diacritization for a given word form. However, for word forms that have two or more possible diacritizations poly diacritizable forms automatic diacritiz ation is a nontrivial task. Automatic diacritization is a necessary part of higher level Arabic NLP tasks such as speech synthesis and speech recognition, as well as any machine learning system that uses a corpus as training data. Because of the crucial and challenging nature of the automatic diacritization problem, there exists a range of recent work that approaches diacritization from a number of angles, but all of them are firmly within the probabilistic machine learning approach, in which a decisionmaking algorithm for diacritic restoration is trained on a large text, or corpus, of diacritized text. However, mainly because of the limited size of extant diacritized Arabic corpora, these works also use complicated mathematical models and make use of d ata preprocessing.

PAGE 11

11 My study use s an algorithm that gather s novel contexts from the Web to enlarge an existing training corpus for a supervised machine learning diacritizer. The probabilistic model on which the diacritizer was trained was kept simple to demonstrate the utility of this process of bootstrapping to create a Webscale corpus that acts as a robust model of the language. The results of the diacritizer that was trained on the bootstrapped corpus not only outstripped the results for the same diacritizer trained on the original training corpus, but are better than the results in the current literature. This demonstrates that the bootstrapping approach using a Web querying algorithm is a profitable one. The results of the bootstrapping algorithm contained linguistic artifacts portions of documents fro m the Web. These new texts proved interesting in the sense that they shed light on the linguistic phenomenon of diacritization as practiced by writers of Arabic. Chapter 2 contains background information. This includes some notes on Arabic language, es pecially morphology and the Arabic writing system. Chapter 2 also features a brief discussion of NLP and the way in which this discipline depends on and generates corpora, as well as how NLP and linguistics interacts. Finally, there is a discussion of th e current work on Arabic Natural Language Processing, focusing on the problem of automatic diacritization. Chapter 3 details my studys novel approach to automatic diacritization, with an outline first on the methodology of NLP experimentation, and then a presentation of the algorithm and a description of the experiment testing the algorithm, along with examples of how the algorithm functioned on real data. The experiment demonstrates that the

PAGE 12

12 bootstrapping approach to building a Web scale training corpus is very fruitful. The results of the experiment and some remarks on observed errors conclude this chapter. Chapter 4 discusses some implications that can be drawn from the outcome of the experiment. There is also a presentation of future extensions of this specific algorithm and of the general approach of bootstrapping corpora with Web query algorithms. Finally, there is a discussion of the diacritization practices of Arabic writers based on the data taken from the corpus that was bootstrapped for the ex periment.

PAGE 13

13 CHAPTER 2 BACKGROUND 2.1 The Arabic Language Arabic is a member of the Semitic branch of the AfroAsiatic language family spoken in North Africa and the Middle East. Arabic can be considered a language family, with a number of spoken dialects w hich can be grouped geographically. Brustad (2000) documented some of the major differences between 4 broad groupings of the dialects, which she identified as Maghreb, Egyptian, Levantine, and Gulf, although these are not the only possible groupings of Ar abic dialects. Versteegh (1997) identifies a usual classification among 5 groups: Arabian peninsula, Mesopotamian, SyroLebanese, Egyptian, and Maghreb. However, the dialects cannot be considered the whole picture of the Arabic language. Adding compl exity to the linguistic situation is the prevalence of other forms of Arabic in many contexts throughout the Arabic speaking world, including Classical Arabic (CA) in religious and literary settings and Modern Standard Arabic (MSA) in educational, media, l iterary, and governmental environments. The exact status of these forms of Arabic which lack a population of native speakers is difficult to specify. In fact, Kaye (1970) claims that it is much easier for the linguist to claim what MSA is not than what it is, eventually categorizing it as a ill defined system as opposed to the well defined system of any and all of the colloquials. Versteegh (1997) claims that MSA is a modified modern variety of CA. The relationship between CA and MSA is difficul t to define although grammatical simplification and lexical differences are typically cited as playing a role. I n Rydings Reference Grammar of Modern Standard Arabic (2005) MSA is described as a written norm, a major medium of communication for publi c speaking

PAGE 14

14 and broadcasting as well as functioning as a lingua franca in interactions between speakers of different dialects. Modern Standard Arabic is also, by far, the most commonly written version of Arabic. Along with Classical Arabic, which is foun d primarily in the Quran and ancient poetry, MSA provides the bulk of written Arabic data. It is taught in school, and used in media, government, and business throughout the Arabic speaking world. The spoken dialects differ syntactically, lexically, morphologically, and phonologically, among themselves (Ryding 2005, Brustad 2000) and these differences, primarily lexical, can cause regional variation in MSA (Versteegh 1997). All native speakers of Arabic speak a dialect as a first language; no one speak s MSA as a mother tongue, although the educational situation, majority religion, and ubiquitous nature of governmental and media texts in Arabic speaking counties reinforces some knowledge of MSA among all speakers This situation, identified as diglossia in Kaye (1970, following Ferguson 1959) is not at all unique to Arabic, and is found in German speaking Switzerland, Greece, as well as a host of other places. Although the dialects present a number of interesting phenomena, and extensions of the current work to spoken dialects will be considered in Chapter 4, the bulk of this paper will focus on MSA. Specifically, because the diacritization problem is a syntactic, morp hological, and lexical issue written representations of Arabic, relevant aspects of these sub fields will be explored below. 2.1.1 Arabic Syntax. Arabic (henceforth referring to MSA, unless otherwise noted) is a predominantly hea d initial (left h eaded) language, and thus V erb S ubject O bject (VSO) order is unmarked However, Subject Verb Object ( SVO ) order is also possible, and is seen

PAGE 15

15 more in spoken Arabic, due to a range of pragmatic factors (Dahlgren 1998). Another result of this left headedness is that adjectives typically follow nouns they modify, although this too can vary, especially with comparatives or superlatives. Many pertinent features of Arabic syntax are marked morphologically, and thus will be treated more fully below. Arabic has three cases: nominative, accusative, and genitive marked with suffixation on nouns There are two voices: active and passive marked with internal vowel variation on verbs. Three moods occur: indicative, imperative, and jussive marked with suffixation on verbs. Arabic also has two verbal forms which are variously argued as being tenses or aspects. Tense views a situation from a certain reference point in time A spect is different ways of viewing the internal temporal constituenc y of a situation (Comrie 1976) Scholarly opinions are varied and nuanced on the classification of the Arabic verbal system as one of tense or aspect, and I will not attempt to resolve the issue here. Brustad (2000) advances the view that aspect is primary in spoken Arabic verb forms but that time reference (related to but distinct from tense) is a function of whol e sentences In addition, various strategies exist in the dialects for shifting time reference (temporal verbs, participles) and combining these references with aspect (Brustad 2000) Brustad (2000) also cites Comrie (1976) in defining the verbal system of CA as aspectual, making a distinction between perfective and imperfective. Ryding states the differences between aspect and tense and then claims that t hese two grammatical categories...have in practice blended into one in MSA (Ryding 2005). I refer to aspect in my study and in MSA aspect is marked with suffixation and internal vowel variation.

PAGE 16

16 The construct state, or idhafa, is used productively in complex Noun Phrases (NPs) to show possession, sometimes long chains of possession, for example, ( maktabi mudiri A lmadrasap: the office of the director of the school) office director the.school1 The idhafa precludes the use of the definite article / A l/ on all but the last nominal element in the chain (which can be a complex NP / A lmadrasa p/ above could be replaced with ( A lmadrasa p A lfaransiyap: the French school) giving the p hrase the office of the director of the French school 2.1.2 Arabic Morphology Morphology in Arabic is a unique root and pattern system that is found in Semitic languages. This means that most content words are formed from a root composed of consonant s typically three, but 2, 4 and even 5consonant roots exist. Given a set of 28 consonant phonemes, the possible combinations are numerous Darwish (2002) states that there are about 10,000 roots, although the set of commonly used roots is smaller According to Ryding (2005) these roots are discontinuous morphemes that carry lexical meaning by denoting some real world semantic field. Ryding also states that roots act as a nucleus or core around which are constellated a wide array of potential meanings, depending on which pattern is keyed into the root (Ryding 2005). The specific meanings are instantiated when a discontinuous root is interlocked with a one of a set of discontinuous patterns specific to nouns, verbs, and adjectives. The 1 Throughout my study examples will be given using the Arabic script, a transliteration using the Buckwalter system (Habash, Soudi, and Buckwalter, 2007), and the English equivalent; when relevant, a wordfor word gloss will be provided. The Appendix contai ns the correspondences between the Arabic alphabet, the Buckwalter scheme, and the International Phonetic Alphabet. When mentioning individual particles in the discussion of Arabic grammer, only phonological representations will be used for the sake of si mplicity.

PAGE 17

17 patterns m ay involve prefixes, suffixes, and root internal vowels. Function words in Arabic typically do not contain a root, and are much shorter than content words. Vowel length in Arabic is phonemic among the 3 vowels /i/, /u/, and /a/, and the short vowels ar e commonly used as the root internal patterns as well as suffixes for nouns and verbs. Nouns are marked for case, gender, number, and definiteness. Case marking is accomplished by suffixing one of the short vowels /i/ (genitive), /u/ (nominative), and / a/ (accusative). Masculine gender is typically unmarked, while feminine is marked with a suffixed / a/ or / at/. Number is also marked on nouns. Singular is unmarked, dual is marked with a suffix / a:n/ and plural is marked regularly with a suffix / u:n / or irregularly with a different vowel pattern. Definiteness is marked with a prefixed or cliticized morpheme /al / or /l /, while indefinite nouns lack this morpheme and have a suffix / n/ after the case suffix. Nonfinal nouns in the construct state ( idhafa) are marked with the genitive case marker, but are understood as definite. Verbal roots in Arabic are marked for mood, voice, and aspect. Mood marking on verbs is accomplished with suffixing either /u/ (indicative), /a/ (jussive), or a zeromor pheme (imperative). The passive voice is marked with alternations in the vowel pattern of the verb, which can vary. Following the aspect analysis for MSA verbs, perfective verbs are marked by vowel patterns and suffixation depending on the person, number and gender. Imperfective verbs are marked with vowel patterns, prefixes, and suffixes that vary based on person, number, and gender. A particle can be prefixed to imperfective verb s (/sa /) or occur as a free morpheme (/sawf/) which were developed to i ndicate future events within the overall aspectual verbal system of Arabic Other participles can be used in combination with the basic verbal forms for complex

PAGE 18

18 expressions of aspect and time reference (such as future perfect). In addition, there are many productive verbal patterns that involve morphological processes like gemmination, infixing, prefixing, and suffixing. Derivational morphology can also produce deverbal nouns, an agentive noun can be formed with either a prefix /mu/ or a variation in root internal vowels; and a patient noun (recipient of an action) can be formed with a prefix /mu / or /ma /. 2.1.3 The Arabic Writing System All computational linguistics, by virtue of the use of computers, involves reducing a language to a written, symbolic form. Even speech recognition, or speech synthesis, requires a program or system to represent the sounds of the language using orthographic characters. The writing system of Arabic poses a unique problem for this representation, and thus will be out lined below. The writing system that developed for Arabic is most likely a descendant of Egyptian hieroglyphics by way of early Semitic, Phoenician, and Aramaic scripts (Rogers 2005). Arabic, the descendant of these early systems, is an abjad, meaning th at only consonants are written. In addition to symbols for consonants and long vowels (which are always written), Arabic has a set of optional diacritics to represent short vowels and other phonological features, which can be written above and below the consonants. The Arabic alphabet, along with diacritics, is included in the Appendix. Diacritics, also called tashkeel are typically not used in Arabic texts. They are only used consistently in religious and pedagogical materials. The set of diacritics i ncludes short vowels fatha [a] dhamma [u] and kasra [in] sukkun [] (lack of a short vowel) shadda (gemmination) and tanween the indefinite marker combined with a word final fatha dhamma, or kasra : [an], [un], [in]. The diacrit ics are found

PAGE 19

19 distributed unevenly in MSA texts: shadda is most commonly found, even in otherwiseundiacritized texts. In addition, sometimes only one short vowel will be used in an effort to disambiguate a word. This phenomenon will be discussed in Chapters 3 and 4. Rogers (2005) notes that some speculate that this type of writing system is ideally suited for representing Semitic languages, and may have been motivated by root and pattern morphology. As we have seen above, diacritics, typically unwrit ten, are used in a number of ways in inflectional and derivational morphology. Therefore, the lack of written diacritics in a text will lead to both lexemic (derivational) and inflectional ambiguity. Diacritics that distinguish between lexemes are typic ally found root internally. We see an example using the root /ktb/, which denotes the semantic field of WRITING. When the internal vowel pattern is fatha fatha, the form that results is a past tense verb: ( katab: to write.PAST.3SG). When the pattern is dhamma dhamma the form is a plural noun: ( kutub: book.PL). Lexemic differentiation can also be caused by the causative, which is typically formed by gemmination of the middle consonant, often accompanied by vowel changes. Causativization in Arabic is classified as derivational morphology by Benmamoun (1999) and Hallman ( 2006 unpublished paper) and it follows from this analysis that the result for many Arabic verbs is a unique lexical item with different argument structure and translation equivalents, which brings to bear the same difficulties of ambiguity for Natural Language Processing ( NLP ) For example, / E lm/ is a root that carries the concept thinking, knowing, cognition. The form (E alima : h e knew ) is a transitive verb Applying gemmination gives a ditransitive causative: (Eal~ ama : he taught )

PAGE 20

20 Another derivational morphological alternation that occurs commonly in Arabic is between the active and passive voices of verbs. An example of this alternation is given by the pair (kataba: he wrote) and (kutiba: it was written). The wordinternal diacritics distinguish two lexical items that change the argument structure. Inflectional ambiguity typically arises when a difference in wordfinal diacritics would change a part of verbal inflection, voice, or nominal case. As the nominal cases are all marked with a suffixed short vowel, the absence of these vowels can lead to a mbiguity, although not ambiguity in choice of lexical item. More serious is the loss of word final diacritics on a verb. Because person and gender are often only differentiated by one short vowel this is another type of ambiguity that can lead to indecis ion when choosing a lexical item. For example the word form /ktb t /, without diacritics, could be ( katabtu: to write. PERF .1SG ) ( katabta: to write.PERF .2SG.MASC ) ( katabti: to write.PERF .2SG.FEM ) or ( katabat: to wri te.PERF .3SG.FEM ) All of these ambiguous forms pose huge problems to any NLP application or subtask. For example, for a machine translation program to choose an appropriate translation for the above 4 possible verb forms gives a 25% chance of accuracy. In order to increase the chances of a correct form chosen, context must be measured and used to disambiguate. This greatly increases the computational load the system must bear, as well as increasing the amount of knowledge the system must be fed in ord er to function. Finally, Arabic is a prodrop language as well as being VSO. This means that in some VSO sentences it may be impossible to assign person and gender to the verb. 2.2 Natural Language Processing and the Automatic Diacritization Problem Automatic diacritization of Arabic is a nontrivial computational task that is part of a number of Natural Language Processing (NLP) applications for Arabic. S ection 2.2.1

PAGE 21

21 give s a brief overview of NLP goals and difficulties Section 2.2.2 delves into th e specific problem of automatic diacritization. 2.2.1 Natural Language Processing U ses, C hallenges, and A pproaches Natural Language Processing (NLP) and Computational Linguistics (CL) are labels for two emphases within a broad and dynamic interdisciplinary field of study that integrates knowledge and techniques from computer science and linguistics, among other areas. NLP and CL are overlapping domains that interact with each other but differ in important ways. Broadly speaking, Natural Language Processing involves the application of computers to tasks involving human language. These tasks include machine translation, speech synthesis, speech recognition, question answering, information retrieval, and other useful functions. Computational Linguistics, on the other hand, attempts to understand the nature of human language through computer models. Corpus Linguistics, a subfield of Computational Linguistics, uses large bodies of electronically stored linguistic data to analyze natural language across a host of different parameters. Computational linguistics can aid theoretical linguistic research by providing researchers with large amounts of data that can be analyzed quickly along a variety of criteria. Theoretical linguistic research, and insights into the structure of human language gained from linguistic research, on the other hand, is used to add knowledge to a NLP system. There exists an iterative, mutually feeding relationship between corpus development and NLP. Any corpus involves the use of some NLP, even if the NLP system is just electronic storage of linguistic knowledge that has been handannotated b y native speaker consultants. More typically, the annotation of a large corpus is done automatically, using NLP tools to identify part of speech, phrasal constituency, affix

PAGE 22

22 identity, and a range of other linguistic data. Most current work in NLP, in tur n, relies on corpora for the probabilistic models of language that are created to make the decisions an NLP system confronts. These NLP applications can then, in turn, be used to develop more sophisticated (storing more linguistic information), accurate ( ever shrinking error rates) and larger corpora, which, as we have seen, can be used by both the academy and industry. In keeping with the nature of the field at this time, the development of better corpora is a crucial component of progress in both NLP an d theoretical linguistic research. Arabic corpora currently do not exist in the abundance, size, or depth or English corpora, which provide a measuring stick for what may be achieved in the field. Diacritics are crucial for establishing meaning in written Arabic (see Section 2.1.3). T hus the recovery or insertion of diacritics, especially short vowels and the gemmination marker, is critical to develop useful Arabic language corpora. 2.2.2 T he A utomatic D iacritization P roblem As was shown above in Section 2.1.4, loss of diacritics introduces ambiguity across both lexical and inflectional categories. Any corpus that lacks diacritics will not be able to provide disambiguating statistics to enable a program to choose between two or more possible wordforms. Disambiguation is one of the central tasks of NLP. For example, a text to speech program would take electronic Arabic text and produce a simulated audible speech signal. This signal would be produced by storing audio values for the various written Arabic letters. Obviously, a lack of diacritics totally incapacitates such a system. Another example would be a corpus that functions, along with a bilingual dictionary, as the back end of a machine translation system. The system would look up the word to be translated, but if there are multiple entries, a

PAGE 23

23 decision has to be made. The corpus exists to provide statistics about what word form is the likely choice, given some context, typically measured in an ngram preceding or succeeding the target word. For this, a diacritized corpus is necessary. Speech recognition also requires a means of deciding between ambiguous analyses of an audio input. In most current work a corpus is used to determine the correct choice between ambiguous forms as seen in Section 2.3 Traditionally, as was suggested in S ection 2.2.1, development of corpora and NLP tools operated in an iterative fashion. Automatic diacritization is a type of NLP. The most promising work currently done on automatic diacritization namely, the recent papers which report low error rates depends on access to diacritized corpora. Accordingly, as available diacritized corpora are enlarged and refined, the accuracy of automatic diacritization programs will be improved. In order to make progress on Arabic NLP applications, automatic diacritization must be addressed in two areas techniques and training data (corpora). Focusing on these two areas, current work in the field is reviewed in Section 2.3. 2.3 Literature Review 2.3.1 Common Features Most of the previous work done on automatic diacritization of Arabic has several main features, including the use of similar types of techniques and training with the same corpora. Much of the previous work involves a probabilistic approach, with varying amount of linguistic knowledge added to the system. Often these are computationally expensive techniques, such as support vector machines (e.g. Roth et al. 2008) and maximum entropy (e.g. Zitouni Sorensen, and Sarikaya 2006). The programs introduced in these two papers tend to outperform other models and could be

PAGE 24

24 viewed as the current state of the art. The issue of how the automatic diacritization problem is addressed, either through more complex techniques like these or through improved training data will be addressed in Section 2.3.2. Another commonality is the nearly uniform use of one particular corpus of MSA for calculating probabilities and making decisions, one of the three parts of the Arabic Treeback (ATB) produced and disseminated by the Linguistic Data Consortium ( Maamouri et al. 2004). Another commonly occurring feature of much of the recent work is that the correct decision is often made and assessed on a character level, that is, the choice of a diacritic is determined by the preceding string of characters, rather than words; and the error rate is defined as the number of correct diacritics inserted or restored, compared with the total number of letters. This has the strong potential to give a false sense of accuracy to a system, for most appl ications. An example would be a word form with three consonants and the need to recover three short vowel diacritics, one for each consonant. If the correct lexical item has the pattern fatha fatha fatha and a system applies the pattern dhamma kasr a fatha a character based error rate would be 66%, or two wrong decisions out of three decisions made. However, an error rate based on words, rather than characters, would produce an error rate of 100% for this item, or one wrong decision for one decisi on made. This situation is potentially very common in Arabic, as the following word pair shows. The verb ( kataba: to write. PERF .3SG) is the active voice past tense form glossed as he wrote. The verb (kutiba: to write. PERF .3SG) is the pas sive voice past tense form glossed as it was written. The difference between these two vocalization patterns changes the lexical item(s), and to regard an incorrect decision as partially right has the potential to overstate the accuracy

PAGE 25

25 of a system. Sc hlippe, Nguyen, and Vogel (2008) and Al Ghamdi & Muzafar (2007) are papers that typify this approach. Table 2 1 does not attempt to chart all the recent work done on automatic diacritization. However, it is a representative sample, and lists the author(s) year, computational technique used, and corpus used for training and testing the diacritizer. Table 21 Previous diacritization research Authors Year Method Corpus Roth, et al. 2008 Support Vector Machine (SVM) Arabic Treebank (ATB) Schlippe, Nguyen, and Vogel 2008 Statistical machine translation at word, character, and word+character level ATB and AppTek Kubler & Mohamed 2008 Memory based learning ATB Alghamdi & Muzafar 2007 Letter level quad grams Proprietary Habash & Rambow 2007 SVM ATB Elshafei et al. 2006 Probabilistic uni bi and trigram models Quran Maamouri Bies, and Kulick 2006 Parsing ATB Zitouni, Sorensen, and Sarikaya 2006 Maximum entropy model using lexical features, part of speech tags, segment based features; sequence classification ATB Nelken & Shieber 2005 Finite State Transducer ATB Ananthakrishnan Narayanan, and Bangalore 2005 Maximum entropy, word and character level ATB Habash & Rambow 2005 SVM ATB Vergyri & Kirchhoff 2004 Expectation maximization, simple trigram Foreign Broadcast Information Service, CallHome 2.3.2 Limitations of Previous Work There are several limitations with the extant work in automatic diacritization. They are: 1.) the need for better performance coupled with varying performance measures,

PAGE 26

26 2.) the need for preprocessing and different levels of preprocessing, and 3.) sparseness of data coupled with a lack of specification regarding exact portions of standard corpora are used to test and train diacritization algorithms. 2.3.2.1 Performance Regarding the first limitation, it is the nature of NLP to seek increased accuracy from existing programs by tweaking probability weighting or improving the training data used, or by creating new algorithms to attack the problem. Also fundamental to NLP is that errors in low level tasks are multiplied at higher levels. For instance, an error of diacritization leads to an incorrect word form, which gives an erroneous list of possible part of speech (POS) tags. If the wrong tag is applied, the syntactic structure of the sentence is altered, other words in the same sentence are incorrectly tag ged, and the chunking task, which groups part of speech tagged words into small phrasesized units, is deleteriously affected. POS tagging and chunking are important subtasks for machine translation. Thus, for low level tasks such as diacritization, acc uracy less that 9899% will always be a cause for concern, and error rates less than 2% have not been achieved by the algorithms reviewed here. Accuracy in the works listed above range from a high of 4.6% word error rate in Roth et al. (2008) to a low of 13.5% word error rate in Ananthakrishnan, Narayanan, and Bangalore. (2005). A related problem is that the measures of performance are not uniform and papers frequently list both word error rates (WER) along with diacritic or character error rates (DER / CER). Sometimes the word final diacritics are restored; at other times they are ignored (being typically determinative of case, rather than changing the lexical item). All these differences serve to occlude the exact nature of the results and make compar ison between algorithms difficult and opaque.

PAGE 27

27 2.3.2.2 Reliance on preprocessing In relation to the second limitation listed above, many projects use some level of preprocessing, typically tokenizing or POS tagging. The output of these processes is the input for the diacritizer, and any errors in the preprocessing will be multiplied when diacritizing. As an example, Zitouni Sorensen, and Sarikaya (2006) tokenizes and POS tags the input data for the diacritizer in order to calculate and use probabilities o f diacritic occurrence based on sequences of preceding and following part of speech as well as the position of the undiacritized character in the word. The tokenizer, which separates affixes from roots, directly affects this latter measure, character posi tion in the word, and also affects the POS tagger, so problems in these nontrivial tasks will carry over to the diacritization task. The fact that many diacritization algorithms involve some level of preprocessing indicates that there may be as yet unexpl ored ways in which errors in these lower level preprocessing steps are negatively affecting the accuracy of the diacritizer. An additional difficulty along these lines is that not all papers present their exact methods of preprocessing, and not all use pr eprocessing, once again posing problems for comparison across different approaches. 2.3.2.3 Sparseness of data The third limitation noted here consistently obtains for all the papers reviewed, and is directly related to the central thrust of the current work, is that the data used for training diacritization algorithms are sparse, that is, the corpora used for training and testing are too limited in size and scope to provide robust probabilistic modeling for the diacritization problem. Almost all of the papers listed in Table 1 use part of the ATB, mostly the third, and most recent, releas e a collection of some 300,000 words of MSA

PAGE 28

28 from various newswire services. This corpus is then typically segmented into training and testing sets, typically 80% for the training and 20% for testing. While a useful tool, the limitations of the ATB are not insignificant. First is the matter of the size. A corpus of 300,000 tokens is too small for accurate language modeling. By way of comparison, two freely English corpora are the British National Corpus ( BNC Consortium 2007) and the Corpus of Contemporary American English ( Davies 2008). The British National Corpus (BNC) has 100 million words. The Corpus of Contemporary American English has 400 millions words (and growing). By comparison, the Arabic corpora available are small, including the ATB. Context is what enables NLP programmers to disambiguate forms, and the number of disambiguating contexts in a corpus the size of the ATB will be necessarily small. A fu rther limitation of the ATB is that it is restricted to a specific genre, newswire text in MSA. The aforementioned English corpora each collect data from a variety of sources, including newswire, transcriptions of spoken conversations, academic literature, and fiction. The benefit of using newswire is that large amounts of electronic Arabic text are regularly stored and are more or less readily available. However, a diacritization algorithm should be able to handle texts of various genres, and training a diacritizer on newswire text alone will not achieve this functionality. In addition to, and probably related to the single genre problem, is the problematic type token ratio. The current papers analysis of the third part of the ATB revealed only 65,000 types, or unique word forms, in the corpus of 300,000 words, or tokens. This gives a typetoken ratio of 21.67%, which seems to indicate a very limited vocabulary. Goweder and De Roeck (2001) discuss the differences between English and Arabic in

PAGE 29

29 the ar ea of sparseness, concluding that Arabic is more sparse than English, in the sense that a given Arabic word will occur less frequently in an Arabic corpus than a given English word in an English corpus, leading to increased risk of null probabilities whe n dealing with Arabic corpora. However, the typetoken ratio calculated for the ATB3 seems to indicate the opposite, that the ATB3 is sparse in the sense that its number of types is too limited, not necessarily in the sense that all the unique words in th e corpus (types) are infrequent (few tokens for each type). In fact, it appears that with only 65,000 types in a 300,000 token corpus, each word would have to appear relatively frequently. This increases the likelihood that a diacritizer using the ATB as training data will encounter out of vocabulary word forms and not have an extensive list of disambiguating contexts for the word forms is does contain. In addition, this figure for word types was calculated before any stemming was performed, and thus the vocabulary of the corpus would appear to be even more limited, as the 65,000 word forms include those person, number, gender, and case differences. In addition, of these 65,000 types, only 875 potentially ambiguous word forms were found (forms that had t wo or more possible diacritizations, not including wordfinal diacritics). This number, too, seems low. A final indicator that the ATB alone should not remain the sole training corpus for diacritization problems is that errors can be found in the diacri tization of the ATB. That these errors exist is in itself not surprising, though the ATB was annotated (including diacritics) by hand. For example, consider the noun (A lja m a n : Yemen) which occurs 11 times in the third part of the ATB. Only 5 of t hese 11 occurrences were correctly diacritized with the pattern fathafatha Three occurrences of this form were

PAGE 30

30 diacritized incorrectly as (A lyumn : prosperity, success) and three were not diacritized at all. Obviously mistakes like these affect the decisions made by diacritization algorithms training on these data, but exactly what effects obtain will be explored in Section 3.3 Finally, there is a lack of specificity and transparency in the exact sections of the ATB that are used. The portions of the ATB chosen for training and testing are not specified, save in one paper we found, that of Zitouni Sorensen, and Sarikaya (2006), which explicitly mentioned the portions of the ATB corpus that would be used. Across the field the need is plain for consistent use of the same sections of the same corpora for testing and training. Especially given the small size of the ATB, training or testing with different portions of the same corpus could produce quite different results.

PAGE 31

31 CHAPTER 3 EXPERIMENT My study seeks to ascertain how the performance of an automatic diacritizer can be improved over current efforts by utilizing the worldwide web to bootstrap a much larger and more r obust corpus of disambiguating contexts for training. This primarily addresses the problem of sparseness of data in the ATB alone as training corpus, as well as discovering what errors and limitations of vocabulary or genre in the ATB are interfering with accurate diacritization. The increase in performance via bootstrapping a large training corpus will be demonstrated, along with the necessity of close error examination to refine and improve the algorithm. This approach differs from those detailed in Section 2.3 in that the role and efficacy of an improved training corpus is featured, rather than that of the probabilistic model. To those ends several interlocking experiments are presented. First I present a bootstrapping experiment that greatly increases the size of a training corpus by acquiring novel disambiguation contexts from the worldwide web and clustering them around specific diacritizations of ambiguous word forms. Then the utility of this bootstrapped corpus is demonstrated as it is used to tr ain a diacritizer which is then tested on a portion of the ATB. Finally, I analyze the data to ascertain if specific types of syntactic or semantic context can be linked with ambiguity or errors in diacritization. I also analyze the errors both in the current diacritization algorithm as well as errors in the ATB. The chapter is organized as follows. Subsection 1 reviews the steps necessary for the setup and testing of a supervised machine learning experiment. Subsection 2 presents the current wor ks exp eriment: the boot strapping algorithm, implementation of

PAGE 32

32 the algorithm for context acquisition and clustering, testing of this boot strapped corpus as a training model for an automatic diacritizer, and evaluation. Subsection 3 presents an error analysis, leading to insights into the ATB data and the nature of the newly acquired webbased data. 3.1 Supervised Machine Learning Experimentation 3.1.1 Algorithm Design An algorithm consists of a series of discrete steps designed to solve a problem, and is formaliz ed by specifying these steps, along with the input that is provided to the algorithm as well as the output of the steps. For supervised machine learning, the input consists of a corpus with the desired outcome, that is, a corpus that has, either by previous NLP efforts or human annotation, achieved the desired NLP outcome of the algorithm in question. In the case of supervised machine learning of automatic diacritization, the input is a diacritized corpus. This corpus functions as the training set for th e algorithm. The nature of this training for diacritization on the word level is as follows: first the possible diacritizations Da...Dn of each word form are counted in the training corpus, then the contexts in which each diacritization occurs Ca...Cn is measured and the probabilities are calculated that the correct diacritization is some diacritization Db, given some context Cb. The simplest way of measuring context is the exact words in some window around the target word to be diacritized; however, context can also be abstracted away from the exact words using parts of speech or some other categorization. The output of the algorithm is a decision based on meeting some threshold of probability.

PAGE 33

33 3.1.2 Corpus Partitioning and Algorithm Implementation Typically in computational linguistics, t he corpus available for the experiment is divided into 2 or 3 sets: a training set and a testing set with an optional development set. The partitioning is as follows: 90% training, 10% testing; or 80% training, 10% testing, 10% development. In supervised learning for automatic diacritization the test corpus is first stripped of diacritics in order to simulate raw data for the diacritizer to work on, and after the diacritizer has been run on this stripped test set, the results are compared with the origi nal test set with intact diacritics, to evaluate the performance of the algorithm. The development corpus, when used, is the only section of the original corpus that the researcher can examine. The purpose of looking at the development corpus is to figur e out what errors are likely to be found after testing, or to ascertain the cause of errors obtained after testing. The supervised machine learning experiment is run when the steps of the algorithm have been run and the results evaluated; minimally, this i nvolves the computation of probabilities from the training data (forming the model), using this model to make decisions about the stripped test data, and comparing these decisions to the original test set. This comparison forms the basis for performance e valuation. 3.1.3 Performance Evaluation Performance in Natural Language Processing (NLP) is calculated using two measures recall and precision. Recall is the ratio of the number of decisions that were made to the number of decisions that should have bee n made. It measures the coverage of the algorithm. Precision is the ratio of the number of correct decisions made to the number of total decisions made. It measures the similarity between the decisions made by the algorithm to the decisions made by the human annotators of the

PAGE 34

34 test set. Precision can be gained at the expense of recall, by limiting the number of decisions made to only those with high chances of correctness. Likewise, recall can be improved at the expense of precision, by making decisions everywhere, even with more limited chances of making the right decision. Thus a commonly used performance measure combines the two into one metric, the F measure. It is the harmonic mean of precision and recall, given by E quation 3 1 : ) + ( ) 2 ( = recall precision recall precision Measure F (3 1) Precision is given by Equation 33. outcomes total outcomes correct Precision (3 3) I n the case of automatic diacritization, P recision is equal to the number of words correctly diacritized divided by the total number of words diacritized. Recall is given by Equation 34. outcomes potential outcomes Recall = (3 4) I n the case of automatic diacritization, Recall is equal to the number of words diacritized divided by the number of words needing to be diacritized (total words in the test corpus) 3.2 Bootstrapping Algorithm and Experiment 3.2.1 Foundational Assumptions of the Current Approach A general assumption for this experiment (and for which this experiment serves to validate) is that as the size of a training corpus for an automatic diacritizer increases,

PAGE 35

35 the overall per formance increases as well. However, the increase must been in good contexts, that is, more disambiguating contexts for ambiguous word forms that predict the correct diacritizations. With a greater number of disambiguating contexts, the coverage improves, and if the contexts are good ones, greater precision results The focus of this algorithm as a webbased context gatherer is built on the additional assumption that a potentially ambiguous word form that has a some diacritization Di in context Ci will have the same diacritization Di if found in a different context Cj in the same web document. In essence, I assume that within the same document, an ambiguous undiacritized word form will be the same lexical item even when the context varies. In fact, this algorithm depends on identically diacritized word forms appearing in the same web document in different contexts, for this is the means of growing a larger training corpus. In order to narrow the scope of the current work, only wordinternal diacritization ambiguities are examined. As stated in section 2.1.4 above, alth ough there are ambiguities that can result from the lack of wordfinal diacritics, generally these are not ambiguities between multiple lexical items, but are most commonly related to grammatical categories such as nominal case or verbal mood, and this paper focuses on disambiguating poly diacritizable lexical items. In addition to limiting the scope of the algorithm to wordinternal (that is, lexemic) diacritics, this algorithm uses the whole word in training and evaluation, that is, I do not strip or ste m affixes. Stemming and other morphological processing is an interesting NLP task, and potentially adds valuable knowledge to a model of the language. However, the downside of using morphological preprocessing is that any errors in that

PAGE 36

36 task (a nontrivi al task just like diacritization) are carried over into the training corpus. These errors are compounded when the diacritization algorithm essentially takes corrupted data as input. Also, I measure performance based on the diacritization of the whole wor d, rather than examining individual diacritics on a character by character basis. Thus just one incorrect diacritic in a word form causes the whole word to be counted as incorrectly diacritized. 3.2.2 The A lgorithm The algorithm was coded in Perl by my collaborator Dr. Khaled Elghamry, for Elghamry and Hettick ( 2009) It consists of the following steps: 1.) S egment diacritized corpus into training and test sets 1a.) take 90% of the original corpus as the training corpus and 10% of the corpus as the te st corpus 1b.) make a copy of the test corpus and remove all diacritics while keeping the test corpus hidden 2.) E xtract all poly diacritizable words with their trigram contexts from the training set 2a.) make a list of all words with at least 2 potenti al wo rd internal diacritization, as realized in the training corpus 2b.) for each possible diacritization of these poly diacriti zable word forms, extract seed contexts, defined here as the two words preceding the ambiguous word W (W2 W1) along with the target word 2c.) remove diacritics from these seed contexts 3.) G ather other disambiguating contexts from the web 3a.) for all the seed contexts associated in the tr aining corpus with word W with diacritization Di, run a Web query with the following search string composed of the seed context and the target word with wild card operator: , which r eturns snippets of the firs t 100 documents retrieved consisting of a web address for Web documents on which the search string hit, the orig inal seed context, and a set of trigrams composed of the trigram con si sting of the target word immediately preceded by any other two words. 3b.) extract from the search results the novel trigram context(s) 3c.) group these contexts with the original poly d iacritizable word form W of diacritization Di

PAGE 37

37 4.) C ombine these novel disambiguating contexts with the initial seed contexts into a bootstrapped training corpus 5.) T rain the automatic diacritizer by computing transitional probabilities for diacritizations of ambiguous word forms Equation 35 : ) ( ) ( ) | (1 2 1 2 2 1 W W Freq W W W Freq W W D W PD (3 5) 6.) A pply diacritics to words in the test set based on these transitional probabilities 3.2.3 Exam ple To demonstrate the bootstrapping of the training corpus, which is the core of the algorithm presented here, a brief example is given from the implementation. In extracting a list of poly diacritizable word forms from the training corpus of the Arabic Treebank Part 3 (ATB3) the word form (Al$Er ) is obtained. This form occurs 28 times in the ATB and can be diacritized as either of the following: (Al$iEor: poetry) 26 tokens (Al$aEor: hair) 2 tokens In the training corpus of the ATB3 the trigram (fan kitaAbap Al$iEor: the art of writing poetry) is realized and, along with 7 other trigram contexts for this diacritization, is extracted. This forms the first half of the web query W 2 W 1 W + W where W is the target word, W 2 is ( fan : art) W 1 is ( kitaAbap: writing) and is any word. This web query returns documents that contain the original trigram along with trigrams contain ing new contexts for the target word suc h as : (mn rwAd Al$Er: of the pioneers of poetry) (fn
PAGE 38

38 These examples uphold the assumption this paper makes, that similar contexts for a word form will contain that word form with the same diacritization (where similar contexts are those that occur in the same document). Because the original seed context is linked to a specific diacritization found in the ATB3 (Al$iEor: poetry) these new trigram contexts are clustered w ith that diacritized word form in the expanding (diacritized) training corpus. 3.2.4 Implementation of the Algorithm The testing of this algorithm commenced when the diacritized LDC Arabic Treebank Pt. 3 ( Maamouri et al. 2004) was divided into a training corpus (the first 90% of the original) and a test corpus (the remaining 10%). From this training corpus I generated a list of poly diacritizable word forms, excluding poly diacritizable function words. The list of poly diacr itizable word forms contained 875 unique word forms with at least 2 possible diacritizations in 8626 unique trigram contexts. These data indicate that on average, there are 10 unique trigram contexts for each potentially ambiguous diacritized word form. These 8626 trigrams formed the web queries as stated in Step 3 above. The algorithm accessed the web using www.alltheweb.com using the Yahoo Web Index, and returned 20616 novel trigrams, for a total of 29242 unique disambiguating trigram contexts. Thus the average number of disambiguating contexts for each diacritization of a poly diacritizable word from increased 3fold. These contexts (along with the rest of the original training corpus) comprised the bootstrapped training corpus. The increase in size of the training corpus is due entirely to trigram contexts that are clustered with specific diacritized word forms. The automatic diacritizer was trained by computing the transitional probability that word W should be given diacritization Di after two immediately preceding words W2 W1

PAGE 39

39 using the transitional probability equation in Step 5. If the poly diacritizable target word had more than one diacritization in the same trigram context, the diacritization t hat occurred most frequently in that context in the bootstrapped training corpus was applied to that word form in that context in the test corpus. The final step in the algorithm was to use this trained automatic diacritizer on the test set (the final 10% of the ATB3 ) To measure how much the bootstrapping algorithm improves performance over other methods, I measured performance, as given by the F measure above, in two baselines, Baseline 1, giving the target word the most frequently occurring word in the training corpus, without measuring or accounting for context and Baseline 2, calculating the conditional probabilities from the original training corpus (prebootstrapping) that the trigram context Ci (W2 W1 W) indicates a certain diacritization Di for an ambiguous poly diacritizable word form W in the test set The impact of the bootstrapping algorithm is seen in the improvement over these two baselines. 3.2.5 Evaluation The precision, recall, and F Measure of both baselines and the bootstrapping alg orithm were calculated with respect to the original diacritization of the test set. The results of both baselines and the bootstrapping algorithm are listed in Table 31. Table 31 Diacritization results for algorithm and baselines Training Model Precisi on Recall F Measure Baseline 1 0.758 1.0 0.862 Baseline 2 0.967 0.217 0.354 Bootstrapped corpus 0.989 0.927 0.957 The results of the experiment reveal several points. First, the application of most frequently occurring diacritics to ambiguous word forms leads to a perfect score for Recall a decision is made for every poly diacritizable word. However, as expected,

PAGE 40

40 this baseline training model is too inelegant to produce a very high score for precision. The low (0.76) score for Precision indicates that on average, 1 out of every 4 instances of a poly diacritizable word have diacritizations that differ from the most common diacritization. Predictably, when the trigram context is measured and used to calculate transitional probability, the precision increases. The downside of this gain in precision is a severe drop in Recall, showing that the trigram contexts for ambiguous words in the test corpus do not commonly occur in the training corpus. Hence the lower F Measure for Baseline 2 compared with Baseline 1, despite the increase in precision. The bootstrapped corpus couples a high Recall with very good Precision, for a higher F measure than either baseline by a significant 9.5 points. The Recall score of 0.92 shows the increase in unique disambiguation contexts that the algorithm had seen in the bootstrapped training corpus compared to the original, ATB3only train ing corpus. It remains to be seen if the Recall, the lowest component of the F Meaure, can be improved by further expanding the corpus, a potentiality that will be explored further in Chapter 4. The score for Precision, a very high 0.98 9, indicates that these new contexts are reliable for indicating which diacritization should be applied. This high Precision score is especially encouraging compared with the results for Word Error Rate in the extant literature. As stated in Section 2.3. 2.1, the lowest Word Error Rate (highest precision) reported is 4.6%, or about 0.95 Precision, in Roth et al. (2008). The results obtained here surpass those of the highest performing algorithm in the literature. This encouraging result signals that boot strapping a training corpus using web queries is a promising methodology that should be explored further.

PAGE 41

41 3.3 Error Analysis An examination of the errors made by the bootstrapping algorithm reveal s two broad categories of error: errors of precision or recall caused by the algorithm applying the wrong diacritics or none at all, and errors caused by diacritization errors within the training corpus. The precision errors caused by the algorithm were due to the wholeword nature of this approach to diacritization. Recall errors show that although the bootstrapping resulted in a dramatically larger set of disambiguating contexts, some contexts the diacr itizer had not seen before were in the test corpus, and given these contexts, the diacritizer was unable to m ake a decision The errors in the training data are of most interest, for they indicate some of the limitations of the Arabic Treebank (ATB) as the primary source for training and testing data in Arabic Natural Language Processing ( NLP ) Although the ATB with its rich handannotated linguistic information is the gold standard in the field and a valuable resource, these errors show that it cannot be used uncritically. Although human annotation is the gold standard, this does not make it a flawless standard. In fact, errors in training data in a bootstrapping algorithm such as the one presented here are especially problematic, for they bring a degree of unreliability into the results by grouping potentially good disambiguation contexts around the wrong di acritization of an ambiguous word form. T he poly diacritizable word form in the test corpus in these contexts will be mis diacritized, not because the algorithm did not function properly, but because the seed it used to grow new contexts was flawed. For example, / wAlymn / occurs 8 times in the training corpus and can be diacritized as either (wAlyaman: and Yemen) (wAlyumn: and the prosperity ).

PAGE 42

42 In 3 of the 8 occurrences, the word form is mis diacritized as (wAlyumn: and the prosperity ) when it is clear from the contexts that the word form should have the other diacritization (wAlyaman: and Yemen). The 3 contexts are as follows: (AlErbyp AlmtHdp wAlymn: the United Arabic and Yemen) (fy tw ns wAlymn: in Tunis and Yemen) (AlsEwdyp wmSr wAlymn: Saudi and Egypt and Yemen). These mis diacritizations will mean that these contexts, and any gained through the web query bootstrapping portion of the algorit hm will be grouped with the wrong diacritization (wAlyumn: and the hopefulness), which will be given to any word forms in the test corpus found in these seed contexts or those acquired and mis clustered from the Web. Errors found in the seed cont exts extracted from the training corpus indicate that there may be errors in the test corpus as well, which could have led to correct diacritization decisions by the algorithm being counted as precision errors and negatively affecting the algorithms performance. Further analysis is needed to ascertain if this result did in fact occur.

PAGE 43

43 CHAPTER 4 CONCLUSION In this chapter I discuss the implications of the experiment detailed above. In addition, I explore some potentially fruitful extensions of the current work, and discuss some linguistically interesting features of the data that was collected from the Web using the algorithm. 4.1 Discussion and Potential Limitations 4.1.1 Implications of the Current Work The success of the algorithm at improving the performance of an automatic diacritizer for Arabic indicates that the probabilistic model of a language can be simplified as a larger data set is used for training the learner. This work was able to effectively exploit the Web to bootstrap such a large tr aining corpus from the relatively small Arabic Treebank Part 3 (ATB3) In addition, the focus of the algorithm on gathering disambiguation contexts for ambiguous poly diacritizable words made it possible for all of the growth of the training corpus to com e from reliable trigram contexts. The resultant F measure of over 95%, with high measures for both precision and recall, demonstrates the effectiveness of this approach and compares favorably with other extant work in automatic diacritization of Arabic. T he efficiency of this algorithm, compared to other works that rely on more complex language models and complicated preprocessing, supports one of the foundational assumptions of this paper. Specifically the larger the data set for modeling natural lang uage, the simpler the model can be for accurate results. Even though gathered from the web, a data source potentially filled with the noise of

PAGE 44

44 inaccurate data, the contexts obtained by bootstrapping were valuable for the diacritization task. The resul ts presented in Section 3.2.5 also support the other assumptions on which the work was based, namely, that occurrence in the same document is a good measure of topic similarity and thus of diacritization identity, and that focusing on the word internal diacritics is a n efficient way to approach the diacritization task. My study supports the utility of focusing on improvements to the training corpus for Natural Language Processing (NLP) applications, and specifically demonstrates the practicality of using t he Web as a source for these improvements. 4.1.2 Potential Limitations of the Current Work. Four problems or limitations could potentially affect the performance of the automatic diacritizer. The first two have to do with the nature of Arabic script docum ents on the Web. The second two have to do with the nature of the algorithm as Web based bootstrapping. The first two problems, relating to Arabic script on the Web, appeared while examining the results of the algorithm. First, searching the Web using an Arabic script search string has the potential to return languages that are written in the Arabic script but that are not Arabic, such as Farsi or Urdu. Documents in these languages could possibly be retrieved by the algorithms Web search string, but m ight contain unique trigrams that were not good disambiguating contexts for Arabic diacritization. The second potential pitfall that surfaced is that some Arabic script, Arabic language documents that are posted on the Web are machine translated materials posted, for instance, by companies publishing technical manuals for products or the like.

PAGE 45

45 In fact, these problems did not seem to deleteriously affect the performance of the algorithm, and it seems there is within the form of the query itself the reaso n for this resiliency. The algorithm searches for documents that contain three words in order from the ATB3. This effectively pares down the potentially returned documents, which drastically reduces the chance that a nonArabic language document or even a machinetranslated document would be retrieved by the web search step of the algorithm. The trigram nature of the query mandates that that exact trigram be found in any retrieved Web documents. These trigrams were first written by native Arabic speaki ng writers for the news services out of which the ATB3 was culled. Therefore trigrams occurring in the ATB3 are instances of natural M odern S tandard A rabic (MSA) phrases. It is unlikely that these entire MSA phrases would occur in nonArabic language doc uments. Additionally, if machinetranslated documents are returned by a search that includes natural MSA trigrams, it indicates that these documents, although machine translated, are reflective of natural MSA, and can thus potentially contribute to the ac quisition of novel disambiguation contexts. If the search had been conducted with singleton Arabic script word forms, the bootstrapped contexts may have been more polluted by bad contexts. For instance, one word form in Arabic script might show up in Far si even if the words are two different words, and a single Arabic script word from would be more likely to occur in a poorly translated text, which would not result natural MSA usage. These retrievals from the Web would be considered bad contexts, as the trigram context surrounding the target word would not be likely to disambiguate the correct diacritization in fact, it might not be MSA at all Aside from reliance on the

PAGE 46

46 trigram query, a simple filter might screen documents that are written in nonArab ic languages by searching for any instances of the characters that languages like Farsi and Urdu use which are not in the Arabic alphabet. The second two problems have to do with the nature of the algorithm itself. First, the algorithms means of searching the Web were to use a search engine. Automated searches returned the first 100 documents, as ranked in terms of relevance by the search engine. These ranked documents are the source of bootstrapped texts that were added to the training data for the aut omatic diacritizer. However, there is no guarantee that these texts provide the most relevant data for the diacritization task. Document retrieval and ranking is itself an NLP task. The algorithms for document ranking that search engines use are secret, and thus, using them to bootstrap the training corpus places a degree of trust in the efficacy of the ranking algorithms, which could be misplaced if the search engines return documents that contain poor contexts for diacritization, or leave out documents (rank them below 100) that contain good contexts. Second, the algorithm uses bootstrapping to gain the novel contexts that grow the training set for the diacritizer. Bootstrapping leads to contexts that are further and further away from the original seed, in this case, the trigram contexts that occur in the ATB3. This drift from the original disambiguating context might lead to contexts that do not correctly disambiguate the poly diacritizable word form. T hese latter two potential limitations might be overcome by using more sophisticated probability measures that weight the contexts that were bootstrapped from the Web. This weighting could be accomplished by measur ing the usefulness of

PAGE 47

47 specific web documents returned by the query. It would be fruitful to categorize the usefulness of different Web g enres (news pages, religious pages, blogs, etc.) and even individual sites based on accuracy, as defined by improvement to the diacritizer through the addition of novel disambiguation contexts. The possibility of extending my study to use more involved pr obability measures is discussed in Section 4.2.1. 4.2 Future Directions Extensions of the Current Work There are a number of directions in which the current work could be extended. Because the core of the algorithm is the method of rapidly and efficient ly expanding the training set for a supervised machine learning automatic diacritizer, extensions could come as the result of modifying the algorithm or implementation in the Step 1 (corpus choice), Step 2 (context specification), or Step 5 (probabilistic modeling methodology). The concept of bootstrapping a large training corpus could be taken a step further by iteratively bootstrapping using all the contexts acquired here to bootstrap again, potentially growing the training set even more. A more challenging extension would be to incorporate dialectal Arabic sources into the training set for the diacritizer. Finally, the methodology of bootstrapping a training corpus using the Web could be exported to other (Arabic) NLP applications. The success of boot strapping for this task indicates that the quality of Arabic linguistic data on the Web may be good. 4.2.1 Algorithm Modification The ATB3 was chosen because of its status as the gold standard of humanannotated Arabic corpora, and thus its status as training set of choice for most automatic diacritization projects. However, the efficacy of the algorithm presented here could be measured by bootstrapping a large training set from disambiguating seed

PAGE 48

48 contexts taken from another corpus; in essence, modifying the implementation at Step 1 of the algorithm by varying the input. Step 2 of the algorithm could also be varied in the size and location of the context window relative to the target poly diacritizable word. In the place of the trigram consisting of the target word and the two words directly preceding it, the trigram context could be the target word and the following two words (W W+1 W+ 2). Alternatively, the disambiguating context could be the trigram of the target word with one word on either side (W1 W W+1). Additionally, the context could be a bigram or quadgram or any other context size chosen. However, enlarging the context window is a doubleedged sword. The number of unique disambiguating seed contexts in the training data would probably rise, which would presumably increase the number of unique ngram disambiguation contexts retrieved from Web documents. However, the probability threshold for the algorithm to decide on some diacritization Di given some context Ci would be more difficult to c alculate with a simple probability measure such as transitional probability, because each context would appear less frequently in the bootstrapped training corpus. Other probability measures exist, but switching from the transitional probability used here entails changing the algorithm at another step. The algorithm could also be modified at Step 5 by modeling the diacritization in the bootstrapped training set with another probability measure. This modification would be interesting to explore, but does not appear pressing. The primary reason to use more exotic measures of probability to model a feature of natural language is that the available training data does not give a very strong basis on which to determine the strength of association between some k nown context and some choice of potential

PAGE 49

49 outcomes. As seen above in the results obtained for the algorithm, the data in the bootstrapped training corpus does in fact give a solid basis for measuring strength of association. 4.2.2 Iterative Bootstrapping Another way to extend the work presented here by iterative bootstrapping, or rebootstrapping to collect an even larger training corpus. Because of the success of the initial bootstrapping, the notion of gathering more disambiguating contexts is holds pro mise for increasing recall even further and improving the model of Arabic diacritization simply by adding more data. This fits the overall perspective of the current work, that larger amounts of data should produce higher performing NLP applications. Ho wever, this idea is not without peril. Iterative bootstrapping would use all the trigram contexts acquired in this run of the algorithm as seed contexts for another round of Web searching. The problem with this as in simple bootstrapping (discussed in Section 4.1.2) is that it moves farther away from the original data, increasing the possibility of bringing bad contexts into the training data. This can be mitigated one of two ways: adding a step to the algorithm to filter the returned snippets somehow or using a higher level association measure for training the diacritizer. Iterative bootstrapping, along with one or both of these strategies, may prove fruitful, but it moves away from the generalization this paper seeks to support, that is, that a lar ge enough data set can function as a sufficient model for natural language phenomena. 4.2.3 Dialectal Arabic on the Web The question of exten ding current NLP applications in MSA to dialectal Arabic has not been thoroughly researched. The vast majority of extant NLP research in Arabic is done with MSA. In fact, in the 12 papers on diacritization reviewed above in Section

PAGE 50

50 2.3, only 1 (Vergyri and K irchhoff 2004) uses any dialectal Arabic as training data (Egyptian Colloquial Arabic ECA). In Habash and Rambow (2006) the authors present a morphological analyzer for Arabic dialects. However, they cite the lack of resources for the dialects that c auses them to use mostly MSA for evaluation (Habash and Rambow 2006). The prevalence of dialectal Arabic makes it important to develop NLP applications that focus on these language varieties. The preponderance of NLP research uses a probabilistic approac h and as outlined in Section 2.2.1, these probabilistic approaches depend of access to large natural language corpora. Therefore there is a need for corpora of dialectal Arabic. Dialectal Arabic is (almost) by definition a completely spoken, rather than written language. This makes corpus production a much more challenging task involving speech recognition (sometimes the very task a corpus is being developing to improve!) or transcription. Some corpora exist, produced by the Linguistic Data Consortium for Gulf, Iraqi, Levantine, and Egyptian Arabic, however, these are all transcriptions of telephone conversations. In addition to being limited in discourse type, they show that expansion of the set and size of these types of dialectal corpora will only c ome with great expenditure of time and money. However, there are sources of written dialectal Arabic. This Arabic often utilizes a writing system known by a number of labels, including chat room Arabic, Arabizi, (Yaghan 2008), ASCII ized Arabic (Palfreyman and Khalil 2003), Romanized ECA (Warschauer El Said, and Zohry 2002), and Arablish. These different labels identify a fluid correspondence system that makes it possible to represent Arabic using the Roman alphabet and numerals. I will use Y aghans term Arabizi, which is itself a

PAGE 51

51 Roman character representation of a combination of the Arabic adjectives (Earabiyap : Arabic) and (injli yziya p : English). Arabizi is commonly used between native Arabic speakers using Computer Mediated Communication (CMC), such as email, instant messaging, text messaging, and forum posting. Although a written modality, a number of researchers have documented the proximity of CMC to spoken language (Androutsopoulos 2006, Danet and Herring 2007). Thus a corpus of Arabizi texts has potential to provide or augment a training corpus of dialectal Arabic to model natural language phenomena for NLP. Difficulties abound in this approach. The correspondences from Arabic letters to Arabizi graphs are not oneto one and involve some ambiguity. In addition, because Arabizi represents dialectal Arabic spoken in diglossic settings, frequent codeswitching occurs. Diglossia occurs throughout the Arabic speaking world, between the dialects, MSA, and even f ormer colonial languages English and French, and thus an Arabizi corpus would contain frequent representations of nondialectal MSA words as well as English and/ or French words. This, however, may not pose such a problem, as some Arabic dialect NLP would need to handle the codeswitching that is a feature of the dialects. Of greatest import to the current work in diacritization is the potential for the explicit representation of tashkeel in Arabizi texts. In fact, the author conducted a study in which I t ook a sample of 60 Arabizi texts from instant message conversations on Facebook. The number of short vowels realized in the text by Roman characters was divided by the total number of short vowels in the Arabic words that were being represented. It was determined that 86% of the short vowel diacritics were explicitly

PAGE 52

52 represented. In addition, most of these were wordinternal, as shown above in Section 3, the most crucial diacritics for determining the choice of lexical item. The na ture of the Web as a linguistic artifact, and the prevalence of various forms of CMC as listed above indicate that a large corpus of Arabizi could be collected. If this corpus could be cleaned as necessary, it would be a valuable resource for dialectal NLP. 4.2.4 Bootstrapping from the Web as a Methodology for other N atural L anguage P rocessing Applications Finally, the use of bootstrapping to grow a Webscale corpus of natural language could be used in NLP tasks other than diacritization. The success of the bootstrapping algo rithm, given above, at not only expanding the training corpus, but expanding it with useful data, implies that the same approach could be used beyond the present application to automatic diacritization. 4.3 Future Directions Interesting Data While examining some data in the bootstrapped corpus several intriguing phenomena appeared. They could hold potential for extensions of the current work in automatic diacritization, but they are also interesting linguistic phenomena in themselves and as such are revi ewed here. They are a window into the decisions made by native Arabic speakers writing in Arabic. The notable feature of some of the data returned from the Web queries is that words were sometimes diacritized in an otherwiseundiacritized text. There were two ways of representing this information about the diacritics. The first is typing one or more diacritic marks in a word in the midst of an undiacritized text. This method I refer to as spotlight diacritization. The other method is to include a short phrase after the polydiacritizable word that indicates some

PAGE 53

53 diacritic mark, typically one of the three short vowels fatha kasra or dhamma, with the letter to be diacritized. This method I label explicit mention. 4.3.1 Spotlight Diacritization The vast majority of diacritized Arabic on the Web is contained in documents that are completely diacritized, that is, religious or pedagogical documents. Of more import to the current study, because more illuminative of the linguistic habits of Arabic wr iters, are the cases where a word has diacritic marks in an otherwise undiacritized document. Even more specifically, it is the short vowels appearing in documents that otherwise lack them that is the relevant phenomenon. The inclusion of the shadda or t he tanween can be conveniently achieved by modern Arabic keyboards for the computer. However, the short vowels are not so easily added, which raises the question as to why they would be added at all. These cases may represent polydiacritizable words in c ontexts that the authors did not think would be sufficient for the readers to disambiguate to correct form. Thus the diacritic marks are added to aid the reader in the disambiguation task. A complicating factor is that the spotlight diacritization is uns ystematic; sometimes only one short vowel is added, sometimes two, sometimes all of them for a given word. Because of this lack of systematicity and the prevalence of fully diacritized documents that were returned with the web search, a full analysis of w here spotlight diacritization occurs has not been accomplished. 4.3.2 Explicit Mention Diacritization Explicit mention diacritization is a more intriguing phenomenon. An example of this would be one of two phrases following the word form / Aljd / which can be diacritized as (Aljid: seriousness) or (Aljad: grandfather). To indicate one of these two possible diacritizations, the phrase (bikasr Aljiy m: with kasra on the

PAGE 54

54 jiy m) or (bifatH Aljiy m: with fatha on the jiy m) wou ld be added after the undiacritized word form. A preliminary analysis of explicit mention diacritization indicates that a categorization of different goals for the writer is possible. Some of the occurrences in the corpus are prescriptive indications of the linguistically pure pronunciation variant. Prescriptivism can also be seen in indicating a variant of a proper name. However, proper names can also be the subject of explicit mention diacritization because the name is rare and a pronunciation guid e (rather than prescriptivist rule) is given. For example, the word (elzT) with the phrase (bi D am ~ Alzay: with dhamma on the zay) indicates the only possible diacritization of the word, but it is a rare work referring to a Sudanese tribe, used metaphorically for giants. Sometimes the diacritization is mentioned explicitly because the words are sema ntically similar, but one specific form is intended by the writer and not clear from the context For example: ( AltmAm bifatH Altay A ltamam: full moon) ( AltmAm bikasr Altay A ltimam: longest night of the year). A fi nal category is illustrated by the example already given to illustrate the phenomenon in general, that of the forms (Aljid: seriousness) and (Aljad: grandfather) that are not semantically related. Due to the preliminary nature of the investiga tion of this phenomenon, relative distribution of the categories is not available. However, the relevant fact about all of these categories is that in each case the writer is making a decision based on their belief that the reader will benefit from help in disambiguating ambiguous forms. Sometimes this need derives from the infrequence of the world, rendering it rare and likely to be mis pronounced or even unrecognized.

PAGE 55

55 Other times the need is to distinguish shades of meaning between semantically relat ed but different lexical items, or to distinguish between two unrelated lexical items that share the root consonants in an example of Arabic homography. In these latter two cases, the assumption the writer is making is that context is insufficient for dis ambiguation, even by an assumedly educated reader. What I did not see, nor do I expect to find, is explicit mention diacritization for the purpose of specifying case. Written Arabic tends to only mark case for indefinites, and indefinite case is carried on long vowels that are easily typed (see the note on common typing of tanween) 4.3.3 Harnessing these Phenomena for further N atural L anguage P rocessing Applications In most cases of both spotlight diacritization and explicit mention diacritization, the writer is uncertain that the context will enable a reader to disambiguate the forms or certain that the context will be insufficient for this task In essence, this is a statement that the context is not strongly associated with the desired diacritizati on or that the diacritization is rare even given the context A corpus of texts using these diacritization techniques can easily be constructed. This corpus would contain a special class of disambiguating context s for the diacritization task contexts t hat are strong support for rare diacritizations. With some targeted searching the trigrams from this corpus, containing the target poly diacritizable words could be added to the bootstrapped training corpus as developed in my study The difficulty is tha t these are not the kind of contexts that were acquired and used for training the diacritizer in the algorithm presented in Section3.2.2. Those contexts had a strong association with specific diacritizations. These contexts may not be strongly associated with any diacritization, or may be associated with the wrong diacritization at least in the minds of the writers of

PAGE 56

56 these documents, or the overt diacritization would not be used. The benefit of using these texts would be in cases when the system must diacritize ambiguous words that occur in contexts either weakly associated with the correct diacritizations, or strongly associated with rare diacritizations.

PAGE 57

57 A PPENDIX BUCKWALTER TRANSLITERATION SYSTEM Arabic Alphabet Buckwalter Transliteration IPA Arabic Alphabet Buckwalter Transliteration IPA A a: k k b b l l t t m m v n n j h h H w w x x y y d d F a:n N u:n r K i:n z z a a s s u u $ i i S s ~ (doubling) D o T t Z p t E g f f q q

PAGE 58

58 LIST OF REFERENCES Alghamdi, Mansour and Zeeshan Muzafar. 2007. KACST Arabic diacritizer Proceedings of The 1st International Symposium on Computers and Arabic Langauge & Exhibition Riyadh November King Abdulaziz City for Science and Technology, Riyadh. Ananthakrishnan, Sankaranarayanan, Shrikanth S. Narayanan and Srinivas Bangalore. 2005. Automatic d iacri tization of Arabic t ranscripts for a utomatic s peech r ecognition. In Proceedings of International Conference On Natural Language Processing pages 47 54, Kanpur. Androutsopoulos, Jannis. 2006. Introduction: Sociolinguistics and computer mediated communic ation Journal of Sociolinguistics 10 ( 4 ): 419 438. Benmamoun, Elabbas. 1999. Arabic morphology: t he central role of the imperfective. Lingua 108(23): 175 201. BNC Consortium. (2007). T he British National Corpus, version 3 (BNC XML Edition) Retrieved Nov. 2 2009, from http://www.natcorp.ox.ac.uk / Brustad, Kristen. 2000. The Syntax of Spoken Arabic : A comprehensive study of Moroccan, Egyptian, Syrian, and Kuwaiti dialects Georgetown University Press. Washington, D.C. Buckwalter, Tim. 2000. Buckwalter Arabic Morphological Analyzer Version 1.0. LDC Catalog No.: LDC2002L49. Linguistic Data Consortium, Philadelphia, PA Comrie, Bernard. 1976. Aspect Cambridge University Press. Cambridge. Dahlgren, SvenOlof. 1998. Word Order in Arabic Acta Universitatis Gothoburgenis. Goteborg Danet, Brenda and Susan C. Herring 2007 Multilingualism on the Internet I n Marlis Hellinger and Anne Pauwels, editors, Handbook of Language and Communic ation: Diversity and Change. Mouton de Gruyter, Berlin, p ages 553 594 Darwish, Kareem. 2002. Building a shallow Arabic morphological analyser in one day. In Proceedings of the Workshop on Computational Approaches to Semitic Languages, pages 22 29, Phil adelphia, PA Davies, Mark. 2008. The Corpus of Contemporary American English (COCA): 400+ million words, 1990present Available online at http://www.americancorpus.org

PAGE 59

59 Elghamry, Khaled and Christian Hettick 2009. Improving the a utomatic d iacritization of Arabic using W eb based a cquisition and clustering of d iscriminating contexts. Paper presented at the 23rd Arabic Linguistics Symposium, Milwaukee, WI April 2009. Elshafei, Moustafa, Husni Al Muhtaseb and Mansour Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. Proceedings of The 18th Saudi National Computer Conference (NCC18) Riyadh March. The Saudi Computer Society, Riyadh. Ferguson, Charles A. 1959. Diglossia. Word 15: 325 340. Goweder, Abduelbaset and Anne De Roeck. 2001. Assessment of a significant a rabic corpus. In Proceedings of the Arabic NLP Workshop at ACL/ EACL 2001, pages 73 79, Toulouse. Habash, Nizar and Owen Rambow. 2007. Arabic d iacritization t hr ough f ull m orphological t agging. In Proceedings of NAACL HLT Companion Volume, pages 53 56, Rochester NY Habash, Nizar and Owen Rambow. 2006. MAGEAD: a morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st Intern ational Conference on Computational Linguistics pages 681 688, Sydney. Habash, Nizar and Owen Rambow. 2005. Arabic t okenization, p art of speech t agging and m orphological d isambiguation in o ne f ell swoop. In Proceedings of the 43rd Annual Meeting of the ACL, pages 573 580, Ann Arbor MI Habash, Nizar, Soudi Abdelhadi and Tim othy Buckwalter. 2007. On Arabic t ransliteration. In A. Soudi, A. van den Bosch and G. Neumann, editors, Arabic Computational Morphology: KnowledgeBased and Empirical Methods Springer, Do rdrecht, pag es 15 22. Kaye, Alan S. 1970. Modern Standard Arabic and the colloquials. Lingua 24(1): 374 391. Kubler, Sandra and Emad Mohamed. 2008. Memory based vocalization of Arabic. Proceedings of the LREC Workshop on HLT and NLP within the Arabic World Marrakesh, May European Language Resources Association, Paris. Maamouri, Mohamed, Ann Bies and Seth Kulick. 2006. Diacritization: a challenge to Arabic t reebank a nnotation and p arsing. In Proceedings of the Conference of th e Machine Translation SIG of the British Computer Society pages 35 47, London. Maamouri, Mohamed, Ann Bies, Tim Buckwalter and Hubert Jin. 2004. Arabic Treebank: Part 3 v 1.0., LDC Catalog No.: LDC2004T11. Linguistic Data Consortium, Philadelphia, PA

PAGE 60

60 Nelken, Rani and Stuart M. Shieber. 2005. Arabic d iacritization u sing w eighted f inite state t ransducers. In Proceedings of the 2005 ACL Workshop on Computational Approaches to Semitic Languages pages 79 86, Ann Arbor MI Palfreym an, David and Muhame d al Khalil. 2003. A f unky l anguage for t eenzz to u se: r epresenting g ulf A rabic in i nstant m essaging. Journal of Computer Mediated Communication, 9 ( 1 ):1 21. Rogers, Henry 2005 Writing Systems: A Linguistic Approach. Blackwell Publishing Malden, MA Roth, Ryan, Owen Rambow, Nizar Habash, Mona Diab and Cynthia Rudin. 2008. Arabic m orphological t agging, d iacritization, and l emmatization u sing l exeme m odels and f eature r anking. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL:HTL) pages 117 120, Columbus OH Ryding, Karin C. 2005. A Reference Grammar of Modern Standard Arabic Cambridge University Press, Cambridge. Schlippe, Tim, ThuyLinh N guyen & Stephan Vogel. 2008. Diacritization as a m achine t ranslation p roblem and as a sequence l abeling p roblem. In Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas pages 270 278, Waikiki, HI Vergyri, Dimitra & Katrin Kirchhoff. 2004. Automatic diacritization of Arabic for a coustic m odeling in speech r ecognition. In Proceedings of the COLING Workshop on Arabic script Based Languages pages 66 73, Geneva. Versteegh, Kees. 1997. The Arabic Language. Edinburgh University Press, Edinburgh. Warschauer, Mark Ghada R. El Said, Ayman Zohry. 2002. Language choice o nline: g lobalization and i dentity in Egypt Journal of Computer Mediated Communication 7( 4 ): 20 40 Yaghan, Mohammad Ali 2008. Arabizi: A Contempor ary style of Arabic slang Design Issues 2 4( 2 ):39 52. Zit ouni, Imed, Jeffrey S. Sorensen, and Ruhi Sarikaya. 2006. Maximum e ntropy b ased r estoration of Arabic d iacritics In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL pages 577 584, Sydney.

PAGE 61

61 BIOGRAPHICAL SKETCH Christian Hettick matriculated to Washington State University, graduating in 2003 with a Bachelor of Arts in b usiness a dministration and a concentration in m anagement i nformation systems. While attending Washington State University he minored in Chinese and spent a summer studying in Beijing. His undergraduate studies culminated in a Senior Honors Thesis on the effects of SinoU.S. relations on multinational corporat ions operating in China. Following his undergraduate studies, he lived and worked in Tunis, Tunisia, where he continued his language studies in Modern Standard Arabic and Tunisian Dialectal Arabic His graduate studies in l inguistics at the University of Florida have included t heoretical l inguistics, computational l inguistics, f ield m ethods, writing systems, and Arabic.