WO2009042861A1

WO2009042861A1 - Methods, systems, and media for partially diacritizing text

Info

Publication number: WO2009042861A1
Application number: PCT/US2008/077849
Authority: WO
Inventors: Mona Talat Diab
Original assignee: The Trustees Of Columbia University In The City Of New York
Priority date: 2007-09-26
Filing date: 2008-09-26
Publication date: 2009-04-02

Abstract

Methods, systems, and media for partially diacritizing text are provided. In accordance with some embodiments, methods for partially diacritizing text are provided, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.

Description

METHODS, SYSTEMS, AND MEDIA FOR PARTIALLY DIACRITIZING TEXT

Cross Reference to Related Application

[0001] This application claims the benefit of United States Provisional Patent

Application No. 60/975,482, filed September 26, 2007 and United States Provisional Patent Application No. 60/975,783, filed September 27, 2007, both of which are hereby incorporated by reference herein in their entireties.

Technical Field

[0002] The disclosed subject matter relates to partially diacritizing text. More particularly, the disclosed subject matter relates to methods, systems, and media for identifying the optimal level of diacritization in written Modern Standard Arabic (MSA) text and/or in written dialectal data.

Background

[0003] Arabic script consists of two classes of symbols: letters and diacritical marks (sometimes referred to herein as "diacritics"). While letters are always written, diacritics - i.e., marks inserted above or below particular letters that are used to express short vowels, lack of vowels, and consonantal gemination (letter doubling) - are optional. These diacritics can be used to aid the reader in disambiguating the text (e.g., distinguishing between different meanings of the word) or articulating the text correctly. However, almost all documents in Modern Standard Arabic, especially those texts which are not poetic or religious, are written using consonants only - i.e., without diacritics. For example, in the Penn Arabic Treebank (ATB) III, version 2, it has been estimated that only 1.6% of all words have some diacritics explicitly marked in the text. This significantly affects the readability and comprehension for written Arabic text. Compounding this problem is the diglossic situation: that the native spoken dialects are quite different from the formal Modern Standard Arabic, the common written and academic language. For example, the different dialects provide a multitude of different pronunciations. [0004] Various approaches attempt to solve this problem by folly restoring all of the diacritics in non-diacritized Arabic text. However, folly specifying all diacritics can hinder readability. Thus, full diacritization encodes too much information into the non-diacritized Arabic text. In addition, full diacritization has been shown to be less than optimal for several natural language processing applications, such as statistical machine translation.

[0005] There is therefore a need in the art for approaches that provide a partial diacritization for Arabic text. Accordingly, it is desirable to provide methods and systems that overcome these and other deficiencies of the prior art.

Summary

[0006] Methods, systems, and media for partially diacritizing text are provided. In accordance with some embodiments, methods for partially diacritizing text are provided, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic: and providing the annotated text.

[0007] In some embodiments, systems for partially diacritizing text are provided, the system comprising: means for receiving text; means for selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; means for applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and means for providing the annotated text.

[0008] In some embodiments, systems for partially diacritizing text are provided, the system comprising: a processor that: receives text; selects a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applies the selected partial diacritization scheme to annotate the received text with at least one diacritic; and provides the annotated text.

[0009] In some embodiments, computer-readable media storing computer- executable instructions that, when executed by a processor, cause the processor to perform methods for partially diacritizing text are provided. The method comprises:

- -> - receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.

Brief Description of the Drawings

[0010] FIG. 1 is a diagram of a mechanism for partially diacritizing Arabic text in accordance with some embodiments.

[0011] FIG. 2 is a diagram of a mechanism for extracting partial diacritizations of Arabic text and training partial diacritization schemes in accordance with some embodiments.

[0012] FIG. 3 is a schematic diagram of an illustrative system suitable for implementation of an application that partially diacritizes Arabic text in accordance with some embodiments.

[0013] FIG. 4 is a detailed example of the server and one of the workstations of FIG. 3 that can be used in accordance with some embodiments.

Detailed Description

[0014] In accordance with various embodiments, mechanisms for partially diacritizing text are provided. In some embodiments, these mechanisms can receive text. In response to receiving the text, a partial diacritization scheme is selected from multiple partial diacritization schemes based on the received text and a performance score. The selected partial diacritization scheme is applied to the text to annotate the text with at least one diacritic. The partially diacritized text can then be provided. [0015] These mechanisms can be used in a variety of applications. For example, these mechanisms can be incorporated into an automatic speech recognition (ASR) system, where partially diacritized Arabic text or Arabic words can be transmitted to a text-to-speech component for articulating the partially diacritized Arabic text. Since the optimal or enhanced level of diacritization renders an optimal level or enhanced of variation, such diacritized text can be used effectively in language modeling for automatic speech recognition (ASR) systems and, more particularly, in the pronunciation dictionary component. In addition, such diacritized text can be used effectively in language modeling for Optical Character Recognition (OCR) systems and, more particularly, in the orthographic dictionary component. Combined with lattice decoding, such optimal or enhanced partial diacritizations allow for better performance of such natural language processing (NLP) applications. [0016] In another example, these mechanisms can be incorporated into an

Internet web browser such that the partial diacritization scheme can be automatically applied to Arabic text displayed to a user, thereby facilitating the readability and comprehension of the Arabic text. In yet another example, these mechanisms can be incorporated into Web servers so that the partial diacritization scheme can be automatically applied when a Web page is stored on the server or delivered to a user. In yet another example, the diacritized Arabic text can be transmitted to a statistical machine translator (SMT) that translates the partially diacritized Arabic text into another language.

[0017] Alternatively, these mechanisms can also be incorporated into a word processing application, a publishing application, and/or any other suitable document or publishing applications for producing reading materials, thereby setting a more comprehensible reading standard for Arabic texts.

[0018] It should be noted that although these embodiments are primarily described using Modern Standard Arabic (MSA), this is only illustrative. These mechanisms can be applied to any suitable Arabic script-based languages, such as Arabic, Farsi, Kurdish, Malay, Persian, Urdu, etc. For example, partial diacritization schemes can be generated for text written in a Perso- Arabic script. Alternatively, these mechanisms can be applied to any suitable non-Latin alphabet that uses diacritics, marks, or characters to alter pronunciation, differentiate between similar words, etc. For example, in the Greek alphabet, diacritics (e.g., acute accent, grave accent, circumflex accent, etc.) are used to indicate different pronunciations, pitch accents, and/or breathings. In another example, in Hebrew orthography, niqqud is the set of diacritics used to represent vowels, distinguish between different pronunciations, etc. In yet another example, in the Vietnamese alphabet, hook and horn diacritics are used over vowels. These mechanisms for partial diacritization can be applied to any suitable language that uses diacritics.

[0019] As used herein, a diacritic is a mark or symbol inserted above or below a particular letter and can used to indicate a short vowel, a lack of a vowel, a consonantal gemination (letter doubling), etc. Generally speaking, there are three types of diacritics in Arabic: a vowel, a nunation, and a shadda (gemination).

[0020] Vowel diacritics represent three short vowels and a diacritic indicating the absence of any vowel. The following are the four vowel diacritics exemplified in conjunction with the letter b ( --'):

^j ha (fathah ώ hu (damma), u bi {kasra), and ψ bo (no vowel).

It should be noted that the absence of a vow^rel is sometime referred to as the sukuun diacritic. The sukuun diacritic marks the boundaries between syllables. [0021] Nunation diacritics occur in the final position of a word in nominals

(e.g., nouns, adjectives, and adverbs). Nunations indicate a short vowel followed by an unwritten n sound, for example:

L bΛF , w> BN and _o όλ^~.

These nunations are generally an indicator of nominal indefiniteness.

[0022] The shadda diacritic is a consonant doubling diacritic, for example:

O b^ (-W).

In another example, the shadda can be combined with vowel or nunation diacritics, such as:

[0023] It should be noted that, in some embodiments, other diacritics can also be included. The Arabic mark hamza can be placed in connection with a number of letters. For example:

Note that Arabic encodings generally do not count the hamza as a diacritic, but rather as a part of the letter (e.g., like the dot on a lower-case Roman letter "i" or under the Arabic letter "b").

[0024] In some embodiments, diacritics can include lexical diacritics. Lexical diacritics are generally used to distinguish between two lexemes (alternatively expressed as lemmas, citation forms, or derivational forms) or an abstraction over inflected word forms which group together the word forms that differ only in terms of one of the inflectional morphological categories, such as, for example, number, gender, aspect, voice, etc. Arabic lexeme citation forms are third masculine singular perfective for verbs and masculine singular (or feminine singular if the masculine form is not possible) for nouns and adjectives. For example, the diacritization

difference between the lexemes V^. '^^/;> (writer) and V"¹ " k-^h'b (to correspond) distinguishes between the meanings of the word (lexical disambiguation) rather than their inflections. Diacritics can be used to mark lexical variation. A common example with the shadda (gemmation) diacritic is the distinction between Form I and Form II of Arabic verb derivations. Form II indicates, in most cases, added causativity to the Form I meaning. Form II is marked by doubling the second radical of the root used in Form I, for example:

§^"1 A kal ^* ate ^* versus Si Ak fed " .

[0025] In some embodiments, diacritics can include inflectional diacritics.

Inflectional diacritics are generally used to distinguish different inflected forms of the same lexeme. For example, the final diacritics in ύtf knΛbis 'book [nominative]^*^ C^^'cf htAha 'book [accusative]^' distinguish the morpho-syntactic case of "book" (e.g., whether the word is the subject or the object of the verb). It should be noted that, in some embodiments, additional inflectional features (e.g., voice, mood, definiteness, etc.) can be shown with other inflectional diacritics.

[0026] In accordance with some embodiments, a partial diacritization scheme can be selected and performed on text (e.g., Arabic text) using a process 100 as illustrated in FIG. 1. As shown, text is received at 102. The text may be received from any suitable source. For example, in some embodiments, these mechanisms detect that a file has been accessed and determine that the file contains text that is written in Arabic or any other suitable language that uses diacritics. In some embodiments, these mechanisms receive Arabic text from a word processing, publishing, or any other suitable application. In some embodiments, these mechanisms receive Arabic text from a web server.

[0027] It should be noted that, in some embodiments, the Arabic text is extracted from the received text. For example, in response to a user accessing a web page that includes both Arabic text and English text, the Arabic text can be extracted from the web page for applying a partial diacritization scheme. [0028] At 104, a partial diacritization scheme is selected from multiple partial diacritization schemes. These partial diacritization schemes can draw on linguistic specifications of different types of diacritization. For example, partial diacritization schemes can include inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, any other suitable partial diacritization scheme, or any suitable combination thereof.

[0029] The lexical diacritization schemes include one or more rules to mark or annotate Arabic words with the idiosyncratic types of diacritics that are typically expressed in the form of derivational morphology, such as the letter doubling (shadda), absence of a vowel diacritic marking syllable boundaries (sukuun), among others. For example, as shown below in Table 1, GEM is a lexical diacritization scheme that marks the words in the data with the shadda diacritic (-). Words that have a gemination diacritic, in the underlying lemma form, are explicitly marked with the shadda diacritic (~). As also shown in Table 1 , SUK is a lexical diacritization scheme that marks words in the data with the sukuun diacritic (o) (the marker signifying the absence of a vowel). Words that have a sukuun in the underlying lemma form are marked with the sukuun (o) diacritic.

[0030] In some embodiments, one suitable lexical diacritization scheme can include a combination of lexical diacritization schemes. For example, a lexical diacritization scheme can have rules for marking Arabic words with the shadda diacritic and the sukuun diacritic. In another example, a lexical diacritization scheme can have rules for marking Arabic words with any suitable combination of lexical diacritics.

[0031] The inflectional diacritization schemes include one or more words to mark or annotate Arabic words with the predictable forms of diacritics, such as case, mood, passivation, definiteness, etc. Inflectional diacritics are generally used to distinguish different inflected forms of the same lexeme. For example, as shown in Table 1, PASS is an inflectional diacritization scheme that marks the verb passivation or damma diacritic. As also shown in Table 1, C-M or CASE-MOOD is an inflectional diacritization scheme that marks words with diacritics representing case (e.g., the a, i. u. F, K, and N marks on nominate) and/or mood (e.g.. the a. o. and u marks on subjunctive, jussive, and indicative mood of verbs).

Table 1

[0032] Table 1 shows an example of the contrasting diacritization schemes

(e.g., no diacritization, multiple partial diacritizations, and full diacritization) applied to the sentence "the walls will be restored." It should be noted that the diacritization scheme NONE shows the Arabic sentence without diacritics, while the diacritization scheme FULL shows the Arabic sentence with all of the possible diacritics. [0033] As shown in Table 1, the PASS diacritization scheme annotates the

Arabic sentence with the damma (u) diacritic or verb passivation (e.g., sturmm). The C-M diacritization scheme annotates the verb in the Arabic sentence with a diacritic to show the indicative mood of the verb. In particular, a damma is added to the end of the verb (e.g., strmmu) to indicate that the word means "will be restored." In addition, the C-M diacritization scheme marks the noun with a nominative case diacritic to indicate that the word is the subject of the sentence (e.g., AljdrAnu). [0034] As also shown in Table 1, the GEM diacritization scheme inserts the gemination or shadda diacritic (~) into the verb (e.g., strm~m). The SUK diacritization scheme inserts the sukuun diacritic (o) in the noun to indicate the absence of a vowel (e.g., AljdorAn).

[0035] In some embodiments, a frequency-based partial diacritization scheme can be used, where the frequency-based partial diacritization scheme can identify the distinctive diacritization to associate with a word given the context. It should be noted that the frequency-based partial diacritization scheme is a combination of an inflectional diacritization scheme and a lexical diacritization scheme. This frequency- based partial diacritization scheme can be calculated over several corpora based on frequencies of occurrences of the distinct diacritics in a large collection of fully diacritized texts, such as Arabic Gigaword or any other suitable collection of text. It should be noted that the frequency based diacritization scheme marks the least frequent of cases to distinguish them from the frequent readings especially in the case of minimal pairs.

[0036] In some embodiments, statistical information (e.g., distributions, etc.) from one or more corpuses can be used for frequency-based partial diacritization schemes. For example, in the Penn Arabic Treebank (ATB) III, version 2, 1.6% of the words have some diacritics. Among these, the most common diacritics are the nunation diacritics (e.g., F, K, and N), accounting for 73.4% of the naturally occurring diacritics in the ATB. A majority of these nunation diacritics are used inflectionally to mark nominals (nouns, adjectives, proper nouns, etc.) indicating case assignment together with indefmiteness. For example, the F diacritic marks the accusative case, the K diacritic marks the genitive case, and the N diacritic marks the nominative case. The next frequent diacritic is the shadda diacritic. The shadda diacritic (gemination) accounts for 20.8% of the naturally occurring diacritics in the ATB and occurs 56.7% of the time with verbs. The third frequent diacritic is the damma, which accounts for 3% of the naturally occurring diacritics in the ATB. The majority of the usage of the damma diacritic is to indicate the passive form of verbs. These statistics and distributions of naturally occurring diacritics can be used to train the partial diacritization scheme.

[0037] It should be noted that statistics from any suitable source can be used to train the partial diacritization schemes. For example, in some embodiments, statistics and distributions from Arabic words that are diacritized using one or more partial diacritization schemes can be used to train and update the partial diacritization schemes. In another example, statistics and distributions from Arabic Gigaword can be used to train the partial diacritization schemes.

[0038] In some embodiments, if a word has one unique diacritization associated with it, the partial diacritization scheme can determine to not mark the word. That is, the frequency-based partial diacritization scheme can determine to not annotate the word with diacritics because the most frequent reading of the word is the valid reading. [0039] For example, the word "mn" in Arabic is typically written without a diacritic. This sequence of letters "mn" can be used to mean the most frequent rendering of the preposition "of, which can be expressed with the diacritic "i" as "min." This sequence of letters can also be used to mean the less frequent reading of "who." which is expressed with the diacritic "a" as "man." In addition, this sequence of letter can be used to mean the even less frequent reading of the verb "to bestow" expressed with the diacritic "a" and the shadda or letter doubling diacritic as "mann." Accordingly, in some embodiments, the words "who" and "bestow" can be rendered with diacritics, while the most frequently occurring word "of can be rendered without diacritics.

[0040] In some embodiments, the partial diacritization scheme may determine which diacritics to insert based on frequency. For example, the fully disambiguated form of the Arabic verb "to bestow" is expressed with the diacritic "a" and the shadda or letter doubling diacritic. However, the partial diacritization scheme can determine to only mark the Arabic word with the letter doubling diacritic as it is the most distinctive feature.

[0041] In accordance with some embodiments, these mechanisms can derive each of the partial diacritization schemes (e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, etc.) from the fully disambiguated form of the word.

[0042] As shown in FIG. 2, in response to receiving text from any suitable source (e.g., from a text-to-speech application) at 202, a full diacritization can be conducted on the received text to generate a fully disambiguated form of the received text. For example, in some embodiments, the received text can be transmitted to a full diacritization application, such as the Morphological Analysis and Disambiguation of Arabic (MADA) system. The MADA system tags morphologically rich languages by using a set of taggers that are trained for individual linguistic features (e.g., core part- of-speech, tense, number, etc.). As shown previously in Table 1, the FULL diacritization scheme specifies all of the diacritics in the sentence "the walls will be restored" using the MADA system.

[0043] It should be noted that any suitable approach for fully diacritizing words (e.g., Arabic words or words in any alphabet that uses diacritics) can be used. For example, the received text can be transmitted to a morphological analysis system, such as the Buckwalter Arabic Morphological Analysis (BAMA) system.

[0044] At 206, partial diacritizations of the received text can be extracted from the fully diacritized text. In some embodiments, particular diacritics can be selectively removed from the fully disambiguated form based on the partial diacritization scheme. The lexical and inflectional partial diacritizations can be extracted directly from the fully disambiguated form. For example, as described previously, the GEM diacritization can selectively remove all diacritics from the

Arabic words except for the shadda diacritics on verbs. In another example, the SUK diacritization scheme can selectively remove from the fully disambiguated form all diacritics except for the sukuun diacritic in nouns.

[0045] Accordingly, based on the specifications of each partial diacritization scheme, particular diacritics are selectively removed from each word to extract the partial diacritization.

[0046] In some embodiments, the partial diacritization of particular words can be used to train the partial diacritization scheme at 208. For example, the mechanisms can be trained on data in a specific partial diacritization scheme. In response to receiving new Arabic text, the new Arabic text can be annotated with diacritics with the select partial diacritization scheme.

[0047] Referring back to FIG. 1, a partial diacritization scheme is selected from multiple partial diacritization schemes (e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, any other suitable partial diacritization scheme, or any suitable combination thereof).

[0048] The partial diacritization scheme can be selected based on any suitable criteria. For example, in some embodiments, the partial diacritization scheme can be selected based on genre and domain information. In another example, in some embodiments, the partial diacritization scheme can be selected based on a performance score or a performance measurement of each partial diacritization scheme. In yet another example, the selected partial diacritization scheme can be the optimal partial diacritization scheme (e.g., the partial diacritization scheme with the best or highest performances score).

[0049] In some embodiments, a performance score calculated by the MADA system can be used to select the partial diacritization scheme. To measure the degree of undergeneration or overgeneration of the system using a particular partial diacritization scheme, the precision, the recall, and the F-score of the statistical machine translation (SMT) system or any other suitable system can be measured. The precision measurement, which is the ratio of true positives over the total number of true and false positives, measures to what extent overgeneration of diacritics occurs. The recall measurement, which is the ratio of true positives over the sum of true positives and false negatives, measures the amount of undergeneration. The F-score measurement is the harmonic average between precision and recall. As shown below in Table 2, the F-score show^rs the performance score or quality measurement of the specific diacritization scheme.

Table 2

[0050] It should be noted that any suitable criteria can be used to select the partial diacritization scheme. For example, in some embodiments, partial diacritization schemes can be applied from different scientific perspectives (e.g., psycholinguistic and neurological studies coupled with computational modeling in the context of natural language processing machinery). In another example, the partial diacritization scheme can be selected in response to determining the partial diacritization scheme with the highest MADA F-score (or any other suitable performance score determined by any suitable system that assigns full diacritizations to Arabic text). In yet another example, the optimal partial diacritization scheme can be selected in response to determining the partial diacritization scheme with the highest F-score and determining the partial diacritization scheme with rules that best fit the psycholinguistics associated with the received Arabic text. [0051] It should also be noted that the selected partial diacritization scheme can be determined based on a combination of criteria. For example, a multidimensional table can be used to account for different combinations of criteria (e.g., performance score, domain information, genre information, frequency, etc.). The determination of the selected partial diacritization scheme to use on particular text can be calculated by assigning priorities to each of the criteria (e.g., the performance score is the most important criterion and the frequency is the least important criterion).

[0052] Referring back to FIG. 1, in response to selecting the partial diacritization scheme from multiple partial diacritization schemes based on particular criteria at 104, the partial diacritization scheme is applied to annotate the received text with one or more diacritics at 106. At 108, the annotated text is provided. For example, these mechanisms can be incorporated into an automatic speech recognition (ASR) system, where the annotated text can be transmitted to a text-to-speech component for articulating the partially diacritized Arabic text. In another example, the annotated text can be transmitted to a statistical machine translator (SMT) that translates the partially diacritized Arabic text into another language. [0053] FIG. 3 is a schematic diagram of an illustrative system 300 suitable for implementation of an application that partially diacritizes Arabic text in accordance with some embodiments. As illustrated, system 300 can include one or more workstations 302. Workstations 302 can be local to each other or remote from each other, and are connected by one or more communications links 304 to a communications network 306 that is linked via a communications link 308 to a server 310.

[0054] In system 300, server 310 can be any suitable server for executing the application, such as a processor, a computer, a data processing device, or a combination of such devices. Communications network 306 can be any suitable computer network including the Internet, an intranet, a wide-area network (WAN), a local-area network (LAN), a wireless network, a digital subscriber line (DSL) network, a frame relay network, an asynchronous transfer mode (ATM) network, a virtual private network (VPN), or any combination of any of the same. Communications links 304 and 308 can be any communications links suitable for communicating data between workstations 302 and server 310, such as network links, dial-up links, wireless links, hard-wired links, etc. Workstations 302 can be publishing media devices, personal computers, laptop computers, mainframe computers, dumb terminals, data displays, Internet browsers, personal digital assistants (PDAs), two-way pagers, wireless terminals, portable telephones, etc., or any combination of the same. Workstations 302 and server 310 can be located at any suitable location. In one embodiment, workstations 302 and server 310 can be located within an organization. Alternatively, workstations 302 and server 310 can be distributed between multiple organizations.

[0055] The server and one of the workstations, which are depicted in FIG. 3, are illustrated in more detail in FIG. 4. Referring to FIG. 4, workstation 302 can include processor 402, display 404, input device 406, and memory 408, which can be interconnected. In a preferred embodiment, memory 408 contains a storage device for storing a workstation program for controlling processor 402. Memory 408 can also contain an application for partially diacritizing Arabic text in accordance with some embodiments. In some embodiments, the application can be resident in the memory of workstation 302 or server 310.

[0056] In one particular embodiment, the application can include client-side software, hardware, or both. For example, the application can encompass one or more Web-pages or Web-page portions (e.g., via any suitable encoding, such as Hyper Text Markup Language (HTML), Dynamic HyperText Markup Language (DHTML), Extensible Markup Language (XML), JavaServer Pages (JSP), Active Server Pages (ASP), Cold Fusion, or any other suitable approaches).

[0057] Although the application is described herein as being implemented on a workstation, this is only illustrative. The application can be implemented on any suitable platform (e.g., a personal computer (PC), a mainframe computer, a dumb terminal, a data display, a two-way pager, a wireless terminal, a portable telephone, a portable computer, a palmtop computer, a H/PC, an automobile PC, a laptop computer, a personal digital assistant (PDA), a combined cellular phone and PDA, etc.) to provide such features.

[0058] Processor 402 can use the workstation program to present on display

304 the application and the data received through communication link 304 and commands and values transmitted by a user of workstation 302. It should also be noted that data received through communication link 304 or any other communications links can be received from any suitable source, such as web services. Input device 406 can be a computer keyboard, a mouse, a touch-sensitive screen, a cursor-controller, a dial, a switchbank, lever, or any other suitable input device as would be used by a designer of input systems or process control systems. [0059] Server 310 can include processor 420, display 422, input device 424, and memory 426, which can be interconnected. In a preferred embodiment, memory 426 contains a storage device for storing data received through communication link 308 or through other links, and also receives commands and values transmitted by one or more users. The storage device further contains a server program for controlling processor 420. In some embodiments, memory 426 can also contain an application for partially diacritizing Arabic text in accordance with some embodiments. For example, server 310 can use the application to partially diacritize text (e.g., Arabic text) that passes through server 310.

[0060] Accordingly, methods, systems, and media for partially diacritizing text are provided.

[0061] Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention, which is only limited by the claims which follow. Features of the disclosed embodiments can be combined and rearranged in various ways.

Claims

What is claimed is:

1. A method for partially diacritizing text, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.

2. The method of claim 1 , further comprising determining that at least a portion of the received text is in a non-Latin alphabet.

3. The method of claim 1, further comprising determining that at least a portion of the received text is in Arabic.

4. The method of claim 1, further comprising generating a disambiguated form of the received text by conducting a full diacritization on the received text.

5. The method of claim 4, further comprising extracting a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.

6. The method of claim 1, further comprising using the annotated text to train the selected partial diacritization scheme.

7. The method of claim 1 , further comprising training each partial diacritization scheme with statistics corresponding to a corpus.

8. The method of claim 1, wherein selecting the partial diacritization scheme from a plurality of partial diacritization schemes further comprises: determining the performance score of each of the plurality of partial diacritization schemes on the received text; and determining the partial diacritization scheme with the highest performance score.

9. The method of claim 1 , wherein selecting the partial diacritization scheme from a plurality of partial diacritization schemes further comprises: receiving a plurality of criteria for determining the selected partial diacritization scheme, wherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; assigning priorities to each of the plurality of criteria; generating a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and determining the partial diacritization scheme from the multidimensional table.

10. The method of claim 1, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency- based diacritization schemes.

11. The method of claim 1 , further comprising determining whether a distinctive diacritization is associated with a word of the received text.

12. The method of claim 11, wherein the distinctive diacritization is based on a frequent rendering in a corpus.

13. A system for partially diacritizing text, the system comprising: means for receiving text; means for selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; means for applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and means for providing the annotated text.

14. The system of claim 13, further comprising means for determining that at least a portion of the received text is in a non-Latin alphabet.

15. The system of claim 13, further comprising means for determining that at least a portion of the received text is in Arabic.

16. The system of claim 13, further comprising means for generating a disambiguated form of the received text by conducting a full diacritization on the received text.

17. The system of claim 16, further comprising means for extracting a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.

18. The system of claim 13, further comprising means for using the annotated text to train the selected partial diacritization scheme.

19. The system of claim 13, further comprising means for training each partial diacritization scheme with statistics corresponding to a corpus.

20. The system of claim 13, further comprising: means for determining the performance score of each of the plurality of partial diacritization schemes on the received text; and means for determining the partial diacritization scheme with the highest performance score.

21. The system of claim 13 , further comprising: means for receiving a plurality of criteria for determining the selected partial diacritization scheme, wherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; means for assigning priorities to each of the plurality of criteria; means for generating a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and means for determining the partial diacritization scheme from the multidimensional table.

22. The system of claim 13, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency-based diacritization schemes.

23. The system of claim 13, further comprising means for determining whether a distinctive diacritization is associated with a word of the received text.

24. The system of claim 23, wherein the distinctive diacritization is based on a frequent rendering in a corpus.

25. A system for partially diacritizing text, the system comprising: a processor that: receives text; selects a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applies the selected partial diacritization scheme to annotate the received text with at least one diacritic; and provides the annotated text.

26. The system of claim 25, wherein the processor is further configured to determine that at least a portion of the received text is in a non-Latin alphabet.

27. The system of claim 25, wherein the processor is further configured to determine that at least a portion of the received text is in Arabic.

28. The system of claim 25, wherein the processor is further configured to generate a disambiguated form of the received text by conducting a full diacritization on the received text.

29. The system of claim 28, wherein the processor is further configured to extract a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.

30. The system of claim 25, wherein the processor is further configured to use the annotated text to train the selected partial diacritization scheme.

31. The system of claim 25, wherein the processor is further configured to train each partial diacritization scheme with statistics corresponding to a corpus.

32. The system of claim 25, wherein the processor is further configured to: determine the performance score of each of the plurality of partial diacritization schemes on the received text; and determine the partial diacritization scheme with the highest performance score.

33. The system of claim 25, wherein the processor is further configured to: receive a plurality of criteria for determining the selected partial diacritization scheme, wherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; assign priorities to each of the plurality of criteria; generate a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and determine the partial diacritization scheme from the multidimensional table.

34. The system of claim 25, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency-based diacritization schemes.

35. The system of claim 25, wherein the processor is further configured to determine whether a distinctive diacritization is associated with a word of the received text.

36. The system of claim 35, wherein the distinctive diacritization is based on a frequent rendering in a corpus.

37. A computer-readable media storing computer-executable instructions that, when executed by a processor, cause the processor to perform methods for partially diacritizing text, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.

38. The computer-readable medium of claim 37, wherein the method further comprises determining that at least a portion of the received text is in a non- Latin alphabet.

39. The computer-readable medium of claim 37, wherein the method further comprises determining that at least a portion of the received text is in Arabic.

40. The computer-readable medium of claim 37, wherein the method further comprises generating a disambiguated form of the received text by conducting a full diacritization on the received text.

41. The computer-readable medium of claim 40, wherein the method further comprises extracting a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.

42. The computer-readable medium of claim 37, wherein the method further comprises using the annotated text to train the selected partial diacritization scheme.

43. The computer-readable medium of claim 37 , wherein the method further comprises training each partial diacritization scheme with statistics corresponding to a corpus.

44. The computer-readable medium of claim 37, wherein the method further comprises: determining the performance score of each of the plurality of partial diacritization schemes on the received text; and determining the partial diacritization scheme with the highest performance score.

45. The computer-readable medium of claim 37, wherein the method further comprises: receiving a plurality of criteria for determining the selected partial diacritization scheme, w^rherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; assigning priorities to each of the plurality of criteria; generating a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and determining the partial diacritization scheme from the multidimensional table.

46. The computer-readable medium of claim 37, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency-based diacritization schemes.

47. The computer-readable medium of claim 37, wherein the method further comprises determining whether a distinctive diacritization is associated with a word of the received text.

48. The computer-readable medium of claim 47, wherein the distinctive diacritization is based on a frequent rendering in a corpus.