WO2009042861A1 - Methods, systems, and media for partially diacritizing text - Google Patents

Methods, systems, and media for partially diacritizing text Download PDF

Info

Publication number
WO2009042861A1
WO2009042861A1 PCT/US2008/077849 US2008077849W WO2009042861A1 WO 2009042861 A1 WO2009042861 A1 WO 2009042861A1 US 2008077849 W US2008077849 W US 2008077849W WO 2009042861 A1 WO2009042861 A1 WO 2009042861A1
Authority
WO
WIPO (PCT)
Prior art keywords
diacritization
partial
scheme
text
schemes
Prior art date
Application number
PCT/US2008/077849
Other languages
French (fr)
Inventor
Mona Talat Diab
Original Assignee
The Trustees Of Columbia University In The City Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Trustees Of Columbia University In The City Of New York filed Critical The Trustees Of Columbia University In The City Of New York
Publication of WO2009042861A1 publication Critical patent/WO2009042861A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Definitions

  • the disclosed subject matter relates to partially diacritizing text. More particularly, the disclosed subject matter relates to methods, systems, and media for identifying the optimal level of diacritization in written Modern Standard Arabic (MSA) text and/or in written dialectal data.
  • MSA Modern Standard Arabic
  • Arabic script consists of two classes of symbols: letters and diacritical marks (sometimes referred to herein as "diacritics"). While letters are always written, diacritics - i.e., marks inserted above or below particular letters that are used to express short vowels, lack of vowels, and consonantal gemination (letter doubling) - are optional. These diacritics can be used to aid the reader in disambiguating the text (e.g., distinguishing between different meanings of the word) or articulating the text correctly. However, almost all documents in Modern Standard Arabic, especially those texts which are not poetic or religious, are written using consonants only - i.e., without diacritics.
  • Methods, systems, and media for partially diacritizing text are provided.
  • methods for partially diacritizing text are provided, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic: and providing the annotated text.
  • systems for partially diacritizing text comprising: means for receiving text; means for selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; means for applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and means for providing the annotated text.
  • systems for partially diacritizing text comprising: a processor that: receives text; selects a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applies the selected partial diacritization scheme to annotate the received text with at least one diacritic; and provides the annotated text.
  • computer-readable media storing computer- executable instructions that, when executed by a processor, cause the processor to perform methods for partially diacritizing text are provided.
  • the method comprises:
  • FIG. 1 is a diagram of a mechanism for partially diacritizing Arabic text in accordance with some embodiments.
  • FIG. 2 is a diagram of a mechanism for extracting partial diacritizations of Arabic text and training partial diacritization schemes in accordance with some embodiments.
  • FIG. 3 is a schematic diagram of an illustrative system suitable for implementation of an application that partially diacritizes Arabic text in accordance with some embodiments.
  • FIG. 4 is a detailed example of the server and one of the workstations of FIG. 3 that can be used in accordance with some embodiments.
  • mechanisms for partially diacritizing text are provided.
  • these mechanisms can receive text.
  • a partial diacritization scheme is selected from multiple partial diacritization schemes based on the received text and a performance score.
  • the selected partial diacritization scheme is applied to the text to annotate the text with at least one diacritic.
  • the partially diacritized text can then be provided.
  • ASR automatic speech recognition
  • such diacritized text can be used effectively in language modeling for automatic speech recognition (ASR) systems and, more particularly, in the pronunciation dictionary component.
  • ASR automatic speech recognition
  • OCR Optical Character Recognition
  • NLP natural language processing
  • the partial diacritization scheme can be automatically applied to Arabic text displayed to a user, thereby facilitating the readability and comprehension of the Arabic text.
  • these mechanisms can be incorporated into Web servers so that the partial diacritization scheme can be automatically applied when a Web page is stored on the server or delivered to a user.
  • the diacritized Arabic text can be transmitted to a statistical machine translator (SMT) that translates the partially diacritized Arabic text into another language.
  • SMT statistical machine translator
  • these mechanisms can also be incorporated into a word processing application, a publishing application, and/or any other suitable document or publishing applications for producing reading materials, thereby setting a more comprehensible reading standard for Arabic texts.
  • niqqud is the set of diacritics used to represent vowels, distinguish between different pronunciations, etc.
  • hook and horn diacritics are used over vowels.
  • a diacritic is a mark or symbol inserted above or below a particular letter and can used to indicate a short vowel, a lack of a vowel, a consonantal gemination (letter doubling), etc.
  • diacritics there are three types of diacritics in Arabic: a vowel, a nunation, and a shadda (gemination).
  • Vowel diacritics represent three short vowels and a diacritic indicating the absence of any vowel.
  • the following are the four vowel diacritics exemplified in conjunction with the letter b ( --'):
  • sukuun diacritic the absence of a vow r el is sometime referred to as the sukuun diacritic.
  • the sukuun diacritic marks the boundaries between syllables.
  • Nunation diacritics occur in the final position of a word in nominals
  • the shadda diacritic is a consonant doubling diacritic, for example:
  • the shadda can be combined with vowel or nunation diacritics, such as:
  • Arabic mark hamza can be placed in connection with a number of letters. For example:
  • Arabic encodings generally do not count the hamza as a diacritic, but rather as a part of the letter (e.g., like the dot on a lower-case Roman letter “i” or under the Arabic letter "b").
  • diacritics can include lexical diacritics.
  • Lexical diacritics are generally used to distinguish between two lexemes (alternatively expressed as lemmas, citation forms, or derivational forms) or an abstraction over inflected word forms which group together the word forms that differ only in terms of one of the inflectional morphological categories, such as, for example, number, gender, aspect, voice, etc.
  • Arabic lexeme citation forms are third masculine singular perfective for verbs and masculine singular (or feminine singular if the masculine form is not possible) for nouns and adjectives. For example, the diacritization
  • diacritics can include inflectional diacritics.
  • Inflectional diacritics are generally used to distinguish different inflected forms of the same lexeme.
  • the final diacritics in ⁇ tf kn ⁇ bis 'book [nominative] * ⁇ C ⁇ ' cf htAha 'book [accusative] ' distinguish the morpho-syntactic case of "book” (e.g., whether the word is the subject or the object of the verb).
  • additional inflectional features e.g., voice, mood, definiteness, etc.
  • a partial diacritization scheme can be selected and performed on text (e.g., Arabic text) using a process 100 as illustrated in FIG. 1.
  • text is received at 102.
  • the text may be received from any suitable source.
  • these mechanisms detect that a file has been accessed and determine that the file contains text that is written in Arabic or any other suitable language that uses diacritics.
  • these mechanisms receive Arabic text from a word processing, publishing, or any other suitable application.
  • these mechanisms receive Arabic text from a web server.
  • the Arabic text is extracted from the received text.
  • the Arabic text in response to a user accessing a web page that includes both Arabic text and English text, the Arabic text can be extracted from the web page for applying a partial diacritization scheme.
  • a partial diacritization scheme is selected from multiple partial diacritization schemes. These partial diacritization schemes can draw on linguistic specifications of different types of diacritization.
  • partial diacritization schemes can include inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, any other suitable partial diacritization scheme, or any suitable combination thereof.
  • the lexical diacritization schemes include one or more rules to mark or annotate Arabic words with the idiosyncratic types of diacritics that are typically expressed in the form of derivational morphology, such as the letter doubling (shadda), absence of a vowel diacritic marking syllable boundaries (sukuun), among others.
  • GEM is a lexical diacritization scheme that marks the words in the data with the shadda diacritic (-). Words that have a gemination diacritic, in the underlying lemma form, are explicitly marked with the shadda diacritic ( ⁇ ).
  • SUK is a lexical diacritization scheme that marks words in the data with the sukuun diacritic (o) (the marker signifying the absence of a vowel). Words that have a sukuun in the underlying lemma form are marked with the sukuun (o) diacritic.
  • one suitable lexical diacritization scheme can include a combination of lexical diacritization schemes.
  • a lexical diacritization scheme can have rules for marking Arabic words with the shadda diacritic and the sukuun diacritic.
  • a lexical diacritization scheme can have rules for marking Arabic words with any suitable combination of lexical diacritics.
  • the inflectional diacritization schemes include one or more words to mark or annotate Arabic words with the predictable forms of diacritics, such as case, mood, passivation, definiteness, etc. Inflectional diacritics are generally used to distinguish different inflected forms of the same lexeme.
  • PASS is an inflectional diacritization scheme that marks the verb passivation or damma diacritic.
  • C-M or CASE-MOOD is an inflectional diacritization scheme that marks words with diacritics representing case (e.g., the a, i. u. F, K, and N marks on nominate) and/or mood (e.g.. the a. o. and u marks on subjunctive, jussive, and indicative mood of verbs).
  • Table 1 shows an example of the contrasting diacritization schemes
  • the PASS diacritization scheme annotates the PASS diacritization scheme
  • the C-M diacritization scheme annotates the verb in the Arabic sentence with a diacritic to show the indicative mood of the verb.
  • a damma is added to the end of the verb (e.g., strmmu) to indicate that the word means "will be restored.”
  • the C-M diacritization scheme marks the noun with a nominative case diacritic to indicate that the word is the subject of the sentence (e.g., AljdrAnu).
  • the GEM diacritization scheme inserts the gemination or shadda diacritic ( ⁇ ) into the verb (e.g., strm ⁇ m).
  • the SUK diacritization scheme inserts the sukuun diacritic (o) in the noun to indicate the absence of a vowel (e.g., AljdorAn).
  • a frequency-based partial diacritization scheme can be used, where the frequency-based partial diacritization scheme can identify the distinctive diacritization to associate with a word given the context.
  • the frequency-based partial diacritization scheme is a combination of an inflectional diacritization scheme and a lexical diacritization scheme.
  • This frequency- based partial diacritization scheme can be calculated over several corpora based on frequencies of occurrences of the distinct diacritics in a large collection of fully diacritized texts, such as Arabic Gigaword or any other suitable collection of text. It should be noted that the frequency based diacritization scheme marks the least frequent of cases to distinguish them from the frequent readings especially in the case of minimal pairs.
  • statistical information e.g., distributions, etc.
  • frequency-based partial diacritization schemes For example, in the Penn Arabic Treebank (ATB) III, version 2, 1.6% of the words have some diacritics. Among these, the most common diacritics are the nunation diacritics (e.g., F, K, and N), accounting for 73.4% of the naturally occurring diacritics in the ATB. A majority of these nunation diacritics are used inflectionally to mark nominals (nouns, adjectives, proper nouns, etc.) indicating case assignment together with indefmiteness.
  • a majority of these nunation diacritics are used inflectionally to mark nominals (nouns, adjectives, proper nouns, etc.) indicating case assignment together with indefmiteness.
  • the F diacritic marks the accusative case
  • the K diacritic marks the genitive case
  • the N diacritic marks the nominative case.
  • the next frequent diacritic is the shadda diacritic.
  • the shadda diacritic (gemination) accounts for 20.8% of the naturally occurring diacritics in the ATB and occurs 56.7% of the time with verbs.
  • the third frequent diacritic is the damma, which accounts for 3% of the naturally occurring diacritics in the ATB.
  • the majority of the usage of the damma diacritic is to indicate the passive form of verbs.
  • statistics from any suitable source can be used to train the partial diacritization schemes.
  • statistics and distributions from Arabic words that are diacritized using one or more partial diacritization schemes can be used to train and update the partial diacritization schemes.
  • statistics and distributions from Arabic Gigaword can be used to train the partial diacritization schemes.
  • the partial diacritization scheme can determine to not mark the word. That is, the frequency-based partial diacritization scheme can determine to not annotate the word with diacritics because the most frequent reading of the word is the valid reading.
  • the word "mn" in Arabic is typically written without a diacritic.
  • This sequence of letters "mn” can be used to mean the most frequent rendering of the preposition "of, which can be expressed with the diacritic "i” as “min.” This sequence of letters can also be used to mean the less frequent reading of "who.” which is expressed with the diacritic "a” as “man.” In addition, this sequence of letter can be used to mean the even less frequent reading of the verb "to bestow” expressed with the diacritic "a” and the shadda or letter doubling diacritic as "mann.” Accordingly, in some embodiments, the words “who” and “bestow” can be rendered with diacritics, while the most frequently occurring word "of can be rendered without diacritics.
  • the partial diacritization scheme may determine which diacritics to insert based on frequency. For example, the fully disambiguated form of the Arabic verb "to bestow” is expressed with the diacritic "a” and the shadda or letter doubling diacritic. However, the partial diacritization scheme can determine to only mark the Arabic word with the letter doubling diacritic as it is the most distinctive feature.
  • these mechanisms can derive each of the partial diacritization schemes (e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, etc.) from the fully disambiguated form of the word.
  • partial diacritization schemes e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, etc.
  • a full diacritization can be conducted on the received text to generate a fully disambiguated form of the received text.
  • the received text can be transmitted to a full diacritization application, such as the Morphological Analysis and Disambiguation of Arabic (MADA) system.
  • MADA Morphological Analysis and Disambiguation of Arabic
  • the MADA system tags morphologically rich languages by using a set of taggers that are trained for individual linguistic features (e.g., core part- of-speech, tense, number, etc.).
  • the FULL diacritization scheme specifies all of the diacritics in the sentence "the walls will be restored" using the MADA system.
  • the received text can be transmitted to a morphological analysis system, such as the Buckwalter Arabic Morphological Analysis (BAMA) system.
  • BAMA Buckwalter Arabic Morphological Analysis
  • partial diacritizations of the received text can be extracted from the fully diacritized text.
  • particular diacritics can be selectively removed from the fully disambiguated form based on the partial diacritization scheme.
  • the lexical and inflectional partial diacritizations can be extracted directly from the fully disambiguated form. For example, as described previously, the GEM diacritization can selectively remove all diacritics from the
  • the SUK diacritization scheme can selectively remove from the fully disambiguated form all diacritics except for the sukuun diacritic in nouns.
  • the partial diacritization of particular words can be used to train the partial diacritization scheme at 208.
  • the mechanisms can be trained on data in a specific partial diacritization scheme.
  • the new Arabic text can be annotated with diacritics with the select partial diacritization scheme.
  • a partial diacritization scheme is selected from multiple partial diacritization schemes (e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, any other suitable partial diacritization scheme, or any suitable combination thereof).
  • partial diacritization schemes e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, any other suitable partial diacritization scheme, or any suitable combination thereof.
  • the partial diacritization scheme can be selected based on any suitable criteria.
  • the partial diacritization scheme can be selected based on genre and domain information.
  • the partial diacritization scheme can be selected based on a performance score or a performance measurement of each partial diacritization scheme.
  • the selected partial diacritization scheme can be the optimal partial diacritization scheme (e.g., the partial diacritization scheme with the best or highest performances score).
  • a performance score calculated by the MADA system can be used to select the partial diacritization scheme.
  • the precision, the recall, and the F-score of the statistical machine translation (SMT) system or any other suitable system can be measured.
  • the precision measurement which is the ratio of true positives over the total number of true and false positives, measures to what extent overgeneration of diacritics occurs.
  • the recall measurement which is the ratio of true positives over the sum of true positives and false negatives, measures the amount of undergeneration.
  • the F-score measurement is the harmonic average between precision and recall. As shown below in Table 2, the F-score show r s the performance score or quality measurement of the specific diacritization scheme.
  • partial diacritization schemes can be applied from different scientific perspectives (e.g., psycholinguistic and neurological studies coupled with computational modeling in the context of natural language processing machinery).
  • the partial diacritization scheme can be selected in response to determining the partial diacritization scheme with the highest MADA F-score (or any other suitable performance score determined by any suitable system that assigns full diacritizations to Arabic text).
  • the optimal partial diacritization scheme can be selected in response to determining the partial diacritization scheme with the highest F-score and determining the partial diacritization scheme with rules that best fit the psycholinguistics associated with the received Arabic text.
  • the selected partial diacritization scheme can be determined based on a combination of criteria.
  • a multidimensional table can be used to account for different combinations of criteria (e.g., performance score, domain information, genre information, frequency, etc.).
  • the determination of the selected partial diacritization scheme to use on particular text can be calculated by assigning priorities to each of the criteria (e.g., the performance score is the most important criterion and the frequency is the least important criterion).
  • the partial diacritization scheme is applied to annotate the received text with one or more diacritics at 106.
  • the annotated text is provided.
  • these mechanisms can be incorporated into an automatic speech recognition (ASR) system, where the annotated text can be transmitted to a text-to-speech component for articulating the partially diacritized Arabic text.
  • ASR automatic speech recognition
  • the annotated text can be transmitted to a statistical machine translator (SMT) that translates the partially diacritized Arabic text into another language.
  • SMT statistical machine translator
  • system 300 can include one or more workstations 302.
  • Workstations 302 can be local to each other or remote from each other, and are connected by one or more communications links 304 to a communications network 306 that is linked via a communications link 308 to a server 310.
  • server 310 can be any suitable server for executing the application, such as a processor, a computer, a data processing device, or a combination of such devices.
  • Communications network 306 can be any suitable computer network including the Internet, an intranet, a wide-area network (WAN), a local-area network (LAN), a wireless network, a digital subscriber line (DSL) network, a frame relay network, an asynchronous transfer mode (ATM) network, a virtual private network (VPN), or any combination of any of the same.
  • Communications links 304 and 308 can be any communications links suitable for communicating data between workstations 302 and server 310, such as network links, dial-up links, wireless links, hard-wired links, etc.
  • Workstations 302 can be publishing media devices, personal computers, laptop computers, mainframe computers, dumb terminals, data displays, Internet browsers, personal digital assistants (PDAs), two-way pagers, wireless terminals, portable telephones, etc., or any combination of the same.
  • Workstations 302 and server 310 can be located at any suitable location. In one embodiment, workstations 302 and server 310 can be located within an organization. Alternatively, workstations 302 and server 310 can be distributed between multiple organizations.
  • workstation 302 can include processor 402, display 404, input device 406, and memory 408, which can be interconnected.
  • memory 408 contains a storage device for storing a workstation program for controlling processor 402.
  • Memory 408 can also contain an application for partially diacritizing Arabic text in accordance with some embodiments.
  • the application can be resident in the memory of workstation 302 or server 310.
  • the application can include client-side software, hardware, or both.
  • the application can encompass one or more Web-pages or Web-page portions (e.g., via any suitable encoding, such as Hyper Text Markup Language (HTML), Dynamic HyperText Markup Language (DHTML), Extensible Markup Language (XML), JavaServer Pages (JSP), Active Server Pages (ASP), Cold Fusion, or any other suitable approaches).
  • HTTP Hyper Text Markup Language
  • DHTML Dynamic HyperText Markup Language
  • XML Extensible Markup Language
  • JSP JavaServer Pages
  • ASP Active Server Pages
  • Cold Fusion or any other suitable approaches.
  • the application is described herein as being implemented on a workstation, this is only illustrative.
  • the application can be implemented on any suitable platform (e.g., a personal computer (PC), a mainframe computer, a dumb terminal, a data display, a two-way pager, a wireless terminal, a portable telephone, a portable computer, a palmtop computer, a H/PC, an automobile PC, a laptop computer, a personal digital assistant (PDA), a combined cellular phone and PDA, etc.) to provide such features.
  • PC personal computer
  • mainframe computer e.g., a mainframe computer, a dumb terminal, a data display, a two-way pager, a wireless terminal, a portable telephone, a portable computer, a palmtop computer, a H/PC, an automobile PC, a laptop computer, a personal digital assistant (PDA), a combined cellular phone and PDA, etc.
  • PDA personal digital assistant
  • Processor 402 can use the workstation program to present on display
  • Server 310 can include processor 420, display 422, input device 424, and memory 426, which can be interconnected.
  • memory 426 contains a storage device for storing data received through communication link 308 or through other links, and also receives commands and values transmitted by one or more users.
  • the storage device further contains a server program for controlling processor 420.
  • memory 426 can also contain an application for partially diacritizing Arabic text in accordance with some embodiments.
  • server 310 can use the application to partially diacritize text (e.g., Arabic text) that passes through server 310.

Abstract

Methods, systems, and media for partially diacritizing text are provided. In accordance with some embodiments, methods for partially diacritizing text are provided, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.

Description

METHODS, SYSTEMS, AND MEDIA FOR PARTIALLY DIACRITIZING TEXT
Cross Reference to Related Application
[0001] This application claims the benefit of United States Provisional Patent
Application No. 60/975,482, filed September 26, 2007 and United States Provisional Patent Application No. 60/975,783, filed September 27, 2007, both of which are hereby incorporated by reference herein in their entireties.
Technical Field
[0002] The disclosed subject matter relates to partially diacritizing text. More particularly, the disclosed subject matter relates to methods, systems, and media for identifying the optimal level of diacritization in written Modern Standard Arabic (MSA) text and/or in written dialectal data.
Background
[0003] Arabic script consists of two classes of symbols: letters and diacritical marks (sometimes referred to herein as "diacritics"). While letters are always written, diacritics - i.e., marks inserted above or below particular letters that are used to express short vowels, lack of vowels, and consonantal gemination (letter doubling) - are optional. These diacritics can be used to aid the reader in disambiguating the text (e.g., distinguishing between different meanings of the word) or articulating the text correctly. However, almost all documents in Modern Standard Arabic, especially those texts which are not poetic or religious, are written using consonants only - i.e., without diacritics. For example, in the Penn Arabic Treebank (ATB) III, version 2, it has been estimated that only 1.6% of all words have some diacritics explicitly marked in the text. This significantly affects the readability and comprehension for written Arabic text. Compounding this problem is the diglossic situation: that the native spoken dialects are quite different from the formal Modern Standard Arabic, the common written and academic language. For example, the different dialects provide a multitude of different pronunciations. [0004] Various approaches attempt to solve this problem by folly restoring all of the diacritics in non-diacritized Arabic text. However, folly specifying all diacritics can hinder readability. Thus, full diacritization encodes too much information into the non-diacritized Arabic text. In addition, full diacritization has been shown to be less than optimal for several natural language processing applications, such as statistical machine translation.
[0005] There is therefore a need in the art for approaches that provide a partial diacritization for Arabic text. Accordingly, it is desirable to provide methods and systems that overcome these and other deficiencies of the prior art.
Summary
[0006] Methods, systems, and media for partially diacritizing text are provided. In accordance with some embodiments, methods for partially diacritizing text are provided, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic: and providing the annotated text.
[0007] In some embodiments, systems for partially diacritizing text are provided, the system comprising: means for receiving text; means for selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; means for applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and means for providing the annotated text.
[0008] In some embodiments, systems for partially diacritizing text are provided, the system comprising: a processor that: receives text; selects a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applies the selected partial diacritization scheme to annotate the received text with at least one diacritic; and provides the annotated text.
[0009] In some embodiments, computer-readable media storing computer- executable instructions that, when executed by a processor, cause the processor to perform methods for partially diacritizing text are provided. The method comprises:
- -> - receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.
Brief Description of the Drawings
[0010] FIG. 1 is a diagram of a mechanism for partially diacritizing Arabic text in accordance with some embodiments.
[0011] FIG. 2 is a diagram of a mechanism for extracting partial diacritizations of Arabic text and training partial diacritization schemes in accordance with some embodiments.
[0012] FIG. 3 is a schematic diagram of an illustrative system suitable for implementation of an application that partially diacritizes Arabic text in accordance with some embodiments.
[0013] FIG. 4 is a detailed example of the server and one of the workstations of FIG. 3 that can be used in accordance with some embodiments.
Detailed Description
[0014] In accordance with various embodiments, mechanisms for partially diacritizing text are provided. In some embodiments, these mechanisms can receive text. In response to receiving the text, a partial diacritization scheme is selected from multiple partial diacritization schemes based on the received text and a performance score. The selected partial diacritization scheme is applied to the text to annotate the text with at least one diacritic. The partially diacritized text can then be provided. [0015] These mechanisms can be used in a variety of applications. For example, these mechanisms can be incorporated into an automatic speech recognition (ASR) system, where partially diacritized Arabic text or Arabic words can be transmitted to a text-to-speech component for articulating the partially diacritized Arabic text. Since the optimal or enhanced level of diacritization renders an optimal level or enhanced of variation, such diacritized text can be used effectively in language modeling for automatic speech recognition (ASR) systems and, more particularly, in the pronunciation dictionary component. In addition, such diacritized text can be used effectively in language modeling for Optical Character Recognition (OCR) systems and, more particularly, in the orthographic dictionary component. Combined with lattice decoding, such optimal or enhanced partial diacritizations allow for better performance of such natural language processing (NLP) applications. [0016] In another example, these mechanisms can be incorporated into an
Internet web browser such that the partial diacritization scheme can be automatically applied to Arabic text displayed to a user, thereby facilitating the readability and comprehension of the Arabic text. In yet another example, these mechanisms can be incorporated into Web servers so that the partial diacritization scheme can be automatically applied when a Web page is stored on the server or delivered to a user. In yet another example, the diacritized Arabic text can be transmitted to a statistical machine translator (SMT) that translates the partially diacritized Arabic text into another language.
[0017] Alternatively, these mechanisms can also be incorporated into a word processing application, a publishing application, and/or any other suitable document or publishing applications for producing reading materials, thereby setting a more comprehensible reading standard for Arabic texts.
[0018] It should be noted that although these embodiments are primarily described using Modern Standard Arabic (MSA), this is only illustrative. These mechanisms can be applied to any suitable Arabic script-based languages, such as Arabic, Farsi, Kurdish, Malay, Persian, Urdu, etc. For example, partial diacritization schemes can be generated for text written in a Perso- Arabic script. Alternatively, these mechanisms can be applied to any suitable non-Latin alphabet that uses diacritics, marks, or characters to alter pronunciation, differentiate between similar words, etc. For example, in the Greek alphabet, diacritics (e.g., acute accent, grave accent, circumflex accent, etc.) are used to indicate different pronunciations, pitch accents, and/or breathings. In another example, in Hebrew orthography, niqqud is the set of diacritics used to represent vowels, distinguish between different pronunciations, etc. In yet another example, in the Vietnamese alphabet, hook and horn diacritics are used over vowels. These mechanisms for partial diacritization can be applied to any suitable language that uses diacritics.
[0019] As used herein, a diacritic is a mark or symbol inserted above or below a particular letter and can used to indicate a short vowel, a lack of a vowel, a consonantal gemination (letter doubling), etc. Generally speaking, there are three types of diacritics in Arabic: a vowel, a nunation, and a shadda (gemination).
[0020] Vowel diacritics represent three short vowels and a diacritic indicating the absence of any vowel. The following are the four vowel diacritics exemplified in conjunction with the letter b ( --'):
^j ha (fathah ώ hu (damma), u bi {kasra), and ψ bo (no vowel).
It should be noted that the absence of a vowrel is sometime referred to as the sukuun diacritic. The sukuun diacritic marks the boundaries between syllables. [0021] Nunation diacritics occur in the final position of a word in nominals
(e.g., nouns, adjectives, and adverbs). Nunations indicate a short vowel followed by an unwritten n sound, for example:
L bΛF , w> BN and o όλ~.
These nunations are generally an indicator of nominal indefiniteness.
[0022] The shadda diacritic is a consonant doubling diacritic, for example:
O b^ (-W).
In another example, the shadda can be combined with vowel or nunation diacritics, such as:
Figure imgf000006_0001
[0023] It should be noted that, in some embodiments, other diacritics can also be included. The Arabic mark hamza can be placed in connection with a number of letters. For example:
Note that Arabic encodings generally do not count the hamza as a diacritic, but rather as a part of the letter (e.g., like the dot on a lower-case Roman letter "i" or under the Arabic letter "b").
[0024] In some embodiments, diacritics can include lexical diacritics. Lexical diacritics are generally used to distinguish between two lexemes (alternatively expressed as lemmas, citation forms, or derivational forms) or an abstraction over inflected word forms which group together the word forms that differ only in terms of one of the inflectional morphological categories, such as, for example, number, gender, aspect, voice, etc. Arabic lexeme citation forms are third masculine singular perfective for verbs and masculine singular (or feminine singular if the masculine form is not possible) for nouns and adjectives. For example, the diacritization
difference between the lexemes V^. '^/;> (writer) and V"1 " k-^h'b (to correspond) distinguishes between the meanings of the word (lexical disambiguation) rather than their inflections. Diacritics can be used to mark lexical variation. A common example with the shadda (gemmation) diacritic is the distinction between Form I and Form II of Arabic verb derivations. Form II indicates, in most cases, added causativity to the Form I meaning. Form II is marked by doubling the second radical of the root used in Form I, for example:
§"1 A kal * ate * versus Si Ak fed " .
[0025] In some embodiments, diacritics can include inflectional diacritics.
Inflectional diacritics are generally used to distinguish different inflected forms of the same lexeme. For example, the final diacritics in ύtf knΛbis 'book [nominative]*^ C^'cf htAha 'book [accusative]' distinguish the morpho-syntactic case of "book" (e.g., whether the word is the subject or the object of the verb). It should be noted that, in some embodiments, additional inflectional features (e.g., voice, mood, definiteness, etc.) can be shown with other inflectional diacritics.
[0026] In accordance with some embodiments, a partial diacritization scheme can be selected and performed on text (e.g., Arabic text) using a process 100 as illustrated in FIG. 1. As shown, text is received at 102. The text may be received from any suitable source. For example, in some embodiments, these mechanisms detect that a file has been accessed and determine that the file contains text that is written in Arabic or any other suitable language that uses diacritics. In some embodiments, these mechanisms receive Arabic text from a word processing, publishing, or any other suitable application. In some embodiments, these mechanisms receive Arabic text from a web server.
[0027] It should be noted that, in some embodiments, the Arabic text is extracted from the received text. For example, in response to a user accessing a web page that includes both Arabic text and English text, the Arabic text can be extracted from the web page for applying a partial diacritization scheme. [0028] At 104, a partial diacritization scheme is selected from multiple partial diacritization schemes. These partial diacritization schemes can draw on linguistic specifications of different types of diacritization. For example, partial diacritization schemes can include inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, any other suitable partial diacritization scheme, or any suitable combination thereof.
[0029] The lexical diacritization schemes include one or more rules to mark or annotate Arabic words with the idiosyncratic types of diacritics that are typically expressed in the form of derivational morphology, such as the letter doubling (shadda), absence of a vowel diacritic marking syllable boundaries (sukuun), among others. For example, as shown below in Table 1, GEM is a lexical diacritization scheme that marks the words in the data with the shadda diacritic (-). Words that have a gemination diacritic, in the underlying lemma form, are explicitly marked with the shadda diacritic (~). As also shown in Table 1 , SUK is a lexical diacritization scheme that marks words in the data with the sukuun diacritic (o) (the marker signifying the absence of a vowel). Words that have a sukuun in the underlying lemma form are marked with the sukuun (o) diacritic.
[0030] In some embodiments, one suitable lexical diacritization scheme can include a combination of lexical diacritization schemes. For example, a lexical diacritization scheme can have rules for marking Arabic words with the shadda diacritic and the sukuun diacritic. In another example, a lexical diacritization scheme can have rules for marking Arabic words with any suitable combination of lexical diacritics.
[0031] The inflectional diacritization schemes include one or more words to mark or annotate Arabic words with the predictable forms of diacritics, such as case, mood, passivation, definiteness, etc. Inflectional diacritics are generally used to distinguish different inflected forms of the same lexeme. For example, as shown in Table 1, PASS is an inflectional diacritization scheme that marks the verb passivation or damma diacritic. As also shown in Table 1, C-M or CASE-MOOD is an inflectional diacritization scheme that marks words with diacritics representing case (e.g., the a, i. u. F, K, and N marks on nominate) and/or mood (e.g.. the a. o. and u marks on subjunctive, jussive, and indicative mood of verbs).
Figure imgf000009_0001
Table 1
[0032] Table 1 shows an example of the contrasting diacritization schemes
(e.g., no diacritization, multiple partial diacritizations, and full diacritization) applied to the sentence "the walls will be restored." It should be noted that the diacritization scheme NONE shows the Arabic sentence without diacritics, while the diacritization scheme FULL shows the Arabic sentence with all of the possible diacritics. [0033] As shown in Table 1, the PASS diacritization scheme annotates the
Arabic sentence with the damma (u) diacritic or verb passivation (e.g., sturmm). The C-M diacritization scheme annotates the verb in the Arabic sentence with a diacritic to show the indicative mood of the verb. In particular, a damma is added to the end of the verb (e.g., strmmu) to indicate that the word means "will be restored." In addition, the C-M diacritization scheme marks the noun with a nominative case diacritic to indicate that the word is the subject of the sentence (e.g., AljdrAnu). [0034] As also shown in Table 1, the GEM diacritization scheme inserts the gemination or shadda diacritic (~) into the verb (e.g., strm~m). The SUK diacritization scheme inserts the sukuun diacritic (o) in the noun to indicate the absence of a vowel (e.g., AljdorAn).
[0035] In some embodiments, a frequency-based partial diacritization scheme can be used, where the frequency-based partial diacritization scheme can identify the distinctive diacritization to associate with a word given the context. It should be noted that the frequency-based partial diacritization scheme is a combination of an inflectional diacritization scheme and a lexical diacritization scheme. This frequency- based partial diacritization scheme can be calculated over several corpora based on frequencies of occurrences of the distinct diacritics in a large collection of fully diacritized texts, such as Arabic Gigaword or any other suitable collection of text. It should be noted that the frequency based diacritization scheme marks the least frequent of cases to distinguish them from the frequent readings especially in the case of minimal pairs.
[0036] In some embodiments, statistical information (e.g., distributions, etc.) from one or more corpuses can be used for frequency-based partial diacritization schemes. For example, in the Penn Arabic Treebank (ATB) III, version 2, 1.6% of the words have some diacritics. Among these, the most common diacritics are the nunation diacritics (e.g., F, K, and N), accounting for 73.4% of the naturally occurring diacritics in the ATB. A majority of these nunation diacritics are used inflectionally to mark nominals (nouns, adjectives, proper nouns, etc.) indicating case assignment together with indefmiteness. For example, the F diacritic marks the accusative case, the K diacritic marks the genitive case, and the N diacritic marks the nominative case. The next frequent diacritic is the shadda diacritic. The shadda diacritic (gemination) accounts for 20.8% of the naturally occurring diacritics in the ATB and occurs 56.7% of the time with verbs. The third frequent diacritic is the damma, which accounts for 3% of the naturally occurring diacritics in the ATB. The majority of the usage of the damma diacritic is to indicate the passive form of verbs. These statistics and distributions of naturally occurring diacritics can be used to train the partial diacritization scheme.
[0037] It should be noted that statistics from any suitable source can be used to train the partial diacritization schemes. For example, in some embodiments, statistics and distributions from Arabic words that are diacritized using one or more partial diacritization schemes can be used to train and update the partial diacritization schemes. In another example, statistics and distributions from Arabic Gigaword can be used to train the partial diacritization schemes.
[0038] In some embodiments, if a word has one unique diacritization associated with it, the partial diacritization scheme can determine to not mark the word. That is, the frequency-based partial diacritization scheme can determine to not annotate the word with diacritics because the most frequent reading of the word is the valid reading. [0039] For example, the word "mn" in Arabic is typically written without a diacritic. This sequence of letters "mn" can be used to mean the most frequent rendering of the preposition "of, which can be expressed with the diacritic "i" as "min." This sequence of letters can also be used to mean the less frequent reading of "who." which is expressed with the diacritic "a" as "man." In addition, this sequence of letter can be used to mean the even less frequent reading of the verb "to bestow" expressed with the diacritic "a" and the shadda or letter doubling diacritic as "mann." Accordingly, in some embodiments, the words "who" and "bestow" can be rendered with diacritics, while the most frequently occurring word "of can be rendered without diacritics.
[0040] In some embodiments, the partial diacritization scheme may determine which diacritics to insert based on frequency. For example, the fully disambiguated form of the Arabic verb "to bestow" is expressed with the diacritic "a" and the shadda or letter doubling diacritic. However, the partial diacritization scheme can determine to only mark the Arabic word with the letter doubling diacritic as it is the most distinctive feature.
[0041] In accordance with some embodiments, these mechanisms can derive each of the partial diacritization schemes (e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, etc.) from the fully disambiguated form of the word.
[0042] As shown in FIG. 2, in response to receiving text from any suitable source (e.g., from a text-to-speech application) at 202, a full diacritization can be conducted on the received text to generate a fully disambiguated form of the received text. For example, in some embodiments, the received text can be transmitted to a full diacritization application, such as the Morphological Analysis and Disambiguation of Arabic (MADA) system. The MADA system tags morphologically rich languages by using a set of taggers that are trained for individual linguistic features (e.g., core part- of-speech, tense, number, etc.). As shown previously in Table 1, the FULL diacritization scheme specifies all of the diacritics in the sentence "the walls will be restored" using the MADA system.
[0043] It should be noted that any suitable approach for fully diacritizing words (e.g., Arabic words or words in any alphabet that uses diacritics) can be used. For example, the received text can be transmitted to a morphological analysis system, such as the Buckwalter Arabic Morphological Analysis (BAMA) system.
[0044] At 206, partial diacritizations of the received text can be extracted from the fully diacritized text. In some embodiments, particular diacritics can be selectively removed from the fully disambiguated form based on the partial diacritization scheme. The lexical and inflectional partial diacritizations can be extracted directly from the fully disambiguated form. For example, as described previously, the GEM diacritization can selectively remove all diacritics from the
Arabic words except for the shadda diacritics on verbs. In another example, the SUK diacritization scheme can selectively remove from the fully disambiguated form all diacritics except for the sukuun diacritic in nouns.
[0045] Accordingly, based on the specifications of each partial diacritization scheme, particular diacritics are selectively removed from each word to extract the partial diacritization.
[0046] In some embodiments, the partial diacritization of particular words can be used to train the partial diacritization scheme at 208. For example, the mechanisms can be trained on data in a specific partial diacritization scheme. In response to receiving new Arabic text, the new Arabic text can be annotated with diacritics with the select partial diacritization scheme.
[0047] Referring back to FIG. 1, a partial diacritization scheme is selected from multiple partial diacritization schemes (e.g., inflectional diacritization schemes, lexical diacritization schemes, frequency-based diacritization schemes, any other suitable partial diacritization scheme, or any suitable combination thereof).
[0048] The partial diacritization scheme can be selected based on any suitable criteria. For example, in some embodiments, the partial diacritization scheme can be selected based on genre and domain information. In another example, in some embodiments, the partial diacritization scheme can be selected based on a performance score or a performance measurement of each partial diacritization scheme. In yet another example, the selected partial diacritization scheme can be the optimal partial diacritization scheme (e.g., the partial diacritization scheme with the best or highest performances score).
[0049] In some embodiments, a performance score calculated by the MADA system can be used to select the partial diacritization scheme. To measure the degree of undergeneration or overgeneration of the system using a particular partial diacritization scheme, the precision, the recall, and the F-score of the statistical machine translation (SMT) system or any other suitable system can be measured. The precision measurement, which is the ratio of true positives over the total number of true and false positives, measures to what extent overgeneration of diacritics occurs. The recall measurement, which is the ratio of true positives over the sum of true positives and false negatives, measures the amount of undergeneration. The F-score measurement is the harmonic average between precision and recall. As shown below in Table 2, the F-score showrs the performance score or quality measurement of the specific diacritization scheme.
Figure imgf000013_0001
Table 2
[0050] It should be noted that any suitable criteria can be used to select the partial diacritization scheme. For example, in some embodiments, partial diacritization schemes can be applied from different scientific perspectives (e.g., psycholinguistic and neurological studies coupled with computational modeling in the context of natural language processing machinery). In another example, the partial diacritization scheme can be selected in response to determining the partial diacritization scheme with the highest MADA F-score (or any other suitable performance score determined by any suitable system that assigns full diacritizations to Arabic text). In yet another example, the optimal partial diacritization scheme can be selected in response to determining the partial diacritization scheme with the highest F-score and determining the partial diacritization scheme with rules that best fit the psycholinguistics associated with the received Arabic text. [0051] It should also be noted that the selected partial diacritization scheme can be determined based on a combination of criteria. For example, a multidimensional table can be used to account for different combinations of criteria (e.g., performance score, domain information, genre information, frequency, etc.). The determination of the selected partial diacritization scheme to use on particular text can be calculated by assigning priorities to each of the criteria (e.g., the performance score is the most important criterion and the frequency is the least important criterion).
[0052] Referring back to FIG. 1, in response to selecting the partial diacritization scheme from multiple partial diacritization schemes based on particular criteria at 104, the partial diacritization scheme is applied to annotate the received text with one or more diacritics at 106. At 108, the annotated text is provided. For example, these mechanisms can be incorporated into an automatic speech recognition (ASR) system, where the annotated text can be transmitted to a text-to-speech component for articulating the partially diacritized Arabic text. In another example, the annotated text can be transmitted to a statistical machine translator (SMT) that translates the partially diacritized Arabic text into another language. [0053] FIG. 3 is a schematic diagram of an illustrative system 300 suitable for implementation of an application that partially diacritizes Arabic text in accordance with some embodiments. As illustrated, system 300 can include one or more workstations 302. Workstations 302 can be local to each other or remote from each other, and are connected by one or more communications links 304 to a communications network 306 that is linked via a communications link 308 to a server 310.
[0054] In system 300, server 310 can be any suitable server for executing the application, such as a processor, a computer, a data processing device, or a combination of such devices. Communications network 306 can be any suitable computer network including the Internet, an intranet, a wide-area network (WAN), a local-area network (LAN), a wireless network, a digital subscriber line (DSL) network, a frame relay network, an asynchronous transfer mode (ATM) network, a virtual private network (VPN), or any combination of any of the same. Communications links 304 and 308 can be any communications links suitable for communicating data between workstations 302 and server 310, such as network links, dial-up links, wireless links, hard-wired links, etc. Workstations 302 can be publishing media devices, personal computers, laptop computers, mainframe computers, dumb terminals, data displays, Internet browsers, personal digital assistants (PDAs), two-way pagers, wireless terminals, portable telephones, etc., or any combination of the same. Workstations 302 and server 310 can be located at any suitable location. In one embodiment, workstations 302 and server 310 can be located within an organization. Alternatively, workstations 302 and server 310 can be distributed between multiple organizations.
[0055] The server and one of the workstations, which are depicted in FIG. 3, are illustrated in more detail in FIG. 4. Referring to FIG. 4, workstation 302 can include processor 402, display 404, input device 406, and memory 408, which can be interconnected. In a preferred embodiment, memory 408 contains a storage device for storing a workstation program for controlling processor 402. Memory 408 can also contain an application for partially diacritizing Arabic text in accordance with some embodiments. In some embodiments, the application can be resident in the memory of workstation 302 or server 310.
[0056] In one particular embodiment, the application can include client-side software, hardware, or both. For example, the application can encompass one or more Web-pages or Web-page portions (e.g., via any suitable encoding, such as Hyper Text Markup Language (HTML), Dynamic HyperText Markup Language (DHTML), Extensible Markup Language (XML), JavaServer Pages (JSP), Active Server Pages (ASP), Cold Fusion, or any other suitable approaches).
[0057] Although the application is described herein as being implemented on a workstation, this is only illustrative. The application can be implemented on any suitable platform (e.g., a personal computer (PC), a mainframe computer, a dumb terminal, a data display, a two-way pager, a wireless terminal, a portable telephone, a portable computer, a palmtop computer, a H/PC, an automobile PC, a laptop computer, a personal digital assistant (PDA), a combined cellular phone and PDA, etc.) to provide such features.
[0058] Processor 402 can use the workstation program to present on display
304 the application and the data received through communication link 304 and commands and values transmitted by a user of workstation 302. It should also be noted that data received through communication link 304 or any other communications links can be received from any suitable source, such as web services. Input device 406 can be a computer keyboard, a mouse, a touch-sensitive screen, a cursor-controller, a dial, a switchbank, lever, or any other suitable input device as would be used by a designer of input systems or process control systems. [0059] Server 310 can include processor 420, display 422, input device 424, and memory 426, which can be interconnected. In a preferred embodiment, memory 426 contains a storage device for storing data received through communication link 308 or through other links, and also receives commands and values transmitted by one or more users. The storage device further contains a server program for controlling processor 420. In some embodiments, memory 426 can also contain an application for partially diacritizing Arabic text in accordance with some embodiments. For example, server 310 can use the application to partially diacritize text (e.g., Arabic text) that passes through server 310.
[0060] Accordingly, methods, systems, and media for partially diacritizing text are provided.
[0061] Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention, which is only limited by the claims which follow. Features of the disclosed embodiments can be combined and rearranged in various ways.

Claims

What is claimed is:
1. A method for partially diacritizing text, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.
2. The method of claim 1 , further comprising determining that at least a portion of the received text is in a non-Latin alphabet.
3. The method of claim 1, further comprising determining that at least a portion of the received text is in Arabic.
4. The method of claim 1, further comprising generating a disambiguated form of the received text by conducting a full diacritization on the received text.
5. The method of claim 4, further comprising extracting a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.
6. The method of claim 1, further comprising using the annotated text to train the selected partial diacritization scheme.
7. The method of claim 1 , further comprising training each partial diacritization scheme with statistics corresponding to a corpus.
8. The method of claim 1, wherein selecting the partial diacritization scheme from a plurality of partial diacritization schemes further comprises: determining the performance score of each of the plurality of partial diacritization schemes on the received text; and determining the partial diacritization scheme with the highest performance score.
9. The method of claim 1 , wherein selecting the partial diacritization scheme from a plurality of partial diacritization schemes further comprises: receiving a plurality of criteria for determining the selected partial diacritization scheme, wherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; assigning priorities to each of the plurality of criteria; generating a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and determining the partial diacritization scheme from the multidimensional table.
10. The method of claim 1, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency- based diacritization schemes.
11. The method of claim 1 , further comprising determining whether a distinctive diacritization is associated with a word of the received text.
12. The method of claim 11, wherein the distinctive diacritization is based on a frequent rendering in a corpus.
13. A system for partially diacritizing text, the system comprising: means for receiving text; means for selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; means for applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and means for providing the annotated text.
14. The system of claim 13, further comprising means for determining that at least a portion of the received text is in a non-Latin alphabet.
15. The system of claim 13, further comprising means for determining that at least a portion of the received text is in Arabic.
16. The system of claim 13, further comprising means for generating a disambiguated form of the received text by conducting a full diacritization on the received text.
17. The system of claim 16, further comprising means for extracting a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.
18. The system of claim 13, further comprising means for using the annotated text to train the selected partial diacritization scheme.
19. The system of claim 13, further comprising means for training each partial diacritization scheme with statistics corresponding to a corpus.
20. The system of claim 13, further comprising: means for determining the performance score of each of the plurality of partial diacritization schemes on the received text; and means for determining the partial diacritization scheme with the highest performance score.
21. The system of claim 13 , further comprising: means for receiving a plurality of criteria for determining the selected partial diacritization scheme, wherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; means for assigning priorities to each of the plurality of criteria; means for generating a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and means for determining the partial diacritization scheme from the multidimensional table.
22. The system of claim 13, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency-based diacritization schemes.
23. The system of claim 13, further comprising means for determining whether a distinctive diacritization is associated with a word of the received text.
24. The system of claim 23, wherein the distinctive diacritization is based on a frequent rendering in a corpus.
25. A system for partially diacritizing text, the system comprising: a processor that: receives text; selects a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applies the selected partial diacritization scheme to annotate the received text with at least one diacritic; and provides the annotated text.
26. The system of claim 25, wherein the processor is further configured to determine that at least a portion of the received text is in a non-Latin alphabet.
27. The system of claim 25, wherein the processor is further configured to determine that at least a portion of the received text is in Arabic.
28. The system of claim 25, wherein the processor is further configured to generate a disambiguated form of the received text by conducting a full diacritization on the received text.
29. The system of claim 28, wherein the processor is further configured to extract a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.
30. The system of claim 25, wherein the processor is further configured to use the annotated text to train the selected partial diacritization scheme.
31. The system of claim 25, wherein the processor is further configured to train each partial diacritization scheme with statistics corresponding to a corpus.
32. The system of claim 25, wherein the processor is further configured to: determine the performance score of each of the plurality of partial diacritization schemes on the received text; and determine the partial diacritization scheme with the highest performance score.
33. The system of claim 25, wherein the processor is further configured to: receive a plurality of criteria for determining the selected partial diacritization scheme, wherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; assign priorities to each of the plurality of criteria; generate a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and determine the partial diacritization scheme from the multidimensional table.
34. The system of claim 25, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency-based diacritization schemes.
35. The system of claim 25, wherein the processor is further configured to determine whether a distinctive diacritization is associated with a word of the received text.
36. The system of claim 35, wherein the distinctive diacritization is based on a frequent rendering in a corpus.
37. A computer-readable media storing computer-executable instructions that, when executed by a processor, cause the processor to perform methods for partially diacritizing text, the method comprising: receiving text; selecting a partial diacritization scheme from a plurality of partial diacritization schemes based on the received text and a performance score; applying the selected partial diacritization scheme to annotate the received text with at least one diacritic; and providing the annotated text.
38. The computer-readable medium of claim 37, wherein the method further comprises determining that at least a portion of the received text is in a non- Latin alphabet.
39. The computer-readable medium of claim 37, wherein the method further comprises determining that at least a portion of the received text is in Arabic.
40. The computer-readable medium of claim 37, wherein the method further comprises generating a disambiguated form of the received text by conducting a full diacritization on the received text.
41. The computer-readable medium of claim 40, wherein the method further comprises extracting a plurality of partial diacritizations of the received text by selectively removing diacritics from the disambiguated form based on a partial diacritization scheme.
42. The computer-readable medium of claim 37, wherein the method further comprises using the annotated text to train the selected partial diacritization scheme.
43. The computer-readable medium of claim 37 , wherein the method further comprises training each partial diacritization scheme with statistics corresponding to a corpus.
44. The computer-readable medium of claim 37, wherein the method further comprises: determining the performance score of each of the plurality of partial diacritization schemes on the received text; and determining the partial diacritization scheme with the highest performance score.
45. The computer-readable medium of claim 37, wherein the method further comprises: receiving a plurality of criteria for determining the selected partial diacritization scheme, wrherein the plurality of criteria includes one or more of: a performance score, a frequency, genre information, and domain information; assigning priorities to each of the plurality of criteria; generating a multidimensional table that includes each partial diacritization scheme along with the assigned priorities and the associated plurality of criteria; and determining the partial diacritization scheme from the multidimensional table.
46. The computer-readable medium of claim 37, wherein the plurality of partial diacritization schemes includes one or more lexical diacritization schemes, one or more inflectional diacritization schemes, and one or more frequency-based diacritization schemes.
47. The computer-readable medium of claim 37, wherein the method further comprises determining whether a distinctive diacritization is associated with a word of the received text.
48. The computer-readable medium of claim 47, wherein the distinctive diacritization is based on a frequent rendering in a corpus.
PCT/US2008/077849 2007-09-26 2008-09-26 Methods, systems, and media for partially diacritizing text WO2009042861A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US97548207P 2007-09-26 2007-09-26
US60/975,482 2007-09-26
US97578307P 2007-09-27 2007-09-27
US60/975,783 2007-09-27

Publications (1)

Publication Number Publication Date
WO2009042861A1 true WO2009042861A1 (en) 2009-04-02

Family

ID=40511874

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/077849 WO2009042861A1 (en) 2007-09-26 2008-09-26 Methods, systems, and media for partially diacritizing text

Country Status (1)

Country Link
WO (1) WO2009042861A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8612206B2 (en) 2009-12-08 2013-12-17 Microsoft Corporation Transliterating semitic languages including diacritics
US9864745B2 (en) 2011-07-29 2018-01-09 Reginald Dalce Universal language translator
US20220188515A1 (en) * 2019-03-27 2022-06-16 Qatar Foundation For Education, Science And Community Development Method and system for diacritizing arabic text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5787452A (en) * 1996-05-21 1998-07-28 Sybase, Inc. Client/server database system with methods for multi-threaded data processing in a heterogeneous language environment
US20070156748A1 (en) * 2005-12-21 2007-07-05 Ossama Emam Method and System for Automatically Generating Multilingual Electronic Content from Unstructured Data
US20070168178A1 (en) * 2006-01-13 2007-07-19 Vadim Fux Handheld electronic device and method for disambiguation of compound text input and for prioritizing compound language solutions according to quantity of text components
US7260519B2 (en) * 2003-03-13 2007-08-21 Fuji Xerox Co., Ltd. Systems and methods for dynamically determining the attitude of a natural language speaker

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5787452A (en) * 1996-05-21 1998-07-28 Sybase, Inc. Client/server database system with methods for multi-threaded data processing in a heterogeneous language environment
US7260519B2 (en) * 2003-03-13 2007-08-21 Fuji Xerox Co., Ltd. Systems and methods for dynamically determining the attitude of a natural language speaker
US20070156748A1 (en) * 2005-12-21 2007-07-05 Ossama Emam Method and System for Automatically Generating Multilingual Electronic Content from Unstructured Data
US20070168178A1 (en) * 2006-01-13 2007-07-19 Vadim Fux Handheld electronic device and method for disambiguation of compound text input and for prioritizing compound language solutions according to quantity of text components

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8612206B2 (en) 2009-12-08 2013-12-17 Microsoft Corporation Transliterating semitic languages including diacritics
US9864745B2 (en) 2011-07-29 2018-01-09 Reginald Dalce Universal language translator
US20220188515A1 (en) * 2019-03-27 2022-06-16 Qatar Foundation For Education, Science And Community Development Method and system for diacritizing arabic text

Similar Documents

Publication Publication Date Title
Habash et al. Conventional orthography for dialectal Arabic.
Protopapas et al. A comparative quantitative analysis of Greek orthographic transparency
Saad et al. Arabic morphological tools for text mining
Scannell Statistical unicodification of African languages
Bakr et al. A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic
Alghamdi et al. Automatic restoration of arabic diacritics: a simple, purely statistical approach
Maamouri et al. Diacritization: A challenge to Arabic treebank annotation and parsing
Hamed et al. A survey and comparative study of Arabic diacritization tools
Hino et al. The nature of orthographic–phonological and orthographic–semantic relationships for Japanese kana and kanji words
Faili et al. Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language
Adafre Part of speech tagging for Amharic using conditional random fields
Schiff et al. Vowel representation in written Hebrew: Phonological, orthographic and morphological contexts
Tufis Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging.
Argaw et al. An Amharic stemmer: Reducing words to their citation forms
JP2008250651A (en) Information processor, information processing method, and program
Htay et al. Myanmar word segmentation using syllable level longest matching
WO2009042861A1 (en) Methods, systems, and media for partially diacritizing text
Sporleder et al. Automatic paragraph identification: A study across languages and domains
Dash The process of designing a multidisciplinary monolingual sample corpus
Sahala et al. Automated phonological transcription of Akkadian cuneiform text
Alghamdi et al. KACST Arabic diacritizer
Megerdoomian Developing a Persian part of speech tagger
Asahiah Development of a Standard Yorùbá digital text automatic diacritic restoration system
Black et al. Syntactic annotation: linguistic aspects of grammatical tagging and skeleton parsing
Ablimit et al. Partly supervised Uyghur morpheme segmentation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08833798

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08833798

Country of ref document: EP

Kind code of ref document: A1