WO1994022097A1 - Language-sensitive searching system - Google Patents

Language-sensitive searching system Download PDF

Info

Publication number
WO1994022097A1
WO1994022097A1 PCT/US1994/000014 US9400014W WO9422097A1 WO 1994022097 A1 WO1994022097 A1 WO 1994022097A1 US 9400014 W US9400014 W US 9400014W WO 9422097 A1 WO9422097 A1 WO 9422097A1
Authority
WO
WIPO (PCT)
Prior art keywords
recited
text
search
tertiary
language
Prior art date
Application number
PCT/US1994/000014
Other languages
French (fr)
Inventor
Mark Edward Davis
Judy Lin
Original Assignee
Taligent, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taligent, Inc. filed Critical Taligent, Inc.
Priority to AU61206/94A priority Critical patent/AU6120694A/en
Priority to JP6521007A priority patent/JPH08508124A/en
Publication of WO1994022097A1 publication Critical patent/WO1994022097A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/02Comparing digital values
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/02Indexing scheme relating to groups G06F7/02 - G06F7/026
    • G06F2207/025String search, i.e. pattern matching, e.g. find identical word or best match in a string
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99937Sorting

Definitions

  • This invention generally relates to improvements in computer systems and more particularly to language-sensitive text search.
  • Searching text has evolved from early systems where specific fields in a text were searchable to today's computer systems which facilitate full text searches of enormous databases of information.
  • a deficiency that exists even today in search systems is the ability to perform language sensitive matches of information. For example, various spellings in a particular language should all match in a search. Applicant is unaware of any prior art reference that provides the solution present in the subject invention.
  • a primary objective of the present invention to provide a language-sensitive text search.
  • An innovative system and method for performing the search is presented that performs text comparison of any Unicode strings. For any language an ordering is defined based on features of the language. Then, a search operation is performed which uses a fast, language-sensitive search of a text pattern within a larger text string. The text string is examined and a match is performed based on a predefined character precedence to determine if a language- sensitive match has been located.
  • FIG. 1 is a block diagram of a personal computer system in accordance with the subject invention
  • Figure 2 illustrates the logical composition of the UnicodeOrder in accordance with the subject invention
  • Figure 3 illustrates an UnicodeOrders for English in accordance with the subject invention
  • Figure 4 illustrates an example of Unicode structures in accordance with the subject invention
  • Figure 5 represents a data structure for string comparison in accordance with the subject invention
  • FIG. 6 illustrates the flow of control and grouping in accordance with the subject invention
  • Figure 7 illustrates a UnicodeOrder based on the last UnicodeOrder in accordance with the subject invention
  • Figure 8 is a flowchart of the detailed logic in accordance with the subject invention.
  • Figure 9 is an example of a display in accordance with the subject invention.
  • FIG. 1 illustrates a typical hardware configuration of a workstation in accordance with the subject invention having a central processing unit 10, such as a conventional microprocessor, and a number of other units interconnected via a system bus 12.
  • a central processing unit 10 such as a conventional microprocessor
  • the workstation shown in Figure 1 includes a Random Access Memory (RAM) 14, Read Only Memory (ROM) 16, an I/O adapter 18 for connecting peripheral devices such as disk units 20 to the bus, a user interface adapter 22 for connecting a keyboard 24, a mouse 26, a speaker 28, a microphone 32, and /or other user interface devices such as a touch screen device (not shown) to the bus, a communication adapter 34 for connecting the workstation to a data processing network and a display adapter 36 for connecting the bus to a display device 38.
  • the workstation has resident thereon an operating system such as the Apple System/ 7 ® operating system.
  • Text collation classes include provisions for correctly collating a wide variety of natural languages, and for correct natural language searching for those languages.
  • Some languages require primary, secondary and tertiary ordering. For example, in Czech, case differences are a tertiary difference (A vs a), accent differences are a secondary difference (e vs e), and different base letters are a primary difference (A vs B). For these languages, if there are no primary or secondary differences in the string, the first tertiary difference in the strings will determine the resultant order.
  • TTextOrder TTextOrder is an abstract base class that defines the protocol for comparing two text objects. Its subclasses provide a primitive mechanism useful for sorting or searching text objects. A TTextOrder is a required field in the user's locale.
  • the tertiary comparison results (“kSourceTertiaryLess” or “kSourceTertiaryGreater”) are returned when there are no primary or secondary differences in the strings, but there are tertiary differences in the strings (i.e..case difference, as in 'a' versus 'A').
  • the secondary comparison results (“kSourceSecondaryLess” and “kSourceSecondaryGreater”) are returned when there is a secondary difference (ie., accent difference as in a vs. a).
  • Character ordering The following constants are used to denote the ordering strength of a character: kPrimary Difference, kSecondaryDifference, kTertiary Difference, and kNoDifference.
  • Primary difference means that one character is strongly greater than another (i.e.., 'b' and 'a'); secondary difference means that the character is "weakly greater” (such as an accent difference, 'A' and 'A').; tertiary difference means that the character is "very weakly greater” (such as a case difference, 'A' and 'a'). Two characters are considered “no different" when they have equivalent Unicode encoding.
  • the caller can choose to ignore secondary, tertiary and ignorable difference by calling SetOnlyUsePrimaryDifference(). .For example, one would set this flag to FALSE when doing case-sensitive matching in English. And the caller can ignore tertiary difference only by calling:
  • sourceText is primarily
  • TTableBasedTextOrder derives from TTextOrder. It uses a table driven approach for language-sensitive text comparison.
  • the table consists of a list of TTextOrderValue objects indexed by Unicode characters.
  • a TTextOrderValue encapsulates the four natural language collation features described above. It contains an ordering value for the character, and optionally, expansion and contraction information.
  • EOrderStrength strength virtual void AddUnicodeOrdering(const TBaseText& characters, EOrderStrength strength, const TBaseText& expandedCharacters);
  • a TTableBasedTextOrder object does not include the capability for dictionary- based collation, which may be required when the collation order is not deducible from the characters in the text.
  • the abbreviation St. is ambiguous, and may be sorted either as Saint, St. or Street. This behavior can be provided through subclassing: no dictionary-based collation is planned for Pink 1.0.
  • the Macintosh ® collation system provides essentially primary and secondary ordering in a similar way. However, the collation system does not supply the additional characteristics, nor provide a modular table-based mechanism for accessing this information.
  • the La Bonte process See “Quand « Z » réelle-it school « a » ? algorithme de tri respectant languagess et cultures", Alain La Bonte, Gouverêt du Quebec, Bibliothequerance du Quebec, ISBN 2-550-21180) provides for many of the features of this ordering (such as French accents), but it requires conversion of the entire string, does not provide a table-based mechanism that can also be used in searching, nor does it provide information for determining where in two strings a weak-identity check fails. Neither one provides straightforward methods for construction, nor do they provide methods for merging.
  • a TUnicodeOrdering contains the UnicodeOrder (UO) information corresponding to a character in the string.
  • This information consists of the fields shown in Figure 2 (i.e., the logical composition of the UnicodeOrder — depending on the machine, the fields can be packed into a small amount of information).
  • the primary field 210 indicates the basic, strongest sorting order of the character.
  • the secondary order 220 is only used if the primary orders of the characters in a string are the same. For many European languages such as French, this corresponds to the difference between accents.
  • the tertiary order 230 is only used if the primary and secondary orders are the same. For most European languages, this corresponds to a case difference.
  • Figure 3 illustrates an UnicodeOrders for English.
  • Unmapped Characters For Unicode there are 65,534 possible primary UnicodeOrders. However, many times a comparison does not include values for all possible Unicodes. For example, a French comparison might not include values for sorting Hebrew characters. Any character x outside of the comprison's domain maps to 65,536 + Unicode(x). The primary values for characters that are covered by the comparison can be assigned to either the low range (2..65,535) or to the high range (131,072..196,607). This allows for a comparison to have all unmapped characters treated as before the mapped characters, or after, or any point in the middle. For example, to put all unmapped characters between a and b: a Unicode structure as shown in Figure 4 would be employed.
  • French-style UnicodeOrder (isSecondaryBackward or isTertiaryBackward) can be set.
  • isSecondaryBackward or isTertiaryBackward can be set.
  • a single character can map to a sequence of UnicodeOrders (called a split character)
  • a split character in terms of primary ordering, a in German sorts as ae • a sequence of characters can map to a single UnicodeOrder (called a grouped character)
  • the Taligent collation process supports all cases where a sequence of one (or more) characters can map to a sequence of one (or more) UnicodeOrders, which is a combination of grouped & split characters.
  • the resulting UnicodeOrders can be rearranged in sequence.
  • the iterator uses the Comparison to map characters to UnicodeOrders. For a simple 1-1 match, the character is matched in a dictionary. This processing permits quick access for most characters. Whenever there are grouped or split characters, a second mechanism is used to facilitate a complicated access. For example, suppose we have the following ordering:
  • This ordering is represented by the data structure appearing in Figure 5.
  • the label 500 refers to ', -, a, b, d, or e are accessed, the mapping is direct (the acute and hyphen are ignorable characters).
  • the character ⁇ is split, and two pieces of information are returned. The first is the UnicodeOrders of the start of the sequence, and the second is a sequence of one or more additional characters.
  • the characters a is also split.
  • the information stored in the table can be preprocessed to present a list of UnicodeOrders. This is done by looking up the UnicodeOrders that correspond to the remaining characters.
  • This opti ⁇ mization can be done under two conditions: the sequence of characters can contain no reordering accents, and the Text Comparison must be complete (all the characters must have corresponding UnicodeOrders). In this case a has a
  • non-spacing marks can occur in a different order in a string, but have the same interpretation if they do not interact typographically. For example, a + underdot + circumflex is equivalent to a + circumflex + underdot.
  • Every Unicode non-spacing mark has an associated non-spacing priority (spacing marks have a null priority). Whenever a character is encountered that has a non- null priority, a reordering process is invoked. Essentially, any sequence of non-null priority marks are sorted, and their UnicodeOrders are returned in that sorted sequence. If the iterator is asked for the string position, then the position before the first unreturned UnicodeOrder is returned.
  • underdot has a larger non-spacing priority than circumflex, the iterator will return the UnicodeOrder for a, then for diaeresis, then for underdot. However, since diaeresis and breve have the same non-spacing priority (because they interact typographically), they do not rearrange. "--JE" means "does not map to”.
  • This flow of control expresses the logical process: there are a number of optimizations that can also be performed depending on the machine architecture. For example, if the UnicodeOrder is properly constructed, then the primary, secondary and tertiary equality check can be done with one machine instruction.
  • mapping Whenever a mapping is added, the strength of the relation between that character and the last one in the comparison must be specified: equal, primary &secondary equal, or primary equal, or strictly greater. (If the mapping is the first in the comparison, then the "last" mapping is assumed to be ⁇ ignorable, 0, 0>. Each of these produce a UnicodeOrder based on the last UnicodeOrder in the Text Comparison (in the following, abbreviate primary, secondary and tertiary by p, s, and t, resp.) as shown in Figure 7.
  • a new character mapping can be added to the table by adding one of a number of alternatives: a. a single character xi b. a grouped character xi-.xn c. a split character xi/yi-.yn
  • the data in it can be retrieved by iterating through from the first element to the last.
  • a second text comparison can be merged into the first so that all mappings (except unmapped characters) in the first are maintained, and as many of the new mappings from the second are maintained as possible.
  • An example of this is to merge a French Text Comparison into an Arabic Text Comparison. All of the relationships among the Arabic characters (including characters common to both Text Comparisons such as punctuation) should be preserved; relationships among new characters (e.g. Latin) that are not covered by the Arabic Text Comparison will be added.
  • FIG. 8 is a flowchart of the detailed logic in accordance with the subject invention. Processing commences at function block 200 where the termporary result is initialized to a predetermined value. Then, at input block 202, the next source key and the next target key are obtained. A test is performed at decision block 204 to determine if the source primary has the same value as the target primary. If the source primary is not equal, then another test is performed at decision block 214 to determine if the source primary is ignorable. If so, then another test is performed at decision block 220 to determine if the search key should include a match of the primary only or some additional secondary information.
  • a match has been completed and control is passed to input block 260 to obtain the next source key and subsequently to decision block 204. If a secondary match is also desired as detected at decision block 220, then a test is performed at decision block 230 to determine if the source secondary is ignorable. If the source secondary is not ignorable, then the temporary result is updated with the source position, target position and secondary position set equal to GREATER. Then control is passed to input block 260 to obtain the next source key and subsequently to decision block 204. If the source secondary is ignorable as detected at decision block 230, then another test is performed at decision block 232 to determine if a secondary match is only desired or if tertiary information has been saved.
  • control is passed to input block 260 to obtain the next source key and subsequently to decision block 204. If not, then the temporary result is updated with the source position, target position and secondary position set equal to GREATER. Then, control is passed to input block 260 to obtain the next source key and subsequently to decision block 204.
  • the source primary is not ignorable at decision block 214, then another test is performed at decision block 216 to determine if the target primary is ignorable. If so, then another test is performed at decision block 222 to determine if the search key should include a match of the primary only or some additional secondary information. If only a primary match is desired, then a match has been completed and control is passed to input block 262 to obtain the next target key and subsequently to decision block 204. If not, then another test is performed at decision block 234 to determine if the target secondary is ignorable. If so, then the temporary result is updated with the source position, target position and secondary comparison set equal to LESS. Then, control is passed to input block 262 to obtain the next target key and subsequently to decision block 204.
  • control is passed to input block 262 to obtain the next target key and subsequently to decision block 204. If not, then the temporary result is updated with the source position, target position and secondary comparison set equal to LESS. Then, control is passed to input block 262 to obtain the next target key and subsequently to decision block 204.
  • the temporary result is updated with the source position, target position and primary comparison is set equal to the primary comparison result and control is passed to decision block 210.
  • source secondary is equal to the target secondary as detected at decision block 210, then another test is performed at decision block 212 to determine if source tertiary equals to target tertiary. If so, then control is passed to input block 202 to obtain the next source and target key. If not, then a test is performed at decision block 226 to determine if source and target information has been saved or tertiary information. If so, then control is passed to input block 202 to obtain the next source and target key. If not, then the temporary result is set equal to the source position, target position, and the tertiary comparison result in function block 246 and control is passed to input block 202 to obtain the next source and target key.
  • Figure 9 is an example of a display in accordance with the subject invention.
  • the display corresponds to a system which allows language attributes to be associated with a
  • a user can choose the preferred text comparison for any particular lang and associate the text comparison with unmarked text (text without any language attribu
  • a user can also create a new text comparison or modify an existing one.
  • Wh editing the user is presented with a table, as depicted in Figure 9, listing mappings in th comparison in ascending order.
  • the user can select one o more mappings with the mouse.
  • the selected items can be deleted, cut, copied, or move dragging.
  • a new mapping can also be inserted at any point.
  • the left-most column 900 indicates the relationship of the current mapping to the previous (above) mapping. Clicking on the column produces a pop-up menu with a cho symbols: indicated primary-greater, secondary-greater, tertiary-greater, or equal; and an orthogonal set indicating French-secondary and /or French tertiary.
  • mapping There is one special mapping, the unmapped-characters mapping, which contains a symbol for indicating that any unmapped characters go at this point. Since there is always exactly one such location in the text comparison, this mapping is handled specially. If it is deleted, then it will appear at the end of the mapping. If another one is pasted in, then any previously-existing unmapped-characters mapping will be removed.
  • the center column 910 contains the main character(s) in the mapping; the right-most column contains the expansion characters (if any). These can be edited just like any other text in the system.
  • BM Boyer-Moore process
  • baed the difference between exploding and imploding characters is important, but for searching it is not generally significant. That is, with a secondary-strength match, whether the text comparison has a ⁇ a /e or ae ⁇ a is not important: baed and bad match in either case. The one case where this is important is at the end of a pattern. That is, ba should be found in baed, but should not be found in baad.
  • mappings must also be able to process a string in reverse order: in particular, retrieve imploding and exploding mappings in reverse order.
  • Example: oo doesn't need to be included, since it corresponds to the product of o-, which is the same length. However, ae does need to be included, since it corresponds to a, which is shorter.
  • Goal Search for a pattern string within a target string.
  • Input pattern string, target string, strength (primary, secondary, tertiary, bitwise), Text Comparison
  • the first step is to preprocess the pattern string with the Text Comparison to produce an index table (see the next section for details).
  • the pattern string is successively shifted through the target.
  • process the pattern string from the end looking for matches.
  • the Text Comparison is used to process the target string in reverse order, looking up Unicode Orderings (UO). If a match fails, then an index table is employed to shift the pattern string by a specified amount.
  • MTML minimum trailing match length
  • the Boyer-Moore process uses one table indexed by position, and one table indexed by character. In this variant, the latter corresponds to indexing by Unicode Ordering. Build the index tables for the pattern string by traversing the list of Unicode Orderings from back to front as in Boyer-Moore, making the following changes:
  • the index at any position shows how far to shift the processed string at that position if a match has failed against the Unicode Ordering at that position.
  • the index value should be the minimal amount to shift (using the MTML table) such that the current trailing substring could next be found in the pattern.
  • the shift value is 5 because the rightmost occurrence of obae occurs after a trailing sequence with length 5 ( ⁇ ).
  • the index table indexed by Unicode Ordering consists of small array of integers (e.g., 256 integers).
  • the Unicode Ordering is hashed into an integer within the range covered by the array.
  • ba should be found in baed, but should not be found in baad. (This can be made a user option for more control.)
  • Another method for constructing a language-sensitive searcher is to produce a state machine that will recognize each of the various forms (baed and bad), and also disregard any ignorable characters.
  • this technique does not perform as well as the sub-linear methods, such as the method discussed in Gonnet, G.H. and Baeza-Yates, R. Handbook of Algorithms and Data Structures — In Pascal and C. Second ed. Addison-Wesley, Wokingham, UK 1991.
  • a key question here is based on the number of comparisons required in each method and the lookup time per character in the state table vs. in the Text Comparison.
  • the lookup time is quite small as long as the character is not ex ⁇ ploding or imploding, so the performance is dependent on the proportion of such characters in the target text, which is generally quite small.

Abstract

A method and system for providing a language-sensitive text search. An innovative system and method for performing the search is presented that performs text comparison of any Unicode strings. For any language an ordering is defined based on features of the language. Then, an interactive compare function is performed to determine the relationship of a pair of strings. The string is examined and a compare is performed one or more characters at a time based on a predefined character precedence.

Description

LANGUAGE-SENSITIVE SEARCHING SYSTEM
COPYRIGHT NOTIFICATION
Portions of this patent application contain materials that are subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
CROSS-REFERENCE TO RELATED PATENT APPLICATIONS
This patent application is related to the patent application entitled Object Oriented Framework System, by Debra L. Orton, David B. Goldsmith, Christopher P. Moeller, and Andrew G. Heninger, filed 12/23/92, and assigned to Taligent, the disclosure of which is hereby incorporated by reference.
Field of the Invention
This invention generally relates to improvements in computer systems and more particularly to language-sensitive text search.
Background of the Invention
Among developers of workstation software, it is increasingly important to provide a flexible software environment while maintaining consistency in the user's interface. An early attempt at providing this type of an operating environment is disclosed in US Patent 4,686,522 to Hernandez et al. This patent discusses a combined graphic and text processing system in which a user can invoke a dynamic menu at the location of the cursor and invoke any of a variety of functions from the menu. This type of natural interaction with a user improves the user interface and makes the application much more intuitive.
Searching text has evolved from early systems where specific fields in a text were searchable to today's computer systems which facilitate full text searches of enormous databases of information. A deficiency that exists even today in search systems is the ability to perform language sensitive matches of information. For example, various spellings in a particular language should all match in a search. Applicant is unaware of any prior art reference that provides the solution present in the subject invention.
Summary of the Invention
Accordingly, it is a primary objective of the present invention to provide a language-sensitive text search. An innovative system and method for performing the search is presented that performs text comparison of any Unicode strings. For any language an ordering is defined based on features of the language. Then, a search operation is performed which uses a fast, language-sensitive search of a text pattern within a larger text string. The text string is examined and a match is performed based on a predefined character precedence to determine if a language- sensitive match has been located.
Brief Description of the Drawings
Figure 1 is a block diagram of a personal computer system in accordance with the subject invention;
Figure 2 illustrates the logical composition of the UnicodeOrder in accordance with the subject invention;
Figure 3 illustrates an UnicodeOrders for English in accordance with the subject invention;
Figure 4 illustrates an example of Unicode structures in accordance with the subject invention;
Figure 5 represents a data structure for string comparison in accordance with the subject invention;
Figure 6 illustrates the flow of control and grouping in accordance with the subject invention;
Figure 7 illustrates a UnicodeOrder based on the last UnicodeOrder in accordance with the subject invention; Figure 8 is a flowchart of the detailed logic in accordance with the subject invention; and
Figure 9 is an example of a display in accordance with the subject invention.
Detailed Description Of The Invention
The invention is preferably practiced in the context of an operating system resident on a personal computer such as the IBM ® PS/2 ® or Apple ® Macintosh ® computer. A representative hardware environment is depicted in Figure 1, which illustrates a typical hardware configuration of a workstation in accordance with the subject invention having a central processing unit 10, such as a conventional microprocessor, and a number of other units interconnected via a system bus 12. The workstation shown in Figure 1 includes a Random Access Memory (RAM) 14, Read Only Memory (ROM) 16, an I/O adapter 18 for connecting peripheral devices such as disk units 20 to the bus, a user interface adapter 22 for connecting a keyboard 24, a mouse 26, a speaker 28, a microphone 32, and /or other user interface devices such as a touch screen device (not shown) to the bus, a communication adapter 34 for connecting the workstation to a data processing network and a display adapter 36 for connecting the bus to a display device 38. The workstation has resident thereon an operating system such as the Apple System/ 7 ® operating system.
No character encoding contains enough information to provide good alphabetical ordering for any natural language: in the Macintosh, for example, simple byte-wise comparison incorrectly yields:
"A" < "Z" < "a" < "z" < "N" < "0" < "A". Text collation classes include provisions for correctly collating a wide variety of natural languages, and for correct natural language searching for those languages.
Natural Language Collation Features
Correct proper comparison and sorting of natural language text requires the following capabilities. These capabilities are of paramount importance to programmers that are building comparison objects: a set of pre-defined comparison objects for different languages will also be available.
A. Ordering Priorities The first primary difference in a string will determine the resultant order, no matter what the other characters are. Example: cat < dog If there are no primary differences in the string, then the first secondary difference in the strings will determine the resultant order.
Example: ax < Ax < Ax < axx Example: china < China < chinas
Some languages require primary, secondary and tertiary ordering. For example, in Czech, case differences are a tertiary difference (A vs a), accent differences are a secondary difference (e vs e), and different base letters are a primary difference (A vs B). For these languages, if there are no primary or secondary differences in the string, the first tertiary difference in the strings will determine the resultant order.
Example: ax < Ax Example: ax < Ax
B. Grouped characters In collating some languages, a sequence of characters is treated as though it were a single letter of the alphabet.
Example: ex < chx < dx
C. Expanding characters In some languages, a single character is treated as though it were a sequence of letter of the alphabet.
Example: aex < sex < aexx
D. Ignored characters Certain characters are ignored when collating. That is, they are not significant unless there are no other differences in the remainder of the string.
Example: ax < a-x < a-xx Example: blackbird < black-bird < blackbirds
The specific characters that have these behaviors are dependent on the specific language: "a" < "a" is a weak ordering in German, but not in Swedish; "ch" is a grouped character in Spanish, but not in English, etc. Orderings can also differ within a language: users may want a modified German ordering, for example, to get the alternate standard where "a" is treated as an expanding character.
CLASS DESCRIPTION The following classes are provided for text comparison and searching: • TTextOrder • TTableBasedTextOrder
TTextOrder TTextOrder is an abstract base class that defines the protocol for comparing two text objects. Its subclasses provide a primitive mechanism useful for sorting or searching text objects. A TTextOrder is a required field in the user's locale.
Comparison result: Comparing two text objects can return the following results: kSourcePrimaryLess , kSourceSecondaryLess, kSourceTertiaryLess, kSourceEqual, kSourceTertiar Greater, kS our xeSecondar 'Greater or kSourcePrimary Greater. Two objects are equal only where the strings are bit-for-bit equal, or there are equivalent Unicode sequences for a given letter. For example, "ϋ" can either be expressed with the "ii" character, or with the sequence "u" + """.
The tertiary comparison results ("kSourceTertiaryLess" or "kSourceTertiaryGreater") are returned when there are no primary or secondary differences in the strings, but there are tertiary differences in the strings (i.e..case difference, as in 'a' versus 'A'). The secondary comparison results ("kSourceSecondaryLess" and "kSourceSecondaryGreater") are returned when there is a secondary difference (ie., accent difference as in a vs. a).
The primary comparison results ("kSourcePrimaryLess" and
"kSourcePrimaryGreater") are returned when there is a primary difference in the string (ie., character differences as in a vs. b). This also includes the case where up to the end of one of the strings there are no primary differences, but the other string contains additional, non-ignorable characters.
Character ordering: The following constants are used to denote the ordering strength of a character: kPrimary Difference, kSecondaryDifference, kTertiary Difference, and kNoDifference. Primary difference means that one character is strongly greater than another (i.e.., 'b' and 'a'); secondary difference means that the character is "weakly greater" (such as an accent difference, 'A' and 'A').; tertiary difference means that the character is "very weakly greater" (such as a case difference, 'A' and 'a'). Two characters are considered "no different" when they have equivalent Unicode encoding. The caller can choose to ignore secondary, tertiary and ignorable difference by calling SetOnlyUsePrimaryDifference(). .For example, one would set this flag to FALSE when doing case-sensitive matching in English. And the caller can ignore tertiary difference only by calling:
SetOnlyUsePrimary AndSecondaryDif f erence() Public Methods enum EOrderStrength { kPrimaryDifference, kSecondaryDifference, kTertiaryDifference, kNoDifference } ;
enum EComparisonResult{ kSourcePrimaryLess = -3, kSourceSecondaryLess = -2, kSourceTertiaryLess = -1, kSourceEqual = 0, kSourceTertiaryGreater = 1, kSourceSecondaryGreater = 2, kSourcePrimaryGreater = 3
};
//==============================// Compares two TText objects, returns the comparison result as well // as the number of characters matched. Result is always relative to
/ / the sourceText, ie., 'kSourcePrimaryLess' means sourceText is primarily
/ / less than targetText
// virtual EComparisonResult Compare(const TBaseText& sourceText, const TBaseText& targetText) const = 0; virtual EComparisonResult Compare(const TBaseText& sourceText, const TBaseText& targetText, unsigned long& sourceCharactersMatched, unsigned long& targetCharactersMatched) const = 0;
//==_========____=^
// Switch to ignore all but primary difference, which
/ / is case-insensitive matching if tertiary ordering is not used. // Default is FALSE.
// virtual void SetOnlyUsePrimaryDifference(Boolean flag);
//==____====__=____==__=__==________===__^
11 Switch to ignore tertiary difference, which is canse-insensitive matching // if tertiary ordering is used. Default is FALSE. // virtual void SetOnlyUsePrimaryAndSecondaryDifference(Boolean flag); //=======================================
/ / Flag to indicate whether we should use backward Secondary Ordering and / / backward Tertiary Ordering or not.The default value is set to FALSE.
/ / For example, in French, secondary ordering is counted from back to front / / Assuming a > a, (secondary greater) if SetBackwardSecondaryOrderingO is / / set to TRUE , ata < ata (secondary less) because both have same primary // ordering and the secondary ordering is being looked at from backward, / / with the third character "a" of ata less than the third character "a" of ata. // Default is set to FALSE.
virtual void SetBackwardSecondaryOrdering(Boolean flag); virtual Boolean GetBackwardSecondaryOrderingO const; virtual void SetBackwardTertiaryOrdering(Boolean flag); virtual Boolean GetBackwardTertiaryOrderingO const; //========____==^
II Additional comparison method for convenience. Calls Compare(). // Subclass: Should not override. Override Compare instead. //
/ / If 'OnlyUsePrimaryDifference', returns TRUE if Compare()
// returns 'kSourceEqual','kSourceSecondaryLess',or // 'kSourceSecondaryGreater', or 'kSourceTertiaryLess',
// or 'kSourceTertiaryGreater';
/ / If 'OnlyUsePrimaryAndSecondaryDifference', returns TRUE if Compare()
// returns 'kSourceEqual','kSourceTertiaryLess',or
// 'kSourceTertiaryGreater', / / else returns TRUE for 'kSourceEqual' only
Boolean TextIsEqual(const TBaseText& sourceText, const TBaseText& targetText) const; //=============================================^^^ =======
/ / If 'OnlyUsePrimaryDifference', returns TRUE if Compare()
// returns 'kSourcePrimaryGreater'
/ / If 'OnlyUsePrimaryAndSecondaryDifference', returns TRUE if Compare()
// returns 'kSourceSecondaryGreater'or 'kSourcePrimaryGreater', / / else returns TRUE for 'kSourceTertiaryGreater', 'kSourceSecondaryGreater'
// or 'kSourcePrimaryGreater'
Boolean TextIsGreaterThan(const TBaseText& sourceText, const TBaseText& targetText) const; //==============================================================
/ / If 'OnlyUsePrimaryDifference', returns TRUE if Compare() // returns 'kSourcePrimaryLess'
/ / If 'OnlyUsePrimaryAndSecondaryDifference', returns TRUE if Compare() // returns 'kSourceSecondaryLess'or 'kSourcePrimaryLess',
/ / else returns TRUE for 'kSourceTertiaryLess', 'kSourceSecondaryLess' // or 'kSourcePrimaryLess' Boolean TextIsLessThan(const TBaseText& sourceText, const TBaseText& targetText) const; //=============================================================
// getter /setter to determine if this text order contains "grouped" or // "expanding" characters.
//
Boolean HasSpecialCharacters() const; virtual void SetHasSpecialCharacters(Boolean flag);
//=
/ / Get and set the name of this object
// virtual void GetName(TLocaleName& name) const; virtual void SetName(const TLocaleName& name);
Protected Methods
Boolean OnlyUsePrimaryDifference () const; Boolean OnlyUsePrimaryAndSecondaryDifferenceO const;
2. TTableBasedTextOrder
TTableBasedTextOrder derives from TTextOrder. It uses a table driven approach for language-sensitive text comparison. The table consists of a list of TTextOrderValue objects indexed by Unicode characters. A TTextOrderValue encapsulates the four natural language collation features described above. It contains an ordering value for the character, and optionally, expansion and contraction information.
Constructing the table: Currently, the table is constructed based on a text specification. In the future, there will be a TUserlnterface object that is responsible for displaying and editing the table data. A series of characters in increasing sorting order can be added programmatically to the table by successively calling AddComparisonValueO . Public Methods //================================================================
/ / Constructor to create an ordering object from the table specified // by the contents of "file". This is temporary until we have an editor / / to construct tables. //
TTableBasedTextOrder(const TFile& tableSpecification);
//=
/ / TTextOrder overrides. Uses the table to implement comparison.
// virtual EComparisonResult Compare(const TBaseText& sourceText, const TBaseText& targetText) const; virtual EComparisonResult Compare(const TBaseText& sourceText, const TBaseText& targetText, unsigned long& sourceCharactersMatched, unsigned long& targetCharactersMatched) const;
//============================================================
/ / given key, which is one or more characters (it is always one except for
// cases like 'ch' which sorts as a single character), and the order strength,
/ / construct the value and add it as the greatest value currently in the table // (ie., add to the end). These methods automatically sets "HasSpecialCharacters"
// virtual void AddUnicodeOrdering(const TBaseText& key,
EOrderStrength strength); virtual void AddUnicodeOrdering(const TBaseText& characters, EOrderStrength strength, const TBaseText& expandedCharacters);
/* 'ExpandedCharacters' are those that should be part of the expansion ie., "e" when the key is "ae". */
Dictionary-Based Collation
A TTableBasedTextOrder object does not include the capability for dictionary- based collation, which may be required when the collation order is not deducible from the characters in the text. For example, the abbreviation St. is ambiguous, and may be sorted either as Saint, St. or Street. This behavior can be provided through subclassing: no dictionary-based collation is planned for Pink 1.0.
EXAMPLES IN ACCORDANCE WITH A PREFERRED EMBODIMENT OF THE
SUBJECT INVENTION 1. Comparing two text objects: void
CompareO
{
/ / compare two text objects using the text order in the current user's locale. TLocale "locale = TLocale::GetDefaultLocale(); TTextOrder *order = locale->GetTextOrder();
TText sourceText("text objectl");
TText targetText("text object2"); if (order->IsEqual(sόurceText, targetText))
/ / the two text objects are equal
TEXT COMPARISON INTERNALS
This section describes the internals of the process used to do language- sensitive text comparison. The Taligent comparison process allows comparison of any Unicode™ strings. Unicode is a trademark of Unicode, Inc. For details about Unicode, see The Unicode Standard: Worldwide Character Encoding, Version 2.0, Volumes 1,2 by Unicode, Inc., Addison-Wesley Publishing Company, Inc. ISBN 0- 201-56788-1, ISBN 0-201-60845-6. It can also be adapted to more limited character sets as well. The information presented below describes the logical process of comparison.
The Macintosh ® collation system provides essentially primary and secondary ordering in a similar way. However, the collation system does not supply the additional characteristics, nor provide a modular table-based mechanism for accessing this information. The La Bonte process (See "Quand « Z » vient-it avant « a » ? algorithme de tri respectant langues et cultures", Alain La Bonte, Gouvernement du Quebec, Bibliotheque nationale du Quebec, ISBN 2-550-21180) provides for many of the features of this ordering (such as French accents), but it requires conversion of the entire string, does not provide a table-based mechanism that can also be used in searching, nor does it provide information for determining where in two strings a weak-identity check fails. Neither one provides straightforward methods for construction, nor do they provide methods for merging.
Orderings
A TUnicodeOrdering contains the UnicodeOrder (UO) information corresponding to a character in the string. This information consists of the fields shown in Figure 2 (i.e., the logical composition of the UnicodeOrder — depending on the machine, the fields can be packed into a small amount of information).
TUnicodeOrdering
The primary field 210 indicates the basic, strongest sorting order of the character. The secondary order 220 is only used if the primary orders of the characters in a string are the same. For many European languages such as French, this corresponds to the difference between accents. The tertiary order 230 is only used if the primary and secondary orders are the same. For most European languages, this corresponds to a case difference. For example, Figure 3 illustrates an UnicodeOrders for English. When two strings x and y have a primary difference between them based upon the text comparison, and the first primary difference in x is less than y, we say that x is primary-greater than y and write x <« y; similarly, x can be secondary- greater than y (x « y), or tertiary-greater than y (x < y), or equivalent to y (x = y). If there are no primary, secondary, or tertiary differences between the strings, then they are equivalent (x=y) according to the Text Comparison.
Ignorable Characters
There are cases where characters should be ignored in terms of primary differences. For example, in English words a hyphen is ignorable: blackbird « black-bird « blackbirds « blackbirds. This is distinguished by using the value ignorable (= 2) as the primary value. An ignorable UnicodeOrder counts as a secondary difference when the secondary is non-zero; otherwise as a tertiary difference when the tertiary value is non-zero; otherwise the UnicodeOrder is completely ignorable (the comparison proceeds as if the UnicodeOrder were absent).
Unmapped Characters For Unicode, there are 65,534 possible primary UnicodeOrders. However, many times a comparison does not include values for all possible Unicodes. For example, a French comparison might not include values for sorting Hebrew characters. Any character x outside of the comprison's domain maps to 65,536 + Unicode(x). The primary values for characters that are covered by the comparison can be assigned to either the low range (2..65,535) or to the high range (131,072..196,607). This allows for a comparison to have all unmapped characters treated as before the mapped characters, or after, or any point in the middle. For example, to put all unmapped characters between a and b: a Unicode structure as shown in Figure 4 would be employed.
Orientation
In French, the accent ordering works in a peculiar way. Accents are only significant if the primary characters are identical, so they have a secondary difference. However, unlike difference in primary character or in case, it is the last accent difference that determines the order of the two strings. For example, the following strings are in order in French (note the second character in each string: the difference between e and e is not counted in the first string because there is a later accent difference.): peche peche pecher pecher
French Ordering Any time there is a secondary or tertiary difference, French-style UnicodeOrder (isSecondaryBackward or isTertiaryBackward) can be set.. When comparing two UnicodeOrders, if either one is set backward, then the comparison of those two UnicodeOrders overrides previous UnicodeOrders of that class (secondary or tertiary).
Multiple Mappings
' The situation is actually more complex than the above description indicates, since:
• a single character can map to a sequence of UnicodeOrders (called a split character) Example: in terms of primary ordering, a in German sorts as ae • a sequence of characters can map to a single UnicodeOrder (called a grouped character)
Example: in terms of primary ordering, ch in Spanish sorts as a single character between c and d
In general, the Taligent collation process supports all cases where a sequence of one (or more) characters can map to a sequence of one (or more) UnicodeOrders, which is a combination of grouped & split characters.
• depending on the characters, the resulting UnicodeOrders can be rearranged in sequence.
Example:
Q^ = a + overdot + underdot = a + underdot + overdot
This last feature is attributable to certain ignorable characters (such as accents) that appear in different orders. In certain scripts (such as Thai), the letters are written in a different order than they are pronounced or collated. The base text comparison process does not provide for more complex sequences such as those found in Thai, but does provide a framework for subclassing to allow more sophisticated, dictionary-based UnicodeOrders that can be used to handle such languages.
UnicodeOrder Iteration
Logically speaking, whenever two strings are compared, they are each mapped into a sequence of UnicodeOrders. This processing is accomplished by using a CompareTextlterator, which is created from a comparison and a string. Each time the Next method is called, the next UnicodeOrder is retrieved from the string. When the string is exhausted, then an UnicodeOrder is returned whose primary value is EOF. It is important to know where the significant difference occurred in comparing two strings. The CompareTextlterator can be queried to retrieve the current string zero-based:offset (the offset at the start of the string is zero). This is the last offset in the text just before the UnicodeOrder that was just retrieved. For example, in the string "achu" let's suppose that a Spanish CompareTextlterator is called to retrieve the string offset, and to get a comparison order. The following results will be obtained (where UO(x) is the UnicodeOrder corresponding to x):
0, UO(a), 1, UO(ch), 3, UO(u), 4
Internally, the iterator uses the Comparison to map characters to UnicodeOrders. For a simple 1-1 match, the character is matched in a dictionary. This processing permits quick access for most characters. Whenever there are grouped or split characters, a second mechanism is used to facilitate a complicated access. For example, suppose we have the following ordering:
' < - <« a = a/' < a/e <«b <« c <« ch <« cch <« d <« e
This ordering is represented by the data structure appearing in Figure 5. In which, the label 500 refers to ', -, a, b, d, or e are accessed, the mapping is direct (the acute and hyphen are ignorable characters). At label 510 the character ά is split, and two pieces of information are returned. The first is the UnicodeOrders of the start of the sequence, and the second is a sequence of one or more additional characters. At label 520, the characters a is also split. However, the information stored in the table can be preprocessed to present a list of UnicodeOrders. This is done by looking up the UnicodeOrders that correspond to the remaining characters. This opti¬ mization can be done under two conditions: the sequence of characters can contain no reordering accents, and the Text Comparison must be complete (all the characters must have corresponding UnicodeOrders). In this case a has a
UnicodeOrder which is tertiary-greater than ά, followed by the UnicodeOrder for an e).
In this case labeled 550, the additional characters cannot be optimized as in 520, because z has not yet been mapped.
At label 530, when cch is accessed, first the c is checked, finding a pointer to a second dictionary. The second dictionary is checked for a c, finding a pointer to a third dictionary. The third dictionary has an h, so it matches and returns the UnicodeOrder <7,0,0>. At label 540, if the string had ceo, then the last match would fail, and the sequence of UnicodeOrders corresponding to cc would be returned. Note that the failure case always contains the sequence that would have resulted if the sequence had not existed, so no backup is necessary. Finally, at label 560 when an unmatched character x is encountered, then its value is 64K + Unicode( ), 0, 0. The resulting UnicodeOrders are cached internally and returned one at a time.
Reordering Accents
Certain non-spacing marks (accents) can occur in a different order in a string, but have the same interpretation if they do not interact typographically. For example, a + underdot + circumflex is equivalent to a + circumflex + underdot. Every Unicode non-spacing mark has an associated non-spacing priority (spacing marks have a null priority). Whenever a character is encountered that has a non- null priority, a reordering process is invoked. Essentially, any sequence of non-null priority marks are sorted, and their UnicodeOrders are returned in that sorted sequence. If the iterator is asked for the string position, then the position before the first unreturned UnicodeOrder is returned.
For example:
a + underdot + diaeresis JE a + diaeresis + underdot a + diaeresis + underdot JE a + diaeresis + underdot
Since underdot has a larger non-spacing priority than circumflex, the iterator will return the UnicodeOrder for a, then for diaeresis, then for underdot. However, since diaeresis and breve have the same non-spacing priority (because they interact typographically), they do not rearrange. "--JE" means "does not map to".
a + breve + diaeresis ->JE a + diaeresis + breve a + diaeresis + breve --JE a + breve + diaeresis
In terms of flow of control, the grouping is done after splitting and reordering. Therefore, if a is a grouped character (as in Swedish), then the grouping as illustrated in Figure 6 results.
Flow of Control
There are two very common cases when comparing strings: the UnicodeOrders are completely equal (primary, secondary and tertiary), or completely different (primaries different). In the former case, the main left-hand column is followed from top to bottom, when in the second case the second column on the left is followed. In those typical cases, the number of operations is quite small.
This flow of control expresses the logical process: there are a number of optimizations that can also be performed depending on the machine architecture. For example, if the UnicodeOrder is properly constructed, then the primary, secondary and tertiary equality check can be done with one machine instruction.
The user can specify options for this process, depending on the degree of strength desired for the comparison. For example, the userMatchStrength parameter can be set to normal, or to primary AndSecondaryOnly: (where the tertiary fields don't have to match, e.g. so a = A); or to primaryMatchOnly (where strings only have to match in their primary field, e.g. a = A).
Constructing a Text Comparison
Whenever a mapping is added, the strength of the relation between that character and the last one in the comparison must be specified: equal, primary &secondary equal, or primary equal, or strictly greater. (If the mapping is the first in the comparison, then the "last" mapping is assumed to be <ignorable, 0, 0>. Each of these produce a UnicodeOrder based on the last UnicodeOrder in the Text Comparison (in the following, abbreviate primary, secondary and tertiary by p, s, and t, resp.) as shown in Figure 7.
Next CharacterOrder
The orientation of the tertiary or secondary differences must be also specified: nor¬ mal or French differences. Given that, a new character mapping can be added to the table by adding one of a number of alternatives: a. a single character xi b. a grouped character xi-.xn c. a split character xi/yi-.yn
(xi expands to > last + yi..yn) d. a grouped/split character xi-.xn/yi-yn (x ..xn expands to > last + yi..yn) e. unmapped
(unmapped characters here) In the above, whenever x2--*n or yi-.yn occur, the comparison is not complete until they are defined. For example, when x/yi-.yn is added, x gets a new UnicodeOrder according to the above table, but the other y's are placed on hold until their UnicodeOrders are defined. Once they are, then x maps to UO(x) + UO(yi) + ... + UO(yn).
Once a text comparison is formed, then the data in it can be retrieved by iterating through from the first element to the last.
Merging Text Comparisons
Once a text comparison is formed, then it imposes an ordering on characters. A second text comparison can be merged into the first so that all mappings (except unmapped characters) in the first are maintained, and as many of the new mappings from the second are maintained as possible. An example of this is to merge a French Text Comparison into an Arabic Text Comparison. All of the relationships among the Arabic characters (including characters common to both Text Comparisons such as punctuation) should be preserved; relationships among new characters (e.g. Latin) that are not covered by the Arabic Text Comparison will be added.
Merging Process
Produce a third text comparison TC3 by iterating through TC2 in the following way, adding each new mapping as follows. For each new character b, remember the relationship to the previous mapping mostRecent in TC2
1. If b is already in TCi, skip it, and reset mostRecent to be b.
2. If some character or substring of characters from b is already in TC , skip it
3. Otherwise, add b as "close to" mostRecent as possible, and reset mostRecent to b. That is, if b = mostRecent, add if immediately afterward. If b > mostRecent, then add b immediately before the first element that is at least tertiary-greater than mostRecent. If b » mostRecent, then add it immediately before the first element that is at least secondary-greater than mostRecent. If b >» mostRecent, then add it before the first element that is at least primary greater than mostRecent.
Example: Suppose that the text comparison contains the following:
T Cl := ... u = v < w « x «< z... TC2 map TC3 result u = b u = b = V < w « X <« z u < b u = V < b < w « X <« z u « b u = V < w « b « X <« z u <« b u = V < w « X <« b <« z
Flowchart of the Logic
Figure 8 is a flowchart of the detailed logic in accordance with the subject invention. Processing commences at function block 200 where the termporary result is initialized to a predetermined value. Then, at input block 202, the next source key and the next target key are obtained. A test is performed at decision block 204 to determine if the source primary has the same value as the target primary. If the source primary is not equal, then another test is performed at decision block 214 to determine if the source primary is ignorable. If so, then another test is performed at decision block 220 to determine if the search key should include a match of the primary only or some additional secondary information.
If only a primary match is desired, then a match has been completed and control is passed to input block 260 to obtain the next source key and subsequently to decision block 204. If a secondary match is also desired as detected at decision block 220, then a test is performed at decision block 230 to determine if the source secondary is ignorable. If the source secondary is not ignorable, then the temporary result is updated with the source position, target position and secondary position set equal to GREATER. Then control is passed to input block 260 to obtain the next source key and subsequently to decision block 204. If the source secondary is ignorable as detected at decision block 230, then another test is performed at decision block 232 to determine if a secondary match is only desired or if tertiary information has been saved. If so, then control is passed to input block 260 to obtain the next source key and subsequently to decision block 204. If not, then the temporary result is updated with the source position, target position and secondary position set equal to GREATER. Then, control is passed to input block 260 to obtain the next source key and subsequently to decision block 204.
If the source primary is not ignorable at decision block 214, then another test is performed at decision block 216 to determine if the target primary is ignorable. If so, then another test is performed at decision block 222 to determine if the search key should include a match of the primary only or some additional secondary information. If only a primary match is desired, then a match has been completed and control is passed to input block 262 to obtain the next target key and subsequently to decision block 204. If not, then another test is performed at decision block 234 to determine if the target secondary is ignorable. If so, then the temporary result is updated with the source position, target position and secondary comparison set equal to LESS. Then, control is passed to input block 262 to obtain the next target key and subsequently to decision block 204. If the target secondary is not ignorable as detected at decision block 236, then another test is performed at decision block 236 to determine if a secondary match is desired or if saved tertiary or source tertiary equal to ignorable. If so, then control is passed to input block 262 to obtain the next target key and subsequently to decision block 204. If not, then the temporary result is updated with the source position, target position and secondary comparison set equal to LESS. Then, control is passed to input block 262 to obtain the next target key and subsequently to decision block 204.
If the target primary is ignorable at decision block 216, then the temporary result is updated with the source position, target position and primary comparison is set equal to the primary comparison result and control is passed to decision block 210.
If the source primary equals the target primary as detected at decision block 204, then another test is performed at decision block 206 to determine if the source primary is equal to an End Of File (EOF) character. If it is, then temporary result is returned at output terminal 208. If not, then control is passed to decision block 210 to determine if source secondary is equal to target secondary. If not, then another test is performed at decision block 224 to determine if source and target information has been saved. If so, then control is passed to input block 202 to obtain the next source and target key. If not, then the temporary result is set equal to the source position, target position, and the secondary comparison result in function block 246 and control is passed to input block 202 to obtain the next source and target key.
If the source secondary is equal to the target secondary as detected at decision block 210, then another test is performed at decision block 212 to determine if source tertiary equals to target tertiary. If so, then control is passed to input block 202 to obtain the next source and target key. If not, then a test is performed at decision block 226 to determine if source and target information has been saved or tertiary information. If so, then control is passed to input block 202 to obtain the next source and target key. If not, then the temporary result is set equal to the source position, target position, and the tertiary comparison result in function block 246 and control is passed to input block 202 to obtain the next source and target key.
Figure 9 is an example of a display in accordance with the subject invention. The display corresponds to a system which allows language attributes to be associated with a With the system, a user can choose the preferred text comparison for any particular lang and associate the text comparison with unmarked text (text without any language attribu In addition, a user can also create a new text comparison or modify an existing one. Wh editing, the user is presented with a table, as depicted in Figure 9, listing mappings in th comparison in ascending order. As with standard table editing, the user can select one o more mappings with the mouse. The selected items can be deleted, cut, copied, or move dragging. A new mapping can also be inserted at any point.
The left-most column 900 indicates the relationship of the current mapping to the previous (above) mapping. Clicking on the column produces a pop-up menu with a cho symbols: indicated primary-greater, secondary-greater, tertiary-greater, or equal; and an orthogonal set indicating French-secondary and /or French tertiary.
There is one special mapping, the unmapped-characters mapping, which contains a symbol for indicating that any unmapped characters go at this point. Since there is always exactly one such location in the text comparison, this mapping is handled specially. If it is deleted, then it will appear at the end of the mapping. If another one is pasted in, then any previously-existing unmapped-characters mapping will be removed. The center column 910 contains the main character(s) in the mapping; the right-most column contains the expansion characters (if any). These can be edited just like any other text in the system.
FAST LANGUAGE-SENSITIVE SEARCHING
Introduction
Just as imploding, exploding, ignorable, primary /secondary /tertiary issues are relevant to collation, they are also relevant to searching. For example, when searching in Danish, aa needs to be identified with a. . There are a number of processes for fast (sub-linear) searching of text. However, none of these processes handle these language-sensitive requirements. In order to deal with these issues, a preferred embodiment employs a variation of the Boyer-Moore process (BM) which is both sub-linear and language-sensitive. The BM process is disclosed in detail in Boyer, R. & Moore, S.; A Fast String Searching Algorithm,Comraι . ACM 20 , pp. 762-772 (1977); which is hereby incorporated in its entirety by reference.
The preferred embodiment can also use the same data that is produced in a
Text Comparison, so that searching and collation are kept in sync. This implies that the same modifications a user employs for making a new Text Comparison will also suffice for producing a correct language-sensitive search. There are two additional pieces of derived information needed for searching, beyond what is necessary for comparison. These requirements are discussed below. Some fast search processes don't process in reverse; instead, they check the first character after the string if a match fails (e.g., Sunday, D.M. A very fast substring search algorithm. Commun. ACM 33. 8 , pp. 132-142,(Aug. 1990). This technique does not work well with lan¬ guage-sensitive comparisons since two strings of different lengths can match, it cannot be efficiently determined where the "end" of the string in the target that matches the "end" of the pattern is located. In the following examples, we will use a simple artificial Text Comparison that has English order for letters, plus the following: Λ (non-spacing circumflex: ignorable) a < a/e
A < A/e o < o-/o
O < 0-/o z <« a < aa < A < Aa < AA <« ø < oe < 0 < Oe < OE
Note that for collation purposes the difference between exploding and imploding characters is important, but for searching it is not generally significant. That is, with a secondary-strength match, whether the text comparison has a < a /e or ae < a is not important: baed and bad match in either case. The one case where this is important is at the end of a pattern. That is, ba should be found in baed, but should not be found in baad.
Text Comparison Enhancements
The following enhancements are made to the Text Comparison.
a. the database of mappings must also be able to process a string in reverse order: in particular, retrieve imploding and exploding mappings in reverse order.
Example: looking up the Unicode order for e then o (progressing backwards) will produce ø. b. any sequence of characters that could correspond to a smaller exploded character must be included in the database, with a mapping to the smaller width.
Example: oo doesn't need to be included, since it corresponds to the product of o-, which is the same length. However, ae does need to be included, since it corresponds to a, which is shorter.
c. the Text Comparison must be able to return the cumulative minimal match length (see below) as a string is processed.
Example: when processing either e then a (remember it's processing backwards) or a, the minimal match length will be 1.
Overview of the Process
The process has the following basic structure: Goal: Search for a pattern string within a target string. Input: pattern string, target string, strength (primary, secondary, tertiary, bitwise), Text Comparison
Process:
As with Boyer-Moore, the first step is to preprocess the pattern string with the Text Comparison to produce an index table (see the next section for details). In the main loop, the pattern string is successively shifted through the target. At each new location, process the pattern string from the end, looking for matches. The Text Comparison is used to process the target string in reverse order, looking up Unicode Orderings (UO). If a match fails, then an index table is employed to shift the pattern string by a specified amount.
Pattern Preprocessing
The following describes the process for preprocessing the pattern string — which does most of the work. The description is the logical sequence, and optimizations are omitted for clarity. A principle optimization is the creation of two index tables, since a preferred embodiment supports both forward and backward searching. For constructing the tables and patterns for backwards searching, ap¬ propriate changes are made in the processes.
1. Retrieve the Unicode orderings for the string (this will normalize imploding and exploding characters, and reordering accents such as overdot and underdot).
2. Reset the orderings according to the input strength:
• if input strength is secondary, zero out the tertiary values • if it is primary, zero out the secondary and tertiary values.
3. Remove all ignorable Unicode Orderings with null differences (after having done step 2).
Note: • if the input strength is tertiary, this will remove all ignorables with null differences (e.g. Right-Left Mark)
• if the input strength is secondary, this will remove all ignorables with tertiary or null differences.
• if the input strength is primary, this will remove all ignorables with secondary, tertiary or null differences (e.g. non-spacing marks).
4. From the information in the Text Comparison, compute the minimum trailing match length (MTML) for each position in the pattern, which is the minimum length that any string matching the trailing elements at the end of the pattern can have. Example: With the sample ordering, the pattern corresponding to "baedaf" has a minimum length of 5 (matching badaf). [It could match strings up to 7 characters in length (e.g. baedaaf) without ignorables — with ignorables, it could match indefinitely long strings (bΛaΛeΛdΛΛΛΛaΛaf)].
The table of MTMLs for this pattern would be (with the letters standing for the corresponding Unicode Orderings):
position 1 2 3 4 5 6
UO: b a e d a f
MTML: 5 4 4 3 2 1
5. The Boyer-Moore process uses one table indexed by position, and one table indexed by character. In this variant, the latter corresponds to indexing by Unicode Ordering. Build the index tables for the pattern string by traversing the list of Unicode Orderings from back to front as in Boyer-Moore, making the following changes:
The index at any position shows how far to shift the processed string at that position if a match has failed against the Unicode Ordering at that position. The index value should be the minimal amount to shift (using the MTML table) such that the current trailing substring could next be found in the pattern.
Example: given the pattern obaexydbae, the shift table is as follows (with the letters standing for the corresponding Unicode Orderings):
Figure imgf000027_0001
Suppose that there is a mismatch at the d against an o in the target text ("). The shift value is 5 because the rightmost occurrence of obae occurs after a trailing sequence with length 5 (≠).
There may be a large number of Unicode Orderings, not just 256 as in the 8-bit ASCII version of Boyer-Moore. For speed and storage reasons, the index table indexed by Unicode Ordering consists of small array of integers (e.g., 256 integers). When adding a shift value to the table for a given Unicode Ordering, the Unicode Ordering is hashed into an integer within the range covered by the array.
When only the primary strengths are desired, this is done by using a modulo of the array size (if this is chosen to be a power of two, the modulo is simply a masking). When more than just the primary is required, then the secondary or both the secondary and tertiary are xoring in before using the modulo.
When there are multiple shift values (because of modulo collisions) the minimum of the values is recorded. This may end up being less than the optimum value, but in practice does not affect the overall performance significantly.
Searching with the Processed Pattern
When matching a target string against a processed pattern, the standard Boyer-Moore process can be employed with the following changes, as reflected in the preferred embodiment:
1. When retrieving characters from the target text, convert them to Unicode
Orderings (this is done when each character is accessed: the entire target text does not need to be comverted!). The iteration through the text occurs in reverse order. The same normalization (Pattern Preprocessing items 1-3 above) is used to reset strengths and remove ignorables as was used when creating the processed pattern. 2. Explicitly test each Unicode Ordering derived from the text with the pattern. If there is a mismatch, use the index tables to find the shift value.
3. After finding a match, the process must check also after end of the string. If any of the following conditions occurs, then the match fails, and it shifts by 1 and continues searching.
• If there are any ignorables after the end which are stronger than the input strength. Example: finding ba in xxbaΛxx: where the input strength distinguished between a and a, this match should not succeed.
• If there is an imploding character spanning the end of the string. Example: ba should be found in baed, but should not be found in baad. (This can be made a user option for more control.)
State-Table Methods
Another method for constructing a language-sensitive searcher is to produce a state machine that will recognize each of the various forms (baed and bad), and also disregard any ignorable characters. However, in general, this technique does not perform as well as the sub-linear methods, such as the method discussed in Gonnet, G.H. and Baeza-Yates, R. Handbook of Algorithms and Data Structures — In Pascal and C. Second ed. Addison-Wesley, Wokingham, UK 1991. However, a key question here is based on the number of comparisons required in each method and the lookup time per character in the state table vs. in the Text Comparison. In the Text Comparison, the lookup time is quite small as long as the character is not ex¬ ploding or imploding, so the performance is dependent on the proportion of such characters in the target text, which is generally quite small. By storing a flag in the Text Comparison as to whether the target language has a large proportion of such characters, a choice can be made at runtime to select which technique to employ.
While the invention has been described in terms of a preferred embodiment in a specific system environment, those skilled in the art recognize that the invention can be practiced, with modification, in other and different hardware and software environments within the spirit and scope of the appended claims.

Claims

CLAIMSHaving thus described our invention, what we claim as new, and desire to secure by Letters Patent is:
1. A system for searching text based on language-sensitivity, comprising: (a) processor means for defining a match based on features of a language; (b) processor means for performing a search to locate a match for a first text string in a second text string.
2. The system as recited in claim 1, including processing means for initiating the search operation through an iconic operation.
3. The system as recited in claim 2, including processing means for initiating the search operation by double-clicking on an icon.
4. The system as recited in claim 2, including processing means for drop- launching the search operation.
5. The system as recited in claim 1, including storage means for storing a table based description of the matching and ordering relationships based on features of a language.
6. The system as recited in claim 5, including table-based descriptions of primary, secondary, tertiary, expanding, grouped, ignorable-secondary, ignorable-tertiary, French-secondary and French-tertiary orderings.
7. The system as recited in claim 1, including processing means for supporting UNICODE text.
8. The system as recited in claim 7, including processing means for UNICODE accent equivalencies.
9. The system as recited in claim 1, including processing means to perform a partial text search.
10. The system as recited in claim 1, including processing means for performing language-sensitive sublinear searches.
11. The system as recited in claim 1, including processing means for merging a first text search into a second text search.
12. The system as recited in claim 6, including a spreadsheet-like display for defining the behavior of the compare operation by modifying the table-based descriptions.
13. The system as recited in claim 1, including processing means for determining a strength of matches between primary, secondary or tertiary strengths.
14. The system as recited in claim 1, including a hash mechanism for efficiently hashing Unicode orderings for language sensitive searching.
15. A method for searching text based on language-sensitivity, comprising the steps of: (a) defining an ordering based on features of a language; and (b) performing a search to locate a match for a first text string in a second text string.
16. The method as recited in claim 15, including the step of initiating the search operation through an iconic operation.
17. The method as recited in claim 16, including processing means for initiating the search operation by double-clicking on an icon.
18. The method as recited in claim 17, including the step of drop-launching the search operation.
19. The method as recited in claim 15, including the step of storing a table based description of the matching and ordering relationships based on features of a language.
20. The method as recited in claim 19, including the step of processing table based descriptions of primary, secondary, expanding, grouped, tertiary, ignorable- secondary, ignorable-tertiary, French-secondary and French-tertiary.
21. The method as recited in claim 15, including the step of supporting UNICODE text.
22. The method as recited in claim 21, including the step of performing UNICODE accent equivalencies.
23. The method as recited in claim 15, including the step of performing a partial text search.
24. The method as recited in claim 15, including the step of performing a sublinear search.
25. The method as recited in claim 15, including the step of merging a first text search into a second text search.
26. The method as recited in claim 15, including the step of invoking the search operation via a spreadsheet-like display.
27. The method as recited in claim 15, including the step of determining a strength of match between primary, secondary, tertiary matches and terminal checks.
28. The method as recited in claim 15, including the step of hashing Unicode orderings.
PCT/US1994/000014 1993-03-25 1994-01-03 Language-sensitive searching system WO1994022097A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU61206/94A AU6120694A (en) 1993-03-25 1994-01-03 Language-sensitive searching system
JP6521007A JPH08508124A (en) 1993-03-25 1994-01-03 Language recognition collation system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/036,785 1993-03-25
US08/036,785 US5485373A (en) 1993-03-25 1993-03-25 Language-sensitive text searching system with modified Boyer-Moore process

Publications (1)

Publication Number Publication Date
WO1994022097A1 true WO1994022097A1 (en) 1994-09-29

Family

ID=21890642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1994/000014 WO1994022097A1 (en) 1993-03-25 1994-01-03 Language-sensitive searching system

Country Status (4)

Country Link
US (1) US5485373A (en)
JP (1) JPH08508124A (en)
AU (1) AU6120694A (en)
WO (1) WO1994022097A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6963876B2 (en) 2000-06-05 2005-11-08 International Business Machines Corporation System and method for searching extended regular expressions

Families Citing this family (177)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6701428B1 (en) * 1995-05-05 2004-03-02 Apple Computer, Inc. Retrieval of services by attribute
US5687366A (en) * 1995-05-05 1997-11-11 Apple Computer, Inc. Crossing locale boundaries to provide services
US5675818A (en) * 1995-06-12 1997-10-07 Borland International, Inc. System and methods for improved sorting with national language support
US5754840A (en) * 1996-01-23 1998-05-19 Smartpatents, Inc. System, method, and computer program product for developing and maintaining documents which includes analyzing a patent application with regards to the specification and claims
US5873111A (en) * 1996-05-10 1999-02-16 Apple Computer, Inc. Method and system for collation in a processing system of a variety of distinct sets of information
US6304893B1 (en) 1996-07-01 2001-10-16 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server event driven message framework in an interprise computing framework system
US6266709B1 (en) 1996-07-01 2001-07-24 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server failure reporting process
US5848246A (en) 1996-07-01 1998-12-08 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server session manager in an interprise computing framework system
US5987245A (en) 1996-07-01 1999-11-16 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture (#12) for a client-server state machine framework
US6424991B1 (en) 1996-07-01 2002-07-23 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server communication framework
US6272555B1 (en) 1996-07-01 2001-08-07 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server-centric interprise computing framework system
US6038590A (en) 1996-07-01 2000-03-14 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server state machine in an interprise computing framework system
US6434598B1 (en) 1996-07-01 2002-08-13 Sun Microsystems, Inc. Object-oriented system, method and article of manufacture for a client-server graphical user interface (#9) framework in an interprise computing framework system
US5999972A (en) 1996-07-01 1999-12-07 Sun Microsystems, Inc. System, method and article of manufacture for a distributed computer system framework
US6963871B1 (en) 1998-03-25 2005-11-08 Language Analysis Systems, Inc. System and method for adaptive multi-cultural searching and matching of personal names
US8855998B2 (en) 1998-03-25 2014-10-07 International Business Machines Corporation Parsing culturally diverse names
US8812300B2 (en) 1998-03-25 2014-08-19 International Business Machines Corporation Identifying related names
US7099876B1 (en) 1998-12-15 2006-08-29 International Business Machines Corporation Method, system and computer program product for storing transliteration and/or phonetic spelling information in a text string class
US6389386B1 (en) 1998-12-15 2002-05-14 International Business Machines Corporation Method, system and computer program product for sorting text strings
US6460015B1 (en) 1998-12-15 2002-10-01 International Business Machines Corporation Method, system and computer program product for automatic character transliteration in a text string object
US6496844B1 (en) 1998-12-15 2002-12-17 International Business Machines Corporation Method, system and computer program product for providing a user interface with alternative display language choices
US6535886B1 (en) * 1999-10-18 2003-03-18 Sony Corporation Method to compress linguistic structures
KR100372582B1 (en) * 2000-02-23 2003-02-17 가부시키가이샤 히타치세이사쿠쇼 Method and system for data processing
US6400287B1 (en) 2000-07-10 2002-06-04 International Business Machines Corporation Data structure for creating, scoping, and converting to unicode data from single byte character sets, double byte character sets, or mixed character sets comprising both single byte and double byte character sets
US7051278B1 (en) 2000-07-10 2006-05-23 International Business Machines Corporation Method of, system for, and computer program product for scoping the conversion of unicode data from single byte character sets, double byte character sets, or mixed character sets comprising both single byte and double byte character sets
US7278100B1 (en) 2000-07-10 2007-10-02 International Business Machines Corporation Translating a non-unicode string stored in a constant into unicode, and storing the unicode into the constant
US6877003B2 (en) * 2001-05-31 2005-04-05 Oracle International Corporation Efficient collation element structure for handling large numbers of characters
US7395089B1 (en) 2001-06-11 2008-07-01 Palm, Inc Integrated personal digital assistant device
US6975304B1 (en) * 2001-06-11 2005-12-13 Handspring, Inc. Interface for processing of an alternate symbol in a computer device
US6957397B1 (en) 2001-06-11 2005-10-18 Palm, Inc. Navigating through a menu of a handheld computer using a keyboard
US7356361B1 (en) 2001-06-11 2008-04-08 Palm, Inc. Hand-held device
US6950988B1 (en) 2001-06-11 2005-09-27 Handspring, Inc. Multi-context iterative directory filter
US7665043B2 (en) * 2001-12-28 2010-02-16 Palm, Inc. Menu navigation and operation feature for a handheld computer
US7046994B1 (en) * 2002-02-01 2006-05-16 Microsoft Corporation System and method for associating a contact with a call ID
US20050005239A1 (en) * 2003-07-03 2005-01-06 Richards James L. System and method for automatic insertion of cross references in a document
US20070005586A1 (en) * 2004-03-30 2007-01-04 Shaefer Leonard A Jr Parsing culturally diverse names
US7996403B2 (en) * 2007-09-27 2011-08-09 International Business Machines Corporation Method and apparatus for assigning a cultural classification to a name using country-of-association information
US8259075B2 (en) 2009-01-06 2012-09-04 Hewlett-Packard Development Company, L.P. Secondary key group layout for keyboard
US8775457B2 (en) * 2010-05-31 2014-07-08 Red Hat, Inc. Efficient string matching state machine
US8943091B2 (en) * 2012-11-01 2015-01-27 Nvidia Corporation System, method, and computer program product for performing a string search
US9158667B2 (en) 2013-03-04 2015-10-13 Micron Technology, Inc. Apparatuses and methods for performing logical operations using sensing circuitry
US8964496B2 (en) 2013-07-26 2015-02-24 Micron Technology, Inc. Apparatuses and methods for performing compare operations using sensing circuitry
US8971124B1 (en) 2013-08-08 2015-03-03 Micron Technology, Inc. Apparatuses and methods for performing logical operations using sensing circuitry
US9153305B2 (en) 2013-08-30 2015-10-06 Micron Technology, Inc. Independently addressable memory array address spaces
US9019785B2 (en) 2013-09-19 2015-04-28 Micron Technology, Inc. Data shifting via a number of isolation devices
US9449675B2 (en) 2013-10-31 2016-09-20 Micron Technology, Inc. Apparatuses and methods for identifying an extremum value stored in an array of memory cells
US9430191B2 (en) 2013-11-08 2016-08-30 Micron Technology, Inc. Division operations for memory
US9934856B2 (en) 2014-03-31 2018-04-03 Micron Technology, Inc. Apparatuses and methods for comparing data patterns in memory
US9786335B2 (en) 2014-06-05 2017-10-10 Micron Technology, Inc. Apparatuses and methods for performing logical operations using sensing circuitry
US9455020B2 (en) 2014-06-05 2016-09-27 Micron Technology, Inc. Apparatuses and methods for performing an exclusive or operation using sensing circuitry
US9779019B2 (en) 2014-06-05 2017-10-03 Micron Technology, Inc. Data storage layout
US9711207B2 (en) 2014-06-05 2017-07-18 Micron Technology, Inc. Performing logical operations using sensing circuitry
US10074407B2 (en) 2014-06-05 2018-09-11 Micron Technology, Inc. Apparatuses and methods for performing invert operations using sensing circuitry
US9449674B2 (en) 2014-06-05 2016-09-20 Micron Technology, Inc. Performing logical operations using sensing circuitry
US9704540B2 (en) 2014-06-05 2017-07-11 Micron Technology, Inc. Apparatuses and methods for parity determination using sensing circuitry
US9830999B2 (en) 2014-06-05 2017-11-28 Micron Technology, Inc. Comparison operations in memory
US9910787B2 (en) 2014-06-05 2018-03-06 Micron Technology, Inc. Virtual address table
US9711206B2 (en) 2014-06-05 2017-07-18 Micron Technology, Inc. Performing logical operations using sensing circuitry
US9496023B2 (en) 2014-06-05 2016-11-15 Micron Technology, Inc. Comparison operations on logical representations of values in memory
US9747961B2 (en) 2014-09-03 2017-08-29 Micron Technology, Inc. Division operations in memory
US9847110B2 (en) 2014-09-03 2017-12-19 Micron Technology, Inc. Apparatuses and methods for storing a data value in multiple columns of an array corresponding to digits of a vector
US9740607B2 (en) 2014-09-03 2017-08-22 Micron Technology, Inc. Swap operations in memory
US9904515B2 (en) 2014-09-03 2018-02-27 Micron Technology, Inc. Multiplication operations in memory
US9589602B2 (en) 2014-09-03 2017-03-07 Micron Technology, Inc. Comparison operations in memory
US9898252B2 (en) 2014-09-03 2018-02-20 Micron Technology, Inc. Multiplication operations in memory
US10068652B2 (en) 2014-09-03 2018-09-04 Micron Technology, Inc. Apparatuses and methods for determining population count
US9836218B2 (en) 2014-10-03 2017-12-05 Micron Technology, Inc. Computing reduction and prefix sum operations in memory
US9940026B2 (en) 2014-10-03 2018-04-10 Micron Technology, Inc. Multidimensional contiguous memory allocation
US10163467B2 (en) 2014-10-16 2018-12-25 Micron Technology, Inc. Multiple endianness compatibility
US10147480B2 (en) 2014-10-24 2018-12-04 Micron Technology, Inc. Sort operation in memory
US9779784B2 (en) 2014-10-29 2017-10-03 Micron Technology, Inc. Apparatuses and methods for performing logical operations using sensing circuitry
US9747960B2 (en) 2014-12-01 2017-08-29 Micron Technology, Inc. Apparatuses and methods for converting a mask to an index
US10073635B2 (en) 2014-12-01 2018-09-11 Micron Technology, Inc. Multiple endianness compatibility
US10061590B2 (en) 2015-01-07 2018-08-28 Micron Technology, Inc. Generating and executing a control flow
US10032493B2 (en) 2015-01-07 2018-07-24 Micron Technology, Inc. Longest element length determination in memory
US9583163B2 (en) 2015-02-03 2017-02-28 Micron Technology, Inc. Loop structure for operations in memory
EP3254286B1 (en) 2015-02-06 2019-09-11 Micron Technology, INC. Apparatuses and methods for parallel writing to multiple memory device locations
WO2016126472A1 (en) 2015-02-06 2016-08-11 Micron Technology, Inc. Apparatuses and methods for scatter and gather
EP3254287A4 (en) 2015-02-06 2018-08-08 Micron Technology, INC. Apparatuses and methods for memory device as a store for program instructions
US10522212B2 (en) 2015-03-10 2019-12-31 Micron Technology, Inc. Apparatuses and methods for shift decisions
US9741399B2 (en) 2015-03-11 2017-08-22 Micron Technology, Inc. Data shift by elements of a vector in memory
US9898253B2 (en) 2015-03-11 2018-02-20 Micron Technology, Inc. Division operations on variable length elements in memory
EP3268965A4 (en) 2015-03-12 2018-10-03 Micron Technology, INC. Apparatuses and methods for data movement
US10146537B2 (en) 2015-03-13 2018-12-04 Micron Technology, Inc. Vector population count determination in memory
US10049054B2 (en) 2015-04-01 2018-08-14 Micron Technology, Inc. Virtual register file
US10140104B2 (en) 2015-04-14 2018-11-27 Micron Technology, Inc. Target architecture determination
US9959923B2 (en) 2015-04-16 2018-05-01 Micron Technology, Inc. Apparatuses and methods to reverse data stored in memory
US10073786B2 (en) 2015-05-28 2018-09-11 Micron Technology, Inc. Apparatuses and methods for compute enabled cache
US9704541B2 (en) 2015-06-12 2017-07-11 Micron Technology, Inc. Simulating access lines
US9921777B2 (en) 2015-06-22 2018-03-20 Micron Technology, Inc. Apparatuses and methods for data transfer from sensing circuitry to a controller
US9996479B2 (en) 2015-08-17 2018-06-12 Micron Technology, Inc. Encryption of executables in computational memory
US9905276B2 (en) 2015-12-21 2018-02-27 Micron Technology, Inc. Control of sensing components in association with performing operations
US9952925B2 (en) 2016-01-06 2018-04-24 Micron Technology, Inc. Error code calculation on sensing circuitry
US10048888B2 (en) 2016-02-10 2018-08-14 Micron Technology, Inc. Apparatuses and methods for partitioned parallel data movement
US9892767B2 (en) 2016-02-12 2018-02-13 Micron Technology, Inc. Data gathering in memory
US9971541B2 (en) 2016-02-17 2018-05-15 Micron Technology, Inc. Apparatuses and methods for data movement
US9899070B2 (en) 2016-02-19 2018-02-20 Micron Technology, Inc. Modified decode for corner turn
US10956439B2 (en) 2016-02-19 2021-03-23 Micron Technology, Inc. Data transfer with a bit vector operation device
US9697876B1 (en) 2016-03-01 2017-07-04 Micron Technology, Inc. Vertical bit vector shift in memory
US9997232B2 (en) 2016-03-10 2018-06-12 Micron Technology, Inc. Processing in memory (PIM) capable memory device having sensing circuitry performing logic operations
US10262721B2 (en) 2016-03-10 2019-04-16 Micron Technology, Inc. Apparatuses and methods for cache invalidate
US10379772B2 (en) 2016-03-16 2019-08-13 Micron Technology, Inc. Apparatuses and methods for operations using compressed and decompressed data
US9910637B2 (en) 2016-03-17 2018-03-06 Micron Technology, Inc. Signed division in memory
US10120740B2 (en) 2016-03-22 2018-11-06 Micron Technology, Inc. Apparatus and methods for debugging on a memory device
US11074988B2 (en) 2016-03-22 2021-07-27 Micron Technology, Inc. Apparatus and methods for debugging on a host and memory device
US10388393B2 (en) 2016-03-22 2019-08-20 Micron Technology, Inc. Apparatus and methods for debugging on a host and memory device
US10474581B2 (en) 2016-03-25 2019-11-12 Micron Technology, Inc. Apparatuses and methods for cache operations
US10977033B2 (en) 2016-03-25 2021-04-13 Micron Technology, Inc. Mask patterns generated in memory from seed vectors
US10074416B2 (en) 2016-03-28 2018-09-11 Micron Technology, Inc. Apparatuses and methods for data movement
US10430244B2 (en) 2016-03-28 2019-10-01 Micron Technology, Inc. Apparatuses and methods to determine timing of operations
US10453502B2 (en) 2016-04-04 2019-10-22 Micron Technology, Inc. Memory bank power coordination including concurrently performing a memory operation in a selected number of memory regions
US10607665B2 (en) 2016-04-07 2020-03-31 Micron Technology, Inc. Span mask generation
US9818459B2 (en) 2016-04-19 2017-11-14 Micron Technology, Inc. Invert operations using sensing circuitry
US9659605B1 (en) 2016-04-20 2017-05-23 Micron Technology, Inc. Apparatuses and methods for performing corner turn operations using sensing circuitry
US10153008B2 (en) 2016-04-20 2018-12-11 Micron Technology, Inc. Apparatuses and methods for performing corner turn operations using sensing circuitry
US10042608B2 (en) 2016-05-11 2018-08-07 Micron Technology, Inc. Signed division in memory
US9659610B1 (en) 2016-05-18 2017-05-23 Micron Technology, Inc. Apparatuses and methods for shifting data
US10049707B2 (en) 2016-06-03 2018-08-14 Micron Technology, Inc. Shifting data
US10387046B2 (en) 2016-06-22 2019-08-20 Micron Technology, Inc. Bank to bank data transfer
US10037785B2 (en) 2016-07-08 2018-07-31 Micron Technology, Inc. Scan chain operation in sensing circuitry
US10388360B2 (en) 2016-07-19 2019-08-20 Micron Technology, Inc. Utilization of data stored in an edge section of an array
US10387299B2 (en) 2016-07-20 2019-08-20 Micron Technology, Inc. Apparatuses and methods for transferring data
US10733089B2 (en) 2016-07-20 2020-08-04 Micron Technology, Inc. Apparatuses and methods for write address tracking
US9767864B1 (en) 2016-07-21 2017-09-19 Micron Technology, Inc. Apparatuses and methods for storing a data value in a sensing circuitry element
US9972367B2 (en) 2016-07-21 2018-05-15 Micron Technology, Inc. Shifting data in sensing circuitry
US10303632B2 (en) 2016-07-26 2019-05-28 Micron Technology, Inc. Accessing status information
US10468087B2 (en) 2016-07-28 2019-11-05 Micron Technology, Inc. Apparatuses and methods for operations in a self-refresh state
US9990181B2 (en) 2016-08-03 2018-06-05 Micron Technology, Inc. Apparatuses and methods for random number generation
US11029951B2 (en) 2016-08-15 2021-06-08 Micron Technology, Inc. Smallest or largest value element determination
US10606587B2 (en) 2016-08-24 2020-03-31 Micron Technology, Inc. Apparatus and methods related to microcode instructions indicating instruction types
US10466928B2 (en) 2016-09-15 2019-11-05 Micron Technology, Inc. Updating a register in memory
US10387058B2 (en) 2016-09-29 2019-08-20 Micron Technology, Inc. Apparatuses and methods to change data category values
US10014034B2 (en) 2016-10-06 2018-07-03 Micron Technology, Inc. Shifting data in sensing circuitry
US10529409B2 (en) 2016-10-13 2020-01-07 Micron Technology, Inc. Apparatuses and methods to perform logical operations using sensing circuitry
US9805772B1 (en) 2016-10-20 2017-10-31 Micron Technology, Inc. Apparatuses and methods to selectively perform logical operations
CN207637499U (en) 2016-11-08 2018-07-20 美光科技公司 The equipment for being used to form the computation module above memory cell array
US10423353B2 (en) 2016-11-11 2019-09-24 Micron Technology, Inc. Apparatuses and methods for memory alignment
US9761300B1 (en) 2016-11-22 2017-09-12 Micron Technology, Inc. Data shift apparatuses and methods
US10402340B2 (en) 2017-02-21 2019-09-03 Micron Technology, Inc. Memory array page table walk
US10268389B2 (en) 2017-02-22 2019-04-23 Micron Technology, Inc. Apparatuses and methods for in-memory operations
US10403352B2 (en) 2017-02-22 2019-09-03 Micron Technology, Inc. Apparatuses and methods for compute in data path
US10838899B2 (en) 2017-03-21 2020-11-17 Micron Technology, Inc. Apparatuses and methods for in-memory data switching networks
US10185674B2 (en) 2017-03-22 2019-01-22 Micron Technology, Inc. Apparatus and methods for in data path compute operations
US11222260B2 (en) 2017-03-22 2022-01-11 Micron Technology, Inc. Apparatuses and methods for operating neural networks
US10049721B1 (en) 2017-03-27 2018-08-14 Micron Technology, Inc. Apparatuses and methods for in-memory operations
US10043570B1 (en) 2017-04-17 2018-08-07 Micron Technology, Inc. Signed element compare in memory
US10147467B2 (en) 2017-04-17 2018-12-04 Micron Technology, Inc. Element value comparison in memory
US9997212B1 (en) 2017-04-24 2018-06-12 Micron Technology, Inc. Accessing data in memory
US10942843B2 (en) 2017-04-25 2021-03-09 Micron Technology, Inc. Storing data elements of different lengths in respective adjacent rows or columns according to memory shapes
US10236038B2 (en) 2017-05-15 2019-03-19 Micron Technology, Inc. Bank to bank data transfer
US10068664B1 (en) 2017-05-19 2018-09-04 Micron Technology, Inc. Column repair in memory
US10013197B1 (en) 2017-06-01 2018-07-03 Micron Technology, Inc. Shift skip
US10262701B2 (en) 2017-06-07 2019-04-16 Micron Technology, Inc. Data transfer between subarrays in memory
US10152271B1 (en) 2017-06-07 2018-12-11 Micron Technology, Inc. Data replication
US10318168B2 (en) 2017-06-19 2019-06-11 Micron Technology, Inc. Apparatuses and methods for simultaneous in data path compute operations
US10162005B1 (en) 2017-08-09 2018-12-25 Micron Technology, Inc. Scan chain operations
US10534553B2 (en) 2017-08-30 2020-01-14 Micron Technology, Inc. Memory array accessibility
US10346092B2 (en) 2017-08-31 2019-07-09 Micron Technology, Inc. Apparatuses and methods for in-memory operations using timing circuitry
US10741239B2 (en) 2017-08-31 2020-08-11 Micron Technology, Inc. Processing in memory device including a row address strobe manager
US10416927B2 (en) 2017-08-31 2019-09-17 Micron Technology, Inc. Processing in memory
US10409739B2 (en) 2017-10-24 2019-09-10 Micron Technology, Inc. Command selection policy
US10522210B2 (en) 2017-12-14 2019-12-31 Micron Technology, Inc. Apparatuses and methods for subarray addressing
US10332586B1 (en) 2017-12-19 2019-06-25 Micron Technology, Inc. Apparatuses and methods for subrow addressing
US10614875B2 (en) 2018-01-30 2020-04-07 Micron Technology, Inc. Logical operations using memory cells
US11194477B2 (en) 2018-01-31 2021-12-07 Micron Technology, Inc. Determination of a match between data values stored by three or more arrays
US10437557B2 (en) 2018-01-31 2019-10-08 Micron Technology, Inc. Determination of a match between data values stored by several arrays
US10725696B2 (en) 2018-04-12 2020-07-28 Micron Technology, Inc. Command selection policy with read priority
US10440341B1 (en) 2018-06-07 2019-10-08 Micron Technology, Inc. Image processor formed in an array of memory cells
US11175915B2 (en) 2018-10-10 2021-11-16 Micron Technology, Inc. Vector registers implemented in memory
US10769071B2 (en) 2018-10-10 2020-09-08 Micron Technology, Inc. Coherent memory access
US10483978B1 (en) 2018-10-16 2019-11-19 Micron Technology, Inc. Memory device processing
US11184446B2 (en) 2018-12-05 2021-11-23 Micron Technology, Inc. Methods and apparatus for incentivizing participation in fog networks
US10867655B1 (en) 2019-07-08 2020-12-15 Micron Technology, Inc. Methods and apparatus for dynamically adjusting performance of partitioned memory
US11360768B2 (en) 2019-08-14 2022-06-14 Micron Technolgy, Inc. Bit string operations in memory
US11449577B2 (en) 2019-11-20 2022-09-20 Micron Technology, Inc. Methods and apparatus for performing video processing matrix operations within a memory array
US11853385B2 (en) 2019-12-05 2023-12-26 Micron Technology, Inc. Methods and apparatus for performing diversity matrix operations within a memory array
US11227641B1 (en) 2020-07-21 2022-01-18 Micron Technology, Inc. Arithmetic operations in memory

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0192927A2 (en) * 1985-02-19 1986-09-03 International Business Machines Corporation Method of editing graphic objects in an interactive draw graphic system using implicit editing actions
EP0294950A2 (en) * 1987-06-11 1988-12-14 Nortel Networks Corporation A method of facilitating computer sorting
EP0310283A2 (en) * 1987-09-28 1989-04-05 Nortel Networks Corporation A multilingual ordered data retrieval system

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA1184305A (en) * 1980-12-08 1985-03-19 Russell J. Campbell Error correcting code decoder
US4615002A (en) * 1983-03-30 1986-09-30 International Business Machines Corp. Concurrent multi-lingual use in data processing system
US4821220A (en) * 1986-07-25 1989-04-11 Tektronix, Inc. System for animating program operation and displaying time-based relationships
US4885717A (en) * 1986-09-25 1989-12-05 Tektronix, Inc. System for graphically representing operation of object-oriented programs
US5060146A (en) * 1988-04-08 1991-10-22 International Business Machines Corporation Multilingual indexing system for alphabetical lysorting by comparing character weights and ascii codes
US4891630A (en) * 1988-04-22 1990-01-02 Friedman Mark B Computer vision system with improved object orientation technique
EP0347162A3 (en) * 1988-06-14 1990-09-12 Tektronix, Inc. Apparatus and methods for controlling data flow processes by generated instruction sequences
US5041992A (en) * 1988-10-24 1991-08-20 University Of Pittsburgh Interactive method of developing software interfaces
US5133075A (en) * 1988-12-19 1992-07-21 Hewlett-Packard Company Method of monitoring changes in attribute values of object in an object-oriented database
US5050090A (en) * 1989-03-30 1991-09-17 R. J. Reynolds Tobacco Company Object placement method and apparatus
US4991094A (en) * 1989-04-26 1991-02-05 International Business Machines Corporation Method for language-independent text tokenization using a character categorization
US5060276A (en) * 1989-05-31 1991-10-22 At&T Bell Laboratories Technique for object orientation detection using a feed-forward neural network
US5125091A (en) * 1989-06-08 1992-06-23 Hazox Corporation Object oriented control of real-time processing
US5181162A (en) * 1989-12-06 1993-01-19 Eastman Kodak Company Document management and production system
US5093914A (en) * 1989-12-15 1992-03-03 At&T Bell Laboratories Method of controlling the execution of object-oriented programs
US5075848A (en) * 1989-12-22 1991-12-24 Intel Corporation Object lifetime control in an object-oriented memory protection mechanism
US5151987A (en) * 1990-10-23 1992-09-29 International Business Machines Corporation Recovery objects in an object oriented computing environment
US5119475A (en) * 1991-03-13 1992-06-02 Schlumberger Technology Corporation Object-oriented framework for menu definition
US5440482A (en) * 1993-03-25 1995-08-08 Taligent, Inc. Forward and reverse Boyer-Moore string searching of multilingual text having a defined collation order
US5387042A (en) * 1993-06-04 1995-02-07 Brown; Carl W. Multilingual keyboard system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0192927A2 (en) * 1985-02-19 1986-09-03 International Business Machines Corporation Method of editing graphic objects in an interactive draw graphic system using implicit editing actions
EP0294950A2 (en) * 1987-06-11 1988-12-14 Nortel Networks Corporation A method of facilitating computer sorting
EP0310283A2 (en) * 1987-09-28 1989-04-05 Nortel Networks Corporation A multilingual ordered data retrieval system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6963876B2 (en) 2000-06-05 2005-11-08 International Business Machines Corporation System and method for searching extended regular expressions

Also Published As

Publication number Publication date
JPH08508124A (en) 1996-08-27
AU6120694A (en) 1994-10-11
US5485373A (en) 1996-01-16

Similar Documents

Publication Publication Date Title
US5440482A (en) Forward and reverse Boyer-Moore string searching of multilingual text having a defined collation order
US5485373A (en) Language-sensitive text searching system with modified Boyer-Moore process
US6785687B2 (en) System for and method of efficient, expandable storage and retrieval of small datasets
US7634470B2 (en) Efficient searching techniques
US5333317A (en) Name resolution in a directory database
US6651052B1 (en) System and method for data storage and retrieval
US6112207A (en) Apparatus and method which features linearizing attributes of an information object into a string of bytes for object representation and storage in a database system
KR100414236B1 (en) A search system and method for retrieval of data
US6263333B1 (en) Method for searching non-tokenized text and tokenized text for matches against a keyword data structure
US20070198566A1 (en) Method and apparatus for efficient storage of hierarchical signal names
US20080215579A1 (en) Including annotation data with disparate relational data
JPH02271468A (en) Data processing method
JPH06324877A (en) Method and equipment for conducting object type of object for application and obtaining object type attribute value of efferent object type
US20020046208A1 (en) Objects in a computer system
US6430557B1 (en) Identifying a group of words using modified query words obtained from successive suffix relationships
Sadakane et al. Indexing huge genome sequences for solving various problems
JP3459053B2 (en) Document search method and apparatus
US7487165B2 (en) Computer implemented method for retrieving hit count data from a data base system and according computer program product
US20050055331A1 (en) Computer implemented method for retrieving data from a data storage system and according computer program product and data storage system
US6469643B1 (en) Information processing system
EP0649106B1 (en) Compactly stored word groups
Ejendibia et al. String searching with DFA-based algorithm
Li et al. Matching spatial relations using db-tree for image retrieval
WO2022103502A1 (en) Mask-augmented inverted index
CA2841027C (en) Fast identification of complex strings in a data stream

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AT AU BB BG BR BY CA CH CZ DE DK ES FI GB HU JP KP KR KZ LK LU LV MG MN MW NL NO NZ PL PT RO RU SD SE SK UA UZ VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA