WO2006061899A1 - 文字列照合装置および文字列照合プログラム - Google Patents
文字列照合装置および文字列照合プログラム Download PDFInfo
- Publication number
- WO2006061899A1 WO2006061899A1 PCT/JP2004/018348 JP2004018348W WO2006061899A1 WO 2006061899 A1 WO2006061899 A1 WO 2006061899A1 JP 2004018348 W JP2004018348 W JP 2004018348W WO 2006061899 A1 WO2006061899 A1 WO 2006061899A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- state
- transition
- character
- nfa
- state transition
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Definitions
- the present invention relates to a technique for collating a pattern designated by a regular expression with text in a sentence.
- Non-Patent Document 1 is a notation method for expressing a language class called a regular language.
- DFA Deterministic Finite Automaton
- the DFA character string matching method is based on a model of a state transition machine (automaton).
- the state transition machine has a state and a state transition function inside.
- the state transition function determines the next state for the current state and input characters.
- the input text is read one character at a time, and the state transitions to the next state obtained by applying the state transition function to the current state and input character pair.
- collation can be performed by scanning the text once without going back, and high-speed character string collation becomes possible.
- a finite automaton with output (Moore machine) that extends the DFA and defines the output for each state is also used to distinguish the conditions that succeeded.
- the state transition function of DFA is determined by the regular expression that is the matching condition.
- regular expression is once converted to NFA (Non-deterministic Finite Automaton), and NFA is further converted to DFA.
- NFA Non-deterministic Finite Automaton
- the DFA character string matching method has the advantage of high speed, but on the other hand, it has the disadvantage that the state transition table for realizing the DFA state transition function tends to be very large.
- the collation condition of FIG. 52 disclosed in Patent Document 3 is taken as an example.
- Fig. 53 shows the state transition table and failure function generated from the matching conditions of Fig. 52 in the conventional finite automaton with output. In this way, it is necessary to generate a state transition table that holds 90 combinations for 18 states and 5 character types.
- Patent Document 1 and Patent Document 2 include AC (AC
- Patent Document 3 discloses a method for reducing state transition tables by defining a failure function in DFA.
- the transition fails once again after the transition is made by the failure function. That is, transition failures may occur in a chain. In such a case, there is a problem that it is necessary to refer to the repeated failure function and the collating speed is lowered.
- FIG. 54 shows a state transition table and a failure function generated in the matching condition force of FIG. 52, which is disclosed in Patent Document 3.
- the state is first initialized to state 1.
- the first character “a” is read, and the state transitions to state 3 indicated in the column of input character “a” in the state 1 row of the state transition table.
- the second character “a” is read and the transition from state 3 to state 6 is performed in the same manner.
- the state transitions from state 6 to state 10 by reading the third character “c”.
- the fourth character “a” appears next, there is no transition destination that corresponds to the character “a” in state 10; therefore, first, state 5 that is the transition destination at the time of failure in state 10 Transition to.
- transition is made to state 2, which is the transition destination when state 5 fails.
- state 1 which is the transition destination when state 2 fails.
- the statement Transition is made to state 3 because transition destination state 3 corresponding to the letter “a” exists.
- the state transition table is referenced and state transitioned 4 times in total for the 4th input character, and 7 state transitions are performed for 4 characters as a whole. I need.
- Tokusen Literature 1 E.J.Hopcroft, D.J.Ullman, "Formal Languages and their Relation to Automata", Addison Wesley (1969)
- Patent Document 1 JP 2004-103035 A
- Patent Document 2 JP 2004-103034 A
- Patent Document 3 Japanese Patent No. 2994926
- the present invention has been made to solve the above-described problems, and reduces the storage capacity required to store a state transition table for character string matching using a regular expression as a matching condition. With the goal.
- the number of state transition table references due to failure of transition is set to 2 or less per character, preventing performance deterioration due to performance deterioration due to repeated failure of transition, and enabling high-speed character string matching. For the purpose.
- a character string matching device includes a state transition table generating unit that generates a state transition table based on a matching condition described by a regular expression, and a state transition table generated by the state transition table generating unit. And the automaton has no next transition destination state for the set of the current state and the input character in the state transition table generated based on the collation condition !, In this case, the input character is not read and the initial state is entered.
- a state transition table generating unit that generates a state transition table based on a matching condition described by a regular expression, and an automaton that transitions based on the state transition table generated by the state transition table generating unit, In the state transition table generated based on the collation condition, the automaton does not read the input character and proceeds to the initial state when the next transition destination state for the set of the current state and the input character does not exist. By transitioning to, the storage capacity required to store the state transition table can be reduced.
- FIG. 1 is an explanatory diagram showing a configuration of a character string matching device.
- FIG. 2 is an explanatory diagram showing a configuration of matching condition 2;
- FIG. 3 is an explanatory diagram showing a configuration of conditional expression 17
- FIG. 4 is an explanatory diagram showing a configuration of a state transition table generation unit 3.
- FIG. 5 is an explanatory diagram showing a configuration of a state transition table storage unit 4.
- FIG. 6 is an explanatory diagram showing a configuration of the output table storage unit 5.
- FIG. 7 is an explanatory diagram showing a configuration of verification result 10.
- FIG. 8 is a flowchart showing the operation of the character string matching device.
- FIG. 13 is a flowchart showing a procedure for removing non-deterministic transitions.
- FIG. 14 is a flowchart showing the procedure for initializing the state set.
- FIG.15 Flowchart showing the procedure for removing non-deterministic transitions related to trap
- FIG. 19 is a flowchart showing the procedure for removing unused states.
- FIG. 20 is a flowchart showing a procedure for removing a redundant state.
- FIG. 43 is an explanatory diagram for explaining the merging of redundant states.
- FIG. 47 is an explanatory diagram showing an example of operation.
- FIG. 49 is an explanatory diagram showing the configuration of the state transition table
- FIG. 50 is an explanatory diagram showing the structure of the output table.
- FIG. 51 is an explanatory diagram showing an operation example.
- FIG. 52 Collation conditions disclosed in Patent Document 3
- FIG. 1 is a block diagram showing a character string collating apparatus according to the present invention.
- a character string matching device 1 is a device that performs character string matching using a regular expression according to the present invention, and whether the input document 6 includes a document that satisfies the matching condition 2 or not.
- Collation condition 2 is a condition describing the conditions for character string collation, and is input to character string collation device 1.
- the state transition table generation unit 3 generates the state transition 11 and the output description 12 from the matching condition 2 and passes them to the state transition table storage unit 4 and the output table storage unit 5, respectively.
- the state transition table storage unit holds a set of state transitions 11.
- the output table storage unit 5 holds the output description 12.
- Input document 6 is a document to be verified.
- the input character reading unit 7 extracts characters included in the input document 6 one by one and sends them to the SDFA automaton 8 as input characters 14.
- the SDFA automaton 8 stores the current state 13 in the internal state storage unit 9, receives the input character 14 from the input character reading unit 7, and refers to the state transition table storage unit 4 and the output table storage unit 5 to store the state storage unit 9 Update the current state 13 stored in and output the collation result 10.
- the state storage unit 9 stores the state held in the SDFA automaton 8. 10 is the matching result. 11 is a state transition, and is a set of the current state 13, the input character 14, and the next state 15. 12 is an output description, which is a set of the current state 13 and condition number 16. 13 is the current state, 14 is the input character. 15 is the next state. 16 is a condition number.
- FIG. 2 is a diagram showing a configuration of the collation condition 2 in the present invention.
- the conditional expression 17 is an individual condition that constitutes the matching condition 2, and one or more conditional expressions 17 are included in the matching condition 2.
- FIG. 3 is a diagram showing a configuration of conditional expression 17 in the present invention.
- Conditional expression 17 consists of condition number 16 and condition description 18.
- Condition number 16 is a number for uniquely distinguishing conditional expressions
- condition description 18 is a matching condition described by a regular expression. .
- FIG. 4 is a diagram showing a configuration of the state transition table generating unit 3 in the present invention.
- the state transition table generation control unit 21 controls an operation procedure for generating the state transition generation table.
- the NFA state set 22, NFA state transition set 23, NFA output description set 24, state set 25, output transition set 26, and state description set 27 are data referred to by the state transition table generation control unit 21.
- FIG. 5 is a diagram showing an example of the configuration of the state transition table storage unit 4 in the present invention.
- 31 is a hash value calculation unit, which calculates a hash value 32 from the current state 13 and the input character 14.
- the hash value 32 is a hash value calculated by the hash value calculation unit 31.
- the state transition hash pointer 33 is a table that stores a plurality of pointers of the state transition hash chain 34.
- the state transition hash chain 34 is a set of a pointer to the state transition hash chain 34, a current state 13, an input character 14, and a next state 15.
- the state transition hash table 35 is a data structure including a state transition hash pointer 33 and a state transition hash chain 34.
- the comparison unit 36 compares the set of the current state 13a and the input character 14a input from the outside with the current state 13b and the input character 14b stored in the state transition hash table 35, and outputs the next state 15.
- FIG. 6 is a diagram showing a configuration of the output table storage unit 5 in the present invention.
- Condition number index 41 stores a plurality of pointers to the condition number chain.
- the condition number chain 42 is a combination of a pointer to the condition number chain 42 and the condition number 16.
- FIG. 7 is a diagram showing an example of the collation result 10 in the present invention.
- Match result 10 is Contains condition number 16, which successfully matched input document 6.
- Non-Patent Document 1 As described in Non-Patent Document 1 and the like, a deterministic finite automaton with output, which is known in the past, is given by a set of (Q, ⁇ , ⁇ , ⁇ , ⁇ , q). Where Q is the state
- ⁇ is the input alphabet and contains the empty character ⁇ .
- ⁇ is the output alphabet, ⁇ is the transition function (Q X ⁇ ⁇ Q), ⁇ is the output function (Q ⁇ A), and q is the initial state.
- the SDFA automaton 8 of the present invention is given by a set of (Q, ⁇ , ⁇ , ⁇ , e, q).
- Q is the state set 25, and is the state set Q of the conventional finite automaton with output.
- ⁇ is an output alphabet, which is a set of sets of condition number 16 in this embodiment.
- ⁇ is a state transition function realized by the state transition table storage unit 4, and hereinafter, the current state 1
- q is the initial state, the meaning of which is known as a conventional force deterministic finite auto with output
- ⁇ is an arbitrary character ⁇ , divisor for the input alphabet ⁇ of a conventional finite automaton with output.
- state transition set 26 The set of state transitions 11 in which the next state 15 exists for the set of the current state 13 input characters 14 is referred to as a state transition set 26 and is denoted as T.
- state transition t tran S (q, q, ⁇ ), q
- sdss Is called the starting point
- q is the ending point
- ⁇ is the transition character.
- the function that gives the starting point of state transition t is ds
- the function that gives the Source and end point is called Char, and the function that gives the Destination transition character is called Char.
- a set of output descriptions 12 whose output alphabet r is not empty for the current state 13 is called an output description set 27 and is denoted as D.
- the function that gives the output state of output description d is called State, and the function that gives the output result is called Result.
- State set 25 is a set of NFA state sets.
- ⁇ is the NFA state transition function, where the current NFA state 13 is q and the input character 14 is s (NFA) (NFA)
- ⁇ is the output function of the NFA, or when the current NFA state 13 is q, the output alpha s (NFA) (NFA) That the bet is r E ⁇
- ⁇ is a set of condition numbers 16, and ⁇ is an extended input alphabet, meaning the present invention s s
- NFA state transition set T
- the function that gives the Source and end point is called Char, and the function that gives the Destination transition character is called Char.
- a set of the NFA state q and the output alphabet r is called an NFA output description.
- NFA output description set 24 A set of NFA output descriptions whose output alphabet p is not empty for NFA state 13 is called an NFA output description set 24 and is denoted as D.
- FIG. 8 shows the operation of the character string matching device 1 of the present invention.
- the character string matching device 1 of the present invention first receives the matching condition 2 and generates the state transition 11 and the output description 12 by the state transition table generation unit 3, that is, executes the procedure for compiling the matching condition. (Step S51).
- step S52 the procedure for outputting the collation result 10 is sequentially executed by the input character reading unit 7 and the SDFA automaton 8 while referring to the state transition 11 and the output description 12 (step S52).
- ⁇ transition is included from the regular expression by the procedure of “Generate NFA including ⁇ transition”.
- step S 103 a transition to the initial state when the verification fails is added by the procedure “addition of transition to initial state” (step S 103).
- non-deterministic transitions are removed by the procedure of “removing non-deterministic transitions” (step S 104).
- step S105 the state unnecessary in the previous procedure is removed.
- step S 106 the redundant state and redundant state transition are removed by the procedure “reduction of the number of states” (step S 106).
- the state transition table is converted from the state set by the procedure of "generation of state transition table and output table”. And an output table are generated (step S107).
- Non-Patent Document 1 For the procedure of “Generation of NFA including ⁇ transition” in step S101, a known procedure shown in Non-Patent Document 1 or the like can be used.
- the metacharacter representing the character set other than the specific character set included in the regular expression is replaced with ⁇ , and the state transition from the corresponding state to the initial state q is changed to other O (NFA).
- step S102 Regarding the procedure of "removing ⁇ transition" in step S102, the ⁇ transition (transition by the empty character) is replaced with a transition to the transition destination set by a known procedure shown in Non-Patent Document 1 and the like. Can be realized.
- FIG. 10 shows the procedure of “add failure transition to initial state” in step S103.
- step S202 If not, the process proceeds to step S202. Otherwise, go to step S203 (step S201).
- step S202 The power that divides the process according to the process
- step S203 The process of step S202 can be substituted by step S203.
- the process of step S202 is limited to the applicable range force S “There is a transition destination by ⁇ from the initial state”, but the state transition any
- the procedure “add failure transition to initial state” can be performed.
- step S202 “add failure transition to initial state (from initial state to ⁇ ⁇
- Figure 28 shows an example of regular expression (a
- a b
- step S202 “addition of failed transition to initial state (from initial state to ⁇
- step S302 step S301.
- ⁇ is set to Char (t), and the process proceeds to step S304 (step S303).
- Step S306 Step S305.
- step S307 step S306
- Step S308 Let t be the first NFA state transition starting from, and go to Step S308 (Step S308).
- Step S309 Step S308
- step S309) If the condition for deviation does not hold, go to step S311 (step S309)
- NFA state transition trans (q, Destination ⁇ ), ⁇ ) is included in NFA state transition set ⁇
- step S310 If it is not rare, in addition to T, the process proceeds to step S311 (step S310).
- step S306 if NFA state q force ⁇ FA initial state q, or step
- step S203 “Addition of failed transition to initial state (from initial state to ⁇ ⁇
- step S352 step S351.
- Step S353 Step S352.
- ⁇ is set to Char (t), and the process proceeds to step S354 (step S353).
- Step S356 Step S355.
- step S356 If the NFA state q is the initial state q, the process proceeds to step S356. Otherwise
- step S306 Advances to step S357 (step S306).
- the NFA state transition trans (q, Destination ⁇ ), ⁇ ) becomes the NFA state transition set ⁇
- step S104 removes non-deterministic transitions contained in NFA and generates deterministic transitions.
- the state transition end point q by the transition character a includes the non-deterministic transition character a.
- step S104 the procedure of “removing non-deterministic transition” in step S104 will be described.
- the variable Retry is set to other.
- variable Retry can be TRUE or FALSE!
- step S402 the procedure of “initial state of state set” is performed, and the process proceeds to step S402 (step S401).
- variable Retry is initialized to FALSE, and the process proceeds to step S401 (step S402).
- the procedure “removal of non-deterministic transition” can be performed according to the above procedure.
- step S401 the procedure of “initial state of state set” in step S401 will be described. This procedure is for initializing the necessary state set to generate the DFA state, and for all NFA states q, the DFA state ⁇ q ⁇ with the associated state transitions. Generation
- the purpose is to do.
- step S502 is the first NFA state included in the NFA state set Q, and step S502 is performed.
- Step S506 proceeds to Step S506. In cases other than that described here, process flow proceeds to Step S504 (Step S503).
- NFA state set (ie, DFA state) ⁇ q ⁇ is added to state set Q, and step S505 is added.
- step S506 If (NFA) (NFA) is performed, the state transition set T is emptied, and the process proceeds to step S507 (step S506).
- Step S511 proceeds to Step S511. In cases other than that described here, process flow proceeds to Step S509 (Step S508).
- step S510 proceeds to step S510 (step S509).
- step S512 Empties the output description set D and proceeds to step S512 (step S511).
- step S508 step S512
- step S403 the procedure of “removing non-deterministic transition related to bag” in step S403 will be described.
- this procedure as shown in Fig. 32 and Fig. 34, when there are multiple transition destinations for one transition character ⁇ ⁇ , each transition character is replaced with a transition to a new state. The purpose is to uniquely determine the transition destination by ⁇ .
- step S 403 the procedure of “removing non-deterministic transitions related to sputum” in step S 403 will be described.
- This procedure uses the variable Found.
- the variable Retry is TRUE or FALSE! It can take any value.
- step S603 step S602.
- Step S616 If all the states q in the state set Q have been processed, the process proceeds to step S616. In cases other than that described here, process flow proceeds to Step S604 (Step S603).
- Step S610 If all alphabets ⁇ in the input alphabet ⁇ have been processed, the process proceeds to step S610. In cases other than that described here, process flow proceeds to Step S606 (Step S605).
- Step S607 If the transition is non-deterministic, go to step S607. In cases other than that described here, process flow proceeds to Step S609 (Step S606).
- Step S60 7 (Step S60 7 ).
- variable Found is set to TRUE, and the process proceeds to step S609 (step S608).
- Step S609 Let ⁇ be the next alphabet in the input alphabet ⁇ , and go to Step S605 (Step S609).
- step S605 When all alphabets ⁇ in the input alphabet ⁇ are processed in step S605, t is set as the first state transition starting from state q, and the process proceeds to step S611 (step S610). [0135] If all the state transitions t starting from state q have been processed, the process proceeds to step S615. In cases other than that described here, process flow proceeds to Step S612 (Step S611).
- step S614 step S612.
- transition character of t is replaced with ⁇ , that is, t is replaced with other other by (Source (t), Destination (t), ⁇ ), and the process proceeds to step S614 (step S613).
- step S611 If all the state transitions t starting from state q are processed in step S611, q is set as the next state transition in state set Q, and the process proceeds to step S603 (step S615).
- variable Found can take either TRUE or FALSE values.
- variable Counter is used.
- Counter can take an integer value of 0 or greater.
- step S703 Let q be the first state in state set Q, and go to step S703 (step S702).
- Step S714 If all the states q in the state set Q have been processed, the process proceeds to step S714. In cases other than that described here, process flow proceeds to Step S701 (Step S703).
- Step S705 Step S704.
- Step S706 First state transition starting from q is set in t, and the process proceeds to step S706 (step S705). [0149] If all the state transitions t starting from q have been processed, the process proceeds to step S710. In cases other than that described here, process flow proceeds to Step S707 (Step S706).
- step S708 If the transition character Char (t) of t is ⁇ , the process proceeds to step S708. Otherwise
- step S709 step S707
- step S705 step S709
- Step S711 If the value of Counter is 2 or more, proceed to step S711. In cases other than that described here, process flow proceeds to Step S713 (Step S710).
- step S2 step S1
- TRUE is set in the variable Found, TRUE is set in the variable Retry, and the process proceeds to step S713 (step S712).
- step S713 Let q be the next state transition in state set Q, and proceed to step S703 (step S713).
- step S703 When all the states q in state set Q are processed in step S703, the variable Found is set.
- step S607 the procedure “procedure of new state” in step S607 and step S711 will be described.
- the procedure “Create new state” takes q and ⁇ as parameters.
- the state q is the starting point, and a set of states that can be transitioned by ⁇ is obtained, and their NFA
- Step S801 Find the union for the state and let it be state q. If q is included in the state transition set T, the process proceeds to step S817. In cases other than that described here, process flow proceeds to Step S802 (Step S801).
- Step S802 the state q is added to the state set Q, and the process proceeds to Step S803 (Step S802).
- step S804 step S803
- step S817 If all state transitions t starting from q have been processed, the process proceeds to step S817. It source
- step S805 step S804.
- step S816 step S805.
- Step S807 Step S807
- step S808 step S807
- step S809 step S808
- Step S816 If all output descriptions d whose output state is Destination (t) have been processed, the process proceeds to step S816. In cases other than that described here, process flow proceeds to Step S814 (Step S813).
- step S815 If the output description is not desc (q, Result (d)) E D, desc (q, Result (d)) is added to D, and the process proceeds to step S815 (step S814).
- step S801 If q is included in the state transition set T in step S801, and if all state transitions t starting from q are processed in step S804, t starts from q And first The process proceeds to step S818 (step S817).
- step S822 If all the state transitions t starting from q have been processed, the process proceeds to step S822. It source
- step S819 step S818).
- Step S820 The state transition t is deleted from the state transition set T, and the process proceeds to Step S821 (Step S820).
- step S821 Let t be the next state transition starting from q, and proceed to step S818 (step S821).
- step S818 If all state transitions t starting from q are processed in step S818, state transition source
- trans (q, q, ⁇ ) ⁇ is not ⁇ , add trans (q, q, ⁇ ) to ⁇ and exit (step source n t source n t
- the procedure “Generate New State” can be performed by the above procedure.
- step S811 the procedure “correction of state transition by ⁇ ” in step S811 will be described.
- transition characters b, c, and ⁇ are q U q, q U q , q U q and other 3 6 4 5 4 6, respectively.
- step S811 The procedure "Correction of state transition by ⁇ " in step S811 is explained with reference to FIG.
- This procedure uses CharSet, a set of extended input alphabet ⁇ E ⁇ .
- CharSet is initialized to empty, and the process proceeds to step S902 (step S901).
- Char (t) is included in the input alphabet ⁇ , that is, Char (t) ⁇ ⁇ and Char (t) ⁇ any
- Step S906 If ⁇ , proceed to step S906. Otherwise, go to Step S905 (Step other S904).
- step S903 If all state transitions starting from Source (t) are processed in step S903, t is set to other.
- step S907 The first state transition starting from state q is set, and the process proceeds to step S908 (step S907).
- Step S909 Step S908
- step S916 (step S909).
- step S912 step S911.
- Char (t) is included in the input alphabet ⁇ , ie Char (t) ⁇ ⁇ and Char (t)
- Step S915 Step otner
- step S914 step S913.
- step S908 The next state transition is started from state q, and the process proceeds to step S908 (step S916).
- the procedure “correction of state transition by ⁇ 2” can be performed by the above procedure.
- step S105 the procedure of "removing unused state" in step S105 will be described. This procedure Then, the state that does not become the end point of the state transition that occurred as a result of the processing so far, that is, the state that never reaches any input is deleted.
- step S105 The procedure of "removing unused state" in step S105 will be described with reference to FIG.
- variable Found can be TRUE or FALSE!
- step S1001 the variable Found is set to FALSE, and the process proceeds to step S1002 (step S1001).
- step S1003 step S1002
- Step S1017 If all the states q included in the state set Q have been processed, the process proceeds to step S1017. In cases other than that described here, process flow proceeds to Step S1004 (Step S1003).
- step S1005 If q is in the initial state q, the process proceeds to step S1016. Otherwise, step S1005
- Step S1016 If there is a state transition with q as the end point, the process proceeds to step S1016. In cases other than that described here, process flow proceeds to Step S1006 (Step S1005).
- variable Found is set to TRUE, and the process proceeds to step S1007 (step S1006).
- step S1008 Let t be the first state transition starting from q, and proceed to step S1008 (step S1007).
- Step S1011 If all state transitions t starting from q have been processed, the process proceeds to step S1011. In cases other than that described here, process flow proceeds to Step S1009 (Step S1008).
- step S1010 The state transition t is removed from the state transition set T, and the process proceeds to step S1010 (step S1009).
- step S1008 [0214] Let t be the next state transition starting from q, and proceed to step S1008 (step S1010).
- step S1008 When all state transitions t starting from q are processed in step S1008, d is set as the first output description in which q is the output state, and the process proceeds to step S1012 (step 1011).
- Step S1015 If all output descriptions d having q as the output state have been processed, the process proceeds to step S1015. Otherwise, go to Step 1013 (Step 1012).
- step S1012 [0218] Let d be the next output description with q as the output state, and proceed to step S1012 (step S1014).
- step S1012 If all output descriptions d in which q is an output state have been processed in step S1012, state q is deleted from state set Q, and the process proceeds to step S1012 (step S1015). [0220] Let q be the next state included in state set Q, and proceed to step S1003 (step S1016).
- the procedure “removal of unused state” can be performed by the above procedure.
- the procedure “removing unused state” is intended to reduce the memory capacity necessary to store the state transition table. Therefore, even if this procedure is omitted, the procedure “collation of input document” can be executed, and by omitting this procedure, the time required for “compilation of collation conditions” can be shortened.
- step S 106 the procedure of “redundant state removal” in step S 106 will be described.
- This procedure removes two types of unnecessary conditions.
- the first case is a state transition with the same end point state as the state other transition due to ⁇ .
- Figure 42 shows an example.
- the SDFA automaton works even if the state transition with q as the end point due to the transition character b is deleted.
- the second case is the merging of multiple states with the same end point of state transition for all transition characters.
- Figure 43 shows an example.
- states q and q are the states for all transition characters.
- step S106 the procedure of “redundant state removal” in step S106 will be described. This procedure uses the variables StateRemoved and TransitionRemoved. Variable
- StateRemoved and TransitionRemoved can take either TRUE or FALSE values.
- TRUE is set in StateRemoved, and the process proceeds to step S1102 (step S1101).
- step S1105 step S1104.
- the procedure “removing the redundant state” can be performed by the above procedure.
- the procedure “removing redundant state” is intended to reduce the memory capacity necessary to store the state transition table. Therefore, even if this procedure is omitted, the procedure “collation of input document” can be executed, and by omitting this procedure, the time required for “compilation of collation conditions” can be shortened.
- step S1102 the procedure “Deletion of redundant state transition” in step S1102 will be described. This procedure corresponds to the first case of “redundant state elimination” and is based on ⁇ as shown in FIG.
- step S1102 the procedure “Deletion of redundant state transition” in step S1102 will be described.
- Step S 1202 Step S1201
- step S1203 is set as the first state transition included in the state transition set T, and the process proceeds to step S1203 (step S1202).
- Step S1204 Step S1203
- Step S1206 Step S1206
- Step S1207 Step S1206
- step S1208 Destination (t)
- Step S1208 The state transition t is deleted from the state transition set T, and the process proceeds to Step S1209 (Step S1208).
- step S1210 The next state transition is included, and the process proceeds to step S1203 (step S1210).
- the procedure “Deletion of redundant state transition” can be performed by the above procedure.
- step 1104 can be executed by a known procedure described in Non-Patent Document 1, for example. This procedure sets the variable StateRemoved to TRUE when more than one state is merged. Otherwise variable
- step S107 generation of state transition table and output table
- state transition 11 and output description 12 are extracted from state set 25, state transition set 26, and output description set 27, and state transition table storage 4 and output table storage 1 respectively.
- the first state transition t is extracted from the state transition set T (step S1302).
- step S1307 After all the state transitions t included in the state transition set T have been processed, the process proceeds to step S1307.
- Hash value is calculated by hash value calculation unit 31 for the set of current state Source (t) and input character Char (t)
- step S1305 step S1314
- the first output description d is taken out from the output description set (step S1307).
- Result (d) is added as condition number chain 42 to condition number index 41 corresponding to state number StateId (State (d)) of State (d), and the process proceeds to step S1310 (step S1309).
- initial state q is set as state q, and the flow proceeds to step S2002 (step S2001).
- the output table storage unit 5 is searched, and the condition number 16 related to q is output. This procedure is realized by sequentially searching for a pointer to the condition number chain 42 corresponding to the state number Stateld (q) of the current state 13 of the condition number index 41. When the search for all the condition number chains 42 is completed, the process proceeds to step S2003 (step S2002).
- step S2004 step S2003
- step S2005 step S2004
- State transition table storage 4 is searched, and whether there is a transition destination q from state q by transition character ⁇ d
- step S2007 If the transition destination q exists, the process proceeds to step S2007. Otherwise, step d
- step S2006 If q does not exist in step S2006, whether or not there is a transition destination qd other d by the transition character ⁇ from the state q, that is, whether or not q which is trans (q, q, ⁇ ) exists.
- D d If q does not exist in step S2006, whether or not there is a transition destination qd other d by the transition character ⁇ from the state q, that is, whether or not q which is trans (q, q, ⁇ ) exists.
- step S2009 step S2008.
- step S2007 If the transition destination q exists, the process proceeds to step S2007. Otherwise, step d
- the procedure “verification of input document” can be performed by the above procedure.
- the operation of the character string collating apparatus of the present example will be described below by taking the case of collation condition 2 shown in Fig. 44 and the case of the input character string shown in Fig. 54 as an example.
- the collation conditions in FIG. 44 are the collation conditions in FIG. 52 shown in Patent Document 3 in accordance with the format of this example, and are collation conditions logically equivalent to FIG.
- the state transition table in FIG. 45 represents the current state 13 stored in the state transition table storage unit 4 and the next state 15 for the input character 14 as a table.
- the state number of the current state is 6.
- the input character “d” indicates that the next state is 10.
- “1” indicates that the next state 15 does not exist, and the state transition table storage unit 4 can store the memory without consuming memory for such a combination.
- the conventional finite state automaton with output requires 90 combinations, whereas in the example of Fig. 45, it is only necessary to store information for 46 state transitions. Reducing the memory required for storage can provide a positive effect.
- the output table of FIG. 46 is a table representing a set of condition numbers 16 corresponding to the current state 13 stored in the output table storage unit 5. For example, in the case of the current state force, the condition number 0 is set. This indicates that it is output.
- FIG. 47 shows an operation when the input character string “ aaC a” is input.
- state q is set to the initial state (state 0).
- state 0 the initial state
- state 2 the transition destination of the character “a”. Since the transition destination by the character “a” is defined, it is not necessary to refer to ⁇ , which is indicated as (unnecessary) in FIG. Then other
- the second character “a” is read and a transition is made to state 5 which is the transition destination of character “a” in state 2.
- the third character “c” is read, and a transition is made to state 9 which is the transition destination of character “c” in state 5.
- the fourth character “a” is read, and a transition is made to state 12, which is the transition destination of character “a” in state 9. The input ends here.
- collation condition in Fig. 44 is described for comparison with the technique described in Patent Document 3, but the special condition that the initial state force can be changed by all characters is used. have.
- the purpose of the collation condition in Fig. 44 is to be able to detect the character string in the collation target character string “abcd” even if one character changes. Without being converted to the matching conditions shown in FIG. In such a case, the memory required for the state transition can be further reduced, and the effects of the present embodiment can be further exerted.
- FIG. 51 shows the operation for input “xabxd” with respect to the matching condition 2 of FIG. In state 0, the transition destination of the first character “x” is not defined.
- the transition state is still defined, so the next state is the initial state, that is, state 0.
- the state transition table is referenced twice.
- the state of the second character “a” is 3, 3 characters Transition to state 6 at eye "b”. Since the transition destination of “x” is not defined in the fourth character “Phantom”, the next state 1 is obtained by referring to ⁇ . At this time, the state transition table is referenced twice. More other
- state 2 since there is number 0 as the output condition number 16, this is output. In this case, the state transition table is referenced 7 times.
- General DFA is a special output that outputs two types of information: “accepted” and “not accepted” as output alphabets. It can be regarded as a Moore machine. Also in this example, it is possible to configure a character string collating device that outputs two types of information “accepted” and “not accepted” simply by determining whether or not a condition number exists in the collation result.
- the input and collation target is assumed to be "character”.
- "Character” is not limited to a human-readable character string, and can be applied to any symbol string or data string. Can be applied.
- the present invention may be applied to identification of data measured by a gene sequence or a sensor.
- the state transition hash table 35 is used for the state transition table storage unit 4, but a two-dimensional table such as an array or a tree structure cannot be logically expressed. It may be realized by a suitable data structure.
- Data structures such as arrays with high access speed are used for frequently used states such as the initial state, and memory capacity is highly efficient for low-use states.
- a data structure such as a tree structure may be used in combination.
- the present invention can be applied to a character string collating apparatus.
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
BRPI0419214-1A BRPI0419214B1 (pt) | 2004-12-09 | Sistema e método de correspondência de sequência | |
CNB2004800445705A CN100524301C (zh) | 2004-12-09 | 2004-12-09 | 字符串对照装置 |
PCT/JP2004/018348 WO2006061899A1 (ja) | 2004-12-09 | 2004-12-09 | 文字列照合装置および文字列照合プログラム |
JP2007531511A JP4535130B2 (ja) | 2004-12-09 | 2004-12-09 | 文字列照合装置および文字列照合プログラム |
US11/792,564 US8032479B2 (en) | 2004-12-09 | 2004-12-09 | String matching system and program therefor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2004/018348 WO2006061899A1 (ja) | 2004-12-09 | 2004-12-09 | 文字列照合装置および文字列照合プログラム |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2006061899A1 true WO2006061899A1 (ja) | 2006-06-15 |
WO2006061899A8 WO2006061899A8 (ja) | 2007-08-30 |
Family
ID=36577729
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2004/018348 WO2006061899A1 (ja) | 2004-12-09 | 2004-12-09 | 文字列照合装置および文字列照合プログラム |
Country Status (4)
Country | Link |
---|---|
US (1) | US8032479B2 (ja) |
JP (1) | JP4535130B2 (ja) |
CN (1) | CN100524301C (ja) |
WO (1) | WO2006061899A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007034777A (ja) * | 2005-07-28 | 2007-02-08 | Nec Corp | データ検索装置及び方法、並びにコンピュータ・プログラム |
WO2008084594A1 (ja) * | 2007-01-12 | 2008-07-17 | Nec Corporation | パターンマッチング装置及び方法 |
JP2010026689A (ja) * | 2008-07-17 | 2010-02-04 | Internatl Business Mach Corp <Ibm> | 情報処理装置、情報処理方法およびプログラム |
Families Citing this family (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070226362A1 (en) * | 2006-03-21 | 2007-09-27 | At&T Corp. | Monitoring regular expressions on out-of-order streams |
US8903840B2 (en) * | 2006-08-31 | 2014-12-02 | International Business Machines Corporation | System and method for launching a specific program from a simple click on a string of characters |
US7630982B2 (en) | 2007-02-24 | 2009-12-08 | Trend Micro Incorporated | Fast identification of complex strings in a data stream |
US20090006316A1 (en) * | 2007-06-29 | 2009-01-01 | Wenfei Fan | Methods and Apparatus for Rewriting Regular XPath Queries on XML Views |
FR2939535B1 (fr) * | 2008-12-10 | 2013-08-16 | Canon Kk | Procede et systeme de traitement pour la configuration d'un processseur exi |
US8862603B1 (en) * | 2010-11-03 | 2014-10-14 | Netlogic Microsystems, Inc. | Minimizing state lists for non-deterministic finite state automatons |
US9858051B2 (en) * | 2011-06-24 | 2018-01-02 | Cavium, Inc. | Regex compiler |
US8990259B2 (en) | 2011-06-24 | 2015-03-24 | Cavium, Inc. | Anchored patterns |
CN103858386B (zh) | 2011-08-02 | 2017-08-25 | 凯为公司 | 用于通过优化的决策树执行包分类的方法和装置 |
JP5554304B2 (ja) * | 2011-09-16 | 2014-07-23 | 株式会社東芝 | オートマトン決定化方法、オートマトン決定化装置およびオートマトン決定化プログラム |
US8818783B2 (en) * | 2011-09-27 | 2014-08-26 | International Business Machines Corporation | Representing state transitions |
US8775393B2 (en) * | 2011-10-03 | 2014-07-08 | Polytechniq Institute of New York University | Updating a perfect hash data structure, such as a multi-dimensional perfect hash data structure, used for high-speed string matching |
CN102542038A (zh) * | 2011-12-27 | 2012-07-04 | 浪潮通信信息系统有限公司 | 一种通用可配置的标准局数据入库方法 |
US9336194B2 (en) | 2012-03-13 | 2016-05-10 | Hewlett Packard Enterprises Development LP | Submatch extraction |
US9558299B2 (en) | 2012-04-30 | 2017-01-31 | Hewlett Packard Enterprise Development Lp | Submatch extraction |
US8725749B2 (en) * | 2012-07-24 | 2014-05-13 | Hewlett-Packard Development Company, L.P. | Matching regular expressions including word boundary symbols |
US8793251B2 (en) * | 2012-07-31 | 2014-07-29 | Hewlett-Packard Development Company, L.P. | Input partitioning and minimization for automaton implementations of capturing group regular expressions |
US8938454B2 (en) * | 2012-10-10 | 2015-01-20 | Polytechnic Institute Of New York University | Using a tunable finite automaton for regular expression matching |
US9268881B2 (en) | 2012-10-19 | 2016-02-23 | Intel Corporation | Child state pre-fetch in NFAs |
US9117170B2 (en) | 2012-11-19 | 2015-08-25 | Intel Corporation | Complex NFA state matching method that matches input symbols against character classes (CCLs), and compares sequence CCLs in parallel |
US9665664B2 (en) | 2012-11-26 | 2017-05-30 | Intel Corporation | DFA-NFA hybrid |
US9304768B2 (en) | 2012-12-18 | 2016-04-05 | Intel Corporation | Cache prefetch for deterministic finite automaton instructions |
US9251440B2 (en) * | 2012-12-18 | 2016-02-02 | Intel Corporation | Multiple step non-deterministic finite automaton matching |
US9268570B2 (en) | 2013-01-23 | 2016-02-23 | Intel Corporation | DFA compression and execution |
US9931785B2 (en) | 2013-03-15 | 2018-04-03 | 3D Systems, Inc. | Chute for laser sintering systems |
US9086688B2 (en) * | 2013-07-09 | 2015-07-21 | Fisher-Rosemount Systems, Inc. | State machine function block with user-definable actions on a transition between states |
WO2015084360A1 (en) * | 2013-12-05 | 2015-06-11 | Hewlett-Packard Development Company, L.P. | Regular expression matching |
US9275336B2 (en) | 2013-12-31 | 2016-03-01 | Cavium, Inc. | Method and system for skipping over group(s) of rules based on skip group rule |
US9544402B2 (en) | 2013-12-31 | 2017-01-10 | Cavium, Inc. | Multi-rule approach to encoding a group of rules |
US9667446B2 (en) | 2014-01-08 | 2017-05-30 | Cavium, Inc. | Condition code approach for comparing rule and packet data that are provided in portions |
US11782983B1 (en) * | 2020-11-27 | 2023-10-10 | Amazon Technologies, Inc. | Expanded character encoding to enhance regular expression filter capabilities |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10105576A (ja) * | 1996-06-27 | 1998-04-24 | Fujitsu Ltd | スパースな状態遷移表に基づく複数記号列の照合装置および方法 |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4764863A (en) * | 1985-05-09 | 1988-08-16 | The United States Of America As Represented By The Secretary Of Commerce | Hardware interpreter for finite state automata |
JP2702927B2 (ja) * | 1987-06-15 | 1998-01-26 | 株式会社日立製作所 | 文字列検索装置 |
US5309358A (en) * | 1992-02-18 | 1994-05-03 | International Business Machines Corporation | Method for interchange code conversion of multi-byte character string characters |
JP2994926B2 (ja) | 1993-10-29 | 1999-12-27 | 松下電器産業株式会社 | 有限状態機械作成方法とパターン照合機械作成方法とこれらを変形する方法および駆動方法 |
JP4021832B2 (ja) | 1996-06-27 | 2007-12-12 | 富士通株式会社 | スパースな状態遷移表に基づく複数記号列の照合装置および方法 |
JP4056962B2 (ja) | 1996-06-27 | 2008-03-05 | 富士通株式会社 | スパースな状態遷移表に基づく複数記号列の照合装置および方法 |
US5995963A (en) * | 1996-06-27 | 1999-11-30 | Fujitsu Limited | Apparatus and method of multi-string matching based on sparse state transition list |
JP3231673B2 (ja) | 1996-11-21 | 2001-11-26 | シャープ株式会社 | 文字,文字列検索方法及び該方法に用いる記録媒体 |
EP1436936A4 (en) * | 2001-09-12 | 2006-08-02 | Safenet Inc | RECOGNITION OF FORMS OF HIGH-SPEED DATA FLOW |
ATE373846T1 (de) * | 2001-09-12 | 2007-10-15 | Safenet Inc | Verfahren zur generierung eines dfa-automaten, wobei übergänge zwecks speichereinsparung in klassen gruppiert werden |
US7346511B2 (en) * | 2002-12-13 | 2008-03-18 | Xerox Corporation | Method and apparatus for recognizing multiword expressions |
US7552051B2 (en) * | 2002-12-13 | 2009-06-23 | Xerox Corporation | Method and apparatus for mapping multiword expressions to identifiers using finite-state networks |
US7305391B2 (en) * | 2003-02-07 | 2007-12-04 | Safenet, Inc. | System and method for determining the start of a match of a regular expression |
WO2004107404A2 (en) * | 2003-05-23 | 2004-12-09 | Sensory Networks, Inc. | Apparatus and method for large hardware finite state machine with embedded equivalence classes |
US20050273450A1 (en) * | 2004-05-21 | 2005-12-08 | Mcmillen Robert J | Regular expression acceleration engine and processing model |
US7539681B2 (en) * | 2004-07-26 | 2009-05-26 | Sourcefire, Inc. | Methods and systems for multi-pattern searching |
US7356663B2 (en) * | 2004-11-08 | 2008-04-08 | Intruguard Devices, Inc. | Layered memory architecture for deterministic finite automaton based string matching useful in network intrusion detection and prevention systems and apparatuses |
-
2004
- 2004-12-09 JP JP2007531511A patent/JP4535130B2/ja active Active
- 2004-12-09 WO PCT/JP2004/018348 patent/WO2006061899A1/ja not_active Application Discontinuation
- 2004-12-09 US US11/792,564 patent/US8032479B2/en active Active
- 2004-12-09 CN CNB2004800445705A patent/CN100524301C/zh active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10105576A (ja) * | 1996-06-27 | 1998-04-24 | Fujitsu Ltd | スパースな状態遷移表に基づく複数記号列の照合装置および方法 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007034777A (ja) * | 2005-07-28 | 2007-02-08 | Nec Corp | データ検索装置及び方法、並びにコンピュータ・プログラム |
WO2008084594A1 (ja) * | 2007-01-12 | 2008-07-17 | Nec Corporation | パターンマッチング装置及び方法 |
JP5120263B2 (ja) * | 2007-01-12 | 2013-01-16 | 日本電気株式会社 | パターンマッチング装置及び方法 |
US8626688B2 (en) | 2007-01-12 | 2014-01-07 | Nec Corporation | Pattern matching device and method using non-deterministic finite automaton |
JP2010026689A (ja) * | 2008-07-17 | 2010-02-04 | Internatl Business Mach Corp <Ibm> | 情報処理装置、情報処理方法およびプログラム |
Also Published As
Publication number | Publication date |
---|---|
CN100524301C (zh) | 2009-08-05 |
JPWO2006061899A1 (ja) | 2009-09-03 |
CN101076798A (zh) | 2007-11-21 |
BRPI0419214A (pt) | 2008-04-15 |
JP4535130B2 (ja) | 2010-09-01 |
WO2006061899A8 (ja) | 2007-08-30 |
US20080109431A1 (en) | 2008-05-08 |
US8032479B2 (en) | 2011-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4535130B2 (ja) | 文字列照合装置および文字列照合プログラム | |
EP0293161B1 (en) | Character processing system with spelling check function | |
JP5381710B2 (ja) | ε遷移を含まない非決定性有限オートマトン生成システムと方法およびプログラム | |
JP2726568B2 (ja) | 文字認識方法及び装置 | |
JP2014235454A (ja) | 文字列探索方法、文字列探索装置および文字列探索プログラム | |
JP5077380B2 (ja) | 文字列照合装置および文字列照合プログラム | |
US7107205B2 (en) | Method and apparatus for aligning ambiguity in finite state transducers | |
US5412567A (en) | Augmenting a lexical transducer by analogy | |
KR100904835B1 (ko) | 문자열 대조 장치 및 문자열 대조 프로그램을 기록한 컴퓨터 판독 가능한 기억 매체 | |
JP4402169B1 (ja) | コード列検索装置、検索方法及びプログラム | |
Atig et al. | Emptiness of ordered multi-pushdown automata is 2etime-complete | |
JP4856573B2 (ja) | 要約文生成装置及び要約文生成プログラム | |
JP2010225137A (ja) | 検索プログラム及び検索方法 | |
KR101645890B1 (ko) | 비결정적 유한 오토마타의 상태 축소 방법 및 장치 | |
JP5807592B2 (ja) | 符号化方法、符号化装置及びコンピュータプログラム | |
JP5041003B2 (ja) | 検索装置および検索方法 | |
Berwick | Mind the gap | |
TWI443538B (zh) | Multi - hierarchical parallel multi - character string alignment device | |
Messerschmidt et al. | A hierarchy of monotone deterministic non-forgetting restarting automata | |
JP6986309B1 (ja) | データ処理装置、データ処理方法、及びプログラム | |
WO2010095179A1 (ja) | コード列検索装置、検索方法及びプログラム | |
JP2747443B2 (ja) | 類似検索装置 | |
Finkel et al. | Descriptive Set Theory and $\omega $-Powers of Finitary Languages | |
Cohen-Sygal | Computational implementation of non-concatenative morphology | |
JP2006079494A (ja) | コンパイラ |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 2007531511 Country of ref document: JP |
|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 11792564 Country of ref document: US Ref document number: 1020077012822 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 200480044570.5 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 04822547 Country of ref document: EP Kind code of ref document: A1 |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 4822547 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: PI0419214 Country of ref document: BR |
|
WWP | Wipo information: published in national office |
Ref document number: 11792564 Country of ref document: US |