CN104077358A

CN104077358A - Automata method for finding large number of short text information

Info

Publication number: CN104077358A
Application number: CN201410243718.3A
Authority: CN
Inventors: 王崇骏; 杨骏元; 彭岳; 杨骏; 谢俊元
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2014-06-03
Filing date: 2014-06-03
Publication date: 2014-10-01

Abstract

The invention provides an automata method for finding a large number of short text information. The automata method comprises the following steps of: 1) automata establishment: a. storing all keywords by a Trie tree structure, wherein tree nodes are regarded as states and trees sides are regarded as state jumps; b. establishing a Hash table for the jump information of each node; c. adding a suffix mark and whole word information for each suffix state; d. adding a fail jump for each node; e. ending; 2) automata running: a. reading a text word by word, and carrying out a jump operation according to the jump information of the current state and the read characters; b. while jumping to the suffix state once, outputting the whole word information stored in the state; c. ending. According to the automata method for finding a large number of short text information provided by the invention, the efficiency of finding the large number of short text information is greatly increased on the premise of ensuring correctness; by using the method of the Hash tables, the time consumption of finding jump tables during automata jump is reduced; the whole process is automatic, and user experience is friendly.

Description

For the automation method of this INFORMATION DISCOVERY of Massive short documents

Technical field

The present invention relates to multi-mode matching field, is a kind of automation method for this INFORMATION DISCOVERY of Massive short documents.

Background technology

Along with the arriving of large data age; in our actual life, often can produce information such as microblogging, note, QQ chat record; these information have a lot of common features; as few in content---maximum 140 words of microblogging; note 80 words, quantity is large---and only national note traffic volume in 2012 has reached 8973.1 hundred million (data are from Ministrys of Industry and Information).How processing this many text messages is to analyze, process one of emphasis of large data.

Multi-key word coupling refers to whether the text of visiting appointment by smelling contains keyword, judges that whether this text message is by detecting.Nowadays multi-key word overmatching art has also had widespread use, and one of the most well-known application is exactly the GFW system of China.Its cardinal principle is that the website that comprises key vocabularies is just masked in router layer.Filter although the technology that this system adopts is multi-key word, it is very similar that multi-key word filters the principle of mating with multi-key word, is all the method by smelling spy keyword.Can say, they are positive and negative two application of same method.

In native system, we need to analyze the short message of magnanimity.Wherein, whether we need to find out each note contains some Sensitive Units titles, determines according to this some social propertys of information recipient, and in this demand, we need to use many texts multi-key word filtering technique.

Summary of the invention

The problem that the present invention solves is this INFORMATION DISCOVERY of Massive short documents, mainly pays close attention to its efficiency, accuracy, and automatism is for addressing this problem, and the present invention is as follows for the step of the automation method of this INFORMATION DISCOVERY of Massive short documents:

1) automat is set up:

A preserves all keywords with Trie tree construction, and wherein tree node is considered as state, and tree limit is considered as state transition;

B is that the jump information of each node is set up Hash table; C is that each suffix state adds suffix mark and whole word information;

C is that each suffix state adds suffix mark and whole word information;

D is that each node adds fail redirect;

E finishes;

2) automat operation:

A word for word reads text, according to the jump information of current state with read in character and carry out skip operation;

B, in the time jumping to suffix state, exports the whole word information of preserving in this state;

C finishes.

In the present invention, step 1) said Trie tree is a kind of tree structure in-a, its advantage is to utilize the common prefix of character string to reduce query time.The Trie tree construction that this step completes is the framework of automat.

In the present invention, step 1) object of setting up Hash table for the jump information of each node in-b be by query conversion for calculating, reduce the expense of inquiry jump list.Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.

In the present invention, step 1) whether suffix mark has identified keyword for identify current state in the time that automat moves in-c.The object of adding whole word information is by the keyword output of having identified.

In the present invention, step 1) in-d fail redirect refer in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.What the fail redirect of a certain state was pointed to is the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.

In the present invention, step 2) skip operation it should be noted that in-a, fail redirect is carried out immediately, when there is fail in the time that state A inputs certain character X and jumping to state B, X character need to continue to input to state B, the rest may be inferred, until end-state is original state, or has the redirect of X character.

The invention has the beneficial effects as follows: the present invention is ensureing under the prerequisite of correctness for the automation method of this INFORMATION DISCOVERY of Massive short documents, utilize automation method greatly to improve the efficiency of mass text INFORMATION DISCOVERY, while utilizing the method for Hash table to reduce automat redirect, search the consuming time of jump list, whole-course automation, user experiences close friend.

Brief description of the drawings

Fig. 1 is the process flow diagram of the present invention for the automation method of this INFORMATION DISCOVERY of Massive short documents.

Embodiment

In order more to understand technology contents of the present invention, especially exemplified by specific embodiment and coordinate appended graphic being described as follows.

Fig. 1 is the process flow diagram of the automation method for this INFORMATION DISCOVERY of Massive short documents of the embodiment of the present invention.Automation method for this INFORMATION DISCOVERY of Massive short documents comprises two stages: automat establishment stage and automat execute phase.Its concrete steps are as follows:

Step 1 is preserved all keywords with Trie tree construction, and the process of establishing of Trie tree is as follows:

1) newly-built root node, i.e. original state, and make current state original state for this reason.

2) obtain the first character of a keyword that will add, if current state without the redirect of this character, newly-built state, and make the redirect of this character of current state point to it, makes current state newly-built state for this reason; If current state has the redirect of this character, redirect, and make current state redirect state for this reason.

3) circulation step 2) until keyword suffix.

4) circulation step 2) 3), add all keywords.

The jump information that step 2 is each node is set up Hash table.Hash algorithm used herein is 31 system algorithms, the character string that is n+1 for length, and its cryptographic hash is:

Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.

Step 3 is that each suffix node adds suffix mark and whole word information.Whether suffix mark has identified keyword for identify current state in the time that automat moves.The object of adding whole word information is by the keyword output of having identified.

Step 4 is that each node adds fail redirect.Fail redirect refers in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.The fail redirect of a certain state should be pointed to the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.The step of adding fail redirect for certain node is as follows:

1) obtain the input character of this state

2) for, search successively all comprise suffix until,, in the time finding the input character of certain state X, stop searching, and to make the fail redirect of this state be state X.

Step 5, from original state, word for word reads text, carries out state transition according to jump list.Wherein, fail redirect need to be carried out immediately, and, in the time that state A inputs certain character X and fail occurs jumps to state B, X character need to continue to input to state B, if B does not still have the redirect of X character.Still need fail to jump to next state, the rest may be inferred, until fail jumps to original state, or fail jumps to the state that has the redirect of X character.

Step 6, in the time jumping to the state of suffix mark, is exported the whole word information of preserving in this state.

In sum, the present invention is ensureing under the prerequisite of correctness for the automation method of this INFORMATION DISCOVERY of Massive short documents, utilize automation method greatly to improve the efficiency of mass text INFORMATION DISCOVERY, while utilizing the method for Hash table to reduce automat redirect, search the consuming time of jump list, whole-course automation, user experiences close friend.

Although the present invention discloses as above with preferred embodiment, so it is not in order to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on claims person of defining.

Claims

1. for the automation method of this INFORMATION DISCOVERY of Massive short documents, it is characterized in that, comprise the steps:

1) automat is set up:

C is that each suffix state adds suffix mark and whole word information;

D is that each node adds fail redirect;

E finishes;

2) automat operation:

C finishes.

2. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) said Trie tree is a kind of tree structure in-a, its advantage is to utilize the common prefix of character string to reduce query time.The Trie tree construction that this step completes is the framework of automat.

3. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, it is characterized in that, wherein step 1) object of setting up Hash table for the jump information of each node in-b be by query conversion for calculating, reduce the expense of inquiry jump list.Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.

4. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) whether suffix mark has identified keyword for identify current state in the time that automat moves in-c.The object of adding whole word information is by the keyword output of having identified.

5. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) in-d fail redirect refer in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.What the fail redirect of a certain state was pointed to is the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.

6. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, it is characterized in that, wherein step 2) skip operation it should be noted that in-a, fail redirect is carried out immediately, when fail occurring in the time that state A inputs certain character X and jumping to state B, X character need to continue to input to state B, and the rest may be inferred, until end-state is original state, or there is the redirect of X character.