CN104077358A - Automata method for finding large number of short text information - Google Patents
Automata method for finding large number of short text information Download PDFInfo
- Publication number
- CN104077358A CN104077358A CN201410243718.3A CN201410243718A CN104077358A CN 104077358 A CN104077358 A CN 104077358A CN 201410243718 A CN201410243718 A CN 201410243718A CN 104077358 A CN104077358 A CN 104077358A
- Authority
- CN
- China
- Prior art keywords
- state
- information
- suffix
- jump
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000009191 jumping Effects 0.000 claims abstract description 6
- 238000010276 construction Methods 0.000 claims description 5
- 230000007704 transition Effects 0.000 claims description 3
- 230000008901 benefit Effects 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 206010003830 Automatism Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000013011 mating Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an automata method for finding a large number of short text information. The automata method comprises the following steps of: 1) automata establishment: a. storing all keywords by a Trie tree structure, wherein tree nodes are regarded as states and trees sides are regarded as state jumps; b. establishing a Hash table for the jump information of each node; c. adding a suffix mark and whole word information for each suffix state; d. adding a fail jump for each node; e. ending; 2) automata running: a. reading a text word by word, and carrying out a jump operation according to the jump information of the current state and the read characters; b. while jumping to the suffix state once, outputting the whole word information stored in the state; c. ending. According to the automata method for finding a large number of short text information provided by the invention, the efficiency of finding the large number of short text information is greatly increased on the premise of ensuring correctness; by using the method of the Hash tables, the time consumption of finding jump tables during automata jump is reduced; the whole process is automatic, and user experience is friendly.
Description
Technical field
The present invention relates to multi-mode matching field, is a kind of automation method for this INFORMATION DISCOVERY of Massive short documents.
Background technology
Along with the arriving of large data age; in our actual life, often can produce information such as microblogging, note, QQ chat record; these information have a lot of common features; as few in content---maximum 140 words of microblogging; note 80 words, quantity is large---and only national note traffic volume in 2012 has reached 8973.1 hundred million (data are from Ministrys of Industry and Information).How processing this many text messages is to analyze, process one of emphasis of large data.
Multi-key word coupling refers to whether the text of visiting appointment by smelling contains keyword, judges that whether this text message is by detecting.Nowadays multi-key word overmatching art has also had widespread use, and one of the most well-known application is exactly the GFW system of China.Its cardinal principle is that the website that comprises key vocabularies is just masked in router layer.Filter although the technology that this system adopts is multi-key word, it is very similar that multi-key word filters the principle of mating with multi-key word, is all the method by smelling spy keyword.Can say, they are positive and negative two application of same method.
In native system, we need to analyze the short message of magnanimity.Wherein, whether we need to find out each note contains some Sensitive Units titles, determines according to this some social propertys of information recipient, and in this demand, we need to use many texts multi-key word filtering technique.
Summary of the invention
The problem that the present invention solves is this INFORMATION DISCOVERY of Massive short documents, mainly pays close attention to its efficiency, accuracy, and automatism is for addressing this problem, and the present invention is as follows for the step of the automation method of this INFORMATION DISCOVERY of Massive short documents:
1) automat is set up:
A preserves all keywords with Trie tree construction, and wherein tree node is considered as state, and tree limit is considered as state transition;
B is that the jump information of each node is set up Hash table; C is that each suffix state adds suffix mark and whole word information;
C is that each suffix state adds suffix mark and whole word information;
D is that each node adds fail redirect;
E finishes;
2) automat operation:
A word for word reads text, according to the jump information of current state with read in character and carry out skip operation;
B, in the time jumping to suffix state, exports the whole word information of preserving in this state;
C finishes.
In the present invention, step 1) said Trie tree is a kind of tree structure in-a, its advantage is to utilize the common prefix of character string to reduce query time.The Trie tree construction that this step completes is the framework of automat.
In the present invention, step 1) object of setting up Hash table for the jump information of each node in-b be by query conversion for calculating, reduce the expense of inquiry jump list.Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.
In the present invention, step 1) whether suffix mark has identified keyword for identify current state in the time that automat moves in-c.The object of adding whole word information is by the keyword output of having identified.
In the present invention, step 1) in-d fail redirect refer in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.What the fail redirect of a certain state was pointed to is the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.
In the present invention, step 2) skip operation it should be noted that in-a, fail redirect is carried out immediately, when there is fail in the time that state A inputs certain character X and jumping to state B, X character need to continue to input to state B, the rest may be inferred, until end-state is original state, or has the redirect of X character.
The invention has the beneficial effects as follows: the present invention is ensureing under the prerequisite of correctness for the automation method of this INFORMATION DISCOVERY of Massive short documents, utilize automation method greatly to improve the efficiency of mass text INFORMATION DISCOVERY, while utilizing the method for Hash table to reduce automat redirect, search the consuming time of jump list, whole-course automation, user experiences close friend.
Brief description of the drawings
Fig. 1 is the process flow diagram of the present invention for the automation method of this INFORMATION DISCOVERY of Massive short documents.
Embodiment
In order more to understand technology contents of the present invention, especially exemplified by specific embodiment and coordinate appended graphic being described as follows.
Fig. 1 is the process flow diagram of the automation method for this INFORMATION DISCOVERY of Massive short documents of the embodiment of the present invention.Automation method for this INFORMATION DISCOVERY of Massive short documents comprises two stages: automat establishment stage and automat execute phase.Its concrete steps are as follows:
Step 1 is preserved all keywords with Trie tree construction, and the process of establishing of Trie tree is as follows:
1) newly-built root node, i.e. original state, and make current state original state for this reason.
2) obtain the first character of a keyword that will add, if current state without the redirect of this character, newly-built state, and make the redirect of this character of current state point to it, makes current state newly-built state for this reason; If current state has the redirect of this character, redirect, and make current state redirect state for this reason.
3) circulation step 2) until keyword suffix.
4) circulation step 2) 3), add all keywords.
The jump information that step 2 is each node is set up Hash table.Hash algorithm used herein is 31 system algorithms, the character string that is n+1 for length, and its cryptographic hash is:
Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.
Step 3 is that each suffix node adds suffix mark and whole word information.Whether suffix mark has identified keyword for identify current state in the time that automat moves.The object of adding whole word information is by the keyword output of having identified.
Step 4 is that each node adds fail redirect.Fail redirect refers in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.The fail redirect of a certain state should be pointed to the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.The step of adding fail redirect for certain node is as follows:
1) obtain the input character of this state
2) for, search successively all comprise suffix until,, in the time finding the input character of certain state X, stop searching, and to make the fail redirect of this state be state X.
Step 5, from original state, word for word reads text, carries out state transition according to jump list.Wherein, fail redirect need to be carried out immediately, and, in the time that state A inputs certain character X and fail occurs jumps to state B, X character need to continue to input to state B, if B does not still have the redirect of X character.Still need fail to jump to next state, the rest may be inferred, until fail jumps to original state, or fail jumps to the state that has the redirect of X character.
Step 6, in the time jumping to the state of suffix mark, is exported the whole word information of preserving in this state.
In sum, the present invention is ensureing under the prerequisite of correctness for the automation method of this INFORMATION DISCOVERY of Massive short documents, utilize automation method greatly to improve the efficiency of mass text INFORMATION DISCOVERY, while utilizing the method for Hash table to reduce automat redirect, search the consuming time of jump list, whole-course automation, user experiences close friend.
Although the present invention discloses as above with preferred embodiment, so it is not in order to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on claims person of defining.
Claims (6)
1. for the automation method of this INFORMATION DISCOVERY of Massive short documents, it is characterized in that, comprise the steps:
1) automat is set up:
A preserves all keywords with Trie tree construction, and wherein tree node is considered as state, and tree limit is considered as state transition;
B is that the jump information of each node is set up Hash table; C is that each suffix state adds suffix mark and whole word information;
C is that each suffix state adds suffix mark and whole word information;
D is that each node adds fail redirect;
E finishes;
2) automat operation:
A word for word reads text, according to the jump information of current state with read in character and carry out skip operation;
B, in the time jumping to suffix state, exports the whole word information of preserving in this state;
C finishes.
2. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) said Trie tree is a kind of tree structure in-a, its advantage is to utilize the common prefix of character string to reduce query time.The Trie tree construction that this step completes is the framework of automat.
3. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, it is characterized in that, wherein step 1) object of setting up Hash table for the jump information of each node in-b be by query conversion for calculating, reduce the expense of inquiry jump list.Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.
4. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) whether suffix mark has identified keyword for identify current state in the time that automat moves in-c.The object of adding whole word information is by the keyword output of having identified.
5. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) in-d fail redirect refer in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.What the fail redirect of a certain state was pointed to is the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.
6. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, it is characterized in that, wherein step 2) skip operation it should be noted that in-a, fail redirect is carried out immediately, when fail occurring in the time that state A inputs certain character X and jumping to state B, X character need to continue to input to state B, and the rest may be inferred, until end-state is original state, or there is the redirect of X character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410243718.3A CN104077358A (en) | 2014-06-03 | 2014-06-03 | Automata method for finding large number of short text information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410243718.3A CN104077358A (en) | 2014-06-03 | 2014-06-03 | Automata method for finding large number of short text information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104077358A true CN104077358A (en) | 2014-10-01 |
Family
ID=51598612
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410243718.3A Pending CN104077358A (en) | 2014-06-03 | 2014-06-03 | Automata method for finding large number of short text information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104077358A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016101552A1 (en) * | 2014-12-25 | 2016-06-30 | 深圳市中兴微电子技术有限公司 | Message detection method and device, and storage medium |
CN108133052A (en) * | 2018-01-18 | 2018-06-08 | 广州汇智通信技术有限公司 | A kind of searching method of multiple key, system, medium and equipment |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412858A (en) * | 2012-07-02 | 2013-11-27 | 清华大学 | Method for large-scale feature matching of text content or network content analyses |
-
2014
- 2014-06-03 CN CN201410243718.3A patent/CN104077358A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412858A (en) * | 2012-07-02 | 2013-11-27 | 清华大学 | Method for large-scale feature matching of text content or network content analyses |
Non-Patent Citations (2)
Title |
---|
匿名: "字符串:KMP Eentend-Kmp 自动机 trie图 trie树 后缀树 后缀数组", 《DUANPLE.BLOG.163.COM/BLOG/STATIC/709717672009825004092/》 * |
匿名: "我的字符串报告(AC自动机详,后缀数组无)", 《WONDERFLOW.GITHUB.IO/BLOG/2012/09/05/E68891E79A84E5AD97E7ACA6E4B8B2E68AA5E5918A/》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016101552A1 (en) * | 2014-12-25 | 2016-06-30 | 深圳市中兴微电子技术有限公司 | Message detection method and device, and storage medium |
CN105791124A (en) * | 2014-12-25 | 2016-07-20 | 深圳市中兴微电子技术有限公司 | Message detection method and device |
CN105791124B (en) * | 2014-12-25 | 2019-04-30 | 深圳市中兴微电子技术有限公司 | Message detecting method and device |
CN108133052A (en) * | 2018-01-18 | 2018-06-08 | 广州汇智通信技术有限公司 | A kind of searching method of multiple key, system, medium and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109670163B (en) | Information identification method, information recommendation method, template construction method and computing device | |
CN101950312B (en) | Method for analyzing webpage content of internet | |
CN105224692A (en) | Support the system and method for the SDN multilevel flow table parallel search of polycaryon processor | |
TWI554897B (en) | Mail index establishment method and system, mail search method and system | |
CN101551803A (en) | Method and device for establishing pattern matching state machine and pattern recognition | |
CN103942308A (en) | Method and device for detecting large-scale social network communities | |
CN104025520B (en) | Lookup table creation method and query method, and controller, forwarding device and system therefor | |
CN106599091B (en) | RDF graph structure storage and index method based on key value storage | |
WO2015021879A1 (en) | Method and device for mining data regular expression | |
CN102870116A (en) | Method and apparatus for content matching | |
CN101630323A (en) | Method for compressing space of finite automaton | |
CN105138649B (en) | Searching method, device and the terminal of data | |
US9251290B2 (en) | Method, server, terminal device, and computer-readable recording medium for selectively removing nondeterminism of nondeterministic finite automata | |
CN113312539B (en) | Method, device, equipment and medium for providing search service | |
CN104077358A (en) | Automata method for finding large number of short text information | |
CN108984626B (en) | Data processing method and device and server | |
US10885453B2 (en) | Calculation device, calculation method, and non-transitory computer-readable recording medium | |
CN103020186B (en) | A kind of document retrieval method based on embedded device, device and equipment | |
CN112860412A (en) | Service data processing method and device, electronic equipment and storage medium | |
CN102253983A (en) | Method and system for identifying Chinese high-risk words | |
CN112214494B (en) | Retrieval method and device | |
Dong et al. | Content-aware partial compression for textual big data analysis in hadoop | |
CN103176953A (en) | Text processing method and text processing system | |
CN108509438A (en) | A kind of ElasticSearch fragments extended method | |
CN108304186A (en) | A kind of method and apparatus executing multi-mode operation based on synthesized configuration file |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20141001 |
|
RJ01 | Rejection of invention patent application after publication |