CN104077358A - Automata method for finding large number of short text information - Google Patents

Automata method for finding large number of short text information Download PDF

Info

Publication number
CN104077358A
CN104077358A CN201410243718.3A CN201410243718A CN104077358A CN 104077358 A CN104077358 A CN 104077358A CN 201410243718 A CN201410243718 A CN 201410243718A CN 104077358 A CN104077358 A CN 104077358A
Authority
CN
China
Prior art keywords
state
information
suffix
jump
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410243718.3A
Other languages
Chinese (zh)
Inventor
王崇骏
杨骏元
彭岳
杨骏
谢俊元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201410243718.3A priority Critical patent/CN104077358A/en
Publication of CN104077358A publication Critical patent/CN104077358A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an automata method for finding a large number of short text information. The automata method comprises the following steps of: 1) automata establishment: a. storing all keywords by a Trie tree structure, wherein tree nodes are regarded as states and trees sides are regarded as state jumps; b. establishing a Hash table for the jump information of each node; c. adding a suffix mark and whole word information for each suffix state; d. adding a fail jump for each node; e. ending; 2) automata running: a. reading a text word by word, and carrying out a jump operation according to the jump information of the current state and the read characters; b. while jumping to the suffix state once, outputting the whole word information stored in the state; c. ending. According to the automata method for finding a large number of short text information provided by the invention, the efficiency of finding the large number of short text information is greatly increased on the premise of ensuring correctness; by using the method of the Hash tables, the time consumption of finding jump tables during automata jump is reduced; the whole process is automatic, and user experience is friendly.

Description

For the automation method of this INFORMATION DISCOVERY of Massive short documents
Technical field
The present invention relates to multi-mode matching field, is a kind of automation method for this INFORMATION DISCOVERY of Massive short documents.
Background technology
Along with the arriving of large data age; in our actual life, often can produce information such as microblogging, note, QQ chat record; these information have a lot of common features; as few in content---maximum 140 words of microblogging; note 80 words, quantity is large---and only national note traffic volume in 2012 has reached 8973.1 hundred million (data are from Ministrys of Industry and Information).How processing this many text messages is to analyze, process one of emphasis of large data.
Multi-key word coupling refers to whether the text of visiting appointment by smelling contains keyword, judges that whether this text message is by detecting.Nowadays multi-key word overmatching art has also had widespread use, and one of the most well-known application is exactly the GFW system of China.Its cardinal principle is that the website that comprises key vocabularies is just masked in router layer.Filter although the technology that this system adopts is multi-key word, it is very similar that multi-key word filters the principle of mating with multi-key word, is all the method by smelling spy keyword.Can say, they are positive and negative two application of same method.
In native system, we need to analyze the short message of magnanimity.Wherein, whether we need to find out each note contains some Sensitive Units titles, determines according to this some social propertys of information recipient, and in this demand, we need to use many texts multi-key word filtering technique.
Summary of the invention
The problem that the present invention solves is this INFORMATION DISCOVERY of Massive short documents, mainly pays close attention to its efficiency, accuracy, and automatism is for addressing this problem, and the present invention is as follows for the step of the automation method of this INFORMATION DISCOVERY of Massive short documents:
1) automat is set up:
A preserves all keywords with Trie tree construction, and wherein tree node is considered as state, and tree limit is considered as state transition;
B is that the jump information of each node is set up Hash table; C is that each suffix state adds suffix mark and whole word information;
C is that each suffix state adds suffix mark and whole word information;
D is that each node adds fail redirect;
E finishes;
2) automat operation:
A word for word reads text, according to the jump information of current state with read in character and carry out skip operation;
B, in the time jumping to suffix state, exports the whole word information of preserving in this state;
C finishes.
In the present invention, step 1) said Trie tree is a kind of tree structure in-a, its advantage is to utilize the common prefix of character string to reduce query time.The Trie tree construction that this step completes is the framework of automat.
In the present invention, step 1) object of setting up Hash table for the jump information of each node in-b be by query conversion for calculating, reduce the expense of inquiry jump list.Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.
In the present invention, step 1) whether suffix mark has identified keyword for identify current state in the time that automat moves in-c.The object of adding whole word information is by the keyword output of having identified.
In the present invention, step 1) in-d fail redirect refer in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.What the fail redirect of a certain state was pointed to is the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.
In the present invention, step 2) skip operation it should be noted that in-a, fail redirect is carried out immediately, when there is fail in the time that state A inputs certain character X and jumping to state B, X character need to continue to input to state B, the rest may be inferred, until end-state is original state, or has the redirect of X character.
The invention has the beneficial effects as follows: the present invention is ensureing under the prerequisite of correctness for the automation method of this INFORMATION DISCOVERY of Massive short documents, utilize automation method greatly to improve the efficiency of mass text INFORMATION DISCOVERY, while utilizing the method for Hash table to reduce automat redirect, search the consuming time of jump list, whole-course automation, user experiences close friend.
Brief description of the drawings
Fig. 1 is the process flow diagram of the present invention for the automation method of this INFORMATION DISCOVERY of Massive short documents.
Embodiment
In order more to understand technology contents of the present invention, especially exemplified by specific embodiment and coordinate appended graphic being described as follows.
Fig. 1 is the process flow diagram of the automation method for this INFORMATION DISCOVERY of Massive short documents of the embodiment of the present invention.Automation method for this INFORMATION DISCOVERY of Massive short documents comprises two stages: automat establishment stage and automat execute phase.Its concrete steps are as follows:
Step 1 is preserved all keywords with Trie tree construction, and the process of establishing of Trie tree is as follows:
1) newly-built root node, i.e. original state, and make current state original state for this reason.
2) obtain the first character of a keyword that will add, if current state without the redirect of this character, newly-built state, and make the redirect of this character of current state point to it, makes current state newly-built state for this reason; If current state has the redirect of this character, redirect, and make current state redirect state for this reason.
3) circulation step 2) until keyword suffix.
4) circulation step 2) 3), add all keywords.
The jump information that step 2 is each node is set up Hash table.Hash algorithm used herein is 31 system algorithms, the character string that is n+1 for length, and its cryptographic hash is:
Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.
Step 3 is that each suffix node adds suffix mark and whole word information.Whether suffix mark has identified keyword for identify current state in the time that automat moves.The object of adding whole word information is by the keyword output of having identified.
Step 4 is that each node adds fail redirect.Fail redirect refers in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.The fail redirect of a certain state should be pointed to the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.The step of adding fail redirect for certain node is as follows:
1) obtain the input character of this state
2) for, search successively all comprise suffix until,, in the time finding the input character of certain state X, stop searching, and to make the fail redirect of this state be state X.
Step 5, from original state, word for word reads text, carries out state transition according to jump list.Wherein, fail redirect need to be carried out immediately, and, in the time that state A inputs certain character X and fail occurs jumps to state B, X character need to continue to input to state B, if B does not still have the redirect of X character.Still need fail to jump to next state, the rest may be inferred, until fail jumps to original state, or fail jumps to the state that has the redirect of X character.
Step 6, in the time jumping to the state of suffix mark, is exported the whole word information of preserving in this state.
In sum, the present invention is ensureing under the prerequisite of correctness for the automation method of this INFORMATION DISCOVERY of Massive short documents, utilize automation method greatly to improve the efficiency of mass text INFORMATION DISCOVERY, while utilizing the method for Hash table to reduce automat redirect, search the consuming time of jump list, whole-course automation, user experiences close friend.
Although the present invention discloses as above with preferred embodiment, so it is not in order to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on claims person of defining.

Claims (6)

1. for the automation method of this INFORMATION DISCOVERY of Massive short documents, it is characterized in that, comprise the steps:
1) automat is set up:
A preserves all keywords with Trie tree construction, and wherein tree node is considered as state, and tree limit is considered as state transition;
B is that the jump information of each node is set up Hash table; C is that each suffix state adds suffix mark and whole word information;
C is that each suffix state adds suffix mark and whole word information;
D is that each node adds fail redirect;
E finishes;
2) automat operation:
A word for word reads text, according to the jump information of current state with read in character and carry out skip operation;
B, in the time jumping to suffix state, exports the whole word information of preserving in this state;
C finishes.
2. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) said Trie tree is a kind of tree structure in-a, its advantage is to utilize the common prefix of character string to reduce query time.The Trie tree construction that this step completes is the framework of automat.
3. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, it is characterized in that, wherein step 1) object of setting up Hash table for the jump information of each node in-b be by query conversion for calculating, reduce the expense of inquiry jump list.Hash table adopts dynamic Hash table method herein, and initial capacity is 16, and load factor is fixed as 0.75.In the time that the entry in Hash table is greater than current capacity and is multiplied by load factor, can automatically perform the double operation of capacity, and carry out rehash operation, the entry in table is added in new table again.
4. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) whether suffix mark has identified keyword for identify current state in the time that automat moves in-c.The object of adding whole word information is by the keyword output of having identified.
5. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, is characterized in that, wherein step 1) in-d fail redirect refer in the time that input character can not find out in current jump list, the state of the down hop of acquiescence.What the fail redirect of a certain state was pointed to is the longest suffix that comprises that can realize redirect of the input character of current state, wherein the input character of state refers to from original state and transfers to the shortest character string that this state need to be inputted, and comprises suffix and refers to character string except all suffix self.
6. the automation method for this INFORMATION DISCOVERY of Massive short documents according to claim 1, it is characterized in that, wherein step 2) skip operation it should be noted that in-a, fail redirect is carried out immediately, when fail occurring in the time that state A inputs certain character X and jumping to state B, X character need to continue to input to state B, and the rest may be inferred, until end-state is original state, or there is the redirect of X character.
CN201410243718.3A 2014-06-03 2014-06-03 Automata method for finding large number of short text information Pending CN104077358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410243718.3A CN104077358A (en) 2014-06-03 2014-06-03 Automata method for finding large number of short text information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410243718.3A CN104077358A (en) 2014-06-03 2014-06-03 Automata method for finding large number of short text information

Publications (1)

Publication Number Publication Date
CN104077358A true CN104077358A (en) 2014-10-01

Family

ID=51598612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410243718.3A Pending CN104077358A (en) 2014-06-03 2014-06-03 Automata method for finding large number of short text information

Country Status (1)

Country Link
CN (1) CN104077358A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016101552A1 (en) * 2014-12-25 2016-06-30 深圳市中兴微电子技术有限公司 Message detection method and device, and storage medium
CN108133052A (en) * 2018-01-18 2018-06-08 广州汇智通信技术有限公司 A kind of searching method of multiple key, system, medium and equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412858A (en) * 2012-07-02 2013-11-27 清华大学 Method for large-scale feature matching of text content or network content analyses

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412858A (en) * 2012-07-02 2013-11-27 清华大学 Method for large-scale feature matching of text content or network content analyses

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
匿名: "字符串:KMP Eentend-Kmp 自动机 trie图 trie树 后缀树 后缀数组", 《DUANPLE.BLOG.163.COM/BLOG/STATIC/709717672009825004092/》 *
匿名: "我的字符串报告(AC自动机详,后缀数组无)", 《WONDERFLOW.GITHUB.IO/BLOG/2012/09/05/E68891E79A84E5AD97E7ACA6E4B8B2E68AA5E5918A/》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016101552A1 (en) * 2014-12-25 2016-06-30 深圳市中兴微电子技术有限公司 Message detection method and device, and storage medium
CN105791124A (en) * 2014-12-25 2016-07-20 深圳市中兴微电子技术有限公司 Message detection method and device
CN105791124B (en) * 2014-12-25 2019-04-30 深圳市中兴微电子技术有限公司 Message detecting method and device
CN108133052A (en) * 2018-01-18 2018-06-08 广州汇智通信技术有限公司 A kind of searching method of multiple key, system, medium and equipment

Similar Documents

Publication Publication Date Title
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN101950312B (en) Method for analyzing webpage content of internet
CN105224692A (en) Support the system and method for the SDN multilevel flow table parallel search of polycaryon processor
TWI554897B (en) Mail index establishment method and system, mail search method and system
CN101551803A (en) Method and device for establishing pattern matching state machine and pattern recognition
CN103942308A (en) Method and device for detecting large-scale social network communities
CN104025520B (en) Lookup table creation method and query method, and controller, forwarding device and system therefor
CN106599091B (en) RDF graph structure storage and index method based on key value storage
WO2015021879A1 (en) Method and device for mining data regular expression
CN102870116A (en) Method and apparatus for content matching
CN101630323A (en) Method for compressing space of finite automaton
CN105138649B (en) Searching method, device and the terminal of data
US9251290B2 (en) Method, server, terminal device, and computer-readable recording medium for selectively removing nondeterminism of nondeterministic finite automata
CN113312539B (en) Method, device, equipment and medium for providing search service
CN104077358A (en) Automata method for finding large number of short text information
CN108984626B (en) Data processing method and device and server
US10885453B2 (en) Calculation device, calculation method, and non-transitory computer-readable recording medium
CN103020186B (en) A kind of document retrieval method based on embedded device, device and equipment
CN112860412A (en) Service data processing method and device, electronic equipment and storage medium
CN102253983A (en) Method and system for identifying Chinese high-risk words
CN112214494B (en) Retrieval method and device
Dong et al. Content-aware partial compression for textual big data analysis in hadoop
CN103176953A (en) Text processing method and text processing system
CN108509438A (en) A kind of ElasticSearch fragments extended method
CN108304186A (en) A kind of method and apparatus executing multi-mode operation based on synthesized configuration file

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141001

RJ01 Rejection of invention patent application after publication