CN105843854A - Network data oriented rapid recognition system for topic document - Google Patents

Network data oriented rapid recognition system for topic document Download PDF

Info

Publication number
CN105843854A
CN105843854A CN201610150817.6A CN201610150817A CN105843854A CN 105843854 A CN105843854 A CN 105843854A CN 201610150817 A CN201610150817 A CN 201610150817A CN 105843854 A CN105843854 A CN 105843854A
Authority
CN
China
Prior art keywords
document
backtracking
rule
time
processing module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610150817.6A
Other languages
Chinese (zh)
Other versions
CN105843854B (en
Inventor
程工
刘春阳
庞琳
王卿
李雄
张旭
马宏远
张丽
毕明珠
刘玮
贺敏
杨亚茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Publication of CN105843854A publication Critical patent/CN105843854A/en
Application granted granted Critical
Publication of CN105843854B publication Critical patent/CN105843854B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a network data oriented rapid recognition system for a topic document, so as to achieve the goal of rapidly recognizing topics by means of an efficient match with different rules. The system mainly consists of a document acquisition module, a document result storage module, a polling monitoring module, a real-time service interface, a history service interface, a rule tree construction module, a real-time filtering processing module and a backtracking filtering processing module. According to the system provided by the present invention, a function of synchronously processing real-time data and historical valid data is achieved, a large amount of document data can be processed in batches, dynamic hot-swapping can be performed on a processing algorithm on the premise that normal operation of the system is ensured, normal operation of the system can be still ensured after input/output interface content is changed, the defects that some existing document recognition systems cannot be randomly tampered and has poor flexibility and reusability and the like are overcome, and the system has very strong adaptability for demand changes.

Description

A kind of thematic document system for rapidly identifying of network-oriented data
Technical field
The invention belongs to computer utility and technical field of network information, be specifically related to a kind of network-oriented data Thematic document system for rapidly identifying.
Background technology
Along with the Internet and the popularization and application of mobile phone, the web database technology that netizen produces is skyrocketed through, opinion The information contenies such as altar, blog, news and Social Media are ubiquitous.ZDNET2013 year technology report Accuse display, the data total amount that China in 2013 produces more than 0.8ZB (being equivalent to 800,000,000 TB), 2 times 2012 Year, be equivalent to the data total amount in the whole world in 2009.Expecting the year two thousand twenty, the data total amount that China produces will be 10 times in 2013, more than 8.5ZB.Available information not a duck soup, society is found from the user data of magnanimity Media rapidly, immediately, while offering convenience to the life of people, also surpass because of its speed produced out and away Get over the utilization ratio of people, so that people are difficult to immediately obtain effective information from data.
In network data the special topic of event include in society politics, economic, military, finance, life, The every aspects such as amusement.Such as, political, network become reflection Social Public Feelings main carriers it One, people give full expression to the emotion of oneself, attitude, viewpoint by network, form one carriage that can not be ignored Opinion strength, affects the change of social concerns point and the developments of some events, by entering network data Row is analyzed, and can therefrom obtain public feelings information.In other respects, utilize the ageing of network data, permissible Monitor specific event earthquake, and utilize space time information to find the shake heart;The most such as, by the network information In the relevant discussion of certain film or event excavate, model can be set up to predict box office.Therefore, right Quickly recognize its affiliated special topic in mass data, have important practical significance.But, existing special topic Identification system has that function is single-minded mostly, poor universality, to business change bad adaptability and recognition efficiency The most high deficiency.
Summary of the invention
The defect existed for prior art, the present invention provides the thematic document of a kind of network-oriented data quick Recognition methods and system, can effectively solve the problems referred to above.
The technical solution used in the present invention is as follows:
The present invention provides the thematic document system for rapidly identifying of a kind of network-oriented data, obtains mould including document Block, document results memory module, poll monitoring modular, real time service interface, history service interface, rule Tree builds module, real time filtering processing module and backtracking filter processing module;Wherein, poll monitoring modular leads to Crossing described real time service interface to be connected with described real time filtering processing module, described real time filtering processing module is divided It is not connected with described document acquisition module and document results memory module;Described poll monitoring modular is by described History service interface is connected with described backtracking filter processing module;Described backtracking filter processing module respectively with institute State document acquisition module and document results memory module connects;
Described poll monitoring modular is for distributing to described real time service interface and described history service interface respectively Event rules;
Described rule tree builds module for receiving from the event rules of described poll monitoring modular, and according to Described event rules builds the even numbers group trie tree supporting Dynamic Thermal switching;
Described real time filtering processing module is used for obtaining Real-time document data by described document acquisition module, and Described Real-time document data are processed, is converted into Real-time document structure;Then, described even numbers group trie is used Described Real-time document structure is scanned by tree, identifies the Real-time document meeting special topic requirement, and will identify Result is stored by described document results memory module;
Described backtracking filter processing module is used for obtaining history archive data by described document acquisition module, and Described history archive data are processed, is converted into history archive structure;Then, described even numbers group trie is used Described history archive structure is scanned by tree, identifies the history archive meeting special topic requirement, and will identify Result is stored by described document results memory module.
Preferably, described poll monitoring modular is additionally operable to:
Described poll monitoring modular is being advised to described real time service interface and described history service interface Distribution Events Time then, receive described real time service interface and the heart beating of described history service interface return the most simultaneously, thus examine The state surveying the described real time filtering processing module being connected with described real time service interface is the most normal, if Abnormal, then restart described real time filtering processing module by described real time service interface;Detection and institute the most simultaneously The state stating the described backtracking filter processing module that history service interface is connected is the most normal, if abnormal, Then restart described backtracking filter processing module by described history service interface.
Preferably, described rule tree build module specifically for:
Step is A.2.1: get come from the distribution of described poll monitoring modular when described rule tree builds module During event rules, described rule tree structure module is by checking that achievement is masked as true or false, thus judges whether Other achievement processes are had to carry out;
Step is A.2.2: being masked as true if contribute, showing to there is other achievement processes, then using the thing of the overall situation Part rules variables backs up the event rules received, and described event rules is write into corresponding daily record;Subsequently After waiting 1 second, continuing checking for mark of contributing, carrying out until there are not other achievement processes;
If contributing be masked as vacation, showing to there is not other achievement processes, then using described event rules to set up Even numbers group trie tree, when setting up during even numbers group trie tree, on the one hand, locking achievement process, until building After having set, discharge this lock;On the other hand, amendment achievement is masked as very;
Wherein, the process setting up even numbers group trie tree refers to: extract the word set in described event rules, sets up double Array trie tree, points to newly-built even numbers group trie tree by local intelligence pointer;Then, overall situation intelligent pointer is used Update overall situation trie tree, during updating overall situation tire tree, this process is locked, release lock after renewal, subsequently Amendment achievement variable is false, shows that this achievement process terminates;
Step is A.2.3: check whether backup event rule is empty, if it is empty, then terminates, continues waiting for new Event rules;If not empty, then step is returned A.2.2.
Preferably, extract the word set in described event rules, set up even numbers group trie tree, specifically comprise following step Rapid:
Step is A.2.2.1: described event rules includes multiple word set variable, each in traversal event rules Individual word set variable, and whether grammatical term for the character collection variable set exist the word set variable traversed, if it does not, Then the described word set variable traversed is deposited in described word set variables collection;If it is present skip;
Step is A.2.2.2: described event rules includes multiple rule;Each rule in traversal event rules Then, and among judgment rule set, whether there is the rule traversed, if it does not exist, then will traverse Described rule deposits in described regular collection;If it is present skip;
Step is A.2.2.3: utilize rule parsing algorithm to travel through described word set variables collection and described regular collection, Extract legal word set;
Step is A.2.2.4: utilize the word set extracted to build even numbers group trie tree.
Preferably, described real time filtering processing module specifically for:
Step is B.1: described real time filtering processing module reads real time filtering and processes configuration file, obtains configuration letter Breath;
Step is B.2: described real time filtering processing module obtains the passage class at Real-time document place according to configuration information Type and Thread Count corresponding to passage;Then, open corresponding Real-time document for each passage and process thread;
Thread is processed for each described Real-time document, carries out the process of real time filtering process, including:
Step is B.2.1: described Real-time document processes thread and reads several Real-time document by input module, and Several Real-time document are packaged into file structure;
Step is B.2.2: described Real-time document processes thread and assembles document: i.e.: by suitable according to territory of file structure Sequence separates with spcial character, then records the side-play amount in each document and territory;
Step is B.2.3: described Real-time document processes thread and uses described even numbers group trie tree to enter described assembling document Line discipline mates, and identifies the document meeting special topic demand in described Real-time document structure.
Preferably, B.2.3 step comprises following step:
Step is B.2.3.1: described Real-time document processes thread and judges whether even numbers group trie tree sets up complete;If The most complete, then wait;If complete, perform step B.2.3.2;
Step is B.2.3.2: described Real-time document processes thread and obtains the described assembling document assembled, described assembling Document is long character string structure;Then, use described even numbers group trie tree to scan described long character string structure, enter Row Keywords matching, obtains scanning result;Wherein, described even numbers group trie tree is used to scan described long character string The process of structure needs to lock;
Step is B.2.3.3: rule parsing, and according to the rule syntax, scanning result is carried out set operation;
B.2.3.4 step, carries out the filtration of information source scope, filters out not in event step result B.2.3.3 The information source scope that rule requires, obtains the document relevant to required special topic.
Preferably, described backtracking filter processing module specifically for:
Step is C.1: described backtracking filter processing module reads backtracking filtration treatment configuration file, filters from backtracking Process and configuration file obtains backtracking time interval and recalls time delay;
Step is C.2: input module reads task by API from data base, and often discovery one meets backtracking bar The task of part is transmitted to described backtracking filter processing module and processes;Specifically include:
Step is C.2.1: it is 0 that input module obtains is_rollback_ended field value from event rule database Data, this field represent task the need of backtracking, 0 represent need backtracking, 3 represent recall, 1 Represent that backtracking completes;
Step is C.2.2: input module obtains the value of rollback_early_time field from obtaining of task, should Value represents the time of last time backtracking;If this time is later than job start time, represent that this task needs backtracking, This task is then sent to recall filter processing module, and otherwise this task does not meets backtracking condition and cannot be carried out back Trace back;
Step is C.2.3: getting of task is sent to described backtracking filter processing module by input module;Described Backtracking filter processing module reads the history archive data within this task effect duration;
Step is C.2.4: described backtracking filter processing module carries out rule match to history archive data, mainly wraps Containing following steps:
Step is C.2.4.1: each the word set variable in described backtracking filter processing module traversal event rules, and Whether grammatical term for the character collection variable set exists the word set variable traversed, if it does not exist, then the institute that will traverse Predicate collection variable deposits in described word set variables collection;If it is present skip;
Each rule in traversal event rules, and among judgment rule set, whether there are the rule traversed Then, if it does not exist, then the described rule traversed is deposited in described regular collection;If it is present Skip;
Utilize rule parsing algorithm to travel through described word set variables collection and described regular collection, extract and meet rule Word set;
Step is C.2.4.2: utilize the word set extracted to build even numbers group trie tree;
Step is C.2.4.3: the sets of documentation of reading is dressed up long character string, according to territory by described backtracking filter processing module Order separate with spcial character, then record the side-play amount in each document and territory;Then, to history literary composition Shelves carry out rule match, identify the document meeting special topic demand in described history archive structure;
Step is C.2.5: the result after rule match is re-write event rules data base;
Step is C.2.6: the report of backtracking result is sent to poll monitoring modular, updates in event rules data base Is_rollback_ended field and rollback_early_time field, determine the event of backtracking this task next time Cycle.
Preferably, step mainly comprises the steps of
Step is C.2.4.3.1: scans the even numbers group trie tree built up by the long character string assembled, carries out key word Coupling, obtains scanning result;
Step is C.2.4.3.2: rule parsing, and according to the rule syntax, scanning result is carried out set operation;
Step is C.2.4.3.3: information source range filter: step result C.2.4.3.2 is carried out information source scope Filter, filter out the information source scope not required in event rules, obtain the document relevant to required special topic.
The thematic document system for rapidly identifying of the network-oriented data that the present invention provides has the advantage that
One aspect of the present invention uses modular framework so that modules can be along with different business scenarios Carry out business change, without causing other modules unavailable;On the other hand the most general data have been worked out Model, makes system have stronger adaptability for different business.
Accompanying drawing explanation
The main line that the thematic document system for rapidly identifying of the network-oriented data that Fig. 1 provides by the present invention is run Journey general flow chart;
The rule tree that Fig. 2 provides for the present invention builds flow chart;
The document real time filtering process chart that Fig. 3 provides for the present invention;
The document backtracking filtration treatment flow chart that Fig. 4 provides for the present invention;
The thematic document system for rapidly identifying Organization Chart that Fig. 5 provides for the present invention.
Detailed description of the invention
In order to make technical problem solved by the invention, technical scheme and beneficial effect clearer, with Lower combination drawings and Examples, are further elaborated to the present invention.Should be appreciated that described herein Specific embodiment only in order to explain the present invention, be not intended to limit the present invention.
The present invention seeks to for the mass data on network, it is provided that a kind of method general, effective, thus Quickly recognize the special topic belonging to document, obtain the effective document under specific special topic in time.
Concrete, present invention achieves the thematic document system for rapidly identifying of a kind of network-oriented data, pass through Reach quickly to identify the purpose of special topic with the efficient matchings of Different Rule.The present invention mainly by document acquisition module, Document results memory module, poll monitoring modular, real time service interface, history service interface, rule tree structure Modeling block, real time filtering processing module and backtracking filter processing module composition.Present invention achieves number in real time Carry out, according to history valid data, the function that processes simultaneously, customized a more common document special topic identification Frame model, it is possible to large volume document data are carried out batch processing, it is possible to before ensureing that system is properly functioning Put and Processing Algorithm is carried out Dynamic Thermal switching, it is possible to still can protect after input/output interface content changes Card system properly functioning, some the document identification systems at present that compensate for cannot arbitrarily be changed, motility and multiple By the defect of property difference etc., demand change there is the strongest adaptability.
Concrete, the thematic document system for rapidly identifying of the network-oriented data that the present invention provides, including document Acquisition module, document results memory module, poll monitoring modular, real time service interface, history service interface, Rule tree builds module, real time filtering processing module and backtracking filter processing module;Wherein, poll monitoring mould Block is connected with described real time filtering processing module by described real time service interface, and described real time filtering processes mould Block is connected with described document acquisition module and document results memory module respectively;Described poll monitoring modular passes through Described history service interface is connected with described backtracking filter processing module;Described backtracking filter processing module is respectively It is connected with described document acquisition module and document results memory module.
Seeing Fig. 1, system main thread specifically includes that 1 reading configuration file;2 start heart beating monitors service;3 Real time filtering processes thread;4 backtracking filtration treatment threads.
Wherein, within the system exist two big concurrent processing flow processs: be respectively as follows: real time filtering handling process with Backtracking filtration treatment flow process.The two handling process constitutes the core of whole system.Real time filtering processes stream Journey also comprises two concurrent processing flow processs: be respectively as follows: the loading of A. event rules and set up trie tree, flow process Figure is shown in accompanying drawing 2, and B. obtains Real-time document and document is carried out real time filtering, and flow chart is shown in accompanying drawing 3.
Hereinafter the operational process of each module is discussed in detail respectively:
(1) poll monitoring modular
Described poll monitoring modular is for distributing to described real time service interface and described history service interface respectively Event rules;
It is additionally operable to: described poll monitoring modular is dividing to described real time service interface and described history service interface When sending out event rules, receive described real time service interface and the heart beating of described history service interface return the most simultaneously, Thus the most just detect the state of the described real time filtering processing module being connected with described real time service interface Often, if abnormal, then described real time filtering processing module is restarted by described real time service interface;Also simultaneously The state of the described backtracking filter processing module that detection is connected with described history service interface is the most normal, as Fruit is abnormal, then restart described backtracking filter processing module by described history service interface.
In actual application, poll monitoring modular can event rules in periodic polling data base, when system just When event rules in startup or data base has renewal, all effective event rules can be sent to by this module Rule tree builds module, is used for building even numbers group trie tree.
(2) rule tree builds module
Described rule tree builds module for receiving from the event rules of described poll monitoring modular, and according to Described event rules builds the even numbers group trie tree supporting Dynamic Thermal switching.
Explaining lower even numbers group trie tree, full name is Double-Array Trie herein, be one of trie tree simply and Effective realization, is made up of two integer arrays, and one is base [], and another is check [], if under array Being designated as i, the two array meets following condition: t=base [i]+a, check [t]=i always.If base, check It is 0, represents that this position is for sky.If base is negative value, represent that this state is word.Even numbers group trie tree exists It is here primarily intended for storing key word, is used for document is carried out efficient multi-mode matching, returns after the match is successful Return matching result.Trie tree herein refers to even numbers group trie tree.
Described rule tree structure module, reference Fig. 2, specifically for:
Step is A.2.1: get come from the distribution of described poll monitoring modular when described rule tree builds module During event rules, described rule tree structure module is by checking that achievement is masked as true or false, thus judges whether Other achievement processes are had to carry out;
When rule tree structure module loading event rules sets up even numbers group trie tree, when system just starts to start, Achievement mark can be set to false.Achievement is masked as false and shows currently without achievement process.
Step is A.2.2: being masked as true if contribute, showing to there is other achievement processes, then using the thing of the overall situation Part rules variables backs up the event rules received, and described event rules is write into corresponding daily record;Subsequently After waiting 1 second, continuing checking for mark of contributing, carrying out until there are not other achievement processes;
If contributing be masked as vacation, showing to there is not other achievement processes, then using described event rules to set up Even numbers group trie tree, when setting up during even numbers group trie tree, on the one hand, locking achievement process, until building After having set, discharge this lock;On the other hand, amendment achievement is masked as very;
Wherein, the process setting up even numbers group trie tree refers to: extract the word set in described event rules, sets up double Array trie tree, points to newly-built even numbers group trie tree by local intelligence pointer;Then, overall situation intelligent pointer is used Update overall situation trie tree, during updating overall situation tire tree, this process is locked, release lock after renewal, subsequently Amendment achievement variable is false, shows that this achievement process terminates;
In this step, extract the word set in described event rules, set up even numbers group trie tree, specifically comprise following Step:
Step is A.2.2.1: described event rules includes multiple word set variable, each in traversal event rules Individual word set variable, and whether grammatical term for the character collection variable set exist the word set variable traversed, if it does not, Then the described word set variable traversed is deposited in described word set variables collection;If it is present skip;
Step is A.2.2.2: described event rules includes multiple rule;Each rule in traversal event rules Then, and among judgment rule set, whether there is the rule traversed, if it does not exist, then will traverse Described rule deposits in described regular collection;If it is present skip;
Step is A.2.2.3: utilize rule parsing algorithm to travel through described word set variables collection and described regular collection, Extract legal word set;
Step is A.2.2.4: utilize the word set extracted to build even numbers group trie tree.
Step is A.2.3: check whether backup event rule is empty, if it is empty, then terminates, continues waiting for new Event rules;If not empty, then step is returned A.2.2.
(3) real time filtering processing module
Described real time filtering processing module is used for obtaining Real-time document data by described document acquisition module, and Described Real-time document data are processed, is converted into Real-time document structure;Then, described even numbers group trie is used Described Real-time document structure is scanned by tree, identifies the Real-time document meeting special topic requirement, and will identify Result is stored by described document results memory module;
Real time filtering processing module reference Fig. 3, specifically for:
Step is B.1: described real time filtering processing module reads real time filtering and processes configuration file, obtains configuration letter Breath;
Step is B.2: described real time filtering processing module obtains the passage class at Real-time document place according to configuration information Type and Thread Count corresponding to passage;Then, open corresponding Real-time document for each passage and process thread;
Thread is processed for each described Real-time document, carries out the process of real time filtering process, including:
Step is B.2.1: described Real-time document processes thread and reads several Real-time document by input module, and Several Real-time document are packaged into file structure;
Concrete, input module reads Real-time document by API from ZeroMQ message queue, according to system Configuration information, determines the number of documents of a collection of reading, is packaged into file structure.
Step is B.2.2: described Real-time document processes thread and assembles document: i.e.: by suitable according to territory of file structure Sequence separates with spcial character, then records the side-play amount in each document and territory;
Step is B.2.3: described Real-time document processes thread and uses described even numbers group trie tree to enter described assembling document Line discipline mates, and identifies the document meeting special topic demand in described Real-time document structure.
B.2.3, step comprises following step:
Step is B.2.3.1: described Real-time document processes thread and judges whether even numbers group trie tree sets up complete;If The most complete, then wait;If complete, perform step B.2.3.2;
Step is B.2.3.2: described Real-time document processes thread and obtains the described assembling document assembled, described assembling Document is long character string structure;Then, use described even numbers group trie tree to scan described long character string structure, enter Row Keywords matching, obtains scanning result;Wherein, described even numbers group trie tree is used to scan described long character string The process of structure needs to lock;
Step is B.2.3.3: rule parsing, and according to the rule syntax, scanning result is carried out set operation;
In this step, the rule syntax are expressed as follows some:
(1) rule with behavior unit, i.e. one rule (rule) end is with newline ' n ' terminate;
(2) if rule exists a plurality of sub-rule, then once exist territory title or the restriction of territory body, that All of sub-rule all should add co-domain and limit, and can not use use of omissions text;Wherein, territory@title Representing title field ,@body represents that text field ,@text represent territory in full, including title and text field;
(3) one rules comprise one or more factor (factor), by rule computing between factor (factor) Symbol connects, and they are: " ", " | ", "-" represent the "AND" of set, "or", "No" computing respectively, because of The operator of element (factor) determines the operation relation of itself correspondence hitting set and other hitting sets;
(4) factors (factor) are made up of "@territory " and entry (term) :@territory term;Also may be used Directly it is made up of entry (term) not carry out territory limiting: term;
(5) entries (term) can comprise following some unit, rule (rule): (rule), variable (var): { var}, character string: " string ", numeral: num, word: word;
(6) in above-mentioned entry (term), each unit can use " ", " | ", "-" symbol to connect, meaning Justice is identical with set operation;
(7) above-mentioned rule (rule) is rule described above, needs to quote with bracket when quoting in term Get up;
(8) above-mentioned variable (var) definition mode is: var=(N number of word connects with logical operator), var is The expression formula that some words (word) couple together with above-mentioned logical symbol, in term, way of reference is { var};
(9) above-mentioned character string (" string ") definition mode is: " N number of character ", the word in quotation marks is counted as Being an entirety, be used for of which defines English phrase more;
(10) above-mentioned numeral (num) definition mode is: numeral;
(11) upper predicate (word) definition mode is: a word;
B.2.3.4 step, carries out information source scope (as from microblogging, or forum) to step result B.2.3.3 Filter, filter out the information source scope not required in event rules, obtain the document relevant to required special topic.
(4) backtracking filter processing module
Described backtracking filter processing module is used for obtaining history archive data by described document acquisition module, and Described history archive data are processed, is converted into history archive structure;Then, described even numbers group trie is used Described history archive structure is scanned by tree, identifies the history archive meeting special topic requirement, and will identify Result is stored by described document results memory module.
Backtracking filter processing module specifically for:
Step is C.1: described backtracking filter processing module reads backtracking filtration treatment configuration file, filters from backtracking Process acquisition backtracking time interval (the time span scope of backtracking every time) and backtracking delay in configuration file Time;
Step is C.2: input module reads task by API from data base, and often discovery one meets backtracking bar The task of part is transmitted to described backtracking filter processing module and processes;Specifically include:
Step is C.2.1: it is 0 that input module obtains is_rollback_ended field value from event rule database Data, this field represent task the need of backtracking, 0 represent need backtracking, 3 represent recall, 1 Represent that backtracking completes;
Step is C.2.2: input module obtains the value of rollback_early_time field from obtaining of task, should Value represents the time of last time backtracking;If this time is later than job start time, represent that this task needs backtracking, This task is then sent to recall filter processing module, and otherwise this task does not meets backtracking condition and cannot be carried out back Trace back;
Step is C.2.3: getting of task is sent to described backtracking filter processing module by input module;Described Backtracking filter processing module reads history archive data within this task effect duration, can from mongodb or MySQL reads the document data within this task effect duration.
Step is C.2.4: described backtracking filter processing module carries out rule match to history archive data, mainly wraps Containing following steps:
Step is C.2.4.1: each the word set variable in described backtracking filter processing module traversal event rules, and Whether grammatical term for the character collection variable set exists the word set variable traversed, if it does not exist, then the institute that will traverse Predicate collection variable deposits in described word set variables collection;If it is present skip;
Each rule in traversal event rules, and among judgment rule set, whether there are the rule traversed Then, if it does not exist, then the described rule traversed is deposited in described regular collection;If it is present Skip;
Utilize rule parsing algorithm to travel through described word set variables collection and described regular collection, extract and meet rule Word set;
Step is C.2.4.2: utilize the word set extracted to build even numbers group trie tree;
Step is C.2.4.3: the sets of documentation of reading is dressed up long character string, according to territory by described backtracking filter processing module Order separate with spcial character, then record the side-play amount in each document and territory;Then, to history literary composition Shelves carry out rule match, identify the document meeting special topic demand in described history archive structure;
Step mainly comprises the steps of
Step is C.2.4.3.1: scans the even numbers group trie tree built up by the long character string assembled, carries out key word Coupling, obtains scanning result;
Step is C.2.4.3.2: rule parsing, and according to the rule syntax, scanning result is carried out set operation;
Step is C.2.4.3.3: information source range filter: step result C.2.4.3.2 is carried out information source scope Filter, filter out the information source scope not required in event rules, obtain the document relevant to required special topic.
Step is C.2.5: the result after rule match is re-write event rules data base;
Step is C.2.6: the report of backtracking result is sent to poll monitoring modular, updates in event rules data base Is_rollback_ended field and rollback_early_time field, determine the event of backtracking this task next time Cycle.
The thematic document system for rapidly identifying of the network-oriented data that the present invention provides, its distributed structure/architecture designs As shown in Figure 5:
Rule tree builds module and receives the event rules from poll monitoring modular, builds even numbers group trie tree, and The rule tree of the Dynamic Thermal switching overall situation, makes system update rule tree in the case of non-stop-machine.
Real time filtering processing module obtains Real-time document data from input module, through the scanning of even numbers group trie tree, Document data is processed, and result is stored.
Backtracking filter processing module receives the event rules from poll monitoring program, from mongodb or MySQL read document and carries out filtration treatment, writing the result into mongodb or MySQL.
From the point of view of system multiplexing, one aspect of the present invention uses modular framework so that modules can To carry out business change along with different business scenarios, without causing other modules unavailable;On the other hand Work out the most general data model (file structure, event rules structure, processing mode), made system pair In different business, there is stronger adaptability.
In terms of system performance perspective, system that employs the matching algorithm of even numbers group trie tree, it is possible to when linear Between complete Keywords matching in complexity so that treatment effeciency is greatly promoted;Can be total to for different passages Enjoy overall situation trie tree, carry out efficient, safe coupling by obstruction mode, and saved internal memory greatly; Additionally system can be deployed on distributed type colony, whole system can be carried out horizontal extension, strengthens Its parallel disposal ability.
Being described principle and the feature of the present invention below in conjunction with business process map, described example is only limited to Explain the system that the present invention realizes, and be not limited to the scope of application claims.
(1) real time filtering and the orchestration instance of backtracking filtration treatment document
By processing module, (real time filtering processes similar with backtracking filter process this flow process, collectively referred to here in as place Reason module) document of certain passage is obtained from input module, native system mainly reads JSON string, JSON The form of string is the red-letter day in { docid:1000000, title: " winter vacation ", body: " have in red-letter day in winter vacation: the Spring Festival, The Lantern Festival.", boardscopeid1: " 100 ", boardscopeid1: " 101 ", boardscopeid1: " 102 ", { docid:1000001, title: " spring ", body: " it is filled with desired season spring Joint ", boardscopeid1: " 103 ", system obtains the content of entire chapter document, root by resolving corresponding field According to configuration information, the every batch of most acquisition 200.
Subsequently these sets of documentation being dressed up a text sequence, this sequence is contain these certification shelves one Individual structure, includes text message and the side-play amount of each document and each document information source range information. If long string information spinner with ' n ' symbol segmentation: " red-letter day in winter vacation red-letter day in n winter vacation have: the Spring Festival, sweet dumplings Joint.N spring n be filled with spring desired season n ", side-play amount is respectively as follows: 0,19,59,66.This is System uses utf-8 coding, and a Chinese character accounts for 3 character lengths.
The main contents of event filtering rule are: event id: r1918, word set: vacation=(winter vacation | the Spring Festival), List of rules :@title (spring | the Spring Festival)@body (Spring Festival in spring)@title (spring | { vacation });Information source scope: 100,101,102.A trie tree for document matches key word is constructed according to this event rules.
The state (whether having built up complete) that processing module starts according to trie tree calls trie tree.Such as, Another passage also have read document and is ready for coupling, finds that the matching process of trie tree has been added lock, then Illustrate that certain passage mates, the most then can block.If lock is released, then can wake up up and wait The file structure assembled is called trie tree by passage, obtains a result, include hit word after coupling, And this word hit position and number information.This matching result should hit 2 times for " spring ", and position is 59,66;" Spring Festival " hits 1 time, and position is 41;" winter vacation " hits 2 times, and position is 0,19.
The hit word letter that processing module is converted into document id followed by this information result and the document comprises Breath, leaves in an intermediate object program in units of event.Such as: " winter vacation " is positioned at 1000000 documents 0 Position, " spring " is positioned at 0 position of 1000001 documents.
Algorithm then according to rule carries out set operation to the result of hit, screens out some no rules The document of middle hit and ship the document that calculation is deleted.Such as: "@title (spring | the Spring Festival) " represent at document Title occurs in that " spring " or " Spring Festival ";"@text (Spring Festival in spring) " then represents in document full content same Time occur " spring " and " Spring Festival ";"@title (spring | { vacation }) " represent the existence " spring in the title territory of document My god " or variable " vacation " in word, namely " winter vacation " or " Spring Festival ".
Result obtained in the previous step is carried out the filtration of information source scope by processing module subsequently, such as: thing above Part rule requires 100,101,102, then only document 1000000 meets this condition.
Result being finally assembled into JSON form be stored in mongoDB, the result for this example is: {_id:1000000,tid:”r1918”}。
(2) the handling process example of acquisition task is filtered in backtracking
Poll monitoring program obtains the backtracking task meeting backtracking condition from event rule database, and flow process is such as Under:
task_id task_start_time task_end_time is_rollback_ended rollback_early_time
1010 2015-1-1 00:00:00 2016-1-1 00:00:00 0 2015-10-1 00:00:00
It is not the number of 1 that poll monitoring system obtains is_rollback_ended field from event rule database According to, this field represents whether rule should be recalled, and on duty is to represent when 0 that this rule needs backtracking, is 3 or 5 Time represent that this rule is recalled, be when 1, to represent that backtracking completes to be no longer necessary to backtracking.
After poll monitoring system gets 1010 tasks of upper table, obtain the value of rollback_early_time. This value represents the time that this task last time recalls.Backtracking filter process is to return from back to front according to the time Trace back, i.e. from recent times to farther historical time backtracking.Sentence after obtaining the value of rollback_early_time Disconnected its whether after job start time because 2015-10-1 is after 2015-1-1, so this task is Can recall.Reading backtracking time interval from configuration file, this value determines the maximum duration of backtracking every time Interval.Backtracking time interval in this example is set to 5 months, and this task is packaged into a task object, The backtracking time range of task is 2015-5-1 00:00:00 2015-10-1 00:00:00, this task is sent Process to backtracking filter processing module.
After backtracking filter processing module has processed, can send report to poll monitoring system, poll is supervised Examining system update subsequently is_rollback_ended field in event rules data base and Rollback_early_time field, such as following table:
task_id task_start_time task_end_time is_rollback_ended rollback_early_time
1010 2015-1-1 00:00:00 2016-1-1 00:00:00 3 2015-5-1 00:00:00
Poll monitoring system can again obtain this task and again carry out according to above-mentioned flow process recalling subsequently Filter, when rollback_early_time recall time interval < when of task_start_time, namely this The backtracking of business has proceeded to for the last time, and the backtracking time range after this task being encapsulated is 2015-1-1 00:00:00 2015-5-1 00:00:00, sends this task again.
After having recalled, the field of poll monitoring system update event rules data base, result such as following table:
task_id task_start_time task_end_time is_rollback_ended rollback_early_time
1010 2015-1-1 00:00:00 2016-1-1 00:00:00 1 2015-1-1 00:00:00
So far, backtracking filter process completes.
The above is only the preferred embodiment of the present invention, it is noted that common for the art For technical staff, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, These improvements and modifications also should regard protection scope of the present invention.

Claims (8)

1. the thematic document system for rapidly identifying of network-oriented data, it is characterised in that include that document obtains Delivery block, document results memory module, poll monitoring modular, real time service interface, history service interface, Rule tree builds module, real time filtering processing module and backtracking filter processing module;Wherein, poll monitoring mould Block is connected with described real time filtering processing module by described real time service interface, and described real time filtering processes mould Block is connected with described document acquisition module and document results memory module respectively;Described poll monitoring modular passes through Described history service interface is connected with described backtracking filter processing module;Described backtracking filter processing module is respectively It is connected with described document acquisition module and document results memory module;
Described poll monitoring modular is for distributing to described real time service interface and described history service interface respectively Event rules;
Described rule tree builds module for receiving from the event rules of described poll monitoring modular, and according to Described event rules builds the even numbers group trie tree supporting Dynamic Thermal switching;
Described real time filtering processing module is used for obtaining Real-time document data by described document acquisition module, and Described Real-time document data are processed, is converted into Real-time document structure;Then, described even numbers group trie is used Described Real-time document structure is scanned by tree, identifies the Real-time document meeting special topic requirement, and will identify Result is stored by described document results memory module;
Described backtracking filter processing module is used for obtaining history archive data by described document acquisition module, and Described history archive data are processed, is converted into history archive structure;Then, described even numbers group trie is used Described history archive structure is scanned by tree, identifies the history archive meeting special topic requirement, and will identify Result is stored by described document results memory module.
The thematic document system for rapidly identifying of network-oriented data the most according to claim 1, its feature Being, described poll monitoring modular is additionally operable to:
Described poll monitoring modular is being advised to described real time service interface and described history service interface Distribution Events Time then, receive described real time service interface and the heart beating of described history service interface return the most simultaneously, thus examine The state surveying the described real time filtering processing module being connected with described real time service interface is the most normal, if Abnormal, then restart described real time filtering processing module by described real time service interface;Detection and institute the most simultaneously The state stating the described backtracking filter processing module that history service interface is connected is the most normal, if abnormal, Then restart described backtracking filter processing module by described history service interface.
The thematic document system for rapidly identifying of network-oriented data the most according to claim 1, its feature Be, described rule tree build module specifically for:
Step is A.2.1: get come from the distribution of described poll monitoring modular when described rule tree builds module During event rules, described rule tree structure module is by checking that achievement is masked as true or false, thus judges whether Other achievement processes are had to carry out;
Step is A.2.2: being masked as true if contribute, showing to there is other achievement processes, then using the thing of the overall situation Part rules variables backs up the event rules received, and described event rules is write into corresponding daily record;Subsequently After waiting 1 second, continuing checking for mark of contributing, carrying out until there are not other achievement processes;
If contributing be masked as vacation, showing to there is not other achievement processes, then using described event rules to set up Even numbers group trie tree, when setting up during even numbers group trie tree, on the one hand, locking achievement process, until building After having set, discharge this lock;On the other hand, amendment achievement is masked as very;
Wherein, the process setting up even numbers group trie tree refers to: extract the word set in described event rules, sets up double Array trie tree, points to newly-built even numbers group trie tree by local intelligence pointer;Then, overall situation intelligent pointer is used Update overall situation trie tree, during updating overall situation tire tree, this process is locked, release lock after renewal, subsequently Amendment achievement variable is false, shows that this achievement process terminates;
Step is A.2.3: check whether backup event rule is empty, if it is empty, then terminates, continues waiting for new Event rules;If not empty, then step is returned A.2.2.
The thematic document system for rapidly identifying of network-oriented data the most according to claim 3, its feature It is, extracts the word set in described event rules, set up even numbers group trie tree, specifically comprise the steps of
Step is A.2.2.1: described event rules includes multiple word set variable, each in traversal event rules Individual word set variable, and whether grammatical term for the character collection variable set exist the word set variable traversed, if it does not, Then the described word set variable traversed is deposited in described word set variables collection;If it is present skip;
Step is A.2.2.2: described event rules includes multiple rule;Each rule in traversal event rules Then, and among judgment rule set, whether there is the rule traversed, if it does not exist, then will traverse Described rule deposits in described regular collection;If it is present skip;
Step is A.2.2.3: utilize rule parsing algorithm to travel through described word set variables collection and described regular collection, Extract legal word set;
Step is A.2.2.4: utilize the word set extracted to build even numbers group trie tree.
The thematic document system for rapidly identifying of network-oriented data the most according to claim 1, its feature Be, described real time filtering processing module specifically for:
Step is B.1: described real time filtering processing module reads real time filtering and processes configuration file, obtains configuration letter Breath;
Step is B.2: described real time filtering processing module obtains the passage class at Real-time document place according to configuration information Type and Thread Count corresponding to passage;Then, open corresponding Real-time document for each passage and process thread;
Thread is processed for each described Real-time document, carries out the process of real time filtering process, including:
Step is B.2.1: described Real-time document processes thread and reads several Real-time document by input module, and Several Real-time document are packaged into file structure;
Step is B.2.2: described Real-time document processes thread and assembles document: i.e.: by suitable according to territory of file structure Sequence separates with spcial character, then records the side-play amount in each document and territory;
Step is B.2.3: described Real-time document processes thread and uses described even numbers group trie tree to enter described assembling document Line discipline mates, and identifies the document meeting special topic demand in described Real-time document structure.
The thematic document system for rapidly identifying of network-oriented data the most according to claim 5, its feature Being, B.2.3 step comprises following step:
Step is B.2.3.1: described Real-time document processes thread and judges whether even numbers group trie tree sets up complete;If The most complete, then wait;If complete, perform step B.2.3.2;
Step is B.2.3.2: described Real-time document processes thread and obtains the described assembling document assembled, described assembling Document is long character string structure;Then, use described even numbers group trie tree to scan described long character string structure, enter Row Keywords matching, obtains scanning result;Wherein, described even numbers group trie tree is used to scan described long character string The process of structure needs to lock;
Step is B.2.3.3: rule parsing, and according to the rule syntax, scanning result is carried out set operation;
B.2.3.4 step, carries out the filtration of information source scope, filters out not in event step result B.2.3.3 The information source scope that rule requires, obtains the document relevant to required special topic.
The thematic document system for rapidly identifying of network-oriented data the most according to claim 1, its feature Be, described backtracking filter processing module specifically for:
Step is C.1: described backtracking filter processing module reads backtracking filtration treatment configuration file, filters from backtracking Process and configuration file obtains backtracking time interval and recalls time delay;
Step is C.2: input module reads task by API from data base, and often discovery one meets backtracking bar The task of part is transmitted to described backtracking filter processing module and processes;Specifically include:
Step is C.2.1: it is 0 that input module obtains is_rollback_ended field value from event rule database Data, this field represent task the need of backtracking, 0 represent need backtracking, 3 represent recall, 1 Represent that backtracking completes;
Step is C.2.2: input module obtains the value of rollback_early_time field from obtaining of task, should Value represents the time of last time backtracking;If this time is later than job start time, represent that this task needs backtracking, This task is then sent to recall filter processing module, and otherwise this task does not meets backtracking condition and cannot be carried out back Trace back;
Step is C.2.3: getting of task is sent to described backtracking filter processing module by input module;Described Backtracking filter processing module reads the history archive data within this task effect duration;
Step is C.2.4: described backtracking filter processing module carries out rule match to history archive data, mainly wraps Containing following steps:
Step is C.2.4.1: each the word set variable in described backtracking filter processing module traversal event rules, and Whether grammatical term for the character collection variable set exists the word set variable traversed, if it does not exist, then the institute that will traverse Predicate collection variable deposits in described word set variables collection;If it is present skip;
Each rule in traversal event rules, and among judgment rule set, whether there are the rule traversed Then, if it does not exist, then the described rule traversed is deposited in described regular collection;If it is present Skip;
Utilize rule parsing algorithm to travel through described word set variables collection and described regular collection, extract and meet rule Word set;
Step is C.2.4.2: utilize the word set extracted to build even numbers group trie tree;
Step is C.2.4.3: the sets of documentation of reading is dressed up long character string, according to territory by described backtracking filter processing module Order separate with spcial character, then record the side-play amount in each document and territory;Then, to history literary composition Shelves carry out rule match, identify the document meeting special topic demand in described history archive structure;
Step is C.2.5: the result after rule match is re-write event rules data base;
Step is C.2.6: the report of backtracking result is sent to poll monitoring modular, updates in event rules data base Is_rollback_ended field and rollback_early_time field, determine the event of backtracking this task next time Cycle.
The thematic document system for rapidly identifying of network-oriented data the most according to claim 7, its feature Being, step mainly comprises the steps of
Step is C.2.4.3.1: scans the even numbers group trie tree built up by the long character string assembled, carries out key word Coupling, obtains scanning result;
Step is C.2.4.3.2: rule parsing, and according to the rule syntax, scanning result is carried out set operation;
Step is C.2.4.3.3: information source range filter: step result C.2.4.3.2 is carried out information source scope Filter, filter out the information source scope not required in event rules, obtain the document relevant to required special topic.
CN201610150817.6A 2015-03-16 2016-03-16 A kind of thematic document system for rapidly identifying of network-oriented data Expired - Fee Related CN105843854B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510114360 2015-03-16
CN2015101143609 2015-03-16

Publications (2)

Publication Number Publication Date
CN105843854A true CN105843854A (en) 2016-08-10
CN105843854B CN105843854B (en) 2019-02-05

Family

ID=56587197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610150817.6A Expired - Fee Related CN105843854B (en) 2015-03-16 2016-03-16 A kind of thematic document system for rapidly identifying of network-oriented data

Country Status (1)

Country Link
CN (1) CN105843854B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030191627A1 (en) * 1998-05-28 2003-10-09 Lawrence Au Topological methods to organize semantic network data flows for conversational applications
US20100293159A1 (en) * 2007-12-14 2010-11-18 Li Zhang Systems and methods for extracting phases from text
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030191627A1 (en) * 1998-05-28 2003-10-09 Lawrence Au Topological methods to organize semantic network data flows for conversational applications
US20100293159A1 (en) * 2007-12-14 2010-11-18 Li Zhang Systems and methods for extracting phases from text
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JESSICA ENRIGHT ETC.: ""A Fast Method for Parallel Document Identification"", 《PROCEEDING OF NAACL HLT》 *
徐晓明: ""专利文本聚类及关键短语抽取的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN105843854B (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN110704411B (en) Knowledge graph building method and device suitable for art field and electronic equipment
CN105095223B (en) File classification method and server
CA2513851C (en) Phrase-based generation of document descriptions
CN103297435B (en) A kind of abnormal access behavioral value method and system based on WEB daily record
CN109213756B (en) Data storage method, data retrieval method, data storage device, data retrieval device, server and storage medium
CN108776671A (en) A kind of network public sentiment monitoring system and method
CN104850574B (en) A kind of filtering sensitive words method of text-oriented information
CN109656999B (en) Method, device, storage medium and apparatus for synchronizing large data volume data
US20150341771A1 (en) Hotspot aggregation method and device
CN104268192B (en) A kind of webpage information extracting method, device and terminal
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN107832333B (en) Method and system for constructing user network data fingerprint based on distributed processing and DPI data
CN106033438B (en) Public sentiment data storage method and server
CN108154390A (en) Put-on method and device, the storage medium and computing device of advertisement blog article
CN104809252A (en) Internet data extraction system
CN103198146B (en) Real-time event filtering method and real-time event filtering system oriented to network stream data
CN111737742B (en) Sensitive data scanning method and system
CN109308330A (en) The method of enterprise&#39;s leakage information extraction, analysis and classification Internet-based
WO2014040570A1 (en) Spam template article identification method and device
CN106209863A (en) A kind of web portal security monitoring method based on the scanning of full station
CN103593442A (en) Duplication eliminating method and device for log data
CN105930313A (en) Method and device for processing notification message
CN106844497A (en) The check device and method of a kind of database code
CN107451212A (en) Synonymous method for digging and device based on relevant search
CN112948429B (en) Data reporting method, device and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190205

Termination date: 20200316

CF01 Termination of patent right due to non-payment of annual fee