CN103778200B - A kind of message information source abstracting method and its system - Google Patents

A kind of message information source abstracting method and its system Download PDF

Info

Publication number
CN103778200B
CN103778200B CN201410010836.XA CN201410010836A CN103778200B CN 103778200 B CN103778200 B CN 103778200B CN 201410010836 A CN201410010836 A CN 201410010836A CN 103778200 B CN103778200 B CN 103778200B
Authority
CN
China
Prior art keywords
information source
message
extraction
character
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410010836.XA
Other languages
Chinese (zh)
Other versions
CN103778200A (en
Inventor
刘春阳
程工
张旭
王卿
程学旗
吴琼
徐学可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Computing Technology of CAS
Priority to CN201410010836.XA priority Critical patent/CN103778200B/en
Publication of CN103778200A publication Critical patent/CN103778200A/en
Application granted granted Critical
Publication of CN103778200B publication Critical patent/CN103778200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Information source in the keyword extraction message for passing through match information source decimation rule storehouse the invention discloses a kind of message information source abstracting method and its system, this method, and the rule judgment information source type in match information source decimation rule storehouse, this method include:Packet parsing step and information source extraction step, packet parsing step is used for the text according to input, extract the character in text, and different subordinate sentences are processed as to character progress punctuate, information source extraction step is to carry out Keywords matching to subordinate sentence according to information source decimation rule storehouse, useful to subordinate sentence extraction to want prime sequences, and is wanted useful on prime sequences, information source is extracted, and passes through the rule judgment information source type in match information source decimation rule storehouse.

Description

A kind of message information source abstracting method and its system
Technical field
The present invention relates to text mining field, more particularly to a kind of message information source abstracting method and system.
Background technology
In recent years, with the development of Internet technology, the various information on network are able to wide-scale distribution, these information qualities With confidence level very different, existing regular traditional news media media relatively, the confidence level such as Ye You forums, blog, microblogging is relatively The emerging medium of difference.Useful information source so how to be extracted by studying a question as everybody extensive concern.
Information extraction(Information Extraction:IE), it is that the information included in text is carried out as its name suggests Structuring is handled, and becomes the same organizational form of form.Input information extraction system is urtext, and output is fixed grating The information point of formula, information point is extracted from various documents, is then integrated in unified form, and this is just It is the main task of information extraction.
Information extraction technique is not intended to comprehensive understanding entire chapter document, and simply the part comprising relevant information in document is entered Row analysis, be as which information it is related, that by by system design when the territory fixed depending on.
Information extraction technique is highly useful for the specific fact that needs are extracted from substantial amounts of document.Interconnection So one document library is there is on the net, on the internet, the information of same subject is generally scattered to be stored on different web sites, The form of performance is also different, if can be stored by these informations together with structured form, that will be highly profitable 's.
The content of the invention
The technical problem to be solved in the present invention is the provision of a kind of message information source abstracting method and its system, to overcome The information extraction efficiency of information extraction technique is low in the prior art, the problem of complex operation.
In order to reach object above, the invention provides a kind of message information source abstracting method, it is characterised in that the side Information source in the keyword extraction message that method passes through match information source decimation rule storehouse, and match described information source decimation rule The rule judgment described information Source Type in storehouse, this method includes:
Packet parsing step:According to the text of input, the character in the text is extracted, and the character is made pauses in reading unpunctuated ancient writings It is processed as different subordinate sentences;
Information source extraction step:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to institute State subordinate sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extraction information source, and by matching described information source The rule judgment information source type in decimation rule storehouse.
Above-mentioned message information source abstracting method, it is characterised in that described information source decimation rule storehouse further comprises:It is useful Element library, real information identifing source rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source abstracting method, it is characterised in that methods described enters one before the packet parsing step Step includes:
Message content adaptation step:For shielding the coding of message or the difference of storage mode, there is provided unified message word Accord with iteration and read interface.
Above-mentioned message information source abstracting method, it is characterised in that methods described further comprises:
Information source statistic procedure:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
Above-mentioned message information source abstracting method, it is characterised in that the packet parsing step also includes:
Message character read step:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judgment step:According to the character types recognition rule, character is divided into different type;
Response events step:According to the different type of the character, user is notified to carry out the extraction behaviour of different type character Make.
Above-mentioned message information source abstracting method, it is characterised in that described information source extraction step also includes:
Index establishment step:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence step:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract process step:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, taken out Information source is taken, and judges the authenticity in described information source, the differentiation of described information Source Type is completed;
Export step:The information in described information source and described information Source Type is exported.
Above-mentioned message information source abstracting method, it is characterised in that the extraction process step also includes:
Information source extraction step:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element extraction step:According to candidate's information source or candidate's information source list, from described point The positional information described in useful key element and the useful key element in subordinate sentence is extracted in sentence;
Real information source judgment step:By pre-defined real information identifing source rule, the candidate is judged Whether information source is real information source;
Information source type extraction step:Entered by predefined described information Source Type recognition rule with the useful key element Row matching completes information source type and differentiated.
Above-mentioned message information source abstracting method, it is characterised in that the useful element library, which includes, uses key element, described useful Key element includes:Media name deictic words, date and time information, media report behavior word and media deictic words.
Above-mentioned message information source abstracting method, it is characterised in that the real information identifing source rule is heuristic rule, Manually formulated by observing message, rule can be added or changed.
Above-mentioned message information source abstracting method, it is characterised in that the real information identifing source rule is heuristic comprising one Rule:If candidate information source described in only one of which in subordinate sentence, and there is the media report behavior word, and meet the time Select the character of information source with the media name deictic words end up or the follow-up source string where subordinate sentence there is institute State in date and time information or follow-up source word symbol and the media deictic words occur, then judge that the candidate information source is true Information source.
Above-mentioned message information source abstracting method, it is characterised in that described information Source Type includes:It is news media, forum, rich Visitor and microblogging.
Above-mentioned message information source abstracting method, it is characterised in that in described information Source Type extraction step, for the letter Source Type is ceased for blog and/or the information source of microblogging, it is necessary to further extraction user's name or Blog Website information.
The present invention also provides a kind of message information source extraction system, and using described message information source abstracting method, it is special Levy and be, the system includes:
Packet parsing module:According to the text of input, code parsing is carried out, the character in the text is extracted, and to institute State character progress punctuate and be processed as different subordinate sentences;
Information source abstraction module:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to institute State subordinate sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extraction information source, and by matching described information source The rule judgment information source type in decimation rule storehouse.
Above-mentioned message information source extraction system, it is characterised in that described information source decimation rule storehouse further comprises:It is useful Element library, real information identifing source rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source extraction system, it is characterised in that the system further comprises:
Message content adaptation module:For shielding the coding of message or the difference of storage mode, there is provided unified message word Accord with iteration and read interface.
Above-mentioned message information source abstracting method, it is characterised in that the system further comprises:
Information source statistical module:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
Above-mentioned message information source extraction system, it is characterised in that the packet parsing module also includes:
Message character read module:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judge module:According to the character types recognition rule, character is divided into different type;
Response events module:According to the different type of the character, user is notified to carry out the extraction behaviour of different type character Make.
Above-mentioned message information source extraction system, it is characterised in that described information source abstraction module also includes:
Index sets up module:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence module:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract processing module:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, taken out Information source is taken, and judges the authenticity in described information source, the differentiation of described information Source Type is completed;
Output module:The information in described information source and described information Source Type is exported.
Above-mentioned message information source extraction system, it is characterised in that the extraction processing module also includes:
Information source abstraction module:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element abstraction module:According to candidate's information source or candidate's information source list, from described point The positional information described in useful key element and the useful key element in subordinate sentence is extracted in sentence;
Real information source judge module:By pre-defined real information identifing source rule, the candidate is judged Whether information source is real information source;
Information source type abstraction module:Entered by predefined described information Source Type recognition rule with the useful key element Row matching completes information source type and differentiated.
Above-mentioned message information source extraction system, it is characterised in that in described information Source Type abstraction module, for the letter Source Type is ceased for blog and the information source of microblogging, it is necessary to further extraction user's name and Blog Website information.
Compared with prior art, the beneficial effects of the present invention are:
1st, the present invention, can flexible expansion, the specific extraction of realization based on the general information extraction framework based on event response Task.
2nd, the present invention can effectively integrate information source decimation rule storehouse, and message source is extracted from message, and judge its type, carry High message information source extraction efficiency reduction operation difficulty.
Brief description of the drawings
Fig. 1 is abstracting method step schematic diagram in message information source of the present invention;
Fig. 2 is packet parsing step schematic diagram of the present invention;
Fig. 3 is information source extraction step schematic diagram of the present invention;
Fig. 4 extracts process step schematic diagram for the present invention;
Fig. 5 is extracting method embodiment step schematic diagram in message information source of the present invention;
Fig. 6 is embodiments of the invention packet parsing step schematic diagram;
Fig. 7 is embodiments of the invention message extraction step schematic diagram;
Fig. 8 is message information source extraction system structural representation of the present invention;
Fig. 9 is specific embodiment of the invention message information source extraction system structural representation.
Wherein, reference:
The information source abstraction module of 1 message content adaptation module 2
The information source statistical module of 3 packet parsing module 4
The character types judge module of 21 message character read module 22
23 response events modules
31 indexes set up the subordinate sentence module of module 32
33 extract the output module of processing module 34
The useful key element abstraction module of 331 information source abstraction module 332
The information source type abstraction module of 333 real information source judge module 334
S1~S4, S11~S13, S21~S24, S231~S234, S100~S102, S1031~S1034:It is of the invention each The administration step of embodiment.
Embodiment
The embodiment of the present invention is given below, detailed description is made that to the present invention with reference to diagram.
Fig. 1 is message information source abstracting method step schematic diagram of the present invention, as shown in figure 1, a kind of report that the present invention is provided Information source in literary information source abstracting method, the keyword extraction message that this method passes through match information source decimation rule storehouse, and The rule judgment described information Source Type in match information source decimation rule storehouse, this method includes:
Message content adaptation step S1:For shielding the coding of message or the difference of storage mode, there is provided unified message Character iteration reads interface;
Packet parsing step S2:According to the text of input, the character in text is extracted, and character progress punctuate is processed as Different subordinate sentences;
Information source extraction step S3:Keywords matching is carried out to subordinate sentence according to information source decimation rule storehouse, subordinate sentence is extracted It is useful to want prime sequences, and wanted useful on prime sequences, information source is extracted, and sentence by the rule in match information source decimation rule storehouse Disconnected information source type;
Information source statistic procedure S4:Collect the extraction result for extracting information source, calculate the statistical information of information source.
Information source decimation rule storehouse therein further comprises:Useful element library, real information identifing source rule, information source Type identification rule and character types recognition rule.
Fig. 2 is packet parsing step schematic diagram of the present invention, as shown in Fig. 2 wherein, packet parsing step S2 also includes:
Message character read step S21:Message byte stream is read, and byte is assembled into according to coded system actual word Symbol;
Character types judgment step S22:According to character types recognition rule, character is divided into different type;
Response events step S23:According to the different type of character, user is notified to carry out the extraction behaviour of different type character Make.
Fig. 3 is information source extraction step schematic diagram of the present invention, as shown in figure 3, wherein, information source extraction step S3 is also wrapped Include:
Index establishment step S31:TRIE keyword indexes are set up according to useful element library;
Subordinate sentence step S32:Character in response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract process step S33:According to TRIE keyword indexes, Keywords matching is carried out to different subordinate sentences, letter is extracted Breath source, and judge the authenticity of information source, complete the differentiation of information source type;
Export step S34:The information of information source and information source type is exported.
Wherein, Fig. 4 is abstracting method detailed step schematic diagram in message information source of the present invention, as shown in figure 4, extracting processing step Rapid S33 also includes:
Information source extraction step S331:Information source extraction is carried out by unit of subordinate sentence, is set up according to useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element extraction step S332:According to candidate's information source or candidate's information source list, from subordinate sentence Extract the positional information in subordinate sentence in useful key element and useful key element;
Real information source judgment step S333:By pre-defined real information identifing source rule, candidate information is judged Whether source is real information source;
Information source type extraction step S334:Pass through predefined information source type recognition rule and the progress of useful key element Differentiate with information source type is completed.
Useful element library therein, which includes, uses key element, and useful key element includes:Media name deictic words, date and time information, matchmaker Body reports behavior word and media deictic words.
Real information identifing source rule therein is heuristic rule, is manually formulated by observing message, rule can add Plus or modification.
Further, real information identifing source of the invention rule includes a heuristic rule:If there was only one in subordinate sentence Individual candidate information source, and there is media report behavior word, and the character in candidate information source is met with media name deictic words knot Subordinate sentence where tail or follow-up source string occurs media deictic words occur in date and time information or follow-up source word symbol, then Judge candidate information source for real information source.
Information source type therein includes:News media, forum, blog and microblogging.
In information source type extraction step S334, for information source that information source type is blog and/or microblogging, it is necessary to enter One step extracts user's name or Blog Website information.
The step of below in conjunction with the specific embodiment of the invention is illustrated, Fig. 5 is message information source of the present invention extracting method One embodiment step schematic diagram, as shown in figure 5, the specific embodiment operating procedure of the present invention, illustrates that message information source is extracted Process.
Present invention aims at a kind of information extraction technique of hommization is provided, occurred including being extracted from message Information source, automatically analyze the type of message source(News, forum, blog, microblogging)And title, extract the user of blog and microblogging Title.
To achieve these goals, the invention provides the rule that a kind of method of rule-based matching and information source are extracted Then storehouse, comprises the following steps:
Step S100:Rule base is read, therefrom extracting keywords and its type information, set up TRIE keyword indexes.
Step S101:According to the text of input, code parsing is carried out, i.e., extracts character stream from text, such as chinese character, Punctuate etc..
Step S102:Punctuate processing is carried out, input text is divided into different subordinate sentences.
Step S103:Step is handled as follows respectively to each subordinate sentence, including:
Step S1031:Multiple-fault diagnosis and date match are carried out using the TRIE books index set up in advance, by subordinate sentence point For " useful key element " sequence, while the positional information of record " useful key element " in subordinate sentence.Useful key element refers to including media name Show report behavior word, media deictic words of word, media etc..
Step S1032:Wanted useful on prime sequences, various pre-defined rules are matched one by one, it is new if there is candidate Information source is heard, candidate's information source is extracted, and determine whether real information source.
Step S1033:By matching pre-defined rule, further the information source to extraction judges its type.
Step S1034:As a result export.
Fig. 6 is embodiments of the invention packet parsing step schematic diagram, as shown in fig. 6, being specifically made up of three steps:
Step S200:Message character is read, and Parser reads interface by message character iteration and reads a character, also It is to say that message character iteration reads interface and reads message byte stream, and according to corresponding coded system, byte is assembled into reality Character, such as Chinese character returns to Parser.
Step S201:Judge the type of character, character is divided into difference according to its functional role in the extraction of different key elements Type, such as year, month, day and some special punctuation marks.
Step S202:Listeners response events are notified, according to the type of character, each Listeners are notified(Observation Person)Perform corresponding call back function and carry out response character reading event.
Information source extracts the realization for a specific Listener for actually corresponding to general extraction framework, by constantly ringing Answer character to read event and complete information source extract function.Fig. 7 is embodiments of the invention message extraction step schematic diagram, such as Fig. 7 Shown, the specific steps for the flow are described as follows:
Step S301:We utilize punctuation marks such as ", " to carry out subordinate sentence segmentation, and information source is then carried out by unit of subordinate sentence Extract.
Step S302:We extract candidate's information source(Generally with " " or《》Surround)Or candidate's information source row Table.
Step S303:If there is candidate's information source, then useful key element is extracted from subordinate sentence and its in subordinate sentence Positional information.These useful key elements and its positional information contribute to positioning real information source, and judge its type.Here, it is useful Key element includes following several types:
A) media name deictic words, such as " Times ", " net ", " news ", " blog ", " mhkc ", " evening paper " etc..Candidate is new Source string is heard using media name deictic words as ending, it is probably real media name to show candidate's news sources, such as " Sina Blog ",《Maeil Business Newspaper》Deng.
B) date and time information, general candidate's news sources are often with the report date:Such as " June 24-25 ", " April 1 ".
C) the report behavior word of media, such as " message ", " report ", " reprinting ", " comment ", " publication ", " issue " show The short sentence may state a news report behavior, thus help to judge the whether true news sources of candidate's news sources.
D) media deictic words, such as " domestic ", " according to ", " media ", " website ".Occur generally around candidate's news sources, table Bright candidate's news source string is probably media noun.
Step S304:On this basis, we can be easy to match various pre-defined rules one by one, judge candidate Information source(If any)Whether real information source.
Such as, wherein a simplest heuristic rule is as follows:If only one of which candidate information source in subordinate sentence, and There is the report behavior word of media, while meeting one of following condition, then may determine that candidate information source is real information source:
A) candidate's news source string is used as ending using media name deictic words.
B) there is date and time information in the short sentence where candidate's news source string.
C) occur the media deictic words such as " domestic ", " according to ", " media ", " website " around candidate's news source string.
Such as, the domestic daily magazine note in " NGO develops AC network " March 11 of subordinate sentence meets above heuristic rule, can extract letter " NGO develops AC network " is information source in breath source.
Here heuristic rule is mainly manually formulated by observing message, may include many complex rules, Er Qiegui It is also then continuous addition or modification.We realize an efficiently expansible information extraction system, can flexibly support rule Addition or modification.
Step S305:We further judge the information source of extraction its type, including news media, forum, blog and Microblogging, for blog and microblogging, we further extract user's name and blog or microblogging site information.Here, we equally make Series of rules is determined, completing information source type by matched rule one by one differentiates, what these rules were provided using step S303 The media name that useful element information is included in information source name indicates word information(If any)And other key elements of surrounding Information.Such as advise for www.xinhuanet.com micro-blog user " XXXX ", the information source type of extraction is microblogging, and its user's name is " XXXX ", microblogging website is " www.xinhuanet.com's microblogging ".
Step S306. we all information sources in message and its type information are exported.
Present invention also offers a kind of message information source extraction system, message information source abstracting method is employed, Fig. 8 is this Invention message information source extraction system structural representation, as shown in figure 8, the system includes:
Message content adaptation module 1:For shielding the coding of message or the difference of storage mode, there is provided unified message word Accord with iteration and read interface;
Packet parsing module 2:According to the text of input, code parsing is carried out, the character in text is extracted, and character is entered Row punctuate is processed as different subordinate sentences;
Information source abstraction module 3:Keywords matching is carried out to subordinate sentence according to information source decimation rule storehouse, subordinate sentence, which is extracted, to be had With wanting prime sequences, and wanted useful on prime sequences, extract information source, and pass through the rule judgment in match information source decimation rule storehouse Information source type;
Information source statistical module 4:Collect the extraction result for extracting information source, calculate the statistical information of information source.
Wherein, packet parsing module 2 also includes:
Message character read module 21:Message byte stream is read, and byte is assembled into according to coded system actual word Symbol;
Character types judge module 22:According to character types recognition rule, character is divided into different type;
Response events module 23:According to the different type of character, user is notified to carry out the extraction operation of different type character.
Wherein, information source abstraction module 3 also includes:
Index sets up module 31:TRIE keyword indexes are set up according to useful element library;
Subordinate sentence module 32:Character in response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract processing module 33:According to TRIE keyword indexes, Keywords matching, Extracting Information are carried out to different subordinate sentences Source, and judge the authenticity of information source, complete the differentiation of information source type;
Output module 34:The information of information source and information source type is exported.
Wherein, extracting processing module 33 also includes:
Information source abstraction module 331:Information source extraction is carried out by unit of subordinate sentence, the TRIE set up according to useful element library Keyword index, extracts candidate's information source or candidate's information source list;
Useful key element abstraction module 332:According to candidate's information source or candidate's information source list, taken out from subordinate sentence Take the positional information in subordinate sentence in useful key element and useful key element;
Real information source judge module 333:By pre-defined real information identifing source rule, candidate information source is judged Whether it is real information source;
Information source type abstraction module 334:Pass through predefined information source type recognition rule and the progress of useful key element Differentiate with information source type is completed.
Wherein, in information source type abstraction module 334, for information source that information source type is blog and microblogging, it is necessary to Further extract user's name and Blog Website information.
Illustrate message information source extraction system below in conjunction with the specific embodiment of the invention, Fig. 9 is the specific embodiment of the invention Message information source extraction system structural representation, as shown in figure 9, the message information source extraction system of the present invention is included:Below four Individual level:
1) message content adaptation layer:The differences such as shielding message coding, storage mode provide consistent message for upper layer module Character iteration reads interface so that upper layer module only needs to be concerned about the logic extracted.
2) Parser layers:Information extraction overall procedure based on event response.Here designed a model using observer, Parser is actually a target(Subject), and register with a series of observers(Observer).Overall procedure is as follows:It is logical Spend content adaptation stacking generation and read message character, often read a character as an event, notify each observer to perform phase The call back function answered carrys out corresponding event.
3) Extractor layers:An observer Listener is actually corresponded to, by realizing that specific event response is moved Make, complete specific information extraction function etc..It is that one of Extractor layers is implemented that information source, which is extracted, according to input Message content, therefrom extract the type information sources such as news, forum, blog and microblogging;Name is provided for news, forum information source Claim standardization function;User's name and site name extract function are provided for blog and micro-blog information source.
4) information source statistics layer:Information source statistics reads message from message data storehouse traversal, and each message content is carried out Information source is extracted.Finally, collect all extraction results, calculate occurrence number, the message category distribution of extracted information source Etc. statistical information, by statistical result write into Databasce.
Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence Various corresponding changes and deformation, but these corresponding changes and change ought can be made according to the present invention by knowing those skilled in the art Shape should all belong to the protection domain of appended claims of the invention.

Claims (17)

1. a kind of message information source abstracting method, it is characterised in that the pass that methods described passes through match information source decimation rule storehouse Keyword extracts the information source in message, and match described information source decimation rule storehouse by adding of observing that message manually formulates Plus or modification rule judgment information source type, this method includes:
Packet parsing step:According to the text of input, the character in the text is extracted, and punctuate processing is carried out to the character For different subordinate sentences, the packet parsing step also includes:
Message character read step:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judgment step:According to character types recognition rule, character is divided into different type;
Response events step:According to the different type of the character, user is notified to carry out the extraction operation of different type character;
Information source extraction step:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to described point Sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extract information source, and by matching the extraction of described information source The rule judgment information source type of rule base, described information source decimation rule storehouse further comprises:Useful element library, real information Identifing source rule, information source type recognition rule and character types recognition rule, described information source extraction step also include:
Index establishment step:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence step:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract process step:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, letter is extracted Breath source, and judge the authenticity in described information source, complete the differentiation of described information Source Type;
Export step:The information in described information source and described information Source Type is exported.
2. message information source abstracting method according to claim 1, it is characterised in that methods described is walked in the packet parsing Before rapid, further comprise:
Message content adaptation step:Changed for shielding the coding of message or the difference of storage mode there is provided unified message character In generation, reads interface.
3. message information source abstracting method according to claim 2, it is characterised in that methods described further comprises:
Information source statistic procedure:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
4. message information source abstracting method according to claim 1, it is characterised in that the extraction process step also includes:
Information source extraction step:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element extraction step:According to candidate's information source or candidate's information source list, from the subordinate sentence Extract the positional information in subordinate sentence described in useful key element and the useful key element;
Real information source judgment step:By pre-defined real information identifing source rule, candidate's news is judged Whether information source is real information source;
Information source type extraction step:Pass through predefined described information Source Type recognition rule and the useful key element progress Differentiate with information source type is completed.
5. message information source abstracting method according to claim 4, it is characterised in that the useful element library includes with will Element, the useful key element includes:Media name deictic words, date and time information, media report behavior word and media deictic words.
6. message information source abstracting method according to claim 5, it is characterised in that the real information identifing source rule is Heuristic rule, is manually formulated by observing message, and rule can be added or changed.
7. message information source abstracting method according to claim 6, it is characterised in that the real information identifing source rule bag Containing a heuristic rule:If candidate's information source described in only one of which in subordinate sentence, and would there is the media report behavior Word, and meet the character of candidate's information source with the media deictic words end up or follow-up source string where Subordinate sentence occurs the media deictic words occur in the date and time information or the follow-up source string, then judges the candidate Information source is real information source.
8. message information source abstracting method according to claim 1, it is characterised in that described information Source Type includes:News Media, forum, blog and microblogging.
9. message information source abstracting method according to claim 4, it is characterised in that described information Source Type extraction step In, for described information Source Type for blog and/or the information source of microblogging, it is necessary to further extraction user's name or Blog Website Information.
10. a kind of message information source extraction system, using message information source as claimed in any one of claims 1-9 wherein extraction side Method, it is characterised in that the system includes:
Packet parsing module:According to the text of input, code parsing is carried out, the character in the text is extracted, and to the word Symbol carries out punctuate and is processed as different subordinate sentences;
Information source abstraction module:Keywords matching is carried out to the subordinate sentence according to described information source decimation rule storehouse, to described point Sentence extract it is useful want prime sequences, and it is described it is useful want on prime sequences, extract information source, and by matching the extraction of described information source The rule judgment information source type of rule base.
11. message information source extraction system according to claim 10, it is characterised in that described information source decimation rule storehouse is entered One step includes:Useful element library, real information identifing source rule, information source type recognition rule and character types recognition rule.
12. message information source extraction system according to claim 10, it is characterised in that the system further comprises:
Message content adaptation module:Changed for shielding the coding of message or the difference of storage mode there is provided unified message character In generation, reads interface.
13. the message information source extraction system according to claim 10 or 11, it is characterised in that the system is further wrapped Include:
Information source statistical module:Collect the extraction result of the extraction information source, calculate the statistical information in described information source.
14. message information source extraction system according to claim 10, it is characterised in that the packet parsing module is also wrapped Include:
Message character read module:Message byte stream is read, and byte is assembled into according to coded system actual character;
Character types judge module:According to the character types recognition rule, character is divided into different type;
Response events module:According to the different type of the character, user is notified to carry out the extraction operation of different type character.
15. message information source extraction system according to claim 10, it is characterised in that described information source abstraction module is also wrapped Include:
Index sets up module:TRIE keyword indexes are set up according to the useful element library;
Subordinate sentence module:The character in the response events step is subjected to punctuate and is processed as different subordinate sentences;
Extract processing module:According to the TRIE keyword indexes, Keywords matching is carried out to the different subordinate sentence, letter is extracted Breath source, and judge the authenticity in described information source, complete the differentiation of described information Source Type;
Output module:The information in described information source and described information Source Type is exported.
16. the message information source extraction system according to claim 15, it is characterised in that the extraction processing module is also wrapped Include:
Information source abstraction module:Information source extraction is carried out by unit of the subordinate sentence, is set up according to the useful element library TRIE keyword indexes, extract candidate's information source or candidate's information source list;
Useful key element abstraction module:According to candidate's information source or candidate's information source list, from the subordinate sentence Extract the positional information in subordinate sentence described in useful key element and the useful key element;
Real information source judge module:By pre-defined real information identifing source rule, candidate's news is judged Whether information source is real information source;
Information source type abstraction module:Pass through predefined described information Source Type recognition rule and the useful key element progress Differentiate with information source type is completed.
17. the message information source extraction system according to claim 16, it is characterised in that described information Source Type abstraction module In, for described information Source Type for blog and the information source of microblogging, it is necessary to further extract user's name and Blog Website letter Breath.
CN201410010836.XA 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system Active CN103778200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410010836.XA CN103778200B (en) 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410010836.XA CN103778200B (en) 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system

Publications (2)

Publication Number Publication Date
CN103778200A CN103778200A (en) 2014-05-07
CN103778200B true CN103778200B (en) 2017-08-08

Family

ID=50570435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410010836.XA Active CN103778200B (en) 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system

Country Status (1)

Country Link
CN (1) CN103778200B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408101B (en) * 2014-11-19 2018-01-09 南京大学 A kind of full range Web information extracts integrated approach
CN106815203B (en) * 2015-12-01 2021-03-30 北京国双科技有限公司 Method and device for analyzing amount of money in referee document
CN105447202A (en) * 2015-12-31 2016-03-30 宁波公众信息产业有限公司 Internet information collecting system
CN106021439A (en) * 2016-05-16 2016-10-12 腾讯科技(深圳)有限公司 Communication number processing method and device
CN106484767B (en) * 2016-09-08 2019-06-21 中国科学院信息工程研究所 A kind of event extraction method across media
CN108268438B (en) * 2016-12-30 2021-10-22 腾讯科技(深圳)有限公司 Page content extraction method and device and client
CN107423279B (en) * 2017-04-11 2021-01-15 美林数据技术股份有限公司 Information extraction and analysis method for financial credit short message
CN107169061B (en) * 2017-05-02 2020-12-11 广东工业大学 Text multi-label classification method fusing double information sources
CN111090744A (en) * 2019-12-17 2020-05-01 中科鼎富(北京)科技发展有限公司 Stock market operation risk information mining method and device
CN112380257A (en) * 2020-11-26 2021-02-19 厦门市美亚柏科信息股份有限公司 Network data stream locking method, terminal equipment and storage medium
CN112597405A (en) * 2020-12-17 2021-04-02 中国科学院计算技术研究所数字经济产业研究院 Event external information source extraction method based on microblog platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071420A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Method and system for cutting index participle
CN101344889A (en) * 2008-07-31 2009-01-14 中国农业大学 Method and system for network information extraction
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN102999534A (en) * 2011-09-19 2013-03-27 北京金和软件股份有限公司 Chinese word segmentation algorithm based on reverse maximum matching
CN103150432A (en) * 2013-03-07 2013-06-12 宁波成电泰克电子信息技术发展有限公司 Method for internet public opinion analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071420A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Method and system for cutting index participle
CN101344889A (en) * 2008-07-31 2009-01-14 中国农业大学 Method and system for network information extraction
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN102999534A (en) * 2011-09-19 2013-03-27 北京金和软件股份有限公司 Chinese word segmentation algorithm based on reverse maximum matching
CN103150432A (en) * 2013-03-07 2013-06-12 宁波成电泰克电子信息技术发展有限公司 Method for internet public opinion analysis

Also Published As

Publication number Publication date
CN103778200A (en) 2014-05-07

Similar Documents

Publication Publication Date Title
CN103778200B (en) A kind of message information source abstracting method and its system
Karami et al. Twitter and research: A systematic literature review through text mining
Bernard Theory of the Hashtag
Rizzo et al. NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud.
Ratkiewicz et al. Detecting and tracking the spread of astroturf memes in microblog streams
Stamatatos et al. Overview of the PAN/CLEF 2015 evaluation lab
Kumar et al. Analyzing Twitter sentiments through big data
CN102760172B (en) Network searching method and network searching system
CN103488663A (en) System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN106547875B (en) Microblog online emergency detection method based on emotion analysis and label
CN101661513A (en) Detection method of network focus and public sentiment
Rao et al. CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text@ FIRE 2016-An Overview.
Kumar et al. IIT-TUDA: System for sentiment analysis in Indian languages using lexical acquisition
CN103294664A (en) Method and system for discovering new words in open fields
CN105843796A (en) Microblog emotional tendency analysis method and device
CN103678362A (en) Search method and search system
CN103577404A (en) Microblog-oriented discovery method for new emergencies
CN106503907B (en) Service evaluation information determination method and server
Karim et al. A step towards information extraction: Named entity recognition in Bangla using deep learning
Hernandez et al. Constructing consumer profiles from social media data
Kim et al. A user opinion and metadata mining scheme for predicting box office performance of movies in the social network environment
CN107169011A (en) The original recognition methods of webpage based on artificial intelligence, device and storage medium
Khurdiya et al. Extraction and Compilation of Events and Sub-events from Twitter
Subramani et al. Text mining and real-time analytics of twitter data: A case study of australian hay fever prediction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant