CN103699663A

CN103699663A - Hot event mining method based on large-scale knowledge base

Info

Publication number: CN103699663A
Application number: CN201310741535.XA
Authority: CN
Inventors: 郝红卫; 孙正雅; 王桂香; 梁倩
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2013-12-27
Filing date: 2013-12-27
Publication date: 2014-04-02
Anticipated expiration: 2033-12-27
Also published as: CN103699663B

Abstract

The invention discloses a hot event mining method based on a large-scale knowledge base. The method is characterized by including the steps of based on data acquired through the internet, automatically establishing a text-understanding-oriented large-scale knowledge base, and allowing for automatic optimization and knowledge updating of the base; based on the large-scale knowledge base, performing structural information extraction on short texts to be detected, sorting the short texts to be detected, according to extracted structural information, and screening out corresponding event texts; based on the large-scale knowledge base, clustering the screened event texts to screen out hot events. The method has the advantages that structural knowledge representation is automatically extracted from the internet, structural tuple representation is established for semantic relation between instances and concepts, a knowledge back-tracking mode is provided, and accuracy in extraction of structural information from short texts is improved.

Description

A kind of focus incident method for digging based on extensive knowledge base

Technical field

The present invention relates to natural language processing technique field, relate in particular to a kind of short text focus incident method for digging based on extensive knowledge base.

Background technology

Along with the arriving of large data age, effectively the obtaining and the specialized level of intelligence that is testing computing machine of processing of mass data information.How a lot of mechanisms and individual makes computer understanding human thinking and our world if beginning one's study.So far, this is still a very challenging target.Under above-mentioned background, for the focus incident of the short texts such as microblogging, excavate the focus that becomes gradually research.

At the beginning of event detection task proposes, vector space model is most widely used method, in vector space model, a dimension representative report, another dimension representative feature, every piece of report just can be expressed with a vector like this, and then by the method for existing classification or cluster, reaches the object of detection event.In vector space model, the selection of feature or extraction are the importances that affects testing result, and existing method lays particular emphasis on the surface of report, the information that can extract from report intuitively, as word, word, time, personage etc., due in these class methods, between feature, be separate, be therefore called again " word bag " model.

Along with the arrival of network and large data age, event detection has been subject to increasing attention, and the whole bag of tricks also emerges in an endless stream, and wherein having representational is event detection based on topic model.Milestone formula method as topic model, probability latent semantic analysis (Probabilistic Latent Semantic Analysis, PLSA) and potential Dirichlet distribute (Latent Dirichlet Allocation, LDA) a series of significant achievements have been obtained, many methods based on PLSA and LDA, as based on word to the LDA model of biterm, add user profile U-LDA, have supervision or semi-supervised topic model etc. still to occupy a tiny space in current popular method.

Development through many years, event detection reaches its maturity, yet, no matter be vector space model or topic model, they are all from reporting surface, deeply do not process and understand the relation between word or phrase, this has just caused the knowledge of every piece of report not by abundant digging utilization, especially profound semantic knowledge, syntactic knowledge etc.And to contact deeper information, knowledge base is absolutely necessary, and has added the content of knowledge base, between word and word, phrase and phrase, just can set up the contact that human intelligible and computing machine can be used.In construction and the application aspect of knowledge base, existing many mechanisms take an early lead.

Wikipedia, Baidupedia etc. be for the open editting function of user, to collect a large amount of knowledge, and classifies and build the knowledge warehouse of oneself; Freebase further, allows user edit knowledge under similar framework and system, and it is different classes of to use concept to distinguish, and facilitates user to inquire about and all knowledge that extraction is relevant to a certain attribute; Google Knowledge Graph (Google's knowledge collection of illustrative plates) is devoted to build complete knowledge hierarchy, by the institute of same search word likely implication and relevant knowledge in " knowledge collection of illustrative plates ", show, facilitate user to search and use; The ProBase of Microsoft is simulating human thinking, Automatic Extraction concept, example, attribute and the property value of 1,000,000 grades, build knowledge base and calculate the concept frequency and the data such as knowledge uncertainty, based on this, developed the application such as topic detection and text understanding.At present most widely used general have the greatest impact no more than WordNet, it has generated some explanations for each word, and the information of near synonym is provided.

Above-mentioned knowledge base system, or conceptual relation lacks level, or knowledge organization form complete lattice not, or lack effective expression of the relationship of the concepts.In the present invention, for adapting to the application of large data age to the understanding of knowledge, first build extensive knowledge base, set up that concept hierarchy is unified, the structuring with example relation, conceptual relation and Rule Expression form, and can automatic acquisition and the knowledge system of refreshing one's knowledge; And carry out on this basis event detection and focus incident and excavate.

Summary of the invention

The object of the invention is based on extensive knowledge base, a kind of focus incident method for digging based on short text information is provided.The present invention includes three aspects, the extensive knowledge base that text-oriented is understood builds automatically, and short text structured message extracts and the classification of short text event, and short text affair clustering and focus incident that various features merges are screened.

According to an aspect of the present invention, provide a kind of focus incident method for digging based on extensive knowledge base, it is characterized in that, comprise the following steps:

Step S1: the data based on obtaining in internet, automatically build the extensive knowledge base that text-oriented is understood, and realize its Automatic Optimal and the renewal of knowledge;

Step S2: based on extensive knowledge base, short text to be detected is carried out to structured message extraction, and according to extracted structured message, described short text to be detected is classified, filter out corresponding event class text;

Step S3: based on extensive knowledge base, screened event class text is carried out to cluster, and then filter out the focus time.

The present invention excavates a kind of effective method is provided for focus incident.Compared with prior art, the present invention has following advantage:

1. the constructed knowledge base of the present invention, not only built the concept structure system of stratification, make concept form cyberrelationship in this structural system, and can from internet, automatically extract entity and entity relationship, set up its structural knowledge and represent, built by being related to storehouse, ontology library, logically knowing the extensive knowledge base system that storehouse, factbase and rule base form; Knowledge retrogressive method wherein, provide take structurized tuple and rule as basis, effective means that the automatic building process of system is exercised supervision and optimized, make knowledge in system, form circulation, contribute to improve construction of knowledge base precision.

2. constructed knowledge base system is applied in focus incident Mining Problems, and its function is embodied in the links of problem.Based on ontology library, the logical structured message of knowing storehouse and factbase, extract and can utilize knowledge accurately to go to instruct and appreciation information extraction process, improve the accuracy of information extraction; Utilization to concept structure system in ontology library and example information, has introduced the tutorial message of priori, contributes to improve short text nicety of grading; The short text structured features coupling in rule-based storehouse can help to realize effective coupling of diversified language performance form, thereby has made up the short and small defect of short text.In whole focus incident testing process, effective utilization of knowledge base can help algorithm to remove noise, improves accuracy of detection.

Accompanying drawing explanation

In order to describe above-mentioned advantage of the present invention and feature, the specific embodiment in accompanying drawing is by reference carried out to aid illustration detailed content of the present invention.Be appreciated that these accompanying drawings are only the description of exemplary embodiments of the present invention, but not limitation of the present invention.Any accompanying drawing with other forms expression steps of the present invention or content all should belong in the scope of the invention.

Fig. 1 is the focus incident method for digging schematic flow sheet based on extensive knowledge base in the present invention;

Fig. 2 is the knowledge base method for auto constructing process flow diagram that the present invention proposes;

Fig. 3 is the ontology library construction method process flow diagram that the present invention proposes;

Fig. 4 is the tree-shaped concept hierarchy figure that the present invention proposes;

Fig. 5 is the mutation structural drawing of the concept hierarchy that proposes of the present invention;

Fig. 6 is the implementation procedure figure of many tag concept of middle-levelization of the present invention recognition technology;

Fig. 7 is the logical construction method process flow diagram of knowing storehouse in the present invention;

Fig. 8 is the specific implementation process flow diagram of factbase constructing technology in the present invention;

Fig. 9 is the procedure chart of rule base construction method in the present invention;

Figure 10 is each storehouse generative process of knowledge base system and knowledge trace-back process schematic diagram in the present invention;

Figure 11 is the process flow diagram of the event detecting method based on extensive knowledge base in the present invention;

Figure 12 carries out the schematic diagram of concept identification in the logical knowledge of embodiment of the present invention storehouse;

Figure 13 carries out the method flow diagram of short text affair clustering and focus incident screening based on knowledge base in the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.

Fig. 1 shows the schematic flow sheet of the focus incident method for digging of a kind of extensive knowledge base provided by the invention.As shown in Figure 1, the method comprises:

Step S2: based on extensive knowledge base, short text is carried out to structured message extraction, and short essay part is classified, filter out as required event class text wherein;

Step S3: based on extensive knowledge base, the structured message extraction algorithm in applying step S2, and using the event class text that obtains in step S2 as input, realize short text affair clustering and focus incident screening that various features merges.

Wherein, described knowledge base is a kind of structured representation system, and its composition comprises:

-be related to storehouse: the entity n tuple extracting from corpus;

-ontology library: comprise concept hierarchy system, example-Conceptual Projection table and classification table three part;

-Tong knows storehouse: the concept n tuple producing after the generalities of example n tuple;

-factbase: with the example n tuple of predicate and concept sign;

-rule base: with the regular collection of the Horn clause form of weight.In above-mentioned knowledge base structure, general-purpose knowledge bases and domain knowledge base are all embodiment.General-purpose knowledge bases is for the treatment of general considerations, and domain knowledge base is built for specific area, for solving particular problem, as event domain knowledge base used in step S2.Wherein, general-purpose knowledge bases is different with the data that domain knowledge base extracts from internet when building, general-purpose knowledge bases extracts the data in internet for general problem and builds, and domain knowledge base builds for the data in extraction internet, specific field.

Fig. 2 shows the knowledge base method for auto constructing process flow diagram that the present invention proposes.Wherein, information extraction algorithm in non-structured text applying step S2 in corpus, produces entity n tuple, and provides data supporting for follow-up flow process, the non-structured text extraction that wherein said information extraction algorithm is actually from described corpus is related to database data, i.e. entity n tuple.As shown in Figure 2, except being related to that storehouse is extracted, in step S1, the construction step of knowledge base is as follows:

Step S11: ontology library builds, and by hierarchical clustering structure concept multi-level structure, carries out the many tag concept of stratification and identifies to build example-Conceptual Projection table, builds on this basis classification table;

Step S12: logical knowledge storehouse builds, and take example-Conceptual Projection table as basis, produces concept n tuple, and it is evaluated and tested by Conceptual Projection, selects high-quality concept n tuple to form logical knowledge storehouse;

Step S13: factbase builds, the concept n tuple of leading in knowledge storehouse of take is reference, and example n tuple is carried out to Conceptual Projection and ambiguity elimination, and retains high-quality example n tuple formation factbase by the evaluation and test of n tuple;

Step S14: rule base builds, builds rule base by regular Path mining, the assessment of regular degree of confidence and the study of regular weight;

Step S15: knowledge is recalled and automatically upgraded, utilizes ontology library, logical knowledge storehouse and rule base to give to instruct to the automatic building process of knowledge base, realizes the automatic expansion of factbase and the optimization of Knowledge Extraction process.

Below will and to each technology, be described in detail by reference to the accompanying drawings according to said sequence.

Fig. 3 shows the ontology library construction method process flow diagram that the present invention proposes.As shown in Figure 3, ontology library construction step S11 specifically comprises:

Step S111: to the language material in corpus, instance properties is being carried out to, on the basis of hierarchical clustering analysis, set up concept hierarchy system;

The concept hierarchy building in step S111 comprises a basic model and mutation thereof.

Fig. 4 shows the tree-shaped concept hierarchy figure that the present invention proposes.As shown in Figure 4, the feature of tree-shaped concept hierarchy comprises:

The concept of-root node is " all things on earth ", and root node has several sub-concepts to divide concept space.

For example " biology ", " article ", the sub-concept that " mechanism " is root node.

The concept of-each node has a community set, for carrying out attributes similarity calculating and distinguishing other concepts.Sub-concept is inherited the attribute of father's concept, has own unique attribute that distinguishes over the brotgher of node simultaneously.

Between-concept, there is hierarchical relationship.

As, the concept of " China " is " geography/area/country ", " geography/area " is the upperseat concept of " country ".

The hierarchical relationship of-concept can make example further extensive to father's concept or grandfather's concept.

Such as: the concept of " tree peony " is " biology/plant/flower ", and father's concept of " flower " is " plant ", and the concept of " tree peony " can extensively be also " biology/plant ", and extensive more even is as required " biology ".The benefit of this extensive mode is, can find in a wider context the example of same concept, guaranteeing the rational while, contributes to computing machine to compare and analyzes.

Fig. 5 shows the mutation structural drawing of the concept hierarchy of the present invention's proposition.As shown in Figure 5, the mutation structure of described concept hierarchy is except having the feature of tree-shaped concept hierarchy, and its feature also comprises:

The part of-conceptional tree forms directed acyclic graph (DAG), and as shown in Figure 5, this kind of DAG structure makes concept form cyberrelationship, and concept can, by the link of his father's concept to other branch, contribute to the extensive and reasoning of knowledge.

This kind of structure has rational realistic meaning, because of the many things in reality, describes from different perspectives, has different concepts.For example, one of father's concept of " vegetables " is " article/diet/food ", and two of father's concept is " biology/plant "; One of father's concept at " sight spot " is " geography/area ", and two of father's concept is " building ", and three of father's concept is " mechanism ".

Step S112: the many tag concept identification of stratification based on attribute and non-attribute multi-source information, set up example-Conceptual Projection table.

In step S112, for carry out the identification of the many tag concept of stratification data from the example n tuple that is related to the particular type in storehouse, comprising:

The attribute tuple of-example: the tuple being formed by example and its attribute;

The open classification tuple of-example: the tuple that each class label in example and its open classification forms;

-polysemant tuple: with the example tuple of the polysemant senses of a dictionary entry;

-synonym tuple: the tuple being formed by example and its synonym.

Wherein, " attribute " refers to some features that example has, as " color ", and " age ", " sports events ", " establishment time "; " open classification " is all possible classification set of example, has the example of ambiguity generally to have the open classification of many groups; Polysemant refers to the example of ambiguity, has a plurality of concepts such as " biology/personage ", " biology/animal ", " biology/plant/flower " as " cuckoo ".

Figure 6 shows that the implementation procedure of many tag concept of step S112-stratification recognition technology, further comprising the steps:

Step S1121: many labelings of the stratification device based on the assessment of attribute area calibration and attribute construction carries out concept identification, and step S1121 is further comprising the steps:

Step S11211: attribute area calibration assessment: computation attribute is distinguished different classes of ability, and its metric form comprises:

-information gain;

-information gain rate;

-maximum entropy.

Step S11212: attribute construction, it comprises the following steps:

Step S112121: take 2 or 3 attributes is one group, under ad hoc structure, generates all possible combinations of attributes;

Step S112122: assess the class discrimination degree of each combination, module used be in step S11211 any one;

Step S112123: under certain threshold value, the high combinations of attributes of selective discrimination degree is as new complex attribute;

Wherein, described ad hoc structure comprises:

Structure 1: all properties in combination is got common factor, and all properties occurs simultaneously;

Structure 2: all properties in combination is got union, at least one attribute occurs;

Structure 3: a plurality of attributes form disjunctive normal form;

Structure 4: a plurality of attributes form conjunctive normal form.

The object of carrying out attribute construction is for increasing the effective rate of utilization of attribute.Because the example attribute that causes in large scale is too much, and the decision-tree model degree of depth is limited, cannot make full use of each attribute to carry out model training, thus cause being on a grand scale of attribute space and attribute that model effectively utilizes seldom; Therefore a lot of attributes are not used effectively and have lost nicety of grading when carrying out category division.

Attribute construction can make in the situation that do not increase the degree of depth of tree, to realize the optimization of classifier performance.Illustrate below:

For example, attribute " affiliated area ", the concept of the example that comprises this attribute can be any one in " geography/area ", " mechanism/universities and colleges/school ", " mechanism/club ", " building ", and therefore, the class discrimination degree in " affiliated area " is very low.During " affiliated regional ^ types of schools " forms when considering common factor, it is " mechanism/universities and colleges/school " that the concept of the example that comprises this combinations of attributes be take very high probability, and therefore, the class discrimination degree of this combination is higher.Again for example, when selecting combinations of attributes " affiliated regional ^ sports events ^ contends race ", can distinguish " mechanism/club " this classification.

For another example " geography/area/country " this concept classification, under this concept almost all examples all there is the attributes such as " National Day ", " capital " and " national anthem ", this generic attribute generally only has country to have.If represent the union of these three attributes with " currency v capital v national anthem ", when some attributes wherein occur, can judge that the concept of this example is " geography/area/country " with high probability very.If using information gain as the metric form of attribute area calibration, its discrimination is the mean value of each attribute information gain in combination.When carrying out node test, if the information gain of this concept combination is maximum, will be as the discrimination properties of this node.

The attribute of neotectonics and former attribute form the testing attribute collection of this node jointly.

It should be noted that, the combinations of attributes form in above-mentioned attribute construction is only for explaining better the present invention, but not is limited.Its simple mutation of any use or Similar Composite Systems mode and apply the method for above-mentioned attribute construction process, all should belong in the scope of the invention.

Sorter described in step S1121 comprises a kind of in following algorithm:

Multi-Label C4.5: a kind of improvement of carrying out for adapting to many labelings of C4.5 algorithm in decision tree;

Predictive Clustering Trees (PCTs): many labelings of the stratification device based on top-down induction decision tree;

Random forest PCTs: random a plurality of subsets the training pattern of building on PCTs basis, determines final classification in the mode of voting;

Random forest ML C4.5: apply Random forest thought on ML C4.5 basis.

Step S1122: calculate and open the many tag concept identification of stratification of classified information based on concept similarity;

The instance concepts identification of step S1122-based on open classification, for the treatment of example that cannot getattr, its feature comprises the following steps:

Step S11221: judge that whether example is for there being ambiguity word, if there is ambiguity, execution step S11226, otherwise execution step S11222;

Step S11222: obtain one group of open classification without ambiguity example;

Concept identification based on open classification is different from the concept identification based on attribute, and open classification itself is exactly the concept classification of example, is many times many labels classification.But open classification does not have effective concept hierarchy, be not yet all concept labels be all rational.For addressing the above problem, the similarity of concept in each concept classification and concept hierarchy system in the open classification of needs calculating, and identify the most similar concept in concept hierarchy system.

Step S11223: each the concept label to wherein, carry out concept similarity calculating, wherein, each concept, with attribute information, by relatively similarity the weighted sum of attribute between concept, obtains the similarity between concept;

Step S11224: the concept for example mark meets most, obtains one group of concept;

Concept set is after implementing the first fusion rule or be single concept, or is a plurality of but coexisting concept.

For example, " biologist " can have " personage " and " occupation " two concepts, although it is non-ambiguity word in encyclopaedia classification.For another example, " Herba Andrographitis " is a kind of " plant ", can be also a kind of " medicine ".

Step S11225: to this group conceptual execution the first fusion rule;

Wherein, the first fusion rule is:

-calculate the Duplication of attribute between concept, and the similarity degree of the given threshold decision concept of foundation;

-complete when overlapping when the community set of concept, show that two concepts are identical, remove and repeat concept;

-when the community set of concept is relation of inclusion, show that two concepts are set membership, get child's concept;

For example, when the concept of " Qinghai-Tibet Platean " is " geography/landform " and " geography/landform/plateau ", because " plateau " is the sub-concept of " landform ", so the final concept hierarchy of this example is " geography/landform/plateau ".

-when two concepts are overlapping relation, show that two concepts have certain similarity, get the common factor of two concepts as final concept, need to carry out the similarity assessment of concept herein, its feature comprises:

Step 1: the similarity degree between the property set of calculating concept;

Property set coupling, obtains the several concepts high with concept similarity to be matched, forms similar concept collection O1;

By the coupling under particular instance and relation, obtain a plurality of similar concepts of concept to be matched, form similar concept collection O2;

Normal have certain similarity with the concept of a certain example and relation collocation.For example, plantation < peasant, vegetables > and plantation < peasant, the dish > of farmers', under " plantation " relation, has similarity with the concept of example " peasant " collocation, therefore, " farmers''s dish " is similar to " vegetables "; In addition, plantation < peasant, plant > is also common collocation, therefore, " farmers''s dish " also has similarity with " plant ".

Step 2: conceptual choice.

Similar concept collection O1 and similar concept collection O2 are got and occur simultaneously and select concept that wherein similarity is the highest as the recognition result of concept to be matched.

-when the community set of concept is when occuring simultaneously, showing that two concept mutual exclusions, example are ambiguity word, concept all retains.

Step S11226: the many groups of open classification of obtaining ambiguity example;

Step S11227: each class label to the open classification of each group, execution step S11222～S11225;

Step S11228: carry out the first fusion rule after the many groups concept obtaining is merged.

Step S1123: the many tag concept identification of stratification based on concept similarity calculating and polysemant information, the example that is ambiguity that the method is processed, comprises the following steps;

-each senses of a dictionary entry of polysemant is carried out to structuring according to grammer cut apart, extract candidate's concept;

The appearance of numerous new technologies and new works, produces many examples that have ambiguity, as " potato ", and " millet ", " apple " etc.A lot of ambiguity words, except with open classified information, also with the polysemant senses of a dictionary entry, will contribute to the concept identification of ambiguity example to effective utilization of these senses of a dictionary entry.

The concept similarity of-polysemant calculates and concept identification;

Each senses of a dictionary entry to polysemant carries out concept similarity assessment, according to assessment result, for this senses of a dictionary entry marks the concept meeting most.

-execution the first fusion rule.

Step S1124: carry out the second fusion rule and carry out the fusion of stratification concept.In summary, step S1121～S1123 is carrying out concept mark, therefore, on same instance, may have result redundancy.But, this redundant information can improve the accuracy of identification in concept merges.The object merging is to improve the accuracy of concept identification and comprehensive in order to make full use of redundant information.Described the second fusion rule comprises following content:

-when example is during without ambiguity: with the concept identification result based on attribute, be as the criterion;

-when example has ambiguity: get attribute labeling and the union of opening the result of classification annotation and polysemant mark, and carry out the first fusion rule;

-for marking concept, only have the example of synonym information, by inquiring about its synon concept, determine the concept of this example.

As " National People's Congress ", both without attribute information, also without open classification and polysemant information, and only has synonym information.The synonym of inquiry " National People's Congress " can obtain two, respectively: " Renmin University of China " and " people's congress ", its concept is respectively " mechanism/universities and colleges/school ", " mechanism/office ".Therefore, " National People's Congress " is ambiguity word, and its concept is " mechanism/universities and colleges/school " and " mechanism/office ".

Step S113: based on concept hierarchy system and the knowledge of grammar, the example in example-Conceptual Projection table is carried out to classification mapping, generate the classification table of example;

Classification table is the special knowledge of auxiliary text-processing process.Comprising, subjective verb class, emotion part of speech, modal verb class, special part of speech, the classifications such as synonym part of speech, for event classification step S2 to the characteristic matching process in the testing process of event and affair clustering step S3.

It should be noted that, above-mentioned for example just in order to explain better the present invention, but not limitation of the present invention, those skilled in that art should be appreciated that, the data that other channels of any use obtain, the data preprocessing method of any other form, the technology according to such scheme or its simple variant or the identification of combination real concept or entity classification, all should belong in the scope of the invention.

Fig. 7 shows in the present invention and in step S12, leads to the construction method process flow diagram of knowing storehouse, and the data that this construction method is used, for being related to unambiguous entity n tuple in storehouse, are first done to introduce to its data representation format and composition below.

Wherein, without ambiguity, represent in n tuple that all entities are all without ambiguity.Be related in storehouse and comprise without ambiguity n tuple and ambiguity n tuple, these n tuples have been described certain true or reasonably behavior, and wherein, entity had both comprised concept and also comprised example.Different n tuples have similar version.Entity n tuple generally has one of following or its mutation form:

Predicate < entity 1, entity 2, area, time >

< predicate, entity 1, entity 2, area, time >

Wherein, " area " or " time " is the entity with " area " or " time " concept, and " predicate " described contact or the behavior between entity, comprises Attribute class predicate and relation object predicate two parts.No more than 2 of the number of entity, the number of area and time can be more than 1, depending on sentence structure.

Entity n tuple can be divided into Attribute class n tuple, upper the next n tuple and behavior class n tuple according to its " predicate " type.In Attribute class n tuple, the common attribute that predicate is entity, as " birthplace ", " floor area ", " director "; In upper the next n tuple, the subordinate relation that predicate is entity represents, as "Yes", " belonging to "; In behavior class n tuple, the word that predicate contacts for characterizing entity behavior, as " liking ", " cultivation ", " creation ".

In behavior class tuple, there is a kind of relationship description wide range, this relation is " co-occurrence ", the implication that " co-occurrence " characterizes at this is: example 1 and example 2 occur be separated by nearer, and centre not having verb jointly in sentence.The effect of co-occurrence is: between its described entity, there is no obvious behavior relation, but cooccurrence relation can be helpful in follow-up statement disambiguation or text understanding.

Example n tuple is exemplified below:

Attribute class tuple: the author < The Romance of the Three Kingdoms, sieve passes through middle >

Upper and lower bit group: be < dolphin, mammal >

Behavior class tuple: < occurs, the Northeast, earthquake, Japan, today, afternoon >

Co-occurrence class tuple: < co-occurrence, amusement, magazine >

The logical knowledge base construction method proposing in the present invention passes through inquiry without ambiguity example n tuple, to its generalities, and by distinguishing the different concepts n tuple under identical predicate to predicate label.Its handled tuple be in entity n tuple without ambiguity part.

Step S12: logical knowledge storehouse builds, and its process is as shown in Figure 7, further comprising the steps:

Step S121: search had or not ambiguity n tuple in being related to storehouse, i.e. the equal unambiguous n tuple of each entity wherein;

Wherein, without ambiguity, represent that each entity in tuple only has a concept.

Step S122: judge that whether entity n tuple has been concept n tuple, and be each n tuple generalities, is each entity flag concept wherein;

For example, pluck < people, cotton >=> plucks < colony, crop >

Author < books and periodicals, personage >=> author < books and periodicals, personage >

Step S123: add up the frequency that each concept n tuple occurs, record this frequency and remove and repeat, obtain initial concept n tuple-set;

Step S124: for any predicate, for all concept n tuples under this predicate, to predicate since 1 label, to distinguish the different concepts n tuple under identical predicate;

As, pluck < colony, crop > and harvesting < personage, flower > is different concept n tuple, after label, become respectively: harvesting _ 5< colony, crop > and harvesting _ 20< personage, flower >.

Step S125: according to the frequency, each concept n tuple is evaluated and tested, selected the frequency to form the logical storehouse of knowing higher than the concept n tuple of a certain threshold value.

Wherein, each concept n tuple has its frequency statistics, and this frequency has characterized the quantity of the example n tuple under this concept n tuple.

Following the concept n tuple obtaining after above-mentioned steps is exemplified below:

Attribute class tuple: author _ 2< books and periodicals, personage >

Upper and lower bit group: be _ 1< dolphin _ 8 animal >

Behavior class tuple: < occurs _ 12, area, calamity, area, time, time >

Co-occurrence class tuple: < co-occurrence _ 179, activity, books and periodicals >

It should be noted that, above-mentioned illustration is only for explaining better the present invention, but not is limited.The preferred embodiment that above-mentioned tuple structure form is only used for the present invention.This area professional person should be appreciated that, the structured representation form of the tuple of every any other form of use or knowledge and the method for applying above-mentioned generalities technology all should belong in the scope of the invention.

Fig. 8 shows the specific implementation flow process of factbase constructing technology in step S13 of the present invention, the handled example n of this technology tuple comprises two parts: without ambiguity n tuple with there is an ambiguity n tuple, first the data structure form in factbase is carried out to necessary explanation below, then elaborate each step.

Factbase is the example n tuple with concept and predicate label.Example n tuple forms the example n tuple with concept label after concept identification; Wherein, each concept has unique label, is called concept label, is to be the convenient simplification that concept is carried out of mark.With the example n tuple of concept label, its predicate needs label equally, and the identical predicate that this label need mate in concept n tuple obtains.

According to the explanation to entity n tuple above, factbase can be divided into the Attribute class fact according to its predicate type, and the upper the next fact and behavior class are true, and has accordingly one of following or its mutation version:

Predicate _ id< example 1_id, example 2_id, area _ id, time _ id>

< predicate _ id, example 1_id, example 2_id, area _ id, time _ id>

Wherein, id represents the label of predicate or example, generally has digital form, and implication is as follows: for predicate, for distinguishing the different concepts n tuple under this predicate; For example, it is its concept label.

Example n tuple produces factbase after mark, under its form:

Attribute class is true: the author _ 2< The Romance of the Three Kingdoms _ 75, Luo Guanzhong _ 2>

The upper the next fact: be _ 1< dolphin _ 8 mammal _ 8>

Behavior class is true: < occurs _ 12, the Northeast _ 51, earthquake _ 200, Japan _ 51, today _ 139, afternoon-139>

Co-occurrence class is true: < co-occurrence _ 179, amusement _ 185, magazine _ 75>

N tuple in factbase can find its corresponding generalities tuple in logical knowledge storehouse, can be referring to concept n tuple generating portion.

Factbase building process need have been processed ambiguity and unambiguous example n tuple, has the example n tuple of ambiguity need to carry out concept ambiguity resolution and can obtain correct concept.Ambiguity resolution is by mating to come the correct concept of each example under this predicate in unique definite example n tuple by unambiguous concept n tuple under identical predicate.

Step S13: factbase builds, and in conjunction with the embodiments, its implementation process comprises the following steps:

Step S131: for any example n tuple, judge whether ambiguity of this n tuple, whether contain the example of ambiguity, if without ambiguity, perform step S132; If there is ambiguity, illustrate that this n tuple need to carry out ambiguity resolution, execution step S136;

Step S132: query case-list of notion is each the example mark concept in n tuple;

For example, when example is during without ambiguity, original n tuple: pluck < people, cotton >, query case-Conceptual Projection table is known, and the concept of " people " is " colony ", and the concept of " cotton " is " crop ".

Step S133: the logical knowledge of inquiry storehouse, in the situation that ignoring predicate label, find occurrence, or there is no occurrence, or only have an occurrence;

For example, to above-mentioned example, its concept n tuple of ignoring predicate label is: pluck < colony, crop >.

Step S134: if only have an occurrence, execution step S135, otherwise, execution step S13A;

For example, above-mentioned concept n tuple is mated with logical knowledge storehouse, obtain a result: harvesting _ 5< colony, crop >.

Step S135: the predicate of this concept n tuple and concept label are assigned to example n tuple to be marked, execution step S13A;

For example, to above-mentioned example, have the regulation of unique label according to each concept, the label of " colony " is " 6 ", and the label of " crop " is " 11 ".Whole labels are assigned to this example n tuple, obtain annotation results: harvesting _ 5< people _ 6, cotton _ 11>.

Step S136: query case-Conceptual Projection table, obtains its all possible concept n tuple I1 under this predicate;

For example, original n tuple: come into < reporter, Jinjiang >, query case-Conceptual Projection table is known, the concept of " reporter " is " personage ", " occupation ", and the concept in " Jinjiang " is:, all there is ambiguity in " city ", " river ".Ignore predicate label, its all possible concept n tuple is:

Come into < personage, city >, comes into < personage, river >, comes into < occupation, river >, come into < occupation, city >.

Step S137: search all concept n tuple I2 under this predicate in logical knowledge storehouse;

For example, logical, know in storehouse, take totally 152 of the concept n tuples that " coming into " be predicate, as: come into _ 15< is cultural, place >, etc.

Step S138: ignore predicate label, I1 and two set of I2 are mated;

For example, above-mentioned example, obtains unique result after coupling: come into _ 5< personage city >.

Step S139: if can realize single coupling, the qi success that disappears, execution step S135, otherwise, execution step S13A;

For example, to above-mentioned example, finally obtain the example n tuple after disambiguation: come into _ 5< reporter _ 2 Jinjiang _ 55>.

Step S13A: if example n tuple traversal is complete, finish, otherwise, execution step S131.

Step S13B: calculate the frequency that each example occurs in corpus, according to the example frequency, the tuple in factbase is evaluated and tested to retain high-quality example n tuple and formed factbase.

Fig. 9 is the procedure chart of rule base construction method in step S14 of the present invention, and this process comprises the following steps:

Step S140: knowledge pre-service;

Pre-service object is entity or the tuple of some low frequencies of filtering.This kind of processing intent is to improve accuracy and the confidence level of knowledge base, and based on a kind of like this understanding: in statistical significance, the frequent knowledge occurring is more credible and useful with respect to the knowledge seldom occurring.This pre-service is especially necessary to co-occurrence n tuple, complicated and changeable due to statement form, and different entities is easy to co-occurrence, on a large scale and under large-scale data, only has between the entity of frequent co-occurrence and just more may have fixing contact, also has with reference to property.

Step S141: use relation is sought footpath (Relational Pathfinding, RPF) algorithm, and to search for all length in knowing storehouse be the regular path of L and be converted into Horn clause form logical;

Rule is the high-quality criterion coming from knowledge learning, can be applicable in knowledge reasoning or text understanding.In the present invention, rule has following Horn clause form:

Predicate 1< concept, concept >^ predicate 2< concept, concept >......=> predicate L< concept, concept >

Wherein, ^ presentation logic with ,=> represents " release ", the part before it is called " regular prerequisite ", its part is below called " rule conclusion ", L is this regular length.This Rule Expression: if regular prerequisite is set up, can release rule conclusion.

The search path constraint that need follow the principles in rule path: it contains L tuple, and wherein, regular prerequisite is L-1 tuple, and rule conclusion is 1 tuple.Each tuple tuple adjacent with it at least contains an identical example, and to guarantee that path from first to last can UNICOM, wherein, head and the tail tuple is regarded as adjacent.For example, regular path:

Build together _ 78<v2, need to bear _ 1<v2 of v0>^, be responsible for _ 570<v0 of v1>=>, v1>

In, tuple 1 is with tuple 2 as regular prerequisite, and tuple 3 is as rule conclusion.Tuple 1 is identical with the v0 of tuple 2, and tuple 2 is identical with the v2 of tuple 3, and tuple 3 is identical with the v1 of tuple 1, meets regular path constraint.This Rule Expression: if, build together _ 78<v2, v0> and need to bear _ 1<v2, v1> sets up simultaneously, have: be responsible for _ 570<v0, v1> sets up.

Step S142: inquiry factbase, in the situation that considering set membership to regular instantiation, and computation rule degree of confidence, as the evaluation criterion of Rules Filtering to filter out inaccurate rule; And provide its all combinations and be expressed as Horn clause form;

-instantiation method is: regular path is the expression of concept aspect, therefore need first search factbase, searches for all examples under each concept tuple; Example n tuple under each concept is expressed as to the regular path of instantiation with the form of concept combination.The mapping of rule path from concept to example is scale process from less to more.

-for each rule in S141, calculate the degree of confidence that this rule is set up, its method is:

Step S1421: search factbase, to this rule instantiation: not only consider the example under each concept n tuple in this rule during instantiation, consider to take all child's examples that this concept n tuple is father simultaneously;

For example, concept n tuple " management _ 87< personage; the > of company ", because the sub-concept of " personage " is " famous expert ", child's example of this concept n tuple is to be the example of concept with " management _ 87< famous expert; the > of company ", such as " management _ 87< Ma Yun _ 4, Alibaba _ 101> ".

Step S1422: add up the regular quantity Q1 setting up in all instantiation rules;

Step S1423: search and using this rule prerequisite as the strictly all rules of prerequisite, instantiation, and add up the rule sum Q2 setting up in example rule;

Step S1424: using the ratio of Q1 and Q2 as this regular degree of confidence.

Step S143: carry out Rules Filtering according to regular degree of confidence;

According to probability, strictly all rules is screened, only have probability to be just retained higher than the rule of a certain threshold value.The probability that rule is set up has been described the uncertainty of knowledge from statistical significance and experience, and this uncertainty necessary being is in real knowledge network.

Step S144: the regular weight study based on Markov Logic Networks.

Step S11～S14 is as the subdivision of the automatic building process of knowledge base, jointly form one complement each other, interactional knowledge base system.Each module of this system produces relation as shown in figure 10.

Step S15: knowledge is recalled and automatically upgraded, utilizes ontology library, logical knowledge storehouse and rule base to give to instruct to the automatic building process of knowledge base, realizes the automatic expansion of factbase and the optimization of Knowledge Extraction process.This step is further comprising the steps:

Step S151: factbase expansion;

Take rule base as basis, utilize uncertain reasoning technology to expand the example n tuple in factbase, based on existing knowledge, by knowledge learning and rule-based reasoning, obtain new knowledge.The method comprises the following steps:

Step S1511: for factbase, adopt the uncertain reasoning technology reasoning based on Markov Logic Networks to produce new example n tuple;

Step S1512: upgrade factbase.

Step S152: use existing rule base, factbase, logical knowledge storehouse to be optimized being related to storehouse extraction process, upgrade logical know storehouse and factbase simultaneously, this step is further comprising the steps:

Step S1521: each entity n tuple generalities of be extracting, and generalities n tuple is mated with logical knowledge storehouse ignoring in predicate label situation;

Step S1522: if this entity n tuple conceptive without ambiguity:

If can obtain unique coupling with logical knowledge storehouse, and in factbase without this entity n tuple, be the correct predicate of its mark and concept label, deposit factbase in, its generalities tuple deposits the logical storehouse of knowing in, otherwise this entity n tuple is put into and is related to that storehouse waits for that follow-up Conceptual Projection and disambiguation etc. further process;

If know storehouse and there is no matching result with logical, be this n tuple generalities after corresponding predicate lower label, enter the logical storehouse of knowing, meanwhile, after the first group echo of former entity n correct predicate and concept label, enter factbase;

Step S1523: if this entity n tuple has an ambiguity conceptive:

If know storehouse and can obtain unique coupling with logical, and in factbase without this example n tuple, be the correct predicate of its mark and concept label, enter factbase, its generalities tuple enters logical knowledge storehouse, otherwise gives up;

If know storehouse and cannot obtain and uniquely mate or without matching result, cannot carry out ambiguity resolution to this entity n tuple with logical, this entity n tuple will be put into and will be related to that storehouse waits for that follow-up Conceptual Projection and disambiguation etc. further process.

In Figure 10, the non-structured text data of corpus M00 for obtaining from internet; Knowledge base M7 comprises the rule base M6 that is related to storehouse M0, ontology library M3, logical knowledge storehouse M4, factbase M5 and final generation.

The generation relation of above-mentioned each database is as follows:

M00 obtains M0 by entity and Relation extraction technology;

M0 and M1 generate M2 by concept identification technology;

M0, M1 and M2 generate M4 by Conceptual Projection;

M0, M2 and M4 are recalled with ambiguity resolution and are generated M5 by concept;

M4 and M5 generate M6 by rule learning;

M6 and M5 are expanded and are generated new M5 by factbase;

M4, M5 and M6 generate new M5 and M0 by extraction process optimization.

The outermost circulation S0-4+S5 of this system is that knowledge base builds circulation automatically, and wherein, S0-4 comprises five processes of S0-S4, and S15 is knowledge trace-back process.

It should be noted that, the extensive knowledge base that step S1 builds is a kind of preferred embodiment, and domain knowledge base is its another embodiment, domain knowledge base comprises field ontology library, Tong Shi storehouse, field and field factbase in the present invention, in order to solve the problem of specific area.Domain knowledge base has a plurality of.

Figure 11 is the process flow diagram of the event detecting method based on extensive knowledge base in step S2 of the present invention, as shown in figure 11, said method comprising the steps of:

Step S21, utilizing Chinese participle technology is orderly sequence of terms by pending text-converted, and each word is carried out to corresponding part-of-speech tagging, and the merging that some special sequence of terms are carried out word is revised to its part of speech simultaneously;

Described step S21 comprises the following steps:

Step S211: utilize Chinese participle technology, pending short text is converted to orderly sequence of terms and provides the part-of-speech tagging that each word is corresponding.

Such as example sentence: last night, the goddess in the moon successfully lands on the moon for No. three, and complete for the first time and clap mutually with " Jade Hare number " lunar rover, China's moon exploration plan is crowned with complete success.

After Chinese word segmenting, can obtain with the sequence of part-of-speech tagging as follows:

< last night/t, the goddess in the moon/n, three/m, number/q, success/ad, log in/v, the moon/n, and/cc, with/p, "/wyz, the Jade Hare/n, number/n, "/wyy, lunar rover/n, complete/v, for the first time/n, mutual bat/v, China/ns, the moon/n, detection/v, plan/n, acquisition/v, satisfactorily/a, success/n>, wherein, t represents the time, n represents noun, m represents number, q represents measure word, ad represents adverbial word, v represents verb, cc represents conjunction, p represents preposition, wyz represents left quotation marks, wyy represents right quotation marks.

Step S212: the sequence of terms with part-of-speech tagging obtaining according to S211, the sequence of terms that merger is border by quotation marks, punctuation marks used to enclose the title or the frequent part of speech template occurring.After merger, the part of speech of the word after merger is revised.

The frequent part of speech template occurring comprises following several situation: the form that 1. part of speech template is v+k; 2. the form that part of speech template is n+k; 3. the form that part of speech template is n+ng; 4. the form that part of speech template is m+q; 5. the form that part of speech template is n+q; 6. the form that part of speech template is v+ng, wherein, " v " represents verb, and " k " represents affixe, and " ng " represents nominal element, and " m " represents number, and " q " represents measure word, and "+" represents that part of speech connects symbol.

Word after merger is carried out to the correction of part-of-speech tagging, wherein, by quotation marks, punctuation marks used to enclose the title and subsequence thereof, be 1.～situation 5., part of speech is corrected as noun (part-of-speech tagging symbol is n), and subsequence is labeled as non-transitive verb (part-of-speech tagging symbol is vi) for situation 6..

For the part of speech sequence in example sentence, in former part of speech sequence " three/No. m/q " be merged as " No. three/n ", "/wyz, the Jade Hare/n, number/n, "/wyy is merged as " Jade Hare number/n ".Therefore the part of speech sequence obtaining through merger is as follows:

< last night/t, the goddess in the moon/n, No. three/n, success/ad, logs in/v, the moon/n, and/cc, with/p, " Jade Hare number "/n, lunar rover/n, completes/v, for the first time/n, mutually bat/v>.

It should be noted that, these part of speech subsequences are the citation form of merger part of speech sequence, non-restriction of the present invention, and it should be appreciated by those skilled in the art that all should be within the scope of the present invention to the expansion of these part of speech subsequences.

Step S22: the sequence of terms obtaining based on step S21, its entity is mapped to stratification concept space, according to the concept under each entity in sentence, ambiguity entity is wherein carried out to rough semantic disambiguation.

Described step S22 is further comprising the steps:

Step S221: the ontology library based in extensive knowledge base, the entity in sentence with attribute information is mapped in its stratification concept space, therefore the entity in above-mentioned example is mapped as respectively:

The goddess in the moon: article/articles for use/equipment, biology/virtual image

No. three: culture/sign

The moon: day system/celestial body/satellite

The Jade Hare number: article/articles for use/equipment

Lunar rover: article/articles for use/equipment

The conceptual description of these entities is that off-line obtains building ontology library, only its result need be stored and index, during on-line analysis, directly inquires about.

Step S222: carry out semantic disambiguation according to the candidate's concept under each entity in sentence, in sentence under the concept constraint without ambiguity entity, a plurality of candidate's concepts to ambiguity entity are carried out probability calculation, using first candidate's concept of the highest concept this entity in this of probability.

Described step S222 is further comprising the steps:

Step S2221, combines each obtaining in S21 without ambiguity entity (the candidate's concept that is entity is unique) and ambiguity entity, form entity pair, and according to each concept of each entity of entity centering, by entity to mapping to concept pair.For this example, the entity obtaining to and corresponding concept to as follows:

(goddess in the moon, the moon): (article/articles for use/equipment, day system/celestial body/satellite), (biology/virtual image, day system/celestial body/satellite)

(goddess in the moon, the Jade Hare number): (article/articles for use/equipment, article/articles for use/equipment), (biology/virtual image, article/articles for use/equipment)

(goddess in the moon, lunar rover): (article/articles for use/equipment, article/articles for use/equipment), (biology/virtual image, article/articles for use/equipment)

Step S2222: the logical knowledge storehouse based in extensive knowledge base, the right frequency of each concept in statistics S2221, centered by candidate's concept of ambiguity entity, calculates concept to the cumulative sum of the frequency occurring and be normalized and obtain probability.Figure 12 provided in logical knowledge storehouse corresponding concept to and the corresponding frequency, therefore based on this example, the probability calculation process of each candidate's concept of ambiguity entity " goddess in the moon " is as follows:

" goddess in the moon " in the lower total frequency occurring of stratification concept " article/articles for use/equipment " is: 46+2529+2529=5104

" goddess in the moon " in the lower total frequency occurring of stratification concept " biology/virtual image " is: 2+102+102=206

The above results is normalized, the probability that obtains candidate's stratification concept " article/articles for use/equipment " of " goddess in the moon " is 0.96, the probability of candidate's stratification concept " biology/virtual image " of " goddess in the moon " is 0.04, so ambiguity entity " goddess in the moon " is " article/articles for use/equipment " in this first candidate's concept.

Step S23: the sequence of terms with candidate's concept obtaining based on described step S22, the technology of utilizing Chinese dependence to extract, is converted into sequence of terms the structure tuple sequence with semantic information in conjunction with the essential sentence formula of Chinese.This sequence has been preserved the semantic information of sentence, supports feature extraction and the feature identification work of the follow-up event text based on knowledge base.

Described step S23 is further comprising the steps:

Step S231: the technology of utilizing Chinese dependence to extract, the sequence of terms obtaining in S22 is carried out to dependency analysis, obtain in sentence the dependence between word and store, the expression-form of dependence is as follows:

Relation (word _i-id _i, word _j-id _j)

Wherein, relation represents two dependences between word, word _irepresent i the former word of word in sentence, id _irepresent word _ilocation label at sentence.

Such as the sequence of terms in above-mentioned example: < last night/t, the goddess in the moon/n, No. three/n, success/ad, log in/v, the moon/n, and/cc, with/p, " Jade Hare number "/n, lunar rover/n, completes/v, for the first time/n, mutual bat/v, China/ns, the moon/n, detection/v, plan/n, acquisition/v, satisfactory/a, success/n>, carries out after dependency analysis, and the sequence of dependence that obtains word is as follows:

[tmod (logs in-5, last night-1), nsubj (logs in-5, the goddess in the moon-2), tmod (logs in-5, No. three-3), advmod (logs in-5, success-4), root (ROOT-0, log in-5), dobj (logs in-5, the moon-6), prep (claps-13 mutually, with-8), nn (lunar rover-10, " Jade Hare number "-9), pobj is (with-8, lunar rover-10), dobj (claps-13 mutually, complete-11), advmod (claps-13 mutually, for the first time-12), conj (logs in-5, complete-11), nn (plan-18, China-15), nn (plan-18, the moon-16), nn (plan-18, survey-17), nsubj (obtains-19, plan-18), dobj (success-21, obtain-19), amod (success-21, satisfactorily-20)]

Wherein, tmod represents time modified relationship, nsubj represents subject-predicate relation, and advmod represents adverbial word modified relationship, and root represents the core predicate of sentence, dobj represents to call guest's relation, prep represents preposition modified relationship, and nn represents noun modified relationship, the pobj guest's relation that represents to be situated between, conj represents associating relation, and amod represents adjective modified relationship.

Step S232: logical knowledge storehouse and factbase based in extensive knowledge base, according to the sequence of terms with first candidate's concept of the dependence of S231 and S22 generation thereof, and with reference to Chinese essential sentence formula, the structuring tuple of carrying out based on part of speech template extracts and generating structured tuple.Wherein the essential sentence formula of Chinese comprises: 1. NP+VP; 2. NP1+[handle+NP2]+VP; 3. NP1+[quilt+NP2]+VP; 4. NP1+[is]+NP2 etc., wherein NP represents nominal phrase, and VP represents verb phrase, and the form of the corresponding structuring tuple being drawn into is respectively: 1. VP centre word (s:NP centre word, o:VP object); 2. VP centre word (s:NP1 centre word, o:NP2 centre word); 3. VP (s:NP2 centre word, o:NP1 centre word); 4. be (NP1 centre word, NP2 centre word), wherein s represents subject, and o represents object.

It should be noted that, the complex sentence of Chinese is composited by these several basic sentence patterns substantially, therefore, and to the sentence pattern research of the array configuration based on these basic sentence patterns also all within the scope of the invention.

Detailed, described step S232 is further comprising the steps of:

Step S2321: statement core predicate recognition, according to the subjective verb list in ontology library classification table and dependence recognition sequence core predicate, concrete mode is as follows: if having subjective verb in current statement, this core predicate is this subjectivity verb, otherwise using pass in dependence sequence, be that the predicate of " root " is as core predicate, therefore such as not comprising subjective verb in example, adopt that in dependence, to close be that the verb of " root " " logs in " as the core verb of this example sentence.

Step S2322: noun phrase recognition, according to dependence sequence, by thering is the verb part of speech that dependence is " nn ", be modified to noun part of speech, such as the dependence in example sentence " nn (plan-18; survey-17) ", because the part of speech of surveying is verb, it forms nominal modified relationship with " plan ", and therefore by " detection ", the part of speech in part of speech sequence is modified to noun " n ".

Step S2323: the prepositional phrase identification based on part of speech template, the template of prepositional phrase basis of characterization part of speech sequence, these part of speech templates comprise following several situation:

1. p+...+f: wherein p is preposition, and f is the noun of locality, and this template be take p as prezone, with f Wei Hou circle, wholely forms a prepositional phrase;

2. p+...+ time/time/constantly: this template be take p as prezone, using word " time/time/constantly " as rear boundary, wholely form a prepositional phrase;

3. v+p+...++n: this template be take p as prezone, with n Wei Hou circle, whole forms a prepositional phrase.

4. p+...++n+v: this template be take p as prezone, with auxiliary word " " Wei Hou circle, whole form a prepositional phrase.

5. p+n1...+nj: this template be take p as prezone, and YinjWei Hou circle forms a prepositional phrase.

A kind of situation more than part of speech sequence meets, can obtain the prepositional phrase in sentence.Such as meeting part of speech sequence in example sentence 5., the prepositional phrase therefore obtaining is " with " Jade Hare number " lunar rover ".

Step S2324: the prepositional phrase sequence that S2323 must be arrived is separated from original sequence of terms, filters the parts of speech such as function word, adverbial word, adjective simultaneously, obtains one group of part of speech sequence of simplifying.Scan each verb in this part of speech sequence, the part of speech template based on four kinds of essential sentence formulas is carried out the extraction of structuring tuple, wherein four kinds of part of speech template following descriptions respectively that essential sentence formula is corresponding:

The part of speech template that NP+VP is corresponding is: n1+...ni+v+nk+...+nj, and corresponding tuple is v (s:ni, o:nj);

NP1+[handle+NP2] part of speech template that+VP is corresponding is: n1+...+ni+ handle+nk+...+nj+v, corresponding tuple is v (s:ni, o:nj);

NP1+[quilt+NP2] part of speech template that+VP is corresponding is: n1+...+ni+ quilt+nk+...+nj+v, corresponding tuple is v (s:nj, o:ni);

NP1+[is] part of speech template that+NP2 is corresponding is that: n1+...+ni+ is+nk+...nj, corresponding tuple is vshi (s:ni, o:nj)

In example sentence, remove prepositional phrase and function word, after the words such as adverbial word adjective, obtain to simplify part of speech template as follows:

< last night/t, the goddess in the moon/n, No. three/n, log in/v, the moon/n, completes/v, for the first time/n, mutual bat/v, China/ns, the moon/n, detection/n, plan/n, acquisition/v, success/n>

The structuring tuple generating based on basic sentence pattern template is as follows:

Log in (, s: No. three, o: the moon, t: last night)

Complete (o: for the first time, p: with " Jade Hare number " lunar rover)

Clap mutually (s: for the first time, p: with " Jade Hare number " lunar rover)

Obtain (s: plan, o: success, place: China)

Wherein, s represents subject, and o represents object, and t represents the time, and p represents prepositional phrase, and place represents place.The tuple at core verb place is core tuple, so the core tuple in example sentence is " log in (, s: No. three, o: the moon, t: last night) "

Step S2325: based on the logical tuple collocation checking of knowing storehouse and dependence.Structuring tuple based on part of speech template extracts can introduce noise, the tuple being drawn into is not expressed the implication of sentence and has been occurred wrong subject-predicate or meaning guest's collocation, therefore can wrong collocation be verified and be revised based on logical knowledge storehouse, factbase and dependence.Specific algorithm process is as follows:

Step S23251: the example in each structuring tuple is carried out to the mapping of first candidate's concept, form generalities tuple, inquire about this concept tuple in logical knowledge storehouse, if Query Result is true, think that the collocation of this tuple is reasonable, algorithm returns, otherwise carries out step S23252;

Step S23252: if the collocation of the subject of tuple (or object) concept is wrong, first find predicate with this tuple word under dependence " nsubj (or dobj) ", if can find this word, this word is replaced to subject (or object) composition in original tuple, otherwise traversal is arranged in all noun entities before this subject and replaces subject (or object) composition of original tuple simultaneously in part of speech template.Tuple after replacing is led to the checking of knowing storehouse, is true if return results, and the corresponding part of speech of this word is modified to noun part of speech, and algorithm returns, otherwise deletes this tuple.

Such as the tuple in example sentence " logs in (s: No. three, o: the moon, t: last night) ", according to first candidate's concept of entity, the concept tuple generating is for " logging in (s: culture/sign, o: day system/celestial body/satellite, t: the time) ", the logical knowledge of inquiry does not have this concept tuple in storehouse, and be judged as subject collocation error, the predicate of this tuple " logs in " corresponding dependence, and " nsubj (logs in-5, the goddess in the moon-2) " in, " goddess in the moon " can be used as the subject that predicate " logs in ", therefore replace the subject " No. three " of original tuple, form new tuple and " log in (s: the goddess in the moon, o: the moon, t: last night) " and generalities tuple " log in (s: article/articles for use/equipment, o: day system/celestial body/satellite, t: the time) ", now, logical knowledge comprises this generalities tuple in storehouse, it is true returning results, think that this tuple extracts rationally, algorithm returns.In example sentence, other tuple is also similar, so the structuring tuple sequence obtaining after checking in example sentence is as follows:

Log in (s: the goddess in the moon, o: the moon, t: last night)

Complete (o: clap mutually p: with " Jade Hare number " lunar rover)

Obtain (, s: plan, o: success, place: China)

Wherein, because " clapping mutually " done the object that predicate " completes ", therefore the verb part of speech of " clapping mutually " is modified to noun part of speech, simultaneously getter stratification concept, i.e. expression as follows:

Clap mutually: practice/behavior

Now, take " mutually clap " can delete as the structuring tuple of predicate.

Step S2326: carry out the filling of tuple composition based on the logical structuring tuple of knowing storehouse, dependence.In the method that structuring tuple based on part of speech template extracts, due to the cutting of punctuation mark to sentence, the tuple being drawn into may have the situation that lacks subject or lack object, therefore needs to rely on the relation of depositing and lead to know storehouse, and the composition of tuple disappearance is rationally filled.Concrete implementation step is as follows:

Step S23261: the tuple to disappearance subject (or object), if can search out the word that forms the dependence of " nsubj (or dobj) " with this tuple predicate, the subject (or object) that this word is filled to this tuple forms new tuple, otherwise carries out step S23262.By newly-generated tuple according to entity first candidate's Conceptual Projection to concept tuple, based on logical, know storehouse and verify, if logical, know storehouse and be verified, algorithm returns, otherwise carries out S23262;

Step S23262: based on this position of tuple predicate in sequence of terms, forward (or backward) search for all entities, here think that more to close on the entity of predicate larger as the possibility of the subject (or object) of this tuple, now can obtain the entity sequence of candidate's subject (or object), carry out step S23263;

Step S23263: scanning candidate's subject (or object) entity sequence, each candidate in sequence is filled to and in this tuple, forms new structuring tuple, and according to entity the first candidate mappings to concept tuple, utilize logical knowledge storehouse to verify, if the result is true, return to newly-generated tuple, otherwise return to original tuple, algorithm finishes.

Such as the tuple in this example " completes (o: clap mutually, p: with " Jade Hare number " lunar rover) ", this tuple lacks subject composition, and in dependence sequence, " do not complete " word of the relation of formation " nsubj " with predicate, therefore carry out step S23262, the entity sequence of the candidate's subject generating is No. tri-, <, goddess in the moon >, forming new tuple is respectively and " completes (s: No. three, o: clap mutually, and " complete (s: the goddess in the moon p: with " Jade Hare number " lunar rover) ", o: clap mutually, p: with " Jade Hare number " lunar rover) ", it is carried out after the mapping of generalities tuple, utilize logical knowledge storehouse to verify, obtain rational tuple for " to complete (s: the goddess in the moon, o: clap mutually, p: with " Jade Hare number " lunar rover) ".Therefore in sum, the structuring tuple sequence obtaining in example sentence is as follows:

Log in (s: the goddess in the moon, o: the moon, t: last night)

Complete (s: the goddess in the moon, o: clap mutually p: with " Jade Hare number " lunar rover)

Obtain (s: plan, o: success, place: China)

Wherein core tuple is " log in (s: the goddess in the moon, o: the moon, t: last night) "

Step S24: the tuple sequence obtaining according to step S23, based on domain knowledge base, extract text feature, wherein text feature comprises: the 1. classification of predicate in core tuple; The classification of 2. structuring tuple subject; The event sex determination of 3. structuring tuple.It should be noted that, domain knowledge base comprises field ontology library, Tong Shi storehouse, field and field factbase.Particularly, based on domain knowledge base, carrying out affair character extraction and feature, to know method for distinguishing further comprising the steps of:

Step S241: the predicate of the core tuple in the structuring tuple sequence based on generating in S23, here be referred to as core predicate, do to judge: the subjective verb list based in ontology library classification table and modal verb table, if core predicate is subjective verb and core verb, marking the text is non-event class text, algorithm finishes, otherwise carries out step S242;

Step S242: based on field ontology library, to the predicate in each tuple, do as judged: if predicate is present in field ontology library, carry out step S243, otherwise synonym based on predicate, by the extensive one-tenth synonym of core predicate predicate sequence, if field ontology library does not all comprise all words in this sequence, the mark text is non-event class text, and algorithm finishes, otherwise, carry out step S243;

Step S243: if core tuple disappearance subject composition, marking the text is non-event class text, algorithm finishes, otherwise, emotion word lists based in field ontology library classification table, if subject is present in emotion word lists, marking the text is non-event class text, algorithm finishes, otherwise carries out step S244;

Step S244: by core tuple according to first candidate's Conceptual Projection of the example to concept tuple, based on Tong Shi storehouse, field, carry out the judgement of tuple event, inquire about Tong Shi storehouse, field and always whether contain this generalities tuple, if contain, marking the text is event class text, algorithm finishes, otherwise carries out step S245;

Step S245: each example of core tuple is carried out synon extensive, generate the synonym sequence of each entity after extensive, combine with core predicate again, generate the sequence of core tuple, whether in the factbase of field, inquire about successively newly-generated tuple and exist, if exist, marking the text is event class text, otherwise the mark text is non-event class text, and algorithm finishes.

Such as in this example, core tuple is for " to log in (s: the goddess in the moon, o: the moon, t: last night) ", it is based on field ontology library, its core predicate " logs in " and is present in field ontology library, the subject of this tuple exists and is not emotion class word, generalities tuple corresponding to this tuple " logs in (s: article/articles for use/equipment, o: day system/celestial body/satellite, t: the time) " be not present in the logical knowledge storehouse in field, therefore to " goddess in the moon ", " moon " carries out synon extensive, the synonym sequence obtaining after " goddess in the moon " is extensive is < divine boat, Apollo >, the synonym sequence obtaining after moon prosperity is < moon >, therefore generate new tuple as follows:

Log in (s: divine boat, o: the moon, t: last night)

Log in (s: Apollo, o: the moon, t: last night)

Log in (s: the goddess in the moon, o: the moon, t: last night)

Log in (s: divine boat, o: the moon, t: last night)

Log in (s: Apollo, o: the moon, t: last night)

Wherein, tuple " log in (s: Apollo, o: the moon, t: last night) " be present in the factbase of field, therefore, the mark text is event class text, algorithm finishes.

Figure 13 shows that the method flow diagram that carries out short text affair clustering and focus incident screening in step S3 of the present invention based on knowledge base.The input short text data of step S3 are provided by event detection step S2, take knowledge base as support, and application clustering algorithm is excavated potential event bunch, and according to given threshold value screening focus incident, as shown in figure 13.The simple embodiment processing in step S3 below:

At Hua-lien, there are 6.7 grades of earthquakes in 20: 02 on the 31st text1:10 month.

In the evening text2:10 month 31, there are 6.7 grades of earthquakes in Hua-lien county Rui Suixiang, causes a lot of houses impaired.

Twice 5 grades of above earthquakes of text3:10 Yue31 Jilin songyuan's running fire, cause 4929 houses, family impaired at present.

Step S3: the focus incident screening based on knowledge base, it comprises the following steps:

Step S31: short text pre-service and participle, be related to that according to the deduction between word in rule base filtering, with the short sentence of event result, causes the interference to cluster result to prevent different event to have similar event result.

For example, in text2, " cause a lot of houses impaired " is event result, and " 6.7 grades of earthquakes occur Hua-lien county Rui Suixiang " is event body, has similar event result to text3.This type of situation is probably disturbed cluster result owing to having the similar feature of part, therefore need to filter event result.

Described step S31 is further comprising the steps:

Step S311: for cutting apart, determine the core word of each short sentence with comma, core word is noun or the verb that event is relevant;

Step S312: search rule storehouse, relation is released in the front and back of word in rule base mates with the core word of a plurality of short sentences, if regular weight is greater than given threshold value and the match is successful, think between short sentence form before and after release relation, and think that word is below the result of word above, after word place short sentence will be filtered.

As, earthquake-> house is impaired, its expression: earthquake occurs, and can infer that house is impaired.

Step S32: use information extraction technique in step S2 to obtain the structuring statement of short text event, i.e. example n element group representation form, according to text length, the number of structuring tuple is indefinite.Above-mentioned text1～text3 is converted into after structuring tuple, and its representation is:

Text1: occur (s: Taiwan+Hua Lian, o:6.7 level+earthquake);

Text2: impaired (s: house) occurs (s: Taiwan/Hualien County+Rui Suixiang, o:6.7 level+earthquake);

Text3: running fire (s: Jilin+Songyuan City, o: earthquake), cause (s: Songyuan City, o:4929 family+house), impaired (s:4929 family+house)

Wherein, s example is done subject in tuple, and o represents that example does object in tuple, and the modification part that+word is above corresponding composition, is used "/" to separate between a plurality of modifications.

Step S33: increment type feature selecting is mated with structuring tuple is configured to the proper vector value of cluster simultaneously, and wherein, the representation of feature is structuring tuple, and the dimension of proper vector is along with short text quantity increases and increment type increase.In structuring tuple matching process, the tie element of different tuples compares, and weighted sum obtains the final matching results of tuple, and this numerical value is through normalization between 0 and 1, and its value shows that more greatly two tuples are more similar.

Described step S33 is further comprising the steps:

Step S331: the structuring element group representation form of obtaining short text event;

Step S332: for each structuring tuple, itself and proper vector are compared one by one, retain the feature the highest with its similarity, in matching process, structuring tuple will be divided into 5 kinds of compositions, predicate, subject, object, subject is modified, object is modified, and gives corresponding weight w1～w5 for every part;

For example, for tuple: occur (s: Taiwan/Hualien County+Rui Suixiang, o:6.7 level+earthquake), wherein, predicate=generation, subject=Rui Suixiang, object=earthquake, subject modifications=Taiwan/Hualien County ,=6.7 grades of object modifications.

For each the structuring tuple in proper vector, described step S332 is further comprising the steps:

Step S3321: before coupling, the accurate similarity of initialization tuple to be matched is 0;

Step S3322: take feature tuple as benchmark, often match the part in feature tuple in tuple to be matched, the accurate similarity corresponding weight that adds up.Joined process and comprised 3 kinds of operations, i.e. homogeny judgement, synonymy judgement, and the judgement of concept homogeny, these three kinds of operations are according to the similarity degree trend that tapers off, and give the certain decay of cumulative process of accurate similarity;

For example, feature tuple: (s: Taiwan+Hua Lian occurs, o:6.7 level+earthquake), when tuple to be compared: running fire (s: Jilin+Songyuan City, o: while earthquake) mating with it, tie element compares respectively, if " VS running fire occurs " is synonym comparison, " Hua Lian VS Songyuan City " is place, is same concept comparison, two examples that match at this one deck, its similarity degree is minimum.

Step S3323: when above 3 kinds of operations can not realize the coupling of two corresponding instances, the simplified example rule in rule searching storehouse, search whether to have the front and back release relation of these two examples and release weight and be greater than given threshold value, if, think that two examples realize coupling in regular aspect, the accurate similarity respective weights that adds up;

For example, when relatively " US President " is with " Obama ", utilize rule: US President-> Obama, can realize the coupling of these two examples.

Step S3324: the weight sum of supposing all the components in feature tuple is N, the value of accurate similarity is the similarity of tuple to be matched and feature tuple divided by N.

Step S333: if similarity is greater than specific threshold, think that this structuring tuple and a certain characteristic matching are successful, the feature value vector of this short text is 1 in relevant position, otherwise is 0.If it fails to match, think new feature, this structuring tuple is added in proper vector, the feature value vector of this short text is 1 in relevant position, all the other positions are 0.

Step S334: check whether short text all completes structure match, is to exit, otherwise, return to step S331.

Step S34: the structured features value vector based on obtaining, application clustering algorithm carries out cluster, obtains preliminary event category bunch C1, and wherein, each short text is converted into a proper vector.

Clustering algorithm used in described step S34 comprises one of following algorithm:

-Kmeans algorithm;

-Affinity Propagation (AP) algorithm;

-data stream (Stream) clustering algorithm;

-Clustree clustering algorithm;

Step S35: the ontology library based in knowledge base, extract the place in short text event and calculate the place similarity between event, wherein, place has upper and lower hierarchical relationship, as: Nehe county, Qiqihaer City, Heilongjiang Province, this kind of hierarchical relationship is used for calculating the relation of inclusion in place in the coupling of place.

Described step S35 need to use the place mark words list in ontology library, and outstanding feature word is wherein as follows:

Continent, state, province/state/mansion, city, county, township, village, district, island, town.。。

Above-mentioned place mark words is generally positioned at ending place of place word, and has indicated the hierarchical relationship between them in the concept hierarchy of ontology library, wherein, "/" table synonymy below, " state " and " mansion " mostly is foreign place name ending word.

Described step S35 is further comprising the steps:

Step S351: the example-Conceptual Projection table based in ontology library, the place word of short text after extraction participle, and, while dividing word algorithm None-identified when running into new place word, by the identification of dot mark word matchingly word border, place, and mate the correctness that the new place word found confirmed in place word before it;

For example, " roc town, Pingnan County, Guangxi " obtains " Guangxi \ ns Pingnan County \ ns large \ a roc \ n town \ n " after participle, according to " ns " sign, can identify " Guangxi " and " Pingnan County " is all place, and the incorrect identification of " roc town " minute word algorithm.Now, because " town " is place concept, and " roc town " word is above place, and rank is " county ", larger than " town ", and according to the descending expression custom in place, judges that " roc town " should be the three unities.

Step S352: according to the hierarchical relationship of intersite, to the place of extracting in short text, carry out correct classification, put together in same place, distinguish different location simultaneously;

As short text: " earthquake occurs in Hualian County, Taiwan Province, and there is seismaesthesia in Fujian ".Wherein, Hualian County, Taiwan Province is the three unities, and Fujian is another place.

Step S353: place similarity coupling, it comprises the following steps:

Step S3531: the example-Conceptual Projection table of inquiry in ontology library, when two places are identical or have father and son's relation of inclusion, think that the match is successful in place, otherwise execution step S3532;

For example, Sichuan and Chengdu, wherein, the concept in " Sichuan Province " is " province ", the concept in " Chengdu " is " city ", the relation of inclusion according to " province " with " city ", " Sichuan " the match is successful with " Chengdu ".

Step S3532: the thesaurus in inquiry ontology library, if two place words are synonym, or between synonym, there is father and son's relation of inclusion, the match is successful, otherwise execution step S3533;

For example, Kunming and flower city, inquiry thesaurus is known, and these two places are synonyms, and " Kunming " is realized and being mated with " flower city ".

Step S3533: two places are not simple identical or relation of inclusion, if place mark words is added in mark words ending in Ci Buyi place, place in the end, execution step S3531.

For example, Hangzhou and Hangzhou, if fruit does not have the synonymy in " Hangzhou " and " Hangzhou " in thesaurus, need to add at the suffix in " Hangzhou " dot mark word fully, as " Hangzhou province ", " Hangzhou ", mates new place word with " Hangzhou ".

Step S36: the decimation in time based on ontology library and time match.In decimation in time process, except minute time word of word algorithm mark, the time that is noted as in example-Conceptual Projection table but is not needed to process by the time word of participle algorithm identified yet.Time form in any case, finally all has such representation: year-month-day-week-time in the present invention

As, " just now " is adverbial word in minute word algorithm, and in fact how it does time word in sentence, therefore, extracts this class word, needs the support of knowledge base.

In addition, the calculating of time word is also needed the support of knowledge base.As different from " London time " in " Beijing time ", they differ several time zones; For another example, " morning today " is not a time with " tonight " indication, and " midnight yesterday " and " morning today " is likely a time, and these needs of knowledge knowledge bases provide.These times are carried out to similarity degree and calculate and to need knowledge base to tell which computing machine " morning ", " midnight " are time period, and the general knowledge such as how many hours within one day, have.

Described step S36 is further comprising the steps:

Step S361: the time of delivering of short text event obtains, each event has news to deliver the time, this time is since year, be accurate to hour or minute.

Step S362: the example-Conceptual Projection table based in ontology library and classification table, carry out time word extraction;

Step S363: time similarity calculates, and a little, there is the stage time, also has fuzzy expression as " recently ".People are also difficult to accomplish accurate very much on the expression time, and therefore, the time herein relatively adopts the mode comprising in section, and two time phase differences are no more than certain threshold value or two times and have common factor and think that the match is successful.As time " 20: 02 on the 31st October " and " evening October 31 " in text1 and text2 below, the former is precise time, and the latter " evening " is a time phase, according in ontology library for the division of time word time phase of living in, " evening " residing time comprises " 20: 02 ", therefore, text1 and text2 mate in time.

Described step S36 also comprises following content:

Between lunar time, be converted to the solar calendar time;

The solar calendar time is converted to week;

In Time Calculation process, consider time zone conversion.

Step S37: the short text event based on when and where coupling is cluster again, after short text course of event when and where coupling, m-place feature value vector when each short text will obtain, is used the clustering algorithm described in step S34 to carry out cluster to short text event, obtains new event bunch C2.

Step S38: based on text word bag model, by structured features cluster obtain event bunch C1 with time the m-place feature clustering event bunch C2 fusion that obtains, obtain final cluster centre and event bunch C.

Step S39: focus incident sequence and screening, according to the size of each event bunch, event is sorted, and filter out according to given threshold value the event that temperature is high, reach the object of timely discovering hot event.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the focus incident method for digging based on extensive knowledge base, is characterized in that, comprises the following steps:

Step S3: based on extensive knowledge base, screened event class text is carried out to cluster, and then filter out focus incident.

2. method according to claim 1, is characterized in that, step S1 comprises:

Step S11: by hierarchical clustering structure concept multi-level structure, carry out the many tag concept of stratification and identify to build example-Conceptual Projection table, build on this basis classification table, form ontology library;

Step S12: take example-Conceptual Projection table as basis, produce concept n tuple by Conceptual Projection, and it is evaluated and tested, select high-quality concept n tuple to form the logical storehouse of knowing;

Step S13: the concept n tuple of leading in knowledge storehouse of take is reference, carries out Conceptual Projection and ambiguity elimination to example n tuple, and retains high-quality example n tuple formation factbase by the evaluation and test of n tuple;

Step S14: build rule base by regular Path mining, the assessment of regular degree of confidence and the study of regular weight;

Step S15: utilize ontology library, logical knowledge storehouse and rule base to give to instruct to the automatic building process of knowledge base, realize the automatic expansion of factbase and the optimization of Knowledge Extraction process;

Wherein, extensive knowledge base comprises general-purpose knowledge bases and domain knowledge base; General-purpose knowledge bases is for the treatment of general considerations, and domain knowledge base is used for solving particular problem.

3. method according to claim 2, it is characterized in that, in step S11, the stratification many tag concept recognition technology of application based on attribute and non-attribute multi-source information, and the foundation of setting up example-Conceptual Projection table example-Conceptual Projection table specifically comprises the following steps:

Step S1121: many labelings of the stratification device based on the assessment of attribute area calibration and attribute construction carries out concept identification, and wherein said attribute construction comprises the following steps:

Step S11211: take 2 or 3 attributes is one group, under ad hoc structure, generates all possible combinations of attributes;

Step S11212: the class discrimination degree of assessing each combination;

Step S11213: under certain threshold value, the high combinations of attributes of selective discrimination degree is as new complex attribute;

Wherein, described ad hoc structure comprises:

Structure 3: a plurality of attributes form disjunctive normal form;

Structure 4: a plurality of attributes form conjunctive normal form;

Step S1122: the many labels of stratification that calculate and open classified information based on concept similarity carry out concept identification, specifically comprise the following steps;

Step S11221: judge that example, whether for there being ambiguity word, if there is ambiguity, carries out step S11226, otherwise carries out step S11222;

Step S11222: obtain one group of open classification without ambiguity example;

Step S11225: to this group conceptual execution the first fusion rule;

Wherein, described the first fusion rule is:

-when two concepts are overlapping relation, show that two concepts have certain similarity, get the common factor of two concepts as final concept;

-when the community set of concept is when occuring simultaneously, showing that two concept mutual exclusions, example are ambiguity word, concept all retains;

Step S1123: the many labels of stratification based on concept similarity calculating and polysemant information carry out concept identification;

Step S1124: carry out the second fusion rule and carry out the fusion of stratification concept;

Wherein, described the second fusion rule comprises following content:

4. method according to claim 3, wherein, described many labelings of stratification device algorithm comprises a kind of in following algorithm:

-Multi-Label C4.5: a kind of improvement of carrying out for adapting to many labelings of C4.5 algorithm in decision tree;

-Predictive Clustering Trees: many labelings of the stratification device based on top-down induction decision tree;

-Random Forest PCTs: random a plurality of subsets the training pattern of building on PCTs basis, determines final classification in the mode of voting;

-Random Forest ML C4.5: apply Random Forest thought on ML C4.5 basis.

5. method according to claim 2, is characterized in that, in step S12, the logical structure of knowing storehouse comprises the following steps:

Step S121: search had or not ambiguity entity n tuple in being related to storehouse;

Step S122: whether judgement has been concept n tuple without ambiguity entity n tuple, and to each entity n tuple generalities, form concept n tuple;

Step S124: for any predicate in initial concept n tuple-set, for all concept n tuples under this predicate, predicate, since 1 label, be its objective is for to distinguish the different concepts n tuple under identical predicate;

Step S125: according to the frequency of example n tuple under concept n tuple, each concept n tuple is evaluated and tested, selected the frequency to form the logical storehouse of knowing higher than the concept n tuple of a certain threshold value.

6. method according to claim 2, is characterized in that, in step S13, the structure of factbase comprises the following steps:

Step S131: for any example n tuple, judge that whether this example n tuple has ambiguity, if without ambiguity, performs step S132; If there is ambiguity, perform step S136;

Step S132: query case-Conceptual Projection table is each the example mark concept in example n tuple;

Step S133: the logical knowledge of inquiry storehouse, in the situation that ignoring predicate label, find occurrence;

Step S136: query case-Conceptual Projection table, obtains its all possible concept n tuple-set C1 under this predicate;

Step S137: search all concept n tuple-set C2 under this predicate in logical knowledge storehouse;

Step S138: ignore predicate label, C1 and two set of C2 are mated;

Step S13A: if example n tuple traversal is complete, finish, otherwise, execution step S131;

7. method according to claim 2, is characterized in that, in step S14, the structure of rule base comprises the following steps:

Step S141: seek footpath algorithm based on relation, carry out regular route searching in logical knowledge storehouse;

Step S142: in the situation that considering set membership to regular instantiation, and computation rule degree of confidence;

Step S143: carry out Rules Filtering according to regular degree of confidence;

Step S144: carry out the study of regular weight based on Markov Logic Networks.

8. method according to claim 2, is characterized in that, the trace-back process of knowledge described in step S15 comprises the following steps:

Step S151: factbase expansion: rule-based storehouse, utilize uncertain reasoning technology, factbase is expanded, it comprises the following steps:

Step S1511: the rule based on uncertain reasoning technology is carried out reasoning;

Step S1512: on reasoning basis, excavate new knowledge in factbase, i.e. new example n tuple;

Step S1513: upgrade factbase;

Step S152: the optimization that structuring tuple extracts: use existing rule base, factbase, logical knowledge storehouse to be optimized being related to storehouse extraction process, upgrade logical know storehouse and factbase simultaneously.

9. method according to claim 1, is characterized in that, step S2 comprises the following steps:

Step S21: utilize Chinese participle technology that short text to be detected is converted to orderly sequence of terms, and each word is carried out to corresponding part-of-speech tagging, then according to part of speech template, particular words sequence is carried out the merging of word and revised its part of speech simultaneously;

Step S22: the sequence of terms obtaining based on step S21, its entity is mapped to stratification concept space, and ambiguity word is wherein carried out to rough semantic disambiguation, step S22 is further comprising the steps:

Step S221: the ontology library based in extensive knowledge base, maps to its stratification concept space by the entity in sentence with attribute information;

Step S222: the affiliated candidate's concept according to each entity in sentence is carried out semantic disambiguation, in sentence under the concept constraint without ambiguity entity, a plurality of candidate's concepts to ambiguity entity are carried out probability calculation, using first candidate's concept of the highest concept this entity in this of probability;

Step S23: the result obtaining based on described step S22, the technology of utilizing Chinese dependence to extract, is converted into sequence of terms the structure tuple sequence with semantic information in conjunction with the essential sentence formula of Chinese, and step S23 is further comprising the steps:

Step S231: the technology of utilizing Chinese dependence to extract, the sequence of terms obtaining in S22 is carried out to dependency analysis, obtain in sentence the dependence between word and store;

Step S232: logical knowledge storehouse and factbase based in extensive knowledge base, according to the sequence of terms with first candidate's concept of the dependence of S231 and step S22 generation thereof, and with reference to Chinese essential sentence formula, the structuring tuple of carrying out based on part of speech template extracts and generating structured tuple;

Step S24: the tuple sequence obtaining according to step S23, based on event domain knowledge base, extract Text eigenvector collection and identify according to Text eigenvector collection, step S24 is further comprising the steps:

Step S241: the structuring tuple sequence based on generating in step S23, when the predicate of core tuple is subjective verb and core verb, mark text is non-event class text, and finishes identification, otherwise carries out step S242;

Step S242: based on field ontology library, to the predicate in each tuple, if predicate is present in field ontology library, carry out step S243, otherwise synonym based on predicate, by the extensive one-tenth synonym of core predicate predicate sequence, if field ontology library does not all comprise all words in this sequence, the mark text is non-event class text, and finishes identification, otherwise, carry out step S243;

Step S243: if core tuple disappearance subject composition, marking the text is non-event class text, finish identification, otherwise, emotion word lists based in field ontology library classification table, if subject is present in emotion word lists, marking the text is non-event class text, finish identification, otherwise carry out step S244;

Step S244: by core tuple according to first candidate's Conceptual Projection of the example to concept tuple, based on Tong Shi storehouse, field, carry out the judgement of tuple event;

Step S245: each example of core tuple is carried out synon extensive, generate the synonym sequence of each entity after extensive, and combine and generate the sequence of core tuple with core predicate, based on field factbase, carry out the judgement of tuple event, form first stack features of event class text and carry out feature identification, finally obtaining event text collection.

10. method according to claim 9, is characterized in that, described step S222 is further comprising the steps:

Step S2221: each obtaining in S221 combined without ambiguity entity and ambiguity entity, forms entity pair, and according to each concept of each entity of entity centering, by entity to mapping to concept pair;

Step S2222: the logical knowledge storehouse based in extensive knowledge base, the right frequency of each concept in statistic procedure S2221, centered by candidate's concept of ambiguity entity, calculates concept the cumulative merging of the frequency occurring is normalized and obtains probability.

11. methods according to claim 9, is characterized in that, described step S232 is further comprising the steps:

Step S2321: statement core predicate recognition, according to the subjective verb list in ontology library classification table and dependence recognition sequence core predicate;

Step S2322: noun phrase recognition, according to dependence sequence, is modified to noun part of speech by the verb part of speech with predetermined dependence;

Step S2323: the prepositional phrase identification based on part of speech template, the template of prepositional phrase basis of characterization part of speech sequence;

Step S2324: the prepositional phrase sequence obtaining in S2323 is separated from original sequence of terms, filter the parts of speech such as function word, adverbial word, adjective simultaneously, obtain one group of part of speech sequence of simplifying, scan each verb in this part of speech sequence, the part of speech template based on four kinds of essential sentence formulas is carried out the extraction of structuring tuple;

Step S2325: based on the logical tuple collocation checking of knowing storehouse and dependence;

Step S2326: carry out the filling of composition based on the logical structuring tuple of knowing storehouse, dependence.

12. methods according to claim 1, is characterized in that, step S3 comprises the following steps:

Step S31: short text pre-service and participle, according to the deduction between word in rule base, be related to that filtering is with the short sentence of event result, wherein, S31 is further comprising the steps;

Step S312: search rule storehouse, relation is released in the front and back of word in rule base mates with the core word of a plurality of short sentences, if regular weight is greater than given threshold value and the match is successful, release relation before and after forming between short sentence, and word is below the result of word above, after word place short sentence will be filtered; Step S32: use the structuring statement of the information extraction technique acquisition short text event in S2, i.e. example n element group representation form;

Step S33: increment type feature selecting is mated with structuring tuple, is configured to the feature value vector of cluster simultaneously, wherein, the representation of feature is structuring tuple;

Step S34: the structured features value vector based on obtaining, application clustering algorithm carries out cluster, obtains preliminary event category bunch C1;

Step S35: the ontology library based in knowledge base, extracts the place in short text event and calculates the place similarity between event, and the relation of inclusion between place is taken into account;

Step S36: the decimation in time based on ontology library and time match, in decimation in time process, except minute time word of word algorithm mark, by query case-Conceptual Projection table, can obtain should be the word of time word by participle algorithm identified, wherein, step S36 is further comprising the steps:

Step S361: the time of delivering of short text event obtains;

Step S363: time similarity calculates, and the time relatively adopts the mode comprising in section, two time phase differences are no more than certain threshold value or two times and have common factor and think that the match is successful;

Step S37: the short text event based on when and where coupling is cluster again, after short text course of event when and where coupling, m-place feature value vector when each short text will obtain, is used the clustering algorithm described in step S34 to carry out cluster to short text event, obtains new event bunch C2;

Step S38: based on word bag model, the event bunch C1 that structured features cluster is obtained with time the m-place feature clustering event bunch C2 fusion that obtains, obtain final event bunch C;

Step S39: focus incident sequence and screening, according to the size of each event bunch, event is sorted, and filter out according to given threshold value the event that temperature is high;

Wherein, in described step S34, clustering algorithm used comprises one of following algorithm:

-K means algorithm;

-Affinity Propagation (AP) algorithm;

-Stream clustering algorithm;

-Clustree clustering algorithm.

13. methods according to claim 12, is characterized in that, described step S33 comprises the following steps:

Step S332: for each structuring tuple, itself and proper vector are compared one by one, retain the feature the highest with its similarity;

Step S333: if similarity is greater than specific threshold, think that this structuring tuple and a certain characteristic matching are successful, the feature value vector of this short text is 1 in relevant position, otherwise is 0; If it fails to match, think new feature, this structuring tuple is added in proper vector, the feature value vector of this short text is 1 in relevant position, all the other positions are 0;

Step S334: check whether short text all completes structure match, is to exit, otherwise, return to step S331;

Wherein, in step S332, structuring tuple will be divided into 5 kinds of compositions: predicate, and subject, object, subject is modified, and object is modified, and gives corresponding weight w1～w5 for every part.

14. methods according to claim 13, is characterized in that, described step S332 is further comprising the steps:

15. methods according to claim 12, is characterized in that, described step S35 is further comprising the steps:

Step S353: place similarity coupling, example-Conceptual Projection table and synonym table in inquiry ontology library, determine that whether two places are identical, similar or have father and son's relation of inclusion, or noly have above-mentioned relation after adding in the end place mark words;

Wherein, described place mark words is deposited in the concept hierarchy of ontology library, and outstanding feature word is listed below:

Continent, state, province/state/mansion, city, county, township, village, district, island, town,

Above-mentioned place mark words is generally positioned at ending place of place word, and has indicated the hierarchical relationship between them in the concept hierarchy of ontology library.