CN108563637A

CN108563637A - A kind of sentence entity complementing method of fusion triple knowledge base

Info

Publication number: CN108563637A
Application number: CN201810328826.9A
Authority: CN
Inventors: 黄河燕; 魏骁驰; 史学文; 刘茜
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2018-09-21

Abstract

A kind of sentence entity complementing method of fusion triple knowledge base, belongs to Computer Natural Language Processing field.Concrete operation step includes：1. building the data set used for model training；2. entity, relationship, sentence template are indicated with vector；3. the entity word in completion sentence.The sentence entity complementing method compare with the existing technology of fusion triple knowledge base proposed by the present invention, when entity word in for the problem that sentence carries out completion it can be considered that relationship in entity word supplement and sentence between other entity words, efficiently solves and be difficult in conventional sentence complementing method to entity word progress completion.Experiment shows that method proposed by the present invention is obviously improved on the evaluation metrics using average sequence (MR) and preceding 10 hit rate (H@10).

Description

A kind of sentence entity complementing method of fusion triple knowledge base

Technical field

The present invention relates to a kind of sentence entity complementing methods of fusion triple knowledge base, belong at computer natural language Manage technical field.

Background technology

Sentence completion algorithm is a kind of technology by incomplete sentence supplement for complete words.It can be widely applied for inputting Method intelligent prompt, search engine inquiry prompt etc..A wide range of with electronic equipments such as computers is popularized, and text input becomes Basic operation to these equipment can be greatly decreased user by sentence completion technology and input word when inputting word Quantity increases user's input speed, improves user experience, and input business uses on a large scale at present.

Traditional sentence complementing method is mostly based on the co-occurrence statistics information of word, by being carried out to extensive text sentence Statistics, obtains the co-occurrence probabilities between each word and all words, is calculated later in sentence to be supplemented using the co-occurrence probabilities The probability of middle filling different terms, is ranked up candidate word according to the probability size being calculated, if probability is higher A dry word feeds back to user as candidate, is selected for user.

Above-mentioned traditional sentence complementing method has the drawback that the co-occurrence information that only considered in sentence between word, and Existing semantic relation between entity word is not accounted in sentence.When needing to carry out completion to the entity word in sentence, wait mending The semantic relation between other entity words in the entity word and sentence filled is often more even more important than co-occurrence information, particularly with retouching The Subject, Predicate and Object type sentence of relationship between entity is stated, phenomenon becomes apparent.And the co-occurrence information used in conventional method can not The relationship between entity is effectively embodied, thus the completion of lacked entity word in sentence is also difficult to obtain satisfactory effect Fruit.

Invention content

The problem of being difficult to carry out completion to entity word the purpose of the present invention is to solve sentence complementing method, it is proposed that one The sentence entity complementing method of kind fusion triple knowledge base.

The present invention is achieved through the following technical solutions：

Related definition is carried out first：It is specific as follows：

Define 1：Entity can represent the mark of specific things；

Define 2：Relationship can represent the mark of the contact between the things representated by two entities；

Define 3：Triple, i.e., by two entities and its between a kind of structure for forming of relationship；Wherein, the two entities Form entity pair；

Define 4：Triple knowledge base, i.e., the set being made of a large amount of triples；

Wherein, the collection of all entities is collectively referred to as entity set in all triples, and the related collection of institute is collectively referred to as set of relations；

Define 5：Sentence entity can carry out matched noun that is, in sentence with entity；

Define 6：Sentence template, i.e., remaining content after deleting any two sentence entity in sentence；

Define 7：Entity recognition method is named, can be according to the sentence of input, the method for obtaining sentence entity；

Define 8：Entity link method can find entity that can be matched according to the sentence entity of input Method；

Define 9：Triple concatenate rule, even in two triples, second entity of first triple and First entity of two triples is identical, then the two triples can be attached using identical entity；

Define 10：Multiple triples are attached by relation path according to triple concatenate rule, constitute chain structure, The collection that relationship is formed in these triples is collectively referred to as relation path；

Wherein, triple concatenate rule is as described in defining 9；

A kind of sentence entity complementing method of fusion triple knowledge base, concrete operation step are：

Step 1: the data set that structure is used for model training；

Given sentence set carries out step 1.1-step 1.3 for each sentence in sentence set and operates：

Step 1.1：All sentence entities in sentence are extracted using name entity recognition method, obtained all sentences are real The set N of body composition；

Wherein, sentence entity is as described in defining 5；Name entity recognition method as described in defining 7；

Step 1.2：Sentence entity in the N obtained by step 1.1 is matched two-by-two according to the rule of traversal combination, The entity that each two is mutually matched all constitutes sentence entity pair；

Step 1.3：The each sentence entity centering obtained from step 1.2 carries out step 1.3.1-1.3.2 operations：

Step 1.3.1：Two sentence entities of sentence entity centering are deleted from sentence, obtain sentence template；

Wherein, sentence template is as described in defining 6；

Step 1.3.2：Being found using entity link method can be with the progress of two sentence entities of sentence entity centering Two entities in the entity set matched, and constitute entity pair；All triples are traversed in triple knowledge base again, searching can Connect the relation path of this two entity of entity centering；

Wherein, entity link method is as described in defining 8, and triple knowledge base is as described in defining 4, and relation path is as defined 11 It is described；

Step 2: entity, relationship, sentence template are indicated with vector；

Wherein, entity is as described in defining 1；Relationship is as described in defining 2；Sentence template is as described in defining 3；

Step 2.1：To one vector of each entity random initializtion in entity set, all obtained institute's directed quantities Set be denoted as E, to one vector of each relationship random initializtion in set of relations, the set of obtained institute's directed quantity It is denoted as R；

Step 2.2：Triple all in triple knowledge base is substituted into following formula successively： It calculatesAll triples are calculated laterIt sums up to obtain L_k；

Wherein,Indicate the 1st entity in i-th of triple,Indicate the 2nd entity in i-th of triple, rⁱ Indicate the relationship in i-th of triple,With E (rⁱ) the two entities in i-th of triple are indicated respectively With relationship by the vector of random initializtion in step 2.1.

Step 2.3：Step 2.3.1-2.3.2 is carried out for each sentence template obtained by step 1.3.1：

Step 2.3.1：By i-th of sentence template s_iThe neural network model based on sequence is inputted, output obtains sentence mould The vector of plate indicates, is denoted as f (s_i)；

Step 2.3.2：According to defining 6, sentence template has a corresponding sentence entity pair, corresponding to the sentence template Entity is to two entities and the obtained f (s of step 2.3.1 in the entity pair that is obtained in step 1.3.2_i) substitute into formula (1)：

Wherein,WithRespectively two entities of presentation-entity centering initialized in step 2.1 obtain to Amount, ‖ ‖ operations represent two norms；

Step 2.4：Step 2.3 is obtained allIt sums up to obtain L_s；

Step 2.5：For all sentence templates obtained in step 1.3.1, and by this sentence template s_iIn step 2.3.1 vector f (the s obtained_i) and this sentence template corresponding to sentence entity to the relation path that obtains by step 1.3.2 Substitute into formula (2)：

Wherein,Indicate the vector that each relationship initializes in step 2.1 in this relation path, ∑ table Show summation operation；All sentence templates are calculated laterIt sums up to obtain L_p；

Step 2.6,：The L that step 2.2 is obtained_k, the obtained L of step 2.4_s, the obtained L of step 2.5_pIt sums up to obtain Optimization object function L；

Step 2.7：Using the vector of all entities in gradient descent algorithm optimization object function L, relationship vector, be based on Parameter in the neural network model of sequence, makes L minimize, and vector, the institute that all entities are obtained after optimization are related Optimized parameter in vector, the neural network model based on sequence；

Step 3: the entity word in completion sentence, specifically includes following sub-step：

Step 3.1：User provides a sentence s for requiring supplementation with entity word, it is extracted using name entity recognition method In all sentence entities, obtained all sentence entities constitute set E₁；

Wherein, name entity recognition method is as described in defining 7；

Step 3.2：For E₁In each sentence entity, carry out step 3.2.1-3.2.3, wherein i-th of sentence is real Body is denoted as w_i：

Step 3.2.1：By sentence entity w_iIt is combined to obtain sentence entity pair with the sentence entity w required supplementation with；

Step 3.2.2：By sentence entity w_iIt is deleted from s, obtains sentence template

Sentence template is as described in defining 6；

Step 3.2.3：The optimized parameter generation for the neural network model based on sequence that sentence template and step 2.7 are obtained Enter the neural network model based on sequence, obtains indicating the vector of sentence template

Step 3.2.4：The reality that can be carried out with the sentence entity in matched entity set is found using entity link method Body e_i；

Step 3.2.5：The entity e that step 3.2.4 is obtained_iIt is obtained with step 3.2.3Substitute into formula (3)：

Wherein, E (e_i) it is sentence entity e_iIn the vector that step 2.7 obtains；

Step 3.2.6：Vector E is calculated using similarity formulaⁱ(e) all entities obtained with step 2.7 are corresponding vectorial Similarity, and all entities are sorted from big to small according to similarity, obtain the sequence serial number of each entity；

Step 3.3：For each entity in entity set, sequence serial number that each entity is obtained in step 3.2.6 into Row adduction, obtain sequence serial number and；

Step 3.4：The sequence serial number that entity in entity set is obtained according to step 3.3 and ascending sequence, will arrange Sentence entity corresponding to the entity being ranked first in sequence returns, and in the sentence s that is given to step 3.1 user of completion；

Wherein, sentence entity is as described in defining 5.

Advantageous effect

A kind of sentence entity complementing method of fusion triple knowledge base of the present invention, with existing sentence complementing method phase Than having the advantages that：

1. when entity word in for sentence carries out completion it can be considered that entity word to be supplemented and other entity words in sentence Between relationship, solve the problems, such as to be difficult in conventional sentence complementing method to carry out completion to entity word.

During 2 carry out sentence entity completion task in by the data set of wikipedia and Freebase constructions, by Wiki Sentence in encyclopaedia is split as training set and test set at random, the experimental results showed that, under identical data set, the present invention is made Method is compared with language model method of the tradition based on word co-occurrence and without using the limit system of triplet information, this hair The complementing method of bright proposition is obviously improved on the evaluation metrics using average sequence (MR) and preceding 10 hit rate (H@10).

Description of the drawings

Fig. 1 is a kind of general frame design cycle of the sentence entity complementing method of fusion triple knowledge base of the present invention Figure.

Specific implementation mode

The method of the invention is described in detail with reference to the accompanying drawings and embodiments.

Embodiment 1

A kind of detailed process of the sentence entity complementing method of fusion triple knowledge base is as shown in Figure 1.The present embodiment is chatted The flow and its specific embodiment of the method for the invention are stated.

The data used in the present embodiment are by 50,000 sentences in wikipedia and 200,000,000 from Freebase Triple is constituted.

The sentence entity complementing method of the fusion triple knowledge base used in the present embodiment, flow chart as shown in Figure 1, The specific steps are：

Step A, the data set used for model training is built；

Given sentence set, for each sentence in the set carry out step A.1-A.3 step operate：

Step is A.1：All sentence entities in sentence are extracted using name entity recognition method, obtained all sentences are real Body constitutes set E, and the specific implementation of the name entity recognition method uses Open-Source Tools OpenNLP, by all sentence inputtings OpenNLP obtains entity word.

Step is A.2：Sentence entity in the E A.1 obtained by step is matched two-by-two according to the rule of traversal combination, The entity that each two is mutually matched all constitutes sentence entity pair；

Step is A.3：A.2 each sentence entity obtained for step to carry out step A.3.1-A.3.3 operate：

Step is A.3.1：Two sentence entities of sentence entity centering are deleted from sentence, obtain sentence template；

Step is A.3.2：Being found using entity link method can be with the progress of two sentence entities of sentence entity centering Two entities in the entity set matched, and entity pair is constituted, which is matched by the Anchor Text in wikipedia Method is realized；All triples in Freebase are traversed again, find the relationship that can connect this two entity of entity centering Path；

Wherein A.3.2 A.3.1 step can be performed simultaneously with step, can also first carry out step and A.3.2 execute step again A.3.1；

Step B, entity, relationship, sentence template are indicated with vector；

Step is B.1：It is right to one vector of each entity random initializtion in the entity set of Freebase One vector of each relationship random initializtion in the set of relations of Freebase；

Step is B.2：All triples in all Freebase are substituted into following formula： It calculatesAll triples are calculated laterIt sums up to obtain L_k；Wherein,Table Show the 1st entity in i-th of triple in Freebase,Indicate the 2nd reality in i-th of triple in Freebase Body, rⁱIndicate the relationship in i-th of triple in Freebase, With E (rⁱ) indicate in Freebase respectively The two entities and relationship in i-th of triple step B.1 in by the vector of random initializtion.

Step is B.3：For each sentence template for B.3.1 being obtained by step carry out step B.3.1-B.3.2：

Step is B.3.1：By i-th of sentence template s_iThe neural network model based on sequence is inputted, output obtains sentence mould The vector of plate indicates, is denoted as f (s_i)；

Step is B.3.2：By the entity corresponding to the sentence template to step A.3.2 in two in obtained entity pair B.3.1, the f (s that entity and step obtain_i) substitute into following equation：It calculates

Wherein,WithRespectively two entities of presentation-entity centering step B.1 in initialization obtain to Amount；

Step is B.4：B.3, step is obtained allIt sums up to obtain L_s；

Step is B.5：For it is all step A.3.1 in obtained sentence template, the vector that it is B.3.1 obtained in step And its using step A.3.1, step A.3.2, the relation path that A.3.3 obtains of step substitute into following formula：

Wherein,Indicate in relation path of the two entities in Freebase each relationship step B.1 in just All sentence templates are calculated later for the vector that beginningization obtainsIt sums up to obtain L_p；

Step B.6,：B.4, the L that step is obtained_k, the L that B.4 obtains of step_s, the L that B.5 obtains of step_pIt sums up to obtain Optimization object function L；

Step is B.7：Use the vector of all entities, the vector of relationship, recurrence in gradient descent algorithm optimization object function L Parameter in neural network model, makes L minimize, obtained after optimization the vector of all entities in Freebase, institute it is related Optimized parameter in the vector of system, recurrent neural networks model；

Step C, the entity word in completion sentence；

Step is C.1：Given sentence sentence to be supplemented " Obama first-generation _ _ _ " utilizes name Entity recognition Method extraction all sentence entities " Obama " therein, " U.S. ", obtained all sentence entities constitute set E₁；

Step is C.2：For E₁In each sentence entity " Obama " and " U.S. ", respectively carry out step C.2.1- C.2.3：

Step is C.2.1：It is combined the sentence entity and the sentence entity w required supplementation with to obtain sentence entity pair；

Step is C.2.2：The sentence entity is deleted from former sentence, obtains sentence template, is obtained for example, deleting " Obama " Sentence template " _ _ _ it is first-generation _ _ _ ", delete " U.S. " obtain sentence template " Obama birth _ _ _ _ _ _ "；

Step is C.2.3：B.7, the optimized parameter for the recurrent neural network that sentence template and step are obtained substitutes into recurrent neural Network model obtains indicating the vector of sentence template

Step is C.2.4：Matched Freebase entities can be carried out with the sentence entity by being found using entity link method The entity e of concentration_i；

Step is C.2.5：C.2.4, the sentence entity that step obtains and the f that C.2.3 step obtains are substituted into formula：

Wherein, E (e_i) it is sentence entity e_iIn the vector that B.7 step obtains；

Step is C.2.6：Vector E is calculated using cosine similarityⁱ(e) all realities in the Freebase B.7 obtained with step Body corresponds to the similarity of vector, and all entities are sorted from big to small according to similarity, obtains the sequence serial number of each entity；

Step is C.3：For each entity in Freebase entity sets, all sequences that it is C.2.6 obtained in step Serial number sums up, obtain sequence serial number and；

Step is C.4：C.3, the sequence serial number that entity in Freebase entity sets is obtained according to step and ascending row Sentence entity " Hawaii " corresponding to the entity being ranked first is returned and in completion to former sentence, obtains complete sentence by sequence " the first-generation Hawaii of Obama "；

Embodiment 2

In carrying out sentence entity completion task in by the data set of wikipedia and Freebase constructions, by Wiki hundred Sentence in section is split as training set and test set at random, the experimental results showed that, it is used herein under identical data set Method compared with language model method of the tradition based on word co-occurrence and without using the limit system of triplet information, using flat It sorts (MR) and preceding 10 hit rate (H@10) is used as evaluation metrics, following experimental result can be obtained.

Table 1 uses method proposed by the present invention and other sentence complementing method performance comparisons

Table 1 the experimental results showed that：It is identical in training set and test set data, using the method for the invention Compared with without using the method for the present invention, 10 evaluation metrics of MR and H@are obviously improved.

The above is presently preferred embodiments of the present invention, and it is public that the present invention should not be limited to embodiment and attached drawing institute The content opened.It is every not depart from the lower equivalent or modification completed of spirit disclosed in this invention, both fall within the model that the present invention protects It encloses.

Claims

1. a kind of sentence entity complementing method of fusion triple knowledge base, it is characterised in that：Related definition is carried out first：Specifically It is as follows：

Define 1：Entity can represent the mark of specific things；

Define 8：Entity link method, can be according to the sentence entity of input, the method for finding entity that can be matched；

Define 9：Triple concatenate rule, even in two triples, second entity of first triple and second First entity of triple is identical, then the two triples can be attached using identical entity；

Define 10：Multiple triples are attached by relation path according to triple concatenate rule, constitute chain structure, these The collection that relationship is formed in triple is collectively referred to as relation path；

Wherein, triple concatenate rule is as described in defining 9；

Step 1: the data set that structure is used for model training；

Step 1.1：All sentence entities in sentence, obtained all sentence group of entities are extracted using name entity recognition method At set N；

Step 1.2：Sentence entity in the N obtained by step 1.1 is matched two-by-two according to the rule of traversal combination, every two A entity being mutually matched all constitutes sentence entity pair；

Wherein, sentence template is as described in defining 6；

Step 1.3.2：Being found using entity link method can be matched with the progress of two sentence entities of sentence entity centering Two entities in entity set, and constitute entity pair；All triples are traversed in triple knowledge base again, searching can connect The relation path of this two entity of entity centering；

Wherein, entity link method is as described in defining 8, and triple knowledge base is as described in defining 4, and relation path is as defined 11 institutes It states；

Step 2: entity, relationship, sentence template are indicated with vector；

Step 2.1：To one vector of each entity random initializtion in entity set, the collection of all obtained institute's directed quantities Conjunction is denoted as E, and to one vector of each relationship random initializtion in set of relations, the set of obtained institute's directed quantity is denoted as R；

Wherein,Indicate the 1st entity in i-th of triple,Indicate the 2nd entity in i-th of triple, rⁱIt indicates Relationship in i-th of triple,With E (rⁱ) the two entities and pass in i-th of triple is indicated respectively Tie up to the vector by random initializtion in step 2.1；

Step 2.3.1：By i-th of sentence template s_iInput the neural network model based on sequence, output obtain sentence template to Amount indicates, is denoted as f (s_i)；

Step 2.3.2：According to defining 6, sentence template has a corresponding sentence entity pair, by the entity corresponding to the sentence template To two entities and the obtained f (s of step 2.3.1 in the entity pair that is obtained in step 1.3.2_i) substitute into formula (1)：

Wherein,WithThe vector that two entities of presentation-entity centering initialize in step 2.1 respectively, ‖ ‖ Operation represents two norms；

Step 2.4：Step 2.3 is obtained allIt sums up to obtain L_s；

Step 2.5：For all sentence templates obtained in step 1.3.1, and by this sentence template s_iIt is obtained in step 2.3.1 Vector f (the s arrived_i) and this sentence template corresponding to sentence entity the relation path obtained by step 1.3.2 is substituted into it is public Formula (2)：

Wherein,Indicate that the vector that each relationship initializes in step 2.1 in this relation path, ∑ expression are asked And operation；All sentence templates are calculated laterIt sums up to obtain L_p；

Step 2.6,：The L that step 2.2 is obtained_k, the obtained L of step 2.4_s, the obtained L of step 2.5_pIt sums up and is optimized Object function L；

Step 2.7：Using the vector of all entities in gradient descent algorithm optimization object function L, relationship vector, be based on sequence Neural network model in parameter, so that L is minimized, obtained after optimization all entities vector, institute it is related vector, Optimized parameter in neural network model based on sequence；

Step 3.1：User provides a sentence s for requiring supplementation with entity word, is extracted using name entity recognition method therein All sentence entities, obtained all sentence entities constitute set E₁；

Wherein, name entity recognition method is as described in defining 7；

Step 3.2：For E₁In each sentence entity, carry out step 3.2.1-3.2.3, wherein i-th of sentence entity is denoted as w_i：

Sentence template is as described in defining 6；

Step 3.2.3：The optimized parameter for the neural network model based on sequence that sentence template and step 2.7 are obtained substitutes into base In the neural network model of sequence, obtain indicating the vector of sentence template

Step 3.2.4：The entity e that can be carried out with the sentence entity in matched entity set is found using entity link method_i；

Step 3.2.6：Vector E is calculated using similarity formulaⁱ(e) the corresponding vector of all entities for being obtained with step 2.7 it is similar Degree, and all entities are sorted from big to small according to similarity, obtain the sequence serial number of each entity；

Step 3.3：For each entity in entity set, the sequence serial number that each entity is obtained in step 3.2.6 is added With, obtain sequence serial number and；

Step 3.4：The sequence serial number that entity in entity set is obtained according to step 3.3 and ascending sequence, will be in sequence Sentence entity corresponding to the entity being ranked first returns, and in the sentence s that is given to step 3.1 user of completion；

Wherein, sentence entity is as described in defining 5.