CN109359303A

CN109359303A - A kind of Word sense disambiguation method and system based on graph model

Info

Publication number: CN109359303A
Application number: CN201811503355.7A
Authority: CN
Inventors: 孟凡擎; 燕孝飞; 张强; 陈文平; 鹿文鹏
Original assignee: Zaozhuang University
Current assignee: Zaozhuang University
Priority date: 2018-12-10
Filing date: 2018-12-10
Publication date: 2019-02-19
Anticipated expiration: 2038-12-10
Also published as: CN109359303B

Abstract

The invention discloses a kind of Word sense disambiguation method and system based on graph model, belong to natural language processing technique field, the technical problem to be solved in the present invention is how to combine a variety of Chinese and English resources, have complementary advantages, realize the disambiguation knowledge sufficiently excavated in resource, promote word sense disambiguation performance, a kind of technical solution of use are as follows: 1. Word sense disambiguation method based on graph model, include the following steps: S1, extract Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, notional word is extracted as Context Knowledge, notional word is named word, verb, adjective, adverbial word；S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and based on the similarity calculation of HowNet；S3, building disambiguate figure；The correct selection of S4, the meaning of a word.2. a kind of sense disambiguation systems based on graph model, which includes Context Knowledge extraction unit, similarity calculated, disambiguation figure construction unit and the correct selecting unit of the meaning of a word.

Description

A kind of Word sense disambiguation method and system based on graph model

Technical field

The present invention relates to natural language processing technique field, specifically a kind of Word sense disambiguation method based on graph model And system.

Background technique

Word sense disambiguation refers to that the specific context environment according to locating for ambiguity word determines its specific meaning of a word, it is natural language One basic research of process field, to upper layers such as machine translation, information extraction, information retrieval, text classification, sentiment analysis Using directly affecting.The phenomenon that either other western languages such as Chinese or English, polysemy is generally existing.

Traditional method for carrying out Chinese word sense disambiguation task processing based on graph model is mainly utilized in one or more Literary knowledge resource, by the puzzlement of knowledge resource deficiency problem, word sense disambiguation performance is lower.Therefore how to combine a variety of Chinese and English moneys Source has complementary advantages, and realizes the disambiguation knowledge sufficiently excavated in resource, and promoting word sense disambiguation performance is current technology urgently to be solved Problem.

The patent document of Patent No. CN105893346A discloses a kind of graph model meaning of a word based on interdependent syntax tree and disappears Discrimination method the steps include: that 1. pairs of sentences are pre-processed and extract notional word to be disambiguated, and mainly include standardization processing, hyphenation And lemmatization etc.；2. pair sentence carries out interdependent syntactic analysis, its interdependent syntax tree is constructed；3. word is interdependent in acquisition sentence Distance on syntax tree, the i.e. length of shortest path；4. the meaning of a word concept building disambiguation for word in sentence is known according to knowledge base Know figure；5. being existed according to the semantic association path length, the weight of incidence edge, path end points that disambiguate in knowledge graph between meaning of a word node Distance on interdependent syntax tree calculates the figure score value of each meaning of a word node；6. being each ambiguity word, select figure score value maximum The meaning of a word as the correct meaning of a word.But the technical solution is using the semantic association relationship contained in BabelNet, rather than Semantic knowledge in HowNet；It is suitable for the work of English word sense disambiguation, but for Chinese and are not suitable for, and not can solve how In conjunction with a variety of Chinese and English resources, have complementary advantages, realizes the disambiguation knowledge sufficiently excavated in resource, promote asking for word sense disambiguation performance Topic.

Summary of the invention

Technical assignment of the invention is to provide a kind of Word sense disambiguation method and system based on graph model, to solve how to tie A variety of Chinese and English resources are closed, are had complementary advantages, the disambiguation knowledge sufficiently excavated in resource is realized, promotes asking for word sense disambiguation performance Topic.

Technical assignment of the invention realizes in the following manner, a kind of Word sense disambiguation method based on graph model, including Following steps:

S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word；

S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on The similarity calculation of HowNet；

S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure；

The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.

Preferably, specific step is as follows for similarity calculation in the step S2:

S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does the meaning of a word Mapping processing, obtains English set of words；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English word carries out similarity calculation；In addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire HowNet In English word information；

S202, the similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector；

S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, is adopted The form of word language vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.

More preferably, the Word similarity algorithm of word-based vector sum knowledge base is specific as follows in the step S201:

What S20101, judgement gave is word or phrase:

If 1., it is given be two English words, two words are obtained by the cosine similarity of two word vectors of calculating Similarity between language；

If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain phrase to Amount indicates, acquires the similarity of phrase, formula is as follows:

Wherein, | p₁| and | p₂| indicate phrase p₁And p₂The number of contained word；w_iAnd w_jRespectively indicate p₁In i-th of word Language, p₂In j-th of word；

S20102, synset relevant to two English words is iteratively searched for, until iterative steps are more than γ；

S20103, synonym is constructed based on two English words and synset relevant to two English words Collection figure；

S20104, in set distance range, the registration of synset relevant to two English words is calculated in figure, Formula is as follows:

sim_lap(w_i, w_j)=d*count (w_i, w_j)/(count(w_i)+count(w_j))

In formula, count (w_i, w_j) indicate word w_iAnd w_jThe synset number having jointly；count(w_i) and count (w_j) it is respectively w_iAnd w_jThe synset number respectively having；The value of d expression set distance range；

S20105, w in figure is calculated using dijkstra's algorithm_iAnd w_jBetween shortest path, obtain w_iAnd w_jIt is similar Degree, formula are as follows:

sim_bn(w_i, w_j1/ (δ of)=α *^path)+(1-α)sim_lap(w_i, w_j)

Wherein, path is w_iAnd w_jBetween shortest path；Value of the δ to adjust similarity；sim_lap(w_i, w_j) indicate w_iAnd w_jBetween registration；Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula；

S20106, the similarity sim for obtaining vector approach word-based in step S20101_vecWith base in step S20105 In the similarity sim that knowledge base method obtains_bn, linear, additive combination is carried out, obtains final similarity, formula is as follows:

sim_final(w_i, w_j)=β * sim_vec+(1-β)*sim_bn

Wherein, sim_bnAnd sim_vecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains The similarity arrived；Parameter alpha is a regulatory factor, adjusts the similarity that knowledge based library method and word-based vector approach obtain As a result；

S20107, similarity sim is returned_final。

Preferably, specific step is as follows for building disambiguation figure in the step S3:

S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter；

S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

Wherein, ws and ws ' indicates two meaning of a word, sim_howThe similarity calculation based on HowNet is indicated as a result, weight is α； sim_enThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β；sim_vecIt indicates based on term vector Similarity calculation is as a result, weight is γ；Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0；

S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word.

More preferably, the simulated annealing in the step S301 carries out the formula of parameter optimization are as follows:

Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate；δ indicates cooling rate；T indicates current institute Locate temperature；x_newIt indicates newly to take parameter；x_oldIndicate original parameter；

The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:

If (a), newly taking parameter x_newObjective function value be not less than original parameter x_oldObjective function value, then with probability P is that 1 selection newly takes parameter x_new；

If (b), newly taking parameter x_newObjective function value be less than original parameter x_oldObjective function value, then with Probability p For exp ((result (x_new)-result(x_old))/(δ t)) it is used as Selecting All Parameters x_newFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:

If 1., the probability value that generates at random no more than p when, selection newly takes parameter x_new；

2., if the probability value that generates at random is when being greater than p, give up and newly take parameter x_new；

The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword)；Its In, what No. was indicated is concept number；Sword indicates the first sense word；Enword indicates English word；No.,Sword, Enword three is the entirety of organic unity, describes the same meaning of a word concept；A meaning of a word concept number is unique in HowNet A meaning of a word is identified, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.

Preferably, selecting the correct meaning of a word in the step S4, specific step is as follows:

S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure；It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；

S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:

If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word；

If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.

More preferably, figure scoring uses PageRank algorithm in the step S401, and PageRank algorithm is based on Ma Erke Husband's chain model assesses node in figure, and the PageRank score of a node depends on all nodes linked with it PageRank score；The specific PageRank score calculation formula of one node are as follows:

Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate；α refers to the probability for continuing current Markov chain；N is total node quantity；| out (u) | indicate the out-degree of node u；in It (v) is all nodes for being linked to node v.

A kind of sense disambiguation systems based on graph model, the system include,

Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word；

Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet；

Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure；

The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.

Preferably, the similarity calculated includes:

English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected Processing is penetrated, English set of words is obtained；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Cliction language carries out similarity calculation；In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information；

Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two；It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word；

HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree；

The disambiguation figure construction unit includes,

Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain Optimal weights parameter；The formula of simulated annealing progress parameter optimization are as follows:

Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word；Wherein, the meaning of a word refers to one three Tuple indicates are as follows: Word (No., Sword, Enword)；Wherein, what No. was indicated is concept number；Sword indicates that the first justice is former Word；Enword indicates English word；No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads；Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.

More preferably, the correct selecting unit of the meaning of a word includes,

Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure； After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；Figure is commented Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot The PageRank score of point depends on the PageRank score of all nodes linked with it；The specific PageRank of one node Score calculation formula are as follows:

Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate；α refers to the probability for continuing current Markov chain；N is total node quantity；| out (u) | indicate the out-degree of node u；in It (v) is all nodes for being linked to node v；

Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:

Of the invention Word sense disambiguation method and system based on graph model has the advantage that

(1), for the present invention by combining a variety of Chinese and English resources, the disambiguation knowledge in resource is sufficiently excavated in mutual supplement with each other's advantages, Facilitate the promotion of word sense disambiguation performance；

(2), the present invention does the similarity calculation based on English, the similarity calculation based on term vector respectively and is based on The similarity calculation of HowNet, it is ensured that a variety of knowledge resources can be effectively integrated, improve and disambiguate accuracy rate；

(3), the present invention carries out weight optimization to similarity using simulated annealing, obtains fused similarity, into And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disambiguates figure, Guarantee the similarity value of a variety of knowledge resources of Automatic Optimal；

(4), when the present invention carries out English similarity calculation, HowNet word sense information mark is carried out to Context Knowledge, and Meaning of a word mapping processing is done, obtains English set of words, it is ensured that being capable of automatic aligning Chinese and English knowledge resource；

(5), the present invention gives a mark to the meaning of a word candidate in figure by figure scoring, and then obtains the score column of the candidate meaning of a word Table selects score the maximum for the correct meaning of a word, can realize the correct meaning transference to target ambiguities word automatically.

Detailed description of the invention

The following further describes the present invention with reference to the drawings.

Attached drawing 1 is the flow diagram of the Word sense disambiguation method based on graph model；

Attached drawing 2 is the flow diagram of similarity calculation；

Attached drawing 3 is the flow diagram that building disambiguates figure；

Attached drawing 4 is the flow diagram of correct meaning transference；

Attached drawing 5 is the structural block diagram of the word sense disambiguation based on graph model；

Attached drawing 6 is the word sense information figure of citing Chinese medicine word；

Attached drawing 7 is the synset figure that constructs in the Word similarity algorithm of word-based vector sum knowledge base.

Specific embodiment

To a kind of Word sense disambiguation method based on graph model of the invention and it is referring to Figure of description and specific embodiment System is described in detail below.

Embodiment:

As shown in Fig. 1, the Word sense disambiguation method and system of the invention based on graph model, includes the following steps:

Citing: with to " carrying out around " instruction ", in conjunction with the reality of work of Chinese medicine, increased force is wanted in various regions Degree, actively and steadily promotes TCM medical organization to reform." processing for, wherein " Chinese medicine " is to disambiguated term.Part-of-speech tagging Processing uses Chinese Academy of Sciences's Words partition system NLPIR-ICTCLAS.After part-of-speech tagging, " around/v "/wkz guidance/v opinion/n "/ Wky /ude1 implements/vn implements/vn ,/wd combination/v traditional Chinese medicine/n work/vn/ude1 reality/n ,/wd is each Ground/rzs wants/v increasing/v dynamics/n ,/wd actively/a and/cc it is safe/a /ude2 propulsion/vi Chinese medicine/n doctor Treatment/n mechanism/n reform/vn./ wj ", it is extracted notional word go forward side by side row format arrangement, to facilitate subsequent processing, obtain " Chinese medicine _ N_25: implement around _ v_0 guidance _ vn_2 opinion _ n_3 _ v_6 implements _ v_7 combination _ v_9 traditional Chinese medicine _ n_10 work _ Vn_11 reality _ n_13 wants _ v_16 increasing _ v_17 dynamics _ n_18 actively _ a_20 is safe _ a_22 propulsion _ v_24 in Doctor _ n_25 medical treatment _ n_26 mechanism _ n_27 reform _ vn_28 ", is wherein word to be disambiguated before colon, and the number after part of speech is single Word is the location of in sentence.

As shown in Fig. 2, specific step is as follows for similarity calculation:

Similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping Processing, obtains English set of words；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Word carry out similarity calculation, in addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information.The part main code of the Word similarity algorithm of word-based vector sum knowledge base is as follows:

In the Word similarity algorithm of word-based vector sum knowledge base, row 1 gives two English words, they Between similarity, obtained by both calculating the cosine similarity of term vector, if given word is phrase, by training institute Term vector in there is no phrase, need that phrase is further processed, by by the corresponding term vector of word in phrase It is added, the vector for obtaining phrase indicates, and then acquires the similarity of phrase, and formula is as follows:

Wherein, | p₁| and | p₂| indicate phrase p₁And p₂The number of contained word；w_iAnd w_jRespectively indicate p₁In i-th of word Language, p₂In j-th of word.

Row 2-4 iteratively searches for synset relevant to word w1 and w2, until iterative steps step is more than γ, by The cost that figure calculates when node is excessive is larger, therefore sets 10 for greatest iteration step number γ；Row 5, with w1, w2 and they it Between associated synset be basic structure figures；Row 6 in figure within the scope of certain distance, calculates relevant to w1 and w2 same The registration of adopted word set, set distance 2, formula is as follows:

sim_lap(w1, w2)=2*count (w1, w2)/(count (w1)+count (w2))

In formula, count (w1, w2) indicates the synset number that word w1 and w2 have jointly；Count (w1) and Count (w2) is respectively the synset number that w1 and w2 respectively have.

Row 7 calculates the shortest path in figure between w1 and w2 using dijkstra's algorithm, further obtains the phase of w1 and w2 Like degree, formula is as follows:

sim_bn1/ (δ of (w1, w2)=α *^path)+(1-α)sim_lap(w1, w2)

Wherein, path is the shortest path between w1 and w2；Value of the δ to adjust similarity, is set as 1.4；sim_lap (w1, w2) indicates the registration between w1 and w2；Parameter alpha is a regulatory factor, for adjusting the phase of two parts in formula Like angle value.

The method of the above-mentioned method based on term vector and knowledge based library (BabelNet) is carried out linear, additive knot by row 8 It closes, obtains final similarity, formula is as follows:

sim_final(w1, w2)=β * sim_vec+(1-β)*sim_bn

sim_bnAnd sim_vecRespectively indicate the similarity that the method in knowledge based library and the method based on term vector obtain；Ginseng Number α is a regulatory factor, is obtained for adjusting two methods as a result, being specifically configured to 0.6.

Row 9 returns to similarity sim_final。

The processing in relation to term vector is to utilize in the Word similarity algorithm of word-based vector sum knowledge base Word2vec kit, on without mark English Wikipedia corpus, training term vector.Before training, data are carried out Pretreatment, by file format is converted to UTF-8 by Unicode.Training window is set as 5, and default vector dimension is set as 200, model selects Skip-gram.After training terminates, a term vector file is obtained, hereof, each word is mapped For the vectors with 200 dimensions, the often one-dimensional of vector is a double precision numerical value.

Knowledge base chooses BabelNet, and BabelNet provides concept abundant and name entity, and passes through a large amount of language Adopted relationship interlinks, and semantic relation here refers to synonym relationship, hyponymy, integral part relationship etc..It is given Two words (concept or name entity), by means of the available respective synset of BabelNet API, and pass through language The synset of adopted relational links.Synset refers to a synonym collection, has unique identifier in BabelNet, Indicate a specific meaning of a word.Such as " bn:00021464n " instruction synset " computer, computing machine, computing device,data processor,electronic computer,information processing System " indicates a specific meaning of a word " computer, computer ".The Word similarity of word-based vector sum knowledge base is calculated The synset figure constructed in method, as shown in Fig. 7.

Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25: Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3: 143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _ V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13: 109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16: 140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24: 122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".

English is done between above-mentioned gained any two English word (the corresponding English word of each HowNet meaning of a word concept) Similarity calculation, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round and guidance _ vn_2:155807 | Direct is 0.292 is around _ v_0:124932 | revolve round and opinion _ n_3:143264 | complaint Is 0.3085 is around _ v_0:124932 | revolve round and opinion _ n_3:143267 | idea is 0.3742 encloses Around _ v_0:124932 | revolve round and implements _ v_6:047082 | and carry out is 0.4015 is around _ v_0: 124932 | revolve round and implements _ v_7:081572 | and feel at ease is 0.3575 is around _ v_0: 124932 | revolve round and implements _ v_7:081573 | and ascertain is 0.3215 is around _ v_0:124932 | Revolve round and implements _ v_7:081575 | and fulfil is 0.3541 is around _ v_0:124932 | revolve Round and combination _ v_9:064548 | be united in wedlock is 0.3299 is around _ v_0:124932 | Revolve round and combination _ v_9:064549 | combination is 0.3487 is around _ v_0:124932 | Revolve round and traditional Chinese medicine _ n_10:157339 | traditional Chinese medicine and druds Is 0.3520 is around _ v_0:124932 | revolve round and work _ vn_11:044068 | work is 0.3478 Around _ v_0:124932 | revolve round and reality _ n_13:109077 | reality is 0.3664 is around _ v_ 0:124932 | revolve round and reality _ n_13:109078 | practice is 0.3907 is around _ v_0: 124932 | revolve round and wants _ v_16:140522 | and want to is 0.3375 is around _ v_0:124932 | Revolve round and wants _ v_16:140530 | and ask is 0.3482 " shows only part since length is limited here Similarity result.

Similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector；

It should be noted that the meaning of a word of ambiguity word is more, trained term vector file is likely to tend to the ambiguity word Some more common meaning of a word.For this purpose, ambiguity word is converted into the meaning of a word possessed by it using HowNet, that is, each general The first justice read in definition is former, as shown in Fig. 5, ambiguity word " Chinese medicine " is converted to " people " and " knowledge ".

Gained any two Chinese word (corresponding to specific HowNet meaning of a word concept) is done based on the similar of term vector Degree calculates, obtain " Chinese medicine _ n_25: around _ v_0:124932 | around and guidance _ vn_2:155807 | order is-0.0145 encloses Around _ v_0:124932 | surround and opinion _ n_3:143264 | Chinese language is-0.0264 is around _ v_0:124932 | it surrounds And opinion _ n_3:143267 | thought is -0.0366 is around _ v_0:124932 | _ v_6:047082 is implemented around and | Implement is 0.2071 around _ v_0:124932 | implement _ v_7:081572 around and | feel at ease is -0.0430 around _ V_0:124932 | _ v_7:081573 is implemented around and | determine is 0.1502 around _ v_0:124932 | surround and Implement _ v_7:081575 | realize is 0.2254 around _ v_0:124932 | surround and combination _ v_9:064548 | it gets married Is -0.0183 is around _ v_0:124932 | surround and combination _ v_9:064549 | merge is 0.0745 around _ v_0: 124932 | surround and traditional Chinese medicine _ n_10:157339 | knowledge _ drug is 0.0866 is around _ v_0:124932 | it surrounds And work _ vn_11:044068 | is 0.1434 is around _ v_0:124932 | around and reality _ n_13:109077 | Entity is 0.1503 is around _ v_0:124932 | surround and reality _ n_13:109078 | thing is -0.0571 encloses Around _ v_0:124932 | _ v_16:140522 is wanted around and | expectation is 0.1009 is around _ v_0:124932 | around and Want _ v_16:140530 | it is required that is 0.2090 is around _ v_0:124932 | _ v_16:140532 is wanted around and | seek is 0.0496 around _ v_0:124932 | _ v_16:140534 is wanted around and | spend is 0.0176 around _ v_0:124932 | surrounding and increasing _ v_17:059967 | deformation shape is 0.0000 is around _ v_0:124932 | surround and increasing _ v_ 17:059968 | optimization is 0.2410 is around _ v_0:124932 | surround and increasing _ v_17:059969 | expand is 0.1911 around _ v_0:124932 | surround and dynamics _ n_18:076991 | intensity is 0.0592 is around _ v_0: 124932 | around and actively _ a_20:057562 | positive is 0.3089 is around _ v_0:124932 | around and product Pole _ a_20:057564 | positive is 0.0554 is around _ v_0:124932 | around and it is safe _ a_22:126267 | work as is 0.0245 around _ v_0:124932 | around and it is safe _ a_22:126269 | firm is 0.0490 is around _ v_0: 124932 | surround and propulsion _ v_24:122203 | advance is 0.1917 is around _ v_0:124932 | it is pushed away around and Into _ v_24:122206 | mobilize is 0.0277 around _ v_0:124932 | around and propulsion _ v_24:122211 | push away is 0.1740 around _ v_0:124932 | surround and Chinese medicine _ n_25:157332 | knowledge is 0.2205 is around _ v_0: 124932 | surround and Chinese medicine _ n_25:157329 | people is-0.0686 is around _ v_0:124932 | around and mechanism _ N_27:057323 | mechanism is 0.0945 is around _ v_0:124932 | surround and mechanism _ n_27:057325 | component is 0.0582 around _ v_0:124932 | surround and mechanism _ n_27:057326 | component is 0.0582 ".Since length has Limit, shows only part similarity result here.

Similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.

The similarity between each meaning of a word is calculated using the concept similarity kit that HowNet is provided, obtains and " Chinese medicine _ n_25: encloses Around _ v_0:124932 and guidance _ vn_2:155807 is 0.015094 around _ v_0:124932 and opinion _ n_3: 143264 is 0.000624 are around _ v_0:124932 and opinion _ n_3:143267 is 0.010256 around _ v_0: 124932 and implement _ and v_6:047082 is 0.013793 around _ v_0:124932 and implements _ v_7:081572 is 0.010256 implement around _ v_0:124932 and _ v_7:081573 is 0.013793 is around _ v_0:124932 and Practicable _ v_7:081575 is 0.013793 is around _ v_0:124932 and combination _ v_9:064548 is 0.016667 Around _ v_0:124932 and combination _ v_9:064549 is 0.018605 around _ v_0:124932 and traditional Chinese medicine _ n_ 10:157339 is 0.000624 around _ v_0:124932 and work _ vn_11:044065 is 0.000624 around _ V_0:124932 and work _ vn_11:044067 is 0.000624 is around _ v_0:124932 and work _ vn_11: 044068 is 0.015094 surrounds _ v_0 around _ v_0:124932 and reality _ n_13:109077 is 0.000624: 124932 and reality _ n_13:109078 is 0.000624 want _ v_16:140522 is around _ v_0:124932 and 0.010959 want around _ v_0:124932 and _ v_16:140530 is 0.015094 is around _ v_0:124932 and Want _ v_16:140532 is 0.018605 wants around _ v_0:124932 and _ v_16:140534 is 0.015094 encloses Around _ v_0:124932 and increasing _ v_17:059967 is 0.013793 around _ v_0:124932 and increasing _ v_17: 059968 is 0.015094 is around _ v_0:124932 and increasing _ v_17:059969 is 0.013793 around _ v_0: 124932 and dynamics _ n_18:076991 is 0.000624 around _ v_0:124932 and actively _ a_20:057562 Is 0.000624 around _ v_0:124932 and actively _ a_20:057564 is 0.000624 is around _ v_0:124932 And is safe _ a_22:126267 is 0.000624 around _ v_0:124932 and it is safe _ a_22:126269 is 0.000624”。

S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure；As shown in Fig. 3, specific step is as follows for building disambiguation figure:

S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter；The formula of simulated annealing progress parameter optimization are as follows:

2., if the probability value that generates at random is when being greater than p, give up and newly take parameter x_new。

The partial code of weight optimization algorithm based on simulated annealing is as shown in the table:

In weight optimization algorithm based on simulated annealing, row 1 is initialization operation, and setting initial temperature value t is 100, temperature Spending floor value t_min is 0.001, and cooling rate delta is set to 0.98, and greatest iteration step number k is set as 100；Row 2-3 be temperature with And the control of iterative steps；Row 4-5, the double-precision value of random selection 0 to 1-y are x assignment, and are z assignment 1-x-y；Row 6, letter Number getEvalResult (x, y, z) is objective function, function return value resulting disambiguation standard when being given weight parameter x, y, x True rate；Row 7 selects new value to be assigned to x_new in the neighborhood of x；Row 8-18, determines whether x_new retains to replace x, is specifically shown in The formula of simulated annealing progress parameter optimization；Row 20 changes t with the cooling rate of delta；Row 22 returns to x, y, z most Excellent parameter combination.

Wherein, x, y, z indicates the weight variable of three kinds of similarity results, and when executing algorithm for the first time, y is set as 1/3, this When algorithm after obtain the weight optimization parameter of x, y, at this moment min (x, y) is fixed up, continues to execute second of algorithm, After algorithm, other two weight parameters can be determined.

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word；Wherein, the meaning of a word refers to a ternary Group indicates are as follows: Word (No., Sword, Enword)；Wherein, what No. was indicated is concept number；Sword indicates the former word of the first justice Language；Enword indicates English word；No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads；Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.

The meaning of a word of this triple form enables above-mentioned three kinds of similarity calculating methods to be integrated into an entirety, with For " Chinese medicine ", " Chinese medicine " there are two the meaning of a word, correspond respectively to two meaning of a word triples, specific as follows: " Chinese medicine (157329, People, practitioner of Chinese medicine) ", " Chinese medicine (157332, knowledge, traditional Chinese Science) ", the side right weight in disambiguating figure between any two vertex, that is, the semantic similarity between the meaning of a word at this time, can be with It is obtained by the similarity calculation finally merged between the meaning of a word.

The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.As shown in Fig. 4, selecting the correct meaning of a word, specific step is as follows:

S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure；It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；Figure scoring is adopted With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node PageRank score depends on the PageRank score of all nodes linked with it；The specific PageRank score of one node Calculation formula are as follows:

Citing: after figure scoring, obtaining candidate meaning of a word list of concepts,

Chinese medicine _ n_25:157332 2.1213090873827947E58；

Chinese medicine _ n_25:157329 1.8434688340823378E58.

Citing: select meaning of a word concept highest scoring person for the correct meaning of a word, namely " Chinese medicine _ n_25:157332 ".

Embodiment 2:

As shown in Fig. 5, the present invention is based on the sense disambiguation systems of graph model, which includes,

Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet.Similarity calculated includes:

Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two；It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word；For this purpose, ambiguity word is converted into possessed by it using HowNet The first justice in the meaning of a word, that is, each concept definition is former, as shown in fig. 6, ambiguity word " Chinese medicine " is converted to " people " and " is known Know ".

HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.

Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure；Disambiguation figure construction unit includes,

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.The correct selecting unit of the meaning of a word includes,

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of Word sense disambiguation method based on graph model, which comprises the steps of:

S1, extract Context Knowledge: to ambiguity sentences carry out part-of-speech tagging, extract notional word as Context Knowledge, notional word name word, Verb, adjective, adverbial word；

S3, building disambiguate figure: weight optimization carried out to similarity using simulated annealing, obtains fused similarity, into And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, constructs disambiguation Figure；

The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the score of the candidate meaning of a word List selects highest scoring person as the correct meaning of a word.

2. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that similar in the step S2 Specific step is as follows for degree calculating:

S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping Processing, obtains English set of words；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Word carries out similarity calculation；

S202, the similarity calculation based on term vector: using the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two；

S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.

3. the Word sense disambiguation method according to claim 2 based on graph model, which is characterized in that base in the step S201 It is specific as follows in the Word similarity algorithm of term vector and knowledge base:

What S20101, judgement gave is word or phrase:

If 1., it is given be two English words, by the cosine similarity of two word vectors of calculating obtain two words it Between similarity；

If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain the vector table of phrase Show, acquire the similarity of phrase, formula is as follows:

Wherein, | p₁| and | p₂| indicate phrase p₁And p₂The number of contained word；w_iAnd w_jRespectively indicate p₁In i-th of word, p₂ In j-th of word；

S20103, synset is constructed based on two English words and synset relevant to two English words Figure；

S20104, in set distance range, the registration of synset relevant to two English words, formula are calculated in figure It is as follows:

sim_lap(w_i, w_j)=d*count (w_i, w_j)/(count(w_i)+count(w_j))

In formula, count (w_i, w_j) indicate word w_iAnd w_jThe synset number having jointly；count(w_i) and count (w_j) Respectively w_iAnd w_jThe synset number respectively having；The value of d expression set distance range；

S20105, w in figure is calculated using dijkstra's algorithm_iAnd w_jBetween shortest path, obtain w_iWith the similarity of w, formula It is as follows:

sim_bn(w_i, w_j1/ (δ of)=α *^path)+(1-α)sim_lap(w_i, w_j)

Wherein, path is w_iAnd w_jBetween shortest path；Value of the δ to adjust similarity；sim_lap(w_i, w_j) indicate w_iWith w_jBetween registration；Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula；

S20106, the similarity sim for obtaining vector approach word-based in step S20101_vecWith in step S20105 based on knowing Know the similarity sim that library method obtains_bn, linear, additive combination is carried out, obtains final similarity, formula is as follows:

sim_final(w_i, w_j)=β * sim_vec+(1-β)*sim_bn

Wherein, sim_bnAnd sim_vecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains Similarity；Parameter alpha is a regulatory factor, adjusts the similarity knot that knowledge based library method and word-based vector approach obtain Fruit；

S20107, similarity sim is returned_final。

4. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that constructed in the step S3 Disambiguating figure, specific step is as follows:

S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 automatic Optimization, obtains optimal weights parameter；

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

Wherein, ws and ws ' indicates two meaning of a word, sim_howThe similarity calculation based on HowNet is indicated as a result, weight is α；sim_en The Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β；sim_vecIt indicates based on the similar of term vector Spend calculated result, weight γ；Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0；

S303, it constructs to disambiguate and scheme: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulated annealing Weight optimization algorithm, integrate three kinds of similarity values as the side right weight between the meaning of a word.

5. the Word sense disambiguation method according to claim 4 based on graph model, which is characterized in that in the step S301 The formula of simulated annealing progress parameter optimization are as follows:

Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate；δ indicates cooling rate；T expression is presently in temperature Degree；x_newIt indicates newly to take parameter；x_oldIndicate original parameter；

If (a), newly taking parameter x_newObjective function value be not less than original parameter x_oldObjective function value, then with Probability p be 1 Selection newly takes parameter x_new；

If (b), newly taking parameter x_newObjective function value be less than original parameter x_oldObjective function value, then be exp with Probability p ((result(x_new)-result(x_old))/(δ t)) it is used as Selecting All Parameters x_newFoundation, generate a probability value at random, and sentence The size of the disconnected probability value generated at random and Probability p:

The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword)；Wherein, No. what is indicated is concept number；Sword indicates the first sense word；Enword indicates English word；No.,Sword,Enword Three is the entirety of organic unity, describes the same meaning of a word concept；A meaning of a word concept number unique identification one in HowNet A meaning of a word, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.

6. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that selected in the step S4 Specific step is as follows for the correct meaning of a word:

S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure；Completion figure is commented After point, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；

If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is correct The meaning of a word.

7. the Word sense disambiguation method according to claim 6 based on graph model, which is characterized in that scheme in the step S401 Scoring uses PageRank algorithm, and PageRank algorithm is to be assessed based on Markov chain model node in figure, and one The PageRank score of node depends on the PageRank score of all nodes linked with it；One node it is specific PageRank score calculation formula are as follows:

Wherein, 1- α is indicated in random walk process, jumps out the probability that current Markov chain randomly chooses a node；α is Refer to the probability for continuing current Markov chain；N is total node quantity；| out (u) | indicate the out-degree of node u；In (v) is chain It is connected to all nodes of node v.

8. a kind of sense disambiguation systems based on graph model, which is characterized in that the system includes,

Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word is named Word, verb, adjective, adverbial word；

Similarity calculated, for do respectively based on English similarity calculation, the similarity calculation based on term vector and Similarity calculation based on HowNet；

Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disappears Discrimination figure；

The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects score the maximum for the correct meaning of a word.

9. the sense disambiguation systems according to claim 8 based on graph model, which is characterized in that the similarity calculation list Member includes:

English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done at meaning of a word mapping Reason, obtains English set of words；The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English words Language carries out similarity calculation；

Term vector similarity calculated, for using the word2vec kit of Google training term vector on the corpus, Term vector file is obtained, the corresponding term vector of two words is given according to term vector file acquisition, calculates the cosine between term vector Similarity of the similarity as the two；

HowNet similarity calculated, for carrying out word sense information mark to Context Knowledge using HowNet, using word The form of vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided；

The disambiguation figure construction unit includes,

Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to the similarity calculation, word-based based on English Three kinds of similarity values of the similarity calculation of vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain optimal Weight parameter；The formula of simulated annealing progress parameter optimization are as follows:

Sim (ws, ws ')=α sim_how+βsim_en+γsim_vec

Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, moves back using based on simulation The weight optimization algorithm of fire, integrates three kinds of similarity values as the side right weight between the meaning of a word；Wherein, the meaning of a word refers to a ternary Group indicates are as follows: Word (No., Sword, Enword)；Wherein, what No. was indicated is concept number；Sword indicates the former word of the first justice Language；Enword indicates English word；No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads；Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.

10. the sense disambiguation systems based on graph model according to claim 8 or claim 9, which is characterized in that the meaning of a word is correct Selecting unit includes,

Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure；It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts；Figure scoring is adopted With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node PageRank score depends on the PageRank score of all nodes linked with it；The specific PageRank score of one node Calculation formula are as follows:

Wherein, 1- α is indicated in random walk process, jumps out the probability that current Markov chain randomly chooses a node；α is Refer to the probability for continuing current Markov chain；N is total node quantity；| out (u) | indicate the out-degree of node u；In (v) is chain It is connected to all nodes of node v；