CN109359303A - A kind of Word sense disambiguation method and system based on graph model - Google Patents

A kind of Word sense disambiguation method and system based on graph model Download PDF

Info

Publication number
CN109359303A
CN109359303A CN201811503355.7A CN201811503355A CN109359303A CN 109359303 A CN109359303 A CN 109359303A CN 201811503355 A CN201811503355 A CN 201811503355A CN 109359303 A CN109359303 A CN 109359303A
Authority
CN
China
Prior art keywords
word
meaning
similarity
concept
sim
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811503355.7A
Other languages
Chinese (zh)
Other versions
CN109359303B (en
Inventor
孟凡擎
燕孝飞
张强
陈文平
鹿文鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zaozhuang University
Original Assignee
Zaozhuang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zaozhuang University filed Critical Zaozhuang University
Priority to CN201811503355.7A priority Critical patent/CN109359303B/en
Publication of CN109359303A publication Critical patent/CN109359303A/en
Application granted granted Critical
Publication of CN109359303B publication Critical patent/CN109359303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Word sense disambiguation method and system based on graph model, belong to natural language processing technique field, the technical problem to be solved in the present invention is how to combine a variety of Chinese and English resources, have complementary advantages, realize the disambiguation knowledge sufficiently excavated in resource, promote word sense disambiguation performance, a kind of technical solution of use are as follows: 1. Word sense disambiguation method based on graph model, include the following steps: S1, extract Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, notional word is extracted as Context Knowledge, notional word is named word, verb, adjective, adverbial word;S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and based on the similarity calculation of HowNet;S3, building disambiguate figure;The correct selection of S4, the meaning of a word.2. a kind of sense disambiguation systems based on graph model, which includes Context Knowledge extraction unit, similarity calculated, disambiguation figure construction unit and the correct selecting unit of the meaning of a word.

Description

A kind of Word sense disambiguation method and system based on graph model
Technical field
The present invention relates to natural language processing technique field, specifically a kind of Word sense disambiguation method based on graph model And system.
Background technique
Word sense disambiguation refers to that the specific context environment according to locating for ambiguity word determines its specific meaning of a word, it is natural language One basic research of process field, to upper layers such as machine translation, information extraction, information retrieval, text classification, sentiment analysis Using directly affecting.The phenomenon that either other western languages such as Chinese or English, polysemy is generally existing.
Traditional method for carrying out Chinese word sense disambiguation task processing based on graph model is mainly utilized in one or more Literary knowledge resource, by the puzzlement of knowledge resource deficiency problem, word sense disambiguation performance is lower.Therefore how to combine a variety of Chinese and English moneys Source has complementary advantages, and realizes the disambiguation knowledge sufficiently excavated in resource, and promoting word sense disambiguation performance is current technology urgently to be solved Problem.
The patent document of Patent No. CN105893346A discloses a kind of graph model meaning of a word based on interdependent syntax tree and disappears Discrimination method the steps include: that 1. pairs of sentences are pre-processed and extract notional word to be disambiguated, and mainly include standardization processing, hyphenation And lemmatization etc.;2. pair sentence carries out interdependent syntactic analysis, its interdependent syntax tree is constructed;3. word is interdependent in acquisition sentence Distance on syntax tree, the i.e. length of shortest path;4. the meaning of a word concept building disambiguation for word in sentence is known according to knowledge base Know figure;5. being existed according to the semantic association path length, the weight of incidence edge, path end points that disambiguate in knowledge graph between meaning of a word node Distance on interdependent syntax tree calculates the figure score value of each meaning of a word node;6. being each ambiguity word, select figure score value maximum The meaning of a word as the correct meaning of a word.But the technical solution is using the semantic association relationship contained in BabelNet, rather than Semantic knowledge in HowNet;It is suitable for the work of English word sense disambiguation, but for Chinese and are not suitable for, and not can solve how In conjunction with a variety of Chinese and English resources, have complementary advantages, realizes the disambiguation knowledge sufficiently excavated in resource, promote asking for word sense disambiguation performance Topic.
Summary of the invention
Technical assignment of the invention is to provide a kind of Word sense disambiguation method and system based on graph model, to solve how to tie A variety of Chinese and English resources are closed, are had complementary advantages, the disambiguation knowledge sufficiently excavated in resource is realized, promotes asking for word sense disambiguation performance Topic.
Technical assignment of the invention realizes in the following manner, a kind of Word sense disambiguation method based on graph model, including Following steps:
S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on The similarity calculation of HowNet;
S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure;
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.
Preferably, specific step is as follows for similarity calculation in the step S2:
S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does the meaning of a word Mapping processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English word carries out similarity calculation;In addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire HowNet In English word information;
S202, the similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector;
S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, is adopted The form of word language vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.
More preferably, the Word similarity algorithm of word-based vector sum knowledge base is specific as follows in the step S201:
What S20101, judgement gave is word or phrase:
If 1., it is given be two English words, two words are obtained by the cosine similarity of two word vectors of calculating Similarity between language;
If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain phrase to Amount indicates, acquires the similarity of phrase, formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word Language, p2In j-th of word;
S20102, synset relevant to two English words is iteratively searched for, until iterative steps are more than γ;
S20103, synonym is constructed based on two English words and synset relevant to two English words Collection figure;
S20104, in set distance range, the registration of synset relevant to two English words is calculated in figure, Formula is as follows:
simlap(wi, wj)=d*count (wi, wj)/(count(wi)+count(wj))
In formula, count (wi, wj) indicate word wiAnd wjThe synset number having jointly;count(wi) and count (wj) it is respectively wiAnd wjThe synset number respectively having;The value of d expression set distance range;
S20105, w in figure is calculated using dijkstra's algorithmiAnd wjBetween shortest path, obtain wiAnd wjIt is similar Degree, formula are as follows:
simbn(wi, wj1/ (δ of)=α *path)+(1-α)simlap(wi, wj)
Wherein, path is wiAnd wjBetween shortest path;Value of the δ to adjust similarity;simlap(wi, wj) indicate wiAnd wjBetween registration;Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula;
S20106, the similarity sim for obtaining vector approach word-based in step S20101vecWith base in step S20105 In the similarity sim that knowledge base method obtainsbn, linear, additive combination is carried out, obtains final similarity, formula is as follows:
simfinal(wi, wj)=β * simvec+(1-β)*simbn
Wherein, simbnAnd simvecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains The similarity arrived;Parameter alpha is a regulatory factor, adjusts the similarity that knowledge based library method and word-based vector approach obtain As a result;
S20107, similarity sim is returnedfinal
Preferably, specific step is as follows for building disambiguation figure in the step S3:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter;
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word.
More preferably, the simulated annealing in the step S301 carries out the formula of parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword);Its In, what No. was indicated is concept number;Sword indicates the first sense word;Enword indicates English word;No.,Sword, Enword three is the entirety of organic unity, describes the same meaning of a word concept;A meaning of a word concept number is unique in HowNet A meaning of a word is identified, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.
Preferably, selecting the correct meaning of a word in the step S4, specific step is as follows:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
More preferably, figure scoring uses PageRank algorithm in the step S401, and PageRank algorithm is based on Ma Erke Husband's chain model assesses node in figure, and the PageRank score of a node depends on all nodes linked with it PageRank score;The specific PageRank score calculation formula of one node are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v.
A kind of sense disambiguation systems based on graph model, the system include,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet;
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure;
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.
Preferably, the similarity calculated includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected Processing is penetrated, English set of words is obtained;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Cliction language carries out similarity calculation;In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information;
Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two;It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word;
HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree;
The disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain Optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to one three Tuple indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates that the first justice is former Word;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.
More preferably, the correct selecting unit of the meaning of a word includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure; After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure is commented Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot The PageRank score of point depends on the PageRank score of all nodes linked with it;The specific PageRank of one node Score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
Of the invention Word sense disambiguation method and system based on graph model has the advantage that
(1), for the present invention by combining a variety of Chinese and English resources, the disambiguation knowledge in resource is sufficiently excavated in mutual supplement with each other's advantages, Facilitate the promotion of word sense disambiguation performance;
(2), the present invention does the similarity calculation based on English, the similarity calculation based on term vector respectively and is based on The similarity calculation of HowNet, it is ensured that a variety of knowledge resources can be effectively integrated, improve and disambiguate accuracy rate;
(3), the present invention carries out weight optimization to similarity using simulated annealing, obtains fused similarity, into And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disambiguates figure, Guarantee the similarity value of a variety of knowledge resources of Automatic Optimal;
(4), when the present invention carries out English similarity calculation, HowNet word sense information mark is carried out to Context Knowledge, and Meaning of a word mapping processing is done, obtains English set of words, it is ensured that being capable of automatic aligning Chinese and English knowledge resource;
(5), the present invention gives a mark to the meaning of a word candidate in figure by figure scoring, and then obtains the score column of the candidate meaning of a word Table selects score the maximum for the correct meaning of a word, can realize the correct meaning transference to target ambiguities word automatically.
Detailed description of the invention
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is the flow diagram of the Word sense disambiguation method based on graph model;
Attached drawing 2 is the flow diagram of similarity calculation;
Attached drawing 3 is the flow diagram that building disambiguates figure;
Attached drawing 4 is the flow diagram of correct meaning transference;
Attached drawing 5 is the structural block diagram of the word sense disambiguation based on graph model;
Attached drawing 6 is the word sense information figure of citing Chinese medicine word;
Attached drawing 7 is the synset figure that constructs in the Word similarity algorithm of word-based vector sum knowledge base.
Specific embodiment
To a kind of Word sense disambiguation method based on graph model of the invention and it is referring to Figure of description and specific embodiment System is described in detail below.
Embodiment:
As shown in Fig. 1, the Word sense disambiguation method and system of the invention based on graph model, includes the following steps:
S1, it extracts Context Knowledge: part-of-speech tagging is carried out to ambiguity sentences, extract notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
Citing: with to " carrying out around " instruction ", in conjunction with the reality of work of Chinese medicine, increased force is wanted in various regions Degree, actively and steadily promotes TCM medical organization to reform." processing for, wherein " Chinese medicine " is to disambiguated term.Part-of-speech tagging Processing uses Chinese Academy of Sciences's Words partition system NLPIR-ICTCLAS.After part-of-speech tagging, " around/v "/wkz guidance/v opinion/n "/ Wky /ude1 implements/vn implements/vn ,/wd combination/v traditional Chinese medicine/n work/vn/ude1 reality/n ,/wd is each Ground/rzs wants/v increasing/v dynamics/n ,/wd actively/a and/cc it is safe/a /ude2 propulsion/vi Chinese medicine/n doctor Treatment/n mechanism/n reform/vn./ wj ", it is extracted notional word go forward side by side row format arrangement, to facilitate subsequent processing, obtain " Chinese medicine _ N_25: implement around _ v_0 guidance _ vn_2 opinion _ n_3 _ v_6 implements _ v_7 combination _ v_9 traditional Chinese medicine _ n_10 work _ Vn_11 reality _ n_13 wants _ v_16 increasing _ v_17 dynamics _ n_18 actively _ a_20 is safe _ a_22 propulsion _ v_24 in Doctor _ n_25 medical treatment _ n_26 mechanism _ n_27 reform _ vn_28 ", is wherein word to be disambiguated before colon, and the number after part of speech is single Word is the location of in sentence.
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on The similarity calculation of HowNet;
As shown in Fig. 2, specific step is as follows for similarity calculation:
Similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping Processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Word carry out similarity calculation, in addition, in view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information.The part main code of the Word similarity algorithm of word-based vector sum knowledge base is as follows:
In the Word similarity algorithm of word-based vector sum knowledge base, row 1 gives two English words, they Between similarity, obtained by both calculating the cosine similarity of term vector, if given word is phrase, by training institute Term vector in there is no phrase, need that phrase is further processed, by by the corresponding term vector of word in phrase It is added, the vector for obtaining phrase indicates, and then acquires the similarity of phrase, and formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word Language, p2In j-th of word.
Row 2-4 iteratively searches for synset relevant to word w1 and w2, until iterative steps step is more than γ, by The cost that figure calculates when node is excessive is larger, therefore sets 10 for greatest iteration step number γ;Row 5, with w1, w2 and they it Between associated synset be basic structure figures;Row 6 in figure within the scope of certain distance, calculates relevant to w1 and w2 same The registration of adopted word set, set distance 2, formula is as follows:
simlap(w1, w2)=2*count (w1, w2)/(count (w1)+count (w2))
In formula, count (w1, w2) indicates the synset number that word w1 and w2 have jointly;Count (w1) and Count (w2) is respectively the synset number that w1 and w2 respectively have.
Row 7 calculates the shortest path in figure between w1 and w2 using dijkstra's algorithm, further obtains the phase of w1 and w2 Like degree, formula is as follows:
simbn1/ (δ of (w1, w2)=α *path)+(1-α)simlap(w1, w2)
Wherein, path is the shortest path between w1 and w2;Value of the δ to adjust similarity, is set as 1.4;simlap (w1, w2) indicates the registration between w1 and w2;Parameter alpha is a regulatory factor, for adjusting the phase of two parts in formula Like angle value.
The method of the above-mentioned method based on term vector and knowledge based library (BabelNet) is carried out linear, additive knot by row 8 It closes, obtains final similarity, formula is as follows:
simfinal(w1, w2)=β * simvec+(1-β)*simbn
simbnAnd simvecRespectively indicate the similarity that the method in knowledge based library and the method based on term vector obtain;Ginseng Number α is a regulatory factor, is obtained for adjusting two methods as a result, being specifically configured to 0.6.
Row 9 returns to similarity simfinal
The processing in relation to term vector is to utilize in the Word similarity algorithm of word-based vector sum knowledge base Word2vec kit, on without mark English Wikipedia corpus, training term vector.Before training, data are carried out Pretreatment, by file format is converted to UTF-8 by Unicode.Training window is set as 5, and default vector dimension is set as 200, model selects Skip-gram.After training terminates, a term vector file is obtained, hereof, each word is mapped For the vectors with 200 dimensions, the often one-dimensional of vector is a double precision numerical value.
Knowledge base chooses BabelNet, and BabelNet provides concept abundant and name entity, and passes through a large amount of language Adopted relationship interlinks, and semantic relation here refers to synonym relationship, hyponymy, integral part relationship etc..It is given Two words (concept or name entity), by means of the available respective synset of BabelNet API, and pass through language The synset of adopted relational links.Synset refers to a synonym collection, has unique identifier in BabelNet, Indicate a specific meaning of a word.Such as " bn:00021464n " instruction synset " computer, computing machine, computing device,data processor,electronic computer,information processing System " indicates a specific meaning of a word " computer, computer ".The Word similarity of word-based vector sum knowledge base is calculated The synset figure constructed in method, as shown in Fig. 7.
Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25: Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3: 143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _ V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13: 109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16: 140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24: 122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".
After doing meaning of a word mapping processing, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round is around _ v_ 0:124933 | centre on guidance _ vn_2:155807 | direct opinion _ n_3:143264 | complaint opinion _ n_ 3:143267 | idea implements _ v_6:047082 | and carry out implements _ v_7:081572 | and feel at ease implements _ v_ 7:081573 | ascertain implements _ v_7:081575 | fulfil combination _ v_9:064548 | be united in Wedlock combination _ v_9:064549 | combination traditional Chinese medicine _ n_10:157339 | traditional Chinese Medicine and druds work _ vn_11:044068 | work reality _ n_13:109077 | reality reality _ n_13: 109078 | practice wants _ v_16:140522 | and want to wants _ v_16:140530 | and ask wants _ v_16:140532 | ask For wants _ v_16:140534 | take increasing _ v_17:059967 | widen increasing _ v_17:059968 | enhance increasing _ V_17:059969 | enlarge dynamics _ n_18:076991 | dynamics actively _ a_20:057562 | active actively _ a_ 20:057564 | positive is safe _ a_22:126267 | and safe is safe _ a_22:126269 | reliable propulsion _ v_24: 122203 | move forward propulsion _ v_24:122206 | advance propulsion _ v_24:122211 | push into Chinese medicine _ N_25:157332 | traditional_Chinese_medical_science Chinese medicine _ n_25:157329 | Practitioner_of_Chinese_medicine mechanism _ n_27:057323 | institution mechanism _ n_27: 057325 | internal structure of an organization mechanism _ n_27:057326 | mechanism reform _ vn_28:041189|reform”。
English is done between above-mentioned gained any two English word (the corresponding English word of each HowNet meaning of a word concept) Similarity calculation, obtain " Chinese medicine _ n_25: around _ v_0:124932 | revolve round and guidance _ vn_2:155807 | Direct is 0.292 is around _ v_0:124932 | revolve round and opinion _ n_3:143264 | complaint Is 0.3085 is around _ v_0:124932 | revolve round and opinion _ n_3:143267 | idea is 0.3742 encloses Around _ v_0:124932 | revolve round and implements _ v_6:047082 | and carry out is 0.4015 is around _ v_0: 124932 | revolve round and implements _ v_7:081572 | and feel at ease is 0.3575 is around _ v_0: 124932 | revolve round and implements _ v_7:081573 | and ascertain is 0.3215 is around _ v_0:124932 | Revolve round and implements _ v_7:081575 | and fulfil is 0.3541 is around _ v_0:124932 | revolve Round and combination _ v_9:064548 | be united in wedlock is 0.3299 is around _ v_0:124932 | Revolve round and combination _ v_9:064549 | combination is 0.3487 is around _ v_0:124932 | Revolve round and traditional Chinese medicine _ n_10:157339 | traditional Chinese medicine and druds Is 0.3520 is around _ v_0:124932 | revolve round and work _ vn_11:044068 | work is 0.3478 Around _ v_0:124932 | revolve round and reality _ n_13:109077 | reality is 0.3664 is around _ v_ 0:124932 | revolve round and reality _ n_13:109078 | practice is 0.3907 is around _ v_0: 124932 | revolve round and wants _ v_16:140522 | and want to is 0.3375 is around _ v_0:124932 | Revolve round and wants _ v_16:140530 | and ask is 0.3482 " shows only part since length is limited here Similarity result.
Similarity calculation based on term vector: Sogou the whole network news corpus amounts to 1.43GB, uses Google's Word2vec kit training term vector on the corpus, obtains term vector file, two given according to term vector file acquisition The corresponding term vector of word calculates similarity of the cosine similarity as the two between term vector;
It should be noted that the meaning of a word of ambiguity word is more, trained term vector file is likely to tend to the ambiguity word Some more common meaning of a word.For this purpose, ambiguity word is converted into the meaning of a word possessed by it using HowNet, that is, each general The first justice read in definition is former, as shown in Fig. 5, ambiguity word " Chinese medicine " is converted to " people " and " knowledge ".
Citing: after being handled using HowNet ambiguity word, obtain " Chinese medicine _ n_25: around _ v_0:124932 | it surrounds Around _ v_0:124933 | surround guidance _ vn_2:155807 | order opinion _ n_3:143264 | Chinese language opinion _ n_3: 143267 | thought implements _ v_6:047082 | implement to implement _ v_7:081572 | feel at ease to implement _ v_7:081573 | decision is fallen Reality _ v_7:081575 | realize combination _ v_9:064548 | marriage combination _ v_9:064549 | merge traditional Chinese medicine _ n_10: 157339 | knowledge _ drug work _ vn_11:044068 | it is reality _ n_13:109077 | entity reality _ n_13:109078 | Thing wants _ v_16:140522 | it is desirable that _ v_16:140530 | it is required that wanting _ v_16:140532 | it seeks and wants _ v_16: 140534 | spend increasing _ v_17:059967 | deformation shape increasing _ v_17:059968 | optimization increasing _ v_17:059969 | expand Great dynamics _ n_18:076991 | intensity actively _ a_20:057562 | actively actively _ a_20:057564 | the safe _ a_ in front 22:126267 | as safe _ a_22:126269 | firm propulsion _ v_24:122203 | advance propulsion _ v_24:122206 | hair Dynamic propulsion _ v_24:122211 | push away Chinese medicine _ n_25:157332 | knowledge Chinese medicine _ n_25:157329 | robot mechanism _ n_27: 057323 | mechanism _ n_27:057325 | part structures _ n_27:057326 | component reform _ vn_28:041189 | change It is good ".
Gained any two Chinese word (corresponding to specific HowNet meaning of a word concept) is done based on the similar of term vector Degree calculates, obtain " Chinese medicine _ n_25: around _ v_0:124932 | around and guidance _ vn_2:155807 | order is-0.0145 encloses Around _ v_0:124932 | surround and opinion _ n_3:143264 | Chinese language is-0.0264 is around _ v_0:124932 | it surrounds And opinion _ n_3:143267 | thought is -0.0366 is around _ v_0:124932 | _ v_6:047082 is implemented around and | Implement is 0.2071 around _ v_0:124932 | implement _ v_7:081572 around and | feel at ease is -0.0430 around _ V_0:124932 | _ v_7:081573 is implemented around and | determine is 0.1502 around _ v_0:124932 | surround and Implement _ v_7:081575 | realize is 0.2254 around _ v_0:124932 | surround and combination _ v_9:064548 | it gets married Is -0.0183 is around _ v_0:124932 | surround and combination _ v_9:064549 | merge is 0.0745 around _ v_0: 124932 | surround and traditional Chinese medicine _ n_10:157339 | knowledge _ drug is 0.0866 is around _ v_0:124932 | it surrounds And work _ vn_11:044068 | is 0.1434 is around _ v_0:124932 | around and reality _ n_13:109077 | Entity is 0.1503 is around _ v_0:124932 | surround and reality _ n_13:109078 | thing is -0.0571 encloses Around _ v_0:124932 | _ v_16:140522 is wanted around and | expectation is 0.1009 is around _ v_0:124932 | around and Want _ v_16:140530 | it is required that is 0.2090 is around _ v_0:124932 | _ v_16:140532 is wanted around and | seek is 0.0496 around _ v_0:124932 | _ v_16:140534 is wanted around and | spend is 0.0176 around _ v_0:124932 | surrounding and increasing _ v_17:059967 | deformation shape is 0.0000 is around _ v_0:124932 | surround and increasing _ v_ 17:059968 | optimization is 0.2410 is around _ v_0:124932 | surround and increasing _ v_17:059969 | expand is 0.1911 around _ v_0:124932 | surround and dynamics _ n_18:076991 | intensity is 0.0592 is around _ v_0: 124932 | around and actively _ a_20:057562 | positive is 0.3089 is around _ v_0:124932 | around and product Pole _ a_20:057564 | positive is 0.0554 is around _ v_0:124932 | around and it is safe _ a_22:126267 | work as is 0.0245 around _ v_0:124932 | around and it is safe _ a_22:126269 | firm is 0.0490 is around _ v_0: 124932 | surround and propulsion _ v_24:122203 | advance is 0.1917 is around _ v_0:124932 | it is pushed away around and Into _ v_24:122206 | mobilize is 0.0277 around _ v_0:124932 | around and propulsion _ v_24:122211 | push away is 0.1740 around _ v_0:124932 | surround and Chinese medicine _ n_25:157332 | knowledge is 0.2205 is around _ v_0: 124932 | surround and Chinese medicine _ n_25:157329 | people is-0.0686 is around _ v_0:124932 | around and mechanism _ N_27:057323 | mechanism is 0.0945 is around _ v_0:124932 | surround and mechanism _ n_27:057325 | component is 0.0582 around _ v_0:124932 | surround and mechanism _ n_27:057326 | component is 0.0582 ".Since length has Limit, shows only part similarity result here.
Similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.
Citing: carrying out HowNet word sense information mark to Context Knowledge, and the specially meaning of a word is numbered, obtain " Chinese medicine _ n_25: Around _ v_0:124932 around _ v_0:124933 guidance _ vn_2:155807 opinion _ n_3:143264 opinion _ n_3: 143267 implement _ v_6:047082 implements _ v_7:081572 implements _ v_7:081573 implements _ v_7:081575 combination _ V_9:064548 combination _ v_9:064549 traditional Chinese medicine _ n_10:157339 work _ vn_11:044068 reality _ n_13: 109077 reality _ n_13:109078 want _ and v_16:140522 wants _ v_16:140530 wants _ and v_16:140532 wants _ v_16: 140534 increasings _ v_17:059967 increasing _ v_17:059968 increasing _ v_17:059969 dynamics _ n_18:076991 product Pole _ a_20:057562 actively _ a_20:057564 is safe _ a_22:126267 is safe _ a_22:126269 propulsion _ v_24: 122203 propulsions _ v_24:122206 propulsion _ v_24:122211 Chinese medicine _ n_25:157332 Chinese medicine _ n_25:157329 machine Structure _ n_27:057323 mechanism _ n_27:057325 mechanism _ n_27:057326 reform _ vn_28:041189 ".
The similarity between each meaning of a word is calculated using the concept similarity kit that HowNet is provided, obtains and " Chinese medicine _ n_25: encloses Around _ v_0:124932 and guidance _ vn_2:155807 is 0.015094 around _ v_0:124932 and opinion _ n_3: 143264 is 0.000624 are around _ v_0:124932 and opinion _ n_3:143267 is 0.010256 around _ v_0: 124932 and implement _ and v_6:047082 is 0.013793 around _ v_0:124932 and implements _ v_7:081572 is 0.010256 implement around _ v_0:124932 and _ v_7:081573 is 0.013793 is around _ v_0:124932 and Practicable _ v_7:081575 is 0.013793 is around _ v_0:124932 and combination _ v_9:064548 is 0.016667 Around _ v_0:124932 and combination _ v_9:064549 is 0.018605 around _ v_0:124932 and traditional Chinese medicine _ n_ 10:157339 is 0.000624 around _ v_0:124932 and work _ vn_11:044065 is 0.000624 around _ V_0:124932 and work _ vn_11:044067 is 0.000624 is around _ v_0:124932 and work _ vn_11: 044068 is 0.015094 surrounds _ v_0 around _ v_0:124932 and reality _ n_13:109077 is 0.000624: 124932 and reality _ n_13:109078 is 0.000624 want _ v_16:140522 is around _ v_0:124932 and 0.010959 want around _ v_0:124932 and _ v_16:140530 is 0.015094 is around _ v_0:124932 and Want _ v_16:140532 is 0.018605 wants around _ v_0:124932 and _ v_16:140534 is 0.015094 encloses Around _ v_0:124932 and increasing _ v_17:059967 is 0.013793 around _ v_0:124932 and increasing _ v_17: 059968 is 0.015094 is around _ v_0:124932 and increasing _ v_17:059969 is 0.013793 around _ v_0: 124932 and dynamics _ n_18:076991 is 0.000624 around _ v_0:124932 and actively _ a_20:057562 Is 0.000624 around _ v_0:124932 and actively _ a_20:057564 is 0.000624 is around _ v_0:124932 And is safe _ a_22:126267 is 0.000624 around _ v_0:124932 and it is safe _ a_22:126269 is 0.000624”。
S3, building disambiguate figure: carrying out weight optimization to similarity using simulated annealing, obtain fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, is constructed Disambiguate figure;As shown in Fig. 3, specific step is as follows for building disambiguation figure:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 Automatic Optimal obtains optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
The partial code of weight optimization algorithm based on simulated annealing is as shown in the table:
In weight optimization algorithm based on simulated annealing, row 1 is initialization operation, and setting initial temperature value t is 100, temperature Spending floor value t_min is 0.001, and cooling rate delta is set to 0.98, and greatest iteration step number k is set as 100;Row 2-3 be temperature with And the control of iterative steps;Row 4-5, the double-precision value of random selection 0 to 1-y are x assignment, and are z assignment 1-x-y;Row 6, letter Number getEvalResult (x, y, z) is objective function, function return value resulting disambiguation standard when being given weight parameter x, y, x True rate;Row 7 selects new value to be assigned to x_new in the neighborhood of x;Row 8-18, determines whether x_new retains to replace x, is specifically shown in The formula of simulated annealing progress parameter optimization;Row 20 changes t with the cooling rate of delta;Row 22 returns to x, y, z most Excellent parameter combination.
Wherein, x, y, z indicates the weight variable of three kinds of similarity results, and when executing algorithm for the first time, y is set as 1/3, this When algorithm after obtain the weight optimization parameter of x, y, at this moment min (x, y) is fixed up, continues to execute second of algorithm, After algorithm, other two weight parameters can be determined.
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Citing: after weight optimization, merging three kinds of similarity values according to the similarity formula finally merged between the meaning of a word, " Chinese medicine _ n_25: around _ v_0:124932 | revolve round | surround and guidance _ vn_2:155807 | direct | life Enable is 0.015094 | 0.2929 | -0.0145 around _ v_0:124932 | revolve round | around and opinion _ n_ 3:143264 | complaint | Chinese language is 0.000624 | 0.3085 | -0.0264 around _ v_0:124932 | revolve Round | surround and opinion _ n_3:143267 | idea | thought is 0.010256 | 0.3742 | -0.0366
Around _ v_0:124932 | revolve round | _ v_6:047082 is implemented around and | carry out | implement Is 0.013793 | 0.4015 | 0.2071 around _ v_0:124932 | revolve round | _ v_7 is implemented around and: 081572 | feel at ease | feel at ease is 0.010256 | and 0.3575 | -0.0430 around _ v_0:124932 | revolve Round | around and implement _ v_7:081573 | ascertain | determine is 0.013793 | 0.3215 | 0.1502 around _ V_0:124932 | revolve round | _ v_7:081575 is implemented around and | fulfil | realize is 0.013793 | 0.3541 | 0.2254 around _ v_0:124932 | revolve round | surround and combination _ v_9:064548 | be United in wedlock | marriage is 0.016667 | 0.3299 | -0.0183 around _ v_0:124932 | revolve Round | surround and combination _ v_9:064549 | combination | merge is 0.018605 | 0.3487 | 0.0745 encloses Around _ v_0:124932 | revolve round | surround and traditional Chinese medicine _ n_10:157339 | traditional Chinese Medicine and druds | knowledge _ drug is 0.000624 | 0.3520 | 0.0866 around _ v_0:124932 | Revolve round | surround and work _ vn_11:044068 | work | it is is 0.015094 | 0.3478 | 0.1434 encloses Around _ v_0:124932 | revolve round | surround and reality _ n_13:109077 | reality | entity is 0.000624 | 0.3664 | 0.1503 around _ v_0:124932 | revolve round | surround and reality _ n_13: 109078 | practice | thing is 0.000624 | 0.3907 | -0.0571 around _ v_0:124932 | revolve round | _ v_16:140522 is wanted around and | want to | expectation is 0.010959 | 0.3375 | 0.1009 around _ v_0: 124932 | revolve round | _ v_16:140530 is wanted around and | ask | it is required that is 0.015094 | 0.3482 | 0.2090 around _ v_0:124932 | revolve round | _ v_16:140532 is wanted around and | and ask for | seek is 0.018605 | 0.3648 | 0.0496 ", here in order to show that process does not further calculate, such as " 0.018605 | 0.3648 | 0.0496 " indicates three kinds of similarity values, is α 0.018605+ β 0.3648+ γ 0.0496 after their fusions.
S303, building disambiguate figure: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulation The weight optimization algorithm of annealing integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to a ternary Group indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates the former word of the first justice Language;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.
The meaning of a word of this triple form enables above-mentioned three kinds of similarity calculating methods to be integrated into an entirety, with For " Chinese medicine ", " Chinese medicine " there are two the meaning of a word, correspond respectively to two meaning of a word triples, specific as follows: " Chinese medicine (157329, People, practitioner of Chinese medicine) ", " Chinese medicine (157332, knowledge, traditional Chinese Science) ", the side right weight in disambiguating figure between any two vertex, that is, the semantic similarity between the meaning of a word at this time, can be with It is obtained by the similarity calculation finally merged between the meaning of a word.
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects highest scoring person as the correct meaning of a word.As shown in Fig. 4, selecting the correct meaning of a word, specific step is as follows:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure scoring is adopted With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node PageRank score depends on the PageRank score of all nodes linked with it;The specific PageRank score of one node Calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v.
Citing: after figure scoring, obtaining candidate meaning of a word list of concepts,
Chinese medicine _ n_25:157332 2.1213090873827947E58;
Chinese medicine _ n_25:157329 1.8434688340823378E58.
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
Citing: select meaning of a word concept highest scoring person for the correct meaning of a word, namely " Chinese medicine _ n_25:157332 ".
Embodiment 2:
As shown in Fig. 5, the present invention is based on the sense disambiguation systems of graph model, which includes,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word refers to Noun, verb, adjective, adverbial word;
Similarity calculated, for being done respectively based on English similarity calculation, based on the similarity calculation of term vector And the similarity calculation based on HowNet.Similarity calculated includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done the meaning of a word and is reflected Processing is penetrated, English set of words is obtained;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Cliction language carries out similarity calculation;In view of HowNet be it is bilingual, meaning of a word mapping here processing directly acquire in HowNet English word information;
Term vector similarity calculated, for use the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two;It should be noted that the meaning of a word of ambiguity word is more, trained term vector file very may be used It can tend to some more common meaning of a word of the ambiguity word;For this purpose, ambiguity word is converted into possessed by it using HowNet The first justice in the meaning of a word, that is, each concept definition is former, as shown in fig. 6, ambiguity word " Chinese medicine " is converted to " people " and " is known Know ".
HowNet similarity calculated is used for carrying out word sense information mark to Context Knowledge using HowNet The form of word vocabulary and concept number is calculated similar between each meaning of a word using the concept similarity kit that HowNet is provided Degree.
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused Similarity, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, structure Build disambiguation figure;Disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to similarity calculation, base based on English Three kinds of similarity values of the similarity calculation in term vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain Optimal weights parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T indicates current institute Locate temperature;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with probability P is that 1 selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then with Probability p For exp ((result (xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, at random generate a probability Value, and judge the size of the probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α; simenThe Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on term vector Similarity calculation is as a result, weight is γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on mould The weight optimization algorithm of quasi- annealing, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to one three Tuple indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates that the first justice is former Word;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains candidate word The score list of justice, selects score the maximum for the correct meaning of a word.The correct selecting unit of the meaning of a word includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure; After completing figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure is commented Divide and use PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, a knot The PageRank score of point depends on the PageRank score of all nodes linked with it;The specific PageRank of one node Score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, is jumped out current Markov chain and is randomly choosed the general of a node Rate;α refers to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;in It (v) is all nodes for being linked to node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is The correct meaning of a word.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (10)

1. a kind of Word sense disambiguation method based on graph model, which comprises the steps of:
S1, extract Context Knowledge: to ambiguity sentences carry out part-of-speech tagging, extract notional word as Context Knowledge, notional word name word, Verb, adjective, adverbial word;
S2, similarity calculation: the similarity calculation based on English, the similarity calculation based on term vector are done respectively and is based on The similarity calculation of HowNet;
S3, building disambiguate figure: weight optimization carried out to similarity using simulated annealing, obtains fused similarity, into And using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, constructs disambiguation Figure;
The correct selection of S4, the meaning of a word: it is given a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the score of the candidate meaning of a word List selects highest scoring person as the correct meaning of a word.
2. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that similar in the step S2 Specific step is as follows for degree calculating:
S201, the similarity calculation based on English: HowNet word sense information mark is carried out to Context Knowledge, and does meaning of a word mapping Processing, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English Word carries out similarity calculation;
S202, the similarity calculation based on term vector: using the word2vec kit of Google on the corpus training word to Amount, obtains term vector file, gives the corresponding term vector of two words according to term vector file acquisition, remaining between calculating term vector Similarity of the string similarity as the two;
S203, the similarity calculation based on HowNet: word sense information mark is carried out to Context Knowledge using HowNet, using word The form of language vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided.
3. the Word sense disambiguation method according to claim 2 based on graph model, which is characterized in that base in the step S201 It is specific as follows in the Word similarity algorithm of term vector and knowledge base:
What S20101, judgement gave is word or phrase:
If 1., it is given be two English words, by the cosine similarity of two word vectors of calculating obtain two words it Between similarity;
If 2., given word be phrase, need for the corresponding term vector of word in phrase to be added, obtain the vector table of phrase Show, acquire the similarity of phrase, formula is as follows:
Wherein, | p1| and | p2| indicate phrase p1And p2The number of contained word;wiAnd wjRespectively indicate p1In i-th of word, p2 In j-th of word;
S20102, synset relevant to two English words is iteratively searched for, until iterative steps are more than γ;
S20103, synset is constructed based on two English words and synset relevant to two English words Figure;
S20104, in set distance range, the registration of synset relevant to two English words, formula are calculated in figure It is as follows:
simlap(wi, wj)=d*count (wi, wj)/(count(wi)+count(wj))
In formula, count (wi, wj) indicate word wiAnd wjThe synset number having jointly;count(wi) and count (wj) Respectively wiAnd wjThe synset number respectively having;The value of d expression set distance range;
S20105, w in figure is calculated using dijkstra's algorithmiAnd wjBetween shortest path, obtain wiWith the similarity of w, formula It is as follows:
simbn(wi, wj1/ (δ of)=α *path)+(1-α)simlap(wi, wj)
Wherein, path is wiAnd wjBetween shortest path;Value of the δ to adjust similarity;simlap(wi, wj) indicate wiWith wjBetween registration;Parameter alpha is a regulatory factor, adjusts the similarity value of two parts in formula;
S20106, the similarity sim for obtaining vector approach word-based in step S20101vecWith in step S20105 based on knowing Know the similarity sim that library method obtainsbn, linear, additive combination is carried out, obtains final similarity, formula is as follows:
simfinal(wi, wj)=β * simvec+(1-β)*simbn
Wherein, simbnAnd simvecIt respectively indicates similarity that knowledge based library method obtains and word-based vector approach obtains Similarity;Parameter alpha is a regulatory factor, adjusts the similarity knot that knowledge based library method and word-based vector approach obtain Fruit;
S20107, similarity sim is returnedfinal
4. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that constructed in the step S3 Disambiguating figure, specific step is as follows:
S301, weight optimization: the weight optimization algorithm based on simulated annealing carries out three kinds of similarity values in step S2 automatic Optimization, obtains optimal weights parameter;
S302, similarity fusion: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;simen The Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on the similar of term vector Spend calculated result, weight γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
S303, it constructs to disambiguate and scheme: disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, using based on simulated annealing Weight optimization algorithm, integrate three kinds of similarity values as the side right weight between the meaning of a word.
5. the Word sense disambiguation method according to claim 4 based on graph model, which is characterized in that in the step S301 The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T expression is presently in temperature Degree;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with Probability p be 1 Selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then be exp with Probability p ((result(xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, generate a probability value at random, and sentence The size of the disconnected probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
The meaning of a word in the step S303 refers to a triple, indicates are as follows: Word (No., Sword, Enword);Wherein, No. what is indicated is concept number;Sword indicates the first sense word;Enword indicates English word;No.,Sword,Enword Three is the entirety of organic unity, describes the same meaning of a word concept;A meaning of a word concept number unique identification one in HowNet A meaning of a word, it is available to the first sense word in its concept definition, and then the meaning of a word is mapped for English word.
6. the Word sense disambiguation method according to claim 1 based on graph model, which is characterized in that selected in the step S4 Specific step is as follows for the correct meaning of a word:
S401, figure scoring: calling figure methods of marking scores to the different degree on meaning of a word concept vertex in disambiguation figure;Completion figure is commented After point, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;
S402, the correct meaning of a word of selection: the correct meaning of a word, including the following two kinds situation are selected in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is correct The meaning of a word.
7. the Word sense disambiguation method according to claim 6 based on graph model, which is characterized in that scheme in the step S401 Scoring uses PageRank algorithm, and PageRank algorithm is to be assessed based on Markov chain model node in figure, and one The PageRank score of node depends on the PageRank score of all nodes linked with it;One node it is specific PageRank score calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, jumps out the probability that current Markov chain randomly chooses a node;α is Refer to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;In (v) is chain It is connected to all nodes of node v.
8. a kind of sense disambiguation systems based on graph model, which is characterized in that the system includes,
Context Knowledge extraction unit carries out part-of-speech tagging to ambiguity sentences, extracts notional word as Context Knowledge, notional word is named Word, verb, adjective, adverbial word;
Similarity calculated, for do respectively based on English similarity calculation, the similarity calculation based on term vector and Similarity calculation based on HowNet;
Figure construction unit is disambiguated, for carrying out weight optimization to similarity using simulated annealing, is obtained fused similar Degree, and then using word concept as vertex, the semantic relation between concept is side, and the weight on side is fused similarity, and building disappears Discrimination figure;
The correct selecting unit of the meaning of a word for giving a mark by figure scoring to the meaning of a word candidate in figure, and then obtains the candidate meaning of a word Score list selects score the maximum for the correct meaning of a word.
9. the sense disambiguation systems according to claim 8 based on graph model, which is characterized in that the similarity calculation list Member includes:
English similarity calculated for carrying out HowNet word sense information mark to Context Knowledge, and is done at meaning of a word mapping Reason, obtains English set of words;The Word similarity algorithm for recycling word-based vector sum knowledge base, to gained English words Language carries out similarity calculation;
Term vector similarity calculated, for using the word2vec kit of Google training term vector on the corpus, Term vector file is obtained, the corresponding term vector of two words is given according to term vector file acquisition, calculates the cosine between term vector Similarity of the similarity as the two;
HowNet similarity calculated, for carrying out word sense information mark to Context Knowledge using HowNet, using word The form of vocabulary and concept number calculates the similarity between each meaning of a word using the concept similarity kit that HowNet is provided;
The disambiguation figure construction unit includes,
Weight optimization unit, for the weight optimization algorithm based on simulated annealing, to the similarity calculation, word-based based on English Three kinds of similarity values of the similarity calculation of vector and the similarity calculation based on HowNet carry out Automatic Optimal, obtain optimal Weight parameter;The formula of simulated annealing progress parameter optimization are as follows:
Wherein, result (x) indicates objective function, refers to disambiguating accuracy rate;δ indicates cooling rate;T expression is presently in temperature Degree;xnewIt indicates newly to take parameter;xoldIndicate original parameter;
The meaning that the formula that simulated annealing carries out parameter optimization indicates includes the following two kinds situation:
If (a), newly taking parameter xnewObjective function value be not less than original parameter xoldObjective function value, then with Probability p be 1 Selection newly takes parameter xnew
If (b), newly taking parameter xnewObjective function value be less than original parameter xoldObjective function value, then be exp with Probability p ((result(xnew)-result(xold))/(δ t)) it is used as Selecting All Parameters xnewFoundation, generate a probability value at random, and sentence The size of the disconnected probability value generated at random and Probability p:
If 1., the probability value that generates at random no more than p when, selection newly takes parameter xnew
2., if the probability value that generates at random is when being greater than p, give up and newly take parameter xnew
Similarity integrated unit: after weight optimization, the similarity formula that is finally merged between the meaning of a word are as follows:
Sim (ws, ws ')=α simhow+βsimen+γsimvec
Wherein, ws and ws ' indicates two meaning of a word, simhowThe similarity calculation based on HowNet is indicated as a result, weight is α;simen The Word similarity of word-based vector sum knowledge base is indicated as a result, weight is β;simvecIt indicates based on the similar of term vector Spend calculated result, weight γ;Wherein, alpha+beta+γ=1, α >=0, β >=0, γ >=0;
Building disambiguates figure unit, and for disambiguating figure using the meaning of a word as vertex, the semantic relation between the meaning of a word is side, moves back using based on simulation The weight optimization algorithm of fire, integrates three kinds of similarity values as the side right weight between the meaning of a word;Wherein, the meaning of a word refers to a ternary Group indicates are as follows: Word (No., Sword, Enword);Wherein, what No. was indicated is concept number;Sword indicates the former word of the first justice Language;Enword indicates English word;No., Sword, Enword three are the entirety of organic unity, and it is general to describe the same meaning of a word It reads;Meaning of a word concept number one meaning of a word of unique identification in HowNet, it is available to the first justice in its concept definition Former word, and then the meaning of a word is mapped for English word.
10. the sense disambiguation systems based on graph model according to claim 8 or claim 9, which is characterized in that the meaning of a word is correct Selecting unit includes,
Figure scoring unit, scores for different degree of the calling figure methods of marking to meaning of a word concept vertex in disambiguation figure;It completes After figure scoring, candidate meaning of a word concept is arranged from big to small according to score, constitutes candidate meaning of a word list of concepts;Figure scoring is adopted With PageRank algorithm, PageRank algorithm is to be assessed based on Markov chain model node in figure, node PageRank score depends on the PageRank score of all nodes linked with it;The specific PageRank score of one node Calculation formula are as follows:
Wherein, 1- α is indicated in random walk process, jumps out the probability that current Markov chain randomly chooses a node;α is Refer to the probability for continuing current Markov chain;N is total node quantity;| out (u) | indicate the out-degree of node u;In (v) is chain It is connected to all nodes of node v;
Correct meaning of a word unit is selected, for selecting the correct meaning of a word, including the following two kinds situation in disambiguating result:
If 1., disambiguate result in only a meaning of a word concept, using an only meaning of a word concept as the correct meaning of a word;
If 2., disambiguate the result is that the meaning of a word list being made of multiple meaning of a word concepts, with the meaning of a word, concept highest scoring person is correct The meaning of a word.
CN201811503355.7A 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model Active CN109359303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811503355.7A CN109359303B (en) 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811503355.7A CN109359303B (en) 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model

Publications (2)

Publication Number Publication Date
CN109359303A true CN109359303A (en) 2019-02-19
CN109359303B CN109359303B (en) 2023-04-07

Family

ID=65332018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811503355.7A Active CN109359303B (en) 2018-12-10 2018-12-10 Word sense disambiguation method and system based on graph model

Country Status (1)

Country Link
CN (1) CN109359303B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362691A (en) * 2019-07-19 2019-10-22 大连语智星科技有限公司 A kind of tree bank building system
CN110413989A (en) * 2019-06-19 2019-11-05 北京邮电大学 A kind of text field based on domain semantics relational graph determines method and system
CN110598209A (en) * 2019-08-21 2019-12-20 合肥工业大学 Method, system and storage medium for extracting keywords
CN110705295A (en) * 2019-09-11 2020-01-17 北京航空航天大学 Entity name disambiguation method based on keyword extraction
CN110766072A (en) * 2019-10-22 2020-02-07 探智立方(北京)科技有限公司 Automatic generation method of computational graph evolution AI model based on structural similarity
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN112256885A (en) * 2020-10-23 2021-01-22 上海恒生聚源数据服务有限公司 Label disambiguation method, device, equipment and computer readable storage medium
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN115114397A (en) * 2022-05-09 2022-09-27 泰康保险集团股份有限公司 Annuity information updating method, device, electronic device, storage medium, and program
CN115114397B (en) * 2022-05-09 2024-05-31 泰康保险集团股份有限公司 Annuity information updating method, annuity information updating device, electronic device, storage medium, and program

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002017128A1 (en) * 2000-08-24 2002-02-28 Science Applications International Corporation Word sense disambiguation
WO2014087506A1 (en) * 2012-12-05 2014-06-12 三菱電機株式会社 Word meaning estimation device, word meaning estimation method, and word meaning estimation program
WO2016050066A1 (en) * 2014-09-29 2016-04-07 华为技术有限公司 Method and device for parsing interrogative sentence in knowledge base
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
CN105893346A (en) * 2016-03-30 2016-08-24 齐鲁工业大学 Graph model word sense disambiguation method based on dependency syntax tree
CN106951684A (en) * 2017-02-28 2017-07-14 北京大学 A kind of method of entity disambiguation in medical conditions idagnostic logout
WO2017217661A1 (en) * 2016-06-15 2017-12-21 울산대학교 산학협력단 Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002017128A1 (en) * 2000-08-24 2002-02-28 Science Applications International Corporation Word sense disambiguation
WO2014087506A1 (en) * 2012-12-05 2014-06-12 三菱電機株式会社 Word meaning estimation device, word meaning estimation method, and word meaning estimation program
WO2016050066A1 (en) * 2014-09-29 2016-04-07 华为技术有限公司 Method and device for parsing interrogative sentence in knowledge base
CN105760363A (en) * 2016-02-17 2016-07-13 腾讯科技(深圳)有限公司 Text file word sense disambiguation method and device
CN105893346A (en) * 2016-03-30 2016-08-24 齐鲁工业大学 Graph model word sense disambiguation method based on dependency syntax tree
WO2017217661A1 (en) * 2016-06-15 2017-12-21 울산대학교 산학협력단 Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding
CN106951684A (en) * 2017-02-28 2017-07-14 北京大学 A kind of method of entity disambiguation in medical conditions idagnostic logout
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
鹿文鹏: "基于依存和领域知识的词义消歧方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413989A (en) * 2019-06-19 2019-11-05 北京邮电大学 A kind of text field based on domain semantics relational graph determines method and system
CN110413989B (en) * 2019-06-19 2020-11-20 北京邮电大学 Text field determination method and system based on field semantic relation graph
CN110362691A (en) * 2019-07-19 2019-10-22 大连语智星科技有限公司 A kind of tree bank building system
CN110598209A (en) * 2019-08-21 2019-12-20 合肥工业大学 Method, system and storage medium for extracting keywords
CN110705295A (en) * 2019-09-11 2020-01-17 北京航空航天大学 Entity name disambiguation method based on keyword extraction
CN110705295B (en) * 2019-09-11 2021-08-24 北京航空航天大学 Entity name disambiguation method based on keyword extraction
CN110766072A (en) * 2019-10-22 2020-02-07 探智立方(北京)科技有限公司 Automatic generation method of computational graph evolution AI model based on structural similarity
CN111310475B (en) * 2020-02-04 2023-03-10 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN111783418B (en) * 2020-06-09 2024-04-05 北京北大软件工程股份有限公司 Chinese word meaning representation learning method and device
CN112256885A (en) * 2020-10-23 2021-01-22 上海恒生聚源数据服务有限公司 Label disambiguation method, device, equipment and computer readable storage medium
CN112256885B (en) * 2020-10-23 2023-10-27 上海恒生聚源数据服务有限公司 Label disambiguation method, device, equipment and computer readable storage medium
CN113158687B (en) * 2021-04-29 2021-12-28 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN115114397A (en) * 2022-05-09 2022-09-27 泰康保险集团股份有限公司 Annuity information updating method, device, electronic device, storage medium, and program
CN115114397B (en) * 2022-05-09 2024-05-31 泰康保险集团股份有限公司 Annuity information updating method, annuity information updating device, electronic device, storage medium, and program

Also Published As

Publication number Publication date
CN109359303B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN109359303A (en) A kind of Word sense disambiguation method and system based on graph model
CN104915340B (en) Natural language question-answering method and device
KR101850124B1 (en) Evaluating query translations for cross-language query suggestion
Ramisch et al. mwetoolkit: A framework for multiword expression identification.
US9514098B1 (en) Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrases
CN106815252A (en) A kind of searching method and equipment
KR20180125746A (en) System and Method for Sentence Embedding and Similar Question Retrieving
CN109614620A (en) A kind of graph model Word sense disambiguation method and system based on HowNet
CN110377745B (en) Information processing method, information retrieval device and server
Vlachos et al. A new corpus and imitation learning framework for context-dependent semantic parsing
Xie et al. Knowledge base question answering based on deep learning models
US20150161109A1 (en) Reordering words for machine translation
Kanojia et al. Challenge dataset of cognates and false friend pairs from indian languages
Park et al. Frame-Semantic Web: a Case Study for Korean.
Nishihara et al. Word complexity estimation for Japanese lexical simplification
CN108255818B (en) Combined machine translation method using segmentation technology
Harshawardhan et al. Phrase based English–Tamil Translation System by Concept Labeling using Translation Memory
He et al. Language post positioned characteristic based Chinese-Vietnamese statistical machine translation method
Huang et al. A simple, straightforward and effective model for joint bilingual terms detection and word alignment in SMT
CN108280066B (en) Off-line translation method from Chinese to English
Jusoh et al. Automated translation machines: Challenges and a proposed solution
Marinova Evaluation of stacked embeddings for Bulgarian on the downstream tasks POS and NERC
Passban et al. Improving phrase-based SMT using cross-granularity embedding similarity
Mascarell et al. Detecting document-level context triggers to resolve translation ambiguity
TWI492072B (en) Input system and input method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant