CN103235833B - Answer search method and device by the aid of statistical machine translation - Google Patents

Answer search method and device by the aid of statistical machine translation Download PDF

Info

Publication number
CN103235833B
CN103235833B CN201310180146.4A CN201310180146A CN103235833B CN 103235833 B CN103235833 B CN 103235833B CN 201310180146 A CN201310180146 A CN 201310180146A CN 103235833 B CN103235833 B CN 103235833B
Authority
CN
China
Prior art keywords
overbar
candidate answers
matrix
translation
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310180146.4A
Other languages
Chinese (zh)
Other versions
CN103235833A (en
Inventor
周光有
赵军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201310180146.4A priority Critical patent/CN103235833B/en
Publication of CN103235833A publication Critical patent/CN103235833A/en
Application granted granted Critical
Publication of CN103235833B publication Critical patent/CN103235833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an answer search method and device by the aid of statistical machine translation. The method comprises the steps of firstly, translating candidate answers into a plurality of other languages by using a statistical machine translation tool to obtain a plurality of equivalent representations of the candidate answers; then, reducing dimensionalities of the plurality of equivalent representations of the candidate answers through a matrix decomposition method to obtain a low-dimension implication representation form; next, translating an inquired question into the low-dimension implication representation form through statistical machine translation and the matrix decomposition method; and finally, calculating the similarities between the inquired question and the candidate answers in implication space, and returning a plurality of candidate answers with the highest similarity as the answer of the inquired question. By means of the method, problems of vocabulary mismatching and ambiguity can be solved effectively, and tests prove that the answer research performance is improved by 29.36% in large-scale community question and answer data sets.

Description

A kind of answer search method by statistical machine translation and device
Technical field
The present invention relates to natural language processing technique field, be a kind of answer search method by statistical machine translation and Device.
Background technology
With the fast development of Internet technology, the mutual of (User-Generated Content, UGC) is generated based on user The Internet services become more and more popular.Community's question and answer exactly occur in this context a kind of new based on " challenge-response " Communication for information and Knowledge Sharing system, such as Yahoo!Answers, Baidu are known.It is different from automatically request-answering system, in community In question and answer, user can propose any kind of problem it is also possible to answer other any kind of problems of user.Answer is retrieved The basis of community's question and answer analysis, occupies critically important position.The task of answer retrieval refers to from large-scale candidate answers storehouse Retrieve to inquiry problem in semantically similar or close answer, user answers this inquiry problem.Therefore, answer retrieval has Important theory significance and practical value.
The significant challenge that the retrieval of answer at present faces is that the vocabulary between inquiry problem and candidate answers mismatches and word Remittance ambiguity problem.Vocabulary mismatches and answer retrieval model would generally be caused to retrieve many unmatched with user's query intention answer Case, main cause is that in community's question and answer, inquiry problem and answer are all to be given by user, and the query intention of user is highly many Sample.For example, according to different users, word " interest " both can refer to " curiosity " and can also refer to " a charge for borrowing money”." word ambiguity " is the common phenomenon between inquiry problem and candidate answers, is in particular in, The number of times that a lot of words occur in inquiry problem and candidate answers is simultaneously few, or even does not all have in inquiry problem or candidate answers Middle occurred it is impossible to traditional based on entry coupling method.
One method of above-mentioned " lexical ambiguity " and " vocabulary wide gap " problem of solution is just made by statistical machine translation, will be former Ambiguity word in beginning language and the different vocabulary of literal upper expression to be represented with their corresponding translations.And by statistical machine The method premise of device translation is to first have to set up a rational object function, and source language and its corresponding translation are integrated in In one framework, next to that how to reduce the noise that statistical machine translation brings as far as possible, it is finally how to design one kind quickly Method for solving is solving above-mentioned object function.And directly the translation vocabulary obtaining is added in source language, answer retrieval Accuracy rate can be had a greatly reduced quality, and main cause is to be directly appended to greatly increase the complexity of calculating in source language by translation vocabulary Degree, the mistake of machine translation also brings along a lot of noises simultaneously.
The task of answer retrieval refers to the inquiry problem to user input, retrieves and can answer from answer document set The answer of this inquiry.The main difficulty that answer retrieval faces is that user's inquiry problem is same or analogous in expression with candidate answers Using different use word forms during the meaning, it is easily caused vocabulary and mismatches the problem with lexical ambiguity.Traditional method mainly according to By excavating the word association between single language, ignore the semantic association between multilingual information.
Content of the invention
For solving the above problems, the present invention firstly the need of one rational object function of design, by source language and its right The translation answered is effectively integrated in a framework, constrains the shadow that the noise of machine translation is retrieved to answer under this framework simultaneously Ring.Then according to the object function set up and its constraint, devise a kind of quick method for solving.By asking to object function Solution, obtains source language and its implicit expression of corresponding translation, finally spatially calculates user's inquiry and candidate answers implicit Between similarity.According to above-mentioned thinking, present invention is generally directed to the two big difficulties that answer retrieval exists are started with, successfully During statistical machine translation is incorporated into answer retrieval, it is experimentally confirmed, the method is effectively improved answer retrieval Accuracy rate.
The basic thought of the present invention is fully by statistical machine translation, by the ambiguity word in source language and literal upper table Show that different vocabulary to be represented with their corresponding translations, thus improving the performance of answer retrieval.
The invention discloses
A kind of answer search method by statistical machine translation, comprises the steps:
Step 1, by statistical machine translation instrument, all candidate answers that source language represents are translated into other multiple Language;
Step 2, the candidate answers representing the every kind of language including described source language are integrated into one based on non- The framework that negative matrix decomposes;
Step 3, using method of least square Fast Field descent algorithm, the described framework based on Non-negative Matrix Factorization is carried out Solve, obtain the low-dimensional expression that described every kind of language of all candidate answers represents;
Step 4, by statistical machine translation instrument, the inquiry problem that source language represents is translated into other polyglots Translation;
Step 5, the low-dimensional expression being represented using described every kind of language of all candidate answers obtaining in step 3, will look into Inquiry topic and other polyglot translation are transformed on lower dimensional space;
Step 6, according to described inquiry problem and the translation of other polyglot and this inquiry problem and other polyglot Translate the low-dimensional expression of corresponding candidate answers, calculate described inquiry problem and other polyglot translates time corresponding with them Select the similarity between answer, and final retrieval result is obtained according to similarity.
The invention also discloses a kind of answer retrieval device by statistical machine translation, it includes:
Candidate answers translation module, for translating into other Languages by candidate answers;
Matrix decomposition module, the candidate answers that the every kind of language including described source language is represented are integrated into one Framework based on Non-negative Matrix Factorization;
Optimization Solution module, using method of least square Fast Field descent algorithm to the described frame based on Non-negative Matrix Factorization Frame is solved, and obtains the low-dimensional expression that described every kind of language of all candidate answers of each problem represents;
Inquiry problem translation module, for translating into other Languages by inquiry problem;
Based on the similarity calculation module of lower dimensional space, it is used for for inquiry problem being transformed into lower dimensional space, and calculates Inquiry problem and similarity on lower dimensional space for the candidate answers;
Described sort result study module, it is used for according to the calculated similarity of described similarity calculation module, Obtain eventually retrieving answer.
The present invention to lift the performance of answer retrieval using the thought by statistical machine translation.Using statistical machine translation Instrument Google Translate, will be corresponding with them to the ambiguity word in source language and the different vocabulary of literal upper expression Translate and to represent, thus improving the performance of answer retrieval.
Brief description
Fig. 1 is the answer search method in the present invention by statistical machine translation.
Fig. 2 is that in the present invention, structure drawing of device is retrieved in the answer by statistical machine translation.
Specific embodiment
For making the object, technical solutions and advantages of the present invention become more apparent, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in further detail.
The invention discloses a kind of answer search method by statistical machine translation and device.It can be divided into mistake offline Journey and in line process two parts.Off-line procedure is divided and is realized by three modules, i.e. candidate answers translation module, matrix decomposition module, Optimization Solution module.Also three modules are divided to carry out in line process, i.e. inquiry problem translation module, the similarity based on lower dimensional space Computing module and sort result study module.
Fig. 1 shows a kind of answer search method by statistical machine translation proposed by the present invention.As shown in figure 1, its Including offline part with online portion in two stages.Wherein off-line procedure includes:
Step (1), source language l will be used using statistical machine translation instrument1All candidates that (such as English) represents answer Case is translated, and obtains the equivalent representation { l of L-1 kind different language1, l2..., lL-1, wherein L represents the number of all language, Described statistical machine translation instrument can be selected for Google Translate etc..
Step (2), a M is shown as to the candidate answers collection table that every kind of language representspWord-the document matrix of × NWherein MpRepresent all vocabulary in the candidate answers set that pth kind language represents, N represents time Select the number of answer in answer set.
Step (3), one new object function of design, P kind different language is represented by the method using Non-negative Matrix Factorization Candidate answers be integrated in a unified framework, and to reduce what statistical machine translation brought using the strategy of regularization Noise.
Step (4), one Fast Field descent algorithm based on least square of design, by solving to above-mentioned object function Obtain the low-dimensional representation of L kind different language, i.e. coefficient matrixAnd restructuring matrix
Described include in line process:
Step (1), using statistical machine translation instrument by source language l1The inquiry problem translation that (such as English) represents Become the equivalent representation of L-1 kind different language, described statistical machine translation instrument can be selected for Google Translate etc..
Step (2), the coefficient matrix being obtained using solution in above-mentioned off-line procedure (4)Inquiry is asked Topic and its translation expression of corresponding L-1 kind are transformed on lower dimensional space.;
Step (3), calculate the similarity of inquiry problem and candidate answers on lower dimensional space represents.
Step (4), the strategy being learnt using linear ordering, the similarity that L kind different language is represented in lower dimensional space is entered Row merges, and several candidate answers of highest scoring return as final answer.
Fig. 2 shows the answer retrieval device by statistical machine translation proposing in the present invention.As shown in Fig. 2 this inspection Rope device includes:Candidate answers translation module, matrix decomposition module, Optimization Solution module, inquiry problem translation module and base Similarity calculation module in lower dimensional space.
Described candidate answers translation module, in off-line phase, using source language l1(such as English) represents All candidate answers are translated, and obtain the equivalent representation { l of L-1 kind different language1, l2..., lL-1, wherein L represents all languages The number of speech, that is, by candidate answers set D1Translation obtains the candidate answers set D that other L-1 kind language represents2..., DL.
Candidate answers translation is one of technology of the present invention.In order to candidate answers are become other L-1 from a kind of language translation Plant language, wasted time and energy using human translation, for retrieving this authentic task in particular for community's quiz answers, advise to big The candidate answers of mould carry out translating clearly unpractical.Fortunately, the level of current machine translation is in natural language processing In obtained preferable development, although not being also of great satisfaction in translation quality.Have at present and exempted from disclosed in many Taking translation tool provides daily translation service.Google Translate, this translation tool is adopted in the preferred embodiment of the present invention Using statistical machine learning method, translation model is trained on the extensive Parallel Corpus building, becoming from a kind of language translation It may be considered that abundant contextual information during another kind of language, show good in numerous translation tools Translation performance.By to candidate answers set D1After translation, the candidate answers set that other L-1 kind language represents can be obtained D2..., DL.
Described matrix decomposition module, in off-line phase, being shown as one to the candidate answers collection table that every kind of language represents Individual MpWord-the document matrix of × NWherein MpRepresent the candidate answers set that pth kind language represents In all vocabulary, N represent candidate answers combine in answer number.
Matrix decomposition module is one of key technology of the present invention.Define { l1, l2..., lLRepresent the present invention used in Language set, the wherein number of L representation language, l1Represent source language (for example, English), l2…lpRepresent other L-1 kind language Speech.DefinitionRepresent and be based on l1The candidate answers set of language performance.Define candidate answers A M can be expressed aspThe vector of dimensionWherein vectorIn each element correspond to a word, it represents this word i-th Significance level in individual candidate answers;This vectorCan be calculated with tf-idf, tf-idf is a kind of statistical method, in order to comment Estimate the significance level that a copy of it concentrated in a words for a file set or data.DpA M can be expressed asp× N-dimensional Word-document matrixIn this matrix, every a line represents a different word, and each row represent One candidate answers, wherein MpRepresent DpIn not repeated word number, N represents DpThe number of middle candidate answers.
For intuitively, can the other L-1 kind language that obtain represent after translation candidate answers set D2..., DLIn Vocabulary be directly appended to original candidates answer set D1In, so will lead to D1Corresponding matrixDimension from M1× N increases It is added toBut there are two shortcomings in this way:(1) cause Deta sparseness;(2) statistical machine translation Translation error will bring noise problem.In order to solve the above problems, the method that the present invention adopts matrix decomposition.
Assume matrixTwo low-dimensional matrixes can be resolved intoWithConsider matrix simultaneouslySolely Stand onFollowing object function can be obtained:
F ( U ‾ p , V ‾ p ) = min U ‾ p ≥ 0 , V ‾ p ≥ 0 | | D ‾ p - U ‾ p V ‾ p | | F 2
Wherein, | | | |FThe norm of representing matrix, whereinRepresent and obtain after decomposing The coefficient matrix arriving,Represent the restructuring matrix obtaining after decomposing, K represents implicit space Dimension size.
In order to reduce the noise problem that statistical machine translation mistake is brought, present invention assumes that from matrix(p ∈ [2, L]) The restructuring matrix obtainingShould with from matrixThe restructuring matrix obtainingCloser to better.Therefore, the present invention proposes minimum Change restructuring matrix(p ∈ [2, L]) and restructuring matrixDistance before:
F ′ ( V ‾ p ) = min V ‾ p ≥ 0 Σ p = 2 L | | V ‾ p - V ‾ 1 | | F 2
Merge above-mentioned two object function, following object function can be obtained:
F ′ ′ ( U ‾ 1 , · · · , U ‾ L ; V ‾ 1 , · · · , V ‾ L ) = Σ p = 1 L | | D ‾ p - U ‾ p V ‾ p | | F 2 + Σ p = 2 L λ p | | V ‾ p - V ‾ 1 | | F 2
Wherein parameter lambdap(p ∈ [2, L]) are used for adjusting two-part relative weighting.If to parameter lambdapLess value is set, Above-mentioned object functionSimilar to traditional nonnegative matrix (Non-negative Matrix Factorization), if to parameter lambdapLarger value, above-mentioned object function are setMore Emphasize the mistake that statistical machine translation brings.
Described Optimization Solution module is used for solving the parameter in above-mentioned matrix decomposition module, i.e. coefficient matrixAnd restructuring matrixBy this Optimization Solution module, obtain coefficient matrixAnd weight Structure matrixLocal optimum represent, the input results of as offline part.
Optimization Solution module is one of core technology of the present invention.Above-mentioned object functionWith When consider Deta sparseness and the problem of statistical machine translation mistake, have the optimization object that 2L is paired in this object function, Consider when simultaneouslyWithWhen, it is difficult to find an algorithm to solve above-mentioned minimization problem.The present invention proposes one kind Based on the Fast Field descent algorithm of method of least square, for finding locally optimal solution, when optimizing certain destination object, keep Other 2L-1 objects are constant.
KeepWithConstant, to coefficient matrixIteration update can will be upper State object functionChange into as following optimization problem:
min U ‾ p ≥ 0 | | D ‾ p - U ‾ p V ‾ p | | F 2
DefinitionRepresent a column vector, representative is matrixThe i-th row all elements;Represent a column vector, representative is coefficient matrixThe all elements of the i-th row.Therefore, above-mentioned Optimization problem can resolve into MpIndividual separate sub- optimization problem, each sub- optimization problem coefficient of correspondence matrixOne OK:
min u ‾ i ( p ) ≥ 0 | | d ‾ i ( p ) - V ‾ T p u ‾ i ( p ) | | 2 2
Subscript i=1 ..., Mp, wherein MpRepresent DpIn not repeated word number.
Above-mentioned sub- optimization problem is the least square problem of a standard, and its numerical solution is:
u ‾ i ( p ) = ( V ‾ p V ‾ T p ) - 1 V ‾ p d ‾ i ( p )
Retention coefficient matrixAnd restructuring matrixConstant, to restructuring matrix's Iteration updates can be by above-mentioned object functionChange into the optimization problem for following two classes:
When p ∈ [2, L],Following object function can be converted into:
min V ‾ p ≥ 0 | | D ‾ p - U ‾ p V ‾ p | | F 2 + λ p | | V ‾ p - V ‾ 1 | | F 2
As p=1,Following object function can be converted into:
min V ‾ 1 ≥ 0 | | D ‾ 1 - U ‾ 1 V ‾ 1 | | F 2 + λ 1 | | V ‾ 1 | | F 2
For the object function of the first situation above-mentioned, defineIt is matrixIn jth column vector,Represent weight Structure matrixIn jth column vector.Therefore, the object function of the first situation above-mentioned can resolve into N number of separate son Optimization problem, each sub- optimization problem corresponds to restructuring matrixString:
min v ‾ j ( p ) ≥ 0 | | d ‾ j ( p ) - U ‾ p v ‾ j ( p ) | | 2 2 + λ p | | v ‾ j ( p ) - v ‾ j ( 1 ) | | 2 2
Wherein subscript j=1 ..., N, N represent set DpThe number of middle candidate answers.
Above-mentioned sub- optimization problem is a standard based on L2The least square problem of regularization, then its numerical solution For:
v ‾ j ( p ) = ( U ‾ T p U ‾ p + λ p I ‾ ) - 1 ( U ‾ T p d ‾ j ( p ) + λ p v ‾ j ( 1 ) )
Wherein, p ∈ [2, L] represents the pth kind language after translation,Represent unit matrix.
Similarly, the object function of above-mentioned second situation, can be solved using similar method, its numerical solution is:
v ‾ j ( 1 ) = ( U ‾ T 1 U ‾ 1 + λ 1 I ‾ ) - 1 U ‾ 1 T d ‾ j ( 1 )
Described inquiry problem translation module, it is used for, in on-line stage, asking inquiry using statistical machine translation instrument Topic translates into the equivalent representation of L-1 kind different language, and described statistical machine translation instrument can be selected for Google Translate etc..
Similar to candidate answers translation module, become other L-1 kind language in order to problem will be inquired about from a kind of language translation, this Invention is by statistical machine translation instrument Google Translate.For given inquiry problem q, after translation Inquiry problem q representing to other L-1 kind language2..., qL.
The described similarity calculation module based on lower dimensional space, calculates inquiry problem and time for representing in lower dimensional space Select the similarity of answer.
It is one of key technology of the present invention based on the similarity calculation module of lower dimensional space.For given inquiry problem The q and its translation q of corresponding L-1 kind language2..., qL, need to be transformed into low-dimensional spatially.For the ease of stating See, use symbol q1Replace inquiry problem q that source language represents, i.e. q=q1.Therefore, it can q using formula below1Conversion To on lower dimensional space:
v ‾ q 1 = arg min v ‾ ≥ 0 | | q ‾ 1 - U ‾ 1 v ‾ | | 2 2 + λ 1 | | v ‾ | | 2 2
Wherein,It is inquiry problem q1Vector representation,It is inquiry problem q1Vector representation on lower dimensional space, that is, Restructuring matrix;WhereinRepresent the corresponding coefficient matrix of source language that Optimization Solution module obtains.But for candidate answers d1, directly can carry out, using matrix decomposition module, the transformation result that obtains after low-dimensional conversion, that is,Inquiry problem q1With candidate answers d1Similarity on lower dimensional space, can use cosine phase Represent like degree:
s ( q 1 , d 1 ) = < v &OverBar; q 1 , v &OverBar; d 1 > | | v &OverBar; q 1 | | 2 &CenterDot; | | v &OverBar; d 1 | | 2
Wherein, s (q1, d1) represent inquiry problem q1With candidate answers d1Similarity on lower dimensional space.
For q1Corresponding translation qiFor (i ∈ [2, L]), it is possible to use formula below is represented the sky of low-dimensional Between on:
v &OverBar; q i = arg min v &OverBar; &GreaterEqual; 0 | | q &OverBar; i - U &OverBar; i v &OverBar; | | 2 2 + &lambda; i | | v &OverBar; - v &OverBar; q 1 | | 2 2
Wherein,It is inquiry problem qiVector representation.Similarly, for candidate answers d1Corresponding translation di(i ∈ [2, L]) for, directly can carry out, using matrix decomposition module, the result that obtains after lower dimensional space conversionInquiry problem q1Corresponding translation qiWith candidate answers d1Corresponding translation di, the similarity on lower dimensional space can adopt above-mentioned similar cosine similarity computational methods.
Described sort result study module, the similarity for representing L kind different language in lower dimensional space is merged, Several candidate answers of highest scoring return as final answer.For given inquiry problem q1And candidate answers d1, The present invention devises a kind of following sequence learning function:
Score ( q 1 , d 1 ) = &theta; &OverBar; &CenterDot; &Phi; ( q 1 , d 1 )
Wherein, Score (q1, d1) represent inquiry problem q1With candidate answers d1Final score,Represent characteristic vector Weight, Φ (q1, d1)={ s (q1, d1), s (q2, d2) ..., s (qL, dL) represent characteristic vector, corresponding inquiry problem q1With candidate Answer d1The similarity that represents in lower dimensional space of L kind different language.Wherein, parameterUsing the most frequently used in statistical machine learning Cross validation strategy obtain optimum.Finally, according to Score (q1, d1) height sequence, by highest scoring several time Answer is selected to return as final answer.
In order to the performance of this device is described, the present invention to be verified by experiment and by statistical machine translation method, answer to be examined The raising of cable system performance.
The experimental data of the present invention derives from Yahoo!Answers community question answering system, concentrates in these historical problems, often Individual problem is mainly made up of four parts:The exercise question of problem, the answer of the classification of problem, the description of problem and problem.We institute Using data set comprise 1232 class of subscriber labels, 2,288,607 question and answer pair.In order to evaluate the effective of this inventive method Property, in addition we have selected 252 inquiry problems as test data set.Each concentrated for test data inquires about problem, We retrieve best 20 result using language model, then allow two mark persons remove manual mark.If the time returning Select answer similar to this inquiry problem, be just labeled as " related ", be otherwise labeled as " uncorrelated ".If the mark of two mark persons Structure has conflict, allows the 3rd people to make final decision.During judging whether candidate answers are similar to inquiry problem, Mark person only just knows that problem itself.
In the present invention, arrange parameter L=5, that is, need for English Translation to become other 4 kinds of language (Chinese, French, meanings Sharp greatly language, German).
Assume QtRepresent test problem collection, the present invention adopts following two evaluation indexes:
Average accuracy (MAP):Its computing formula is as follows:
MAP ( Q t ) = 1 Q t &Sigma; q &Element; Q t 1 m q &Sigma; k = 1 m q Precision ( R k )
Wherein, mqIt is the number of questions related to inquiry problem q, RkIt is k-th problem and its whole before in retrieval result The set of problem, Precision (Rk) it is RkThe problem ratio related to q.This index reflection test result on the whole average Level.
Precision@n(P@n):It is defined as the accuracy rate of the front n result that system returns for inquiry problem.Whole survey The Precision n of examination collection is the meansigma methodss of Precision n of all the problems in test set, and its computing formula is as follows:
P ( q ) @ n = k n
Wherein, k represents relevant issues number in the front k problem that searching system returns, and n represents asking of searching system return Topic total number.Therefore,
P @ n = &Sigma; q = 1 Q t p ( q ) @ n Q t
In view of user when checking retrieval result it is often desired to just find oneself required letter in above several results Breath, therefore usually arranges n=10.
The present invention is by statistical machine translation, " lexical ambiguity " and " word that will exist between inquiry problem and candidate answers Remittance wide gap " problem, to be represented using the word after translation, can efficiently solve above-mentioned two problems.Table 1 gives by statistics The experiment of performance is retrieved in the answer of machine translation.
Search method MAP P@10
TRLM 0.436 0.261
SMT 0.564 (↑ 29.36%) 0.291 (↑ 11.49%)
Table 1:The experiment of performance is retrieved in answer by statistical machine translation
As shown in table 1, TRLM represents traditional answer search method based on single language translation;SMT represents that the present invention carries The answer search method by statistical machine translation going out.By the contrast of table 1 it can be seen that the method for the present invention makes answer examine The performance of rope is obviously improved.Improve 11.49% as MAP improves 29.36%, [email protected] results show, the present invention The performance of answer retrieval can preferably be lifted.
From being seen with the experimental result of upper table 1, the answer search method by statistical machine translation obtains in performance Good effect, this method is proved to be effective.
Particular embodiments described above, has carried out detailed further to the purpose of the present invention, technical scheme and beneficial effect Describing in detail bright it should be understood that the foregoing is only the specific embodiment of the present invention, being not limited to the present invention, all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement done etc., should be included in the protection of the present invention Within the scope of.

Claims (12)

1. a kind of answer search method by statistical machine translation, comprises the steps:
Step 1, by statistical machine translation instrument, all candidate answers that source language represents are translated into other polyglots;
Step 2, the candidate answers representing the every kind of language including described source language are integrated into one and are based on non-negative square The framework that battle array is decomposed;
Step 3, solved based on the framework of Non-negative Matrix Factorization to described using method of least square Fast Field descent algorithm, Obtain the low-dimensional expression that described every kind of language of all candidate answers represents;
Step 4, by statistical machine translation instrument, the inquiry problem that source language represents is translated into the translation of other polyglots;
Step 5, the low-dimensional expression being represented using described every kind of language of all candidate answers obtaining in step 3, inquiry is asked Topic and other polyglot translation are transformed on lower dimensional space;
Step 6, according to described inquiry problem and the translation of other polyglot and this inquiry problem and other polyglot translation The low-dimensional expression of corresponding candidate answers, calculates described inquiry problem and other polyglot is translated candidate corresponding with them and answered Similarity between case, and final retrieval result is obtained according to similarity.
2. the method for claim 1 is it is characterised in that the described framework based on Non-negative Matrix Factorization table specific as follows Show:
F &prime; &prime; ( U &OverBar; 1 , ... , U &OverBar; L ; V &OverBar; 1 , ... , V &OverBar; L ) = &Sigma; p = 1 L | | D &OverBar; p - U &OverBar; p V &OverBar; p | | F 2 + &Sigma; p = 2 L &lambda; p | | V &OverBar; p - V &OverBar; 1 | | F 2
Wherein,Represent the object function of this framework;L represents source language in interior all language Number;Represent a M corresponding to pth kind languagepWord-the document matrix of × N-dimensional, MpTable Show the number of not repeated word in all candidate answers set, N represents the number of all candidate answers, vectorIn each One of corresponding i-th candidate answers of element word, its element value represents significance level in i-th candidate answers for this word;RepresentThe coefficient matrix obtaining after decomposition,RepresentThe restructuring matrix obtaining after decomposition;||·||FRepresenting matrix Norm, parameter lambdapIt is used for adjusting two-part relative weighting,Represent the corresponding restructuring matrix of source language.
3. method as claimed in claim 2 is calculated it is characterised in that being declined using the described Fast Field based on method of least square Method is solved based on the framework of Non-negative Matrix Factorization to described, specially findsWithLocally optimal solution;Wherein, when Optimize p-th coefficient matrixWhen, keepWithConstant, to coefficient matrixCarry out Iteration updates, above-mentioned object functionChange into as following optimization problem:
m i n U &OverBar; p &GreaterEqual; 0 | | D &OverBar; p - U &OverBar; p V &OverBar; p | | F 2 .
4. method as claimed in claim 3 optimizes p-th restructuring matrix it is characterised in that working asWhen, retention coefficient matrixAnd restructuring matrixConstant, to restructuring matrixIt is iterated updating, above-mentioned target letter NumberChange into the optimization problem for following two classes:
Optimization problem of the first kind:When p ∈ [2, L],It is converted into following object function:
m i n V &OverBar; p &GreaterEqual; 0 | | D &OverBar; p - U &OverBar; p V &OverBar; p | | F 2 + &lambda; p | | V &OverBar; p - V &OverBar; 1 | | F 2
Equations of The Second Kind optimization problem:As p=1,It is converted into following object function:
m i n V &OverBar; 1 &GreaterEqual; 0 | | D &OverBar; 1 - U &OverBar; 1 V &OverBar; 1 | | F 2 + &lambda; 1 | | V &OverBar; 1 | | F 2 .
5. method as claimed in claim 3 is it is characterised in that to coefficient matrixWhen being iterated updating, described target letter The optimization problem of number resolves into MpIndividual separate sub- optimization problem, each sub- optimization problem coefficient of correspondence matrixOne OK:
m i n u &OverBar; i ( p ) &GreaterEqual; 0 | | d &OverBar; i ( p ) - V &OverBar; P T u &OverBar; i ( p ) | | 2 2
Wherein,Represent a column vector, representative is matrixThe i-th row all elements;Represent a column vector, representative is coefficient matrixThe all elements of the i-th row.
6. method as claimed in claim 4 is it is characterised in that to restructuring matrixWhen being iterated updating, the described first kind Optimization problem resolves into N number of separate sub- optimization problem, and each sub- optimization problem corresponds to restructuring matrixString:
m i n v &OverBar; j ( p ) &GreaterEqual; 0 | | | d &OverBar; j ( p ) - U &OverBar; p v &OverBar; j ( p ) | | 2 2 + &lambda; p | | L &OverBar; j ( p ) - v &OverBar; j ( 1 ) | | 2 2
Wherein, defineIt is matrixIn jth column vector,Represent restructuring matrixIn jth column vector;
Equally, described Equations of The Second Kind optimization problem can be solved using the method same with optimization problem of the first kind.
7. method as claimed in claim 5 is it is characterised in that described MpThe individual separate corresponding numerical value of sub- optimization problem Xie Wei:
u &OverBar; i ( p ) = ( V &OverBar; p V &OverBar; p T ) - 1 V &OverBar; p d &OverBar; i ( p )
8. method as claimed in claim 6 is it is characterised in that the corresponding numerical solution of described optimization problem of the first kind is:
v &OverBar; j ( p ) = ( U &OverBar; p T U &OverBar; p + &lambda; p I &OverBar; ) - 1 ( U &OverBar; p T d &OverBar; j ( p ) + &lambda; p v &OverBar; j ( 1 ) )
Wherein, p ∈ [2, L] represents the pth kind language after translation,Represent unit matrix;
The corresponding numerical solution of described Equations of The Second Kind optimization problem is:
v &OverBar; j ( 1 ) = ( U &OverBar; 1 T U &OverBar; 1 + &lambda; 1 I &OverBar; ) - 1 U &OverBar; 1 T d &OverBar; j ( 1 ) .
9. method as claimed in claim 2 is it is characterised in that utilize the described every kind of of described all candidate answers in step 5 Inquiry problem is transformed on lower dimensional space for the low-dimensional expression that language represents, its computational methods is as follows:
v &OverBar; q 1 = arg m i n v &OverBar; &GreaterEqual; 0 | | q &OverBar; 1 - U &OverBar; 1 v &OverBar; | | 2 2 + &lambda; 1 | | v &OverBar; | | 2 2
Wherein,It is inquiry problem q1Vector representation,It is inquiry problem q1Vector representation on lower dimensional space,Represent The corresponding coefficient matrix of source language,Represent inquiry problem q1A kind of low-dimensional vector representation, parameter lambda1It is used for adjusting two parts Relative weighting.
10. method as claimed in claim 2 is it is characterised in that utilize the described every kind of of described all candidate answers in step 5 The low-dimensional expression that language represents, the translation of other polyglots is transformed on lower dimensional space, expression specific as follows:
v &OverBar; q i = arg m i n v &OverBar; &GreaterEqual; 0 | | q &OverBar; i - U &OverBar; i v &OverBar; | | 2 2 + &lambda; i | | v &OverBar; - v &OverBar; q 1 | | 2 2
Wherein,It is other polyglot translation q of inquiry problemiVector representation,Represent that the corresponding other of inquiry problem is many Plant language translation qiCorresponding coefficient matrix;Represent inquiry problem q1Corresponding translation qiA kind of low-dimensional vector representation,Table Show inquiry problem q1Optimum low-dimensional vector representation, parameter lambdaiIt is used for adjusting two-part relative weighting.
11. the method for claim 1 it is characterised in that inquiry problem q1With candidate answers d1Phase on lower dimensional space Like spending, it is calculated as below:
s ( q 1 , d 1 ) = < v &OverBar; q 1 , v &OverBar; d 1 > | | v &OverBar; q 1 | | 2 &CenterDot; | | v &OverBar; d 1 | | 2
Wherein, s (q1, d1) represent inquiry problem q1With candidate answers d1Similarity on lower dimensional space,WithRepresent respectively Inquiry problem q1With candidate answers d1Vector representation on lower dimensional space;
Equally, inquire about problem q1Corresponding translation qiWith candidate answers d1Corresponding translation di, the similarity on lower dimensional space adopts Calculated with same method.
A kind of 12. answer retrieval devices by statistical machine translation, it includes:
Candidate answers translation module, for translating into other Languages by candidate answers;
Matrix decomposition module, the candidate answers that the every kind of language including source language is represented are integrated into one and are based on non-negative The framework of matrix decomposition;
Optimization Solution module, is entered based on the framework of Non-negative Matrix Factorization to described using method of least square Fast Field descent algorithm Row solves, and obtains the low-dimensional expression that described every kind of language of all candidate answers of each problem represents;
Inquiry problem translation module, for translating into other Languages by inquiry problem;
Based on the similarity calculation module of lower dimensional space, it is used for for inquiry problem being transformed into lower dimensional space, and calculates inquiry Problem and similarity on lower dimensional space for the candidate answers;
Sort result study module, it is used for according to the calculated similarity of described similarity calculation module, finally gives inspection Rope answer.
CN201310180146.4A 2013-05-15 2013-05-15 Answer search method and device by the aid of statistical machine translation Active CN103235833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310180146.4A CN103235833B (en) 2013-05-15 2013-05-15 Answer search method and device by the aid of statistical machine translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310180146.4A CN103235833B (en) 2013-05-15 2013-05-15 Answer search method and device by the aid of statistical machine translation

Publications (2)

Publication Number Publication Date
CN103235833A CN103235833A (en) 2013-08-07
CN103235833B true CN103235833B (en) 2017-02-08

Family

ID=48883874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310180146.4A Active CN103235833B (en) 2013-05-15 2013-05-15 Answer search method and device by the aid of statistical machine translation

Country Status (1)

Country Link
CN (1) CN103235833B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782789A (en) * 2020-07-03 2020-10-16 江苏瀚涛软件科技有限公司 Intelligent question and answer method and system
CN112182439B (en) * 2020-09-30 2023-05-23 中国人民大学 Search result diversification method based on self-attention network
US12027070B2 (en) 2022-03-15 2024-07-02 International Business Machines Corporation Cognitive framework for identification of questions and answers

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives》;Guangyou Zhou 等;《Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics》;20110624;第653-662页 *
《互联网机器翻译》;王海峰,吴华,刘占一;《中文信息学报》;20111130;第25卷(第6期);第72-80页 *
《非负矩阵分解及其应用现状分析》;徐泰燕,郝玉龙;《武汉工业学院学报》;20100331;第29卷(第1期);第109-114页 *

Also Published As

Publication number Publication date
CN103235833A (en) 2013-08-07

Similar Documents

Publication Publication Date Title
CN109271505B (en) Question-answering system implementation method based on question-answer pairs
CN107239446B (en) A kind of intelligence relationship extracting method based on neural network Yu attention mechanism
CN107562792B (en) question-answer matching method based on deep learning
CN109271506A (en) A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning
CN109255031A (en) The data processing method of knowledge based map
CN106055675B (en) A kind of Relation extraction method based on convolutional neural networks and apart from supervision
CN102662931B (en) Semantic role labeling method based on synergetic neural network
CN105930452A (en) Smart answering method capable of identifying natural language
CN112434517B (en) Community question-answering website answer ordering method and system combined with active learning
CN110287482B (en) Semi-automatic participle corpus labeling training device
DE112013004082T5 (en) Search system of the emotion entity for the microblog
CN113112164A (en) Transformer fault diagnosis method and device based on knowledge graph and electronic equipment
CN111339269B (en) Knowledge graph question-answering training and application service system capable of automatically generating templates
CN113962219A (en) Semantic matching method and system for knowledge retrieval and question answering of power transformer
CN111125316B (en) Knowledge base question-answering method integrating multiple loss functions and attention mechanism
CN113076411B (en) Medical query expansion method based on knowledge graph
CN111782769A (en) Intelligent knowledge graph question-answering method based on relation prediction
CN112115242A (en) Intelligent customer service question-answering system based on naive Bayes classification algorithm
CN115840812A (en) Method and system for intelligently matching enterprises according to policy text
CN113468304A (en) Construction method of ship berthing knowledge question-answering query system based on knowledge graph
CN103235833B (en) Answer search method and device by the aid of statistical machine translation
CN115422323A (en) Intelligent intelligence question-answering method based on knowledge graph
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
CN111581364A (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN112417170B (en) Relationship linking method for incomplete knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant