CN106844356A

CN106844356A - A kind of method that English-Chinese mechanical translation quality is improved based on data selection

Info

Publication number: CN106844356A
Application number: CN201710031264.7A
Authority: CN
Inventors: 程国艮; 汪鸣; 汪一鸣
Original assignee: Mandarin Technology (beijing) Co Ltd
Current assignee: Mandarin Technology (beijing) Co Ltd
Priority date: 2017-01-17
Filing date: 2017-01-17
Publication date: 2017-06-13
Anticipated expiration: 2037-01-17
Also published as: CN106844356B

Abstract

The invention discloses a kind of method for improving English-Chinese mechanical translation quality based on data selection, methods described includes：The form of expression of data separate bag of words is showed again；Recycle computational methods performance the distance between sentence of cosine, then by the correlation computations to cosine obtain each to final score；Conventional data is ranked up using score, the related data of final choice carry out the systematic training of machine translation system.One aspect of the present invention can reduce time cost and memory space cost in statictic machine translation system training process, because compared to the system trained with multi-field conventional data, the method can reduce the data volume of training data；On the other hand it is to compare related in content because the data for choosing all are to come from same field with data to be tested, so the performance of the system of the data training selected using the method in theory can be better than the machine translation system trained with all data.

Description

A kind of method that English-Chinese mechanical translation quality is improved based on data selection

Technical field

Select to improve English-Chinese machine translation matter the invention belongs to data selection technique field, more particularly to a kind of data that are based on The method of amount.

Background technology

With the proposition of IBM statistical models, the machine translation method based on statistics instead of rule-based translation gradually Method turns into the machine translation method of main flow at this stage.Its basic idea is using the method for counting from large-scale bilingual language Automatic study translation knowledge, builds translation model in material.

In traditional statistical machine translation, the quality of corpus directly decides the quality of final translation system.At this The epoch of individual information explosion, the information of internet growth exponentially, while also for machine translation provides substantial amounts of list Language or bilingual corpora.

In theory with the increase of training data quantity, the quality of translation system can become better and better.However, experimentation have shown that working as Training data is reached after an order of magnitude, and being further added by the quality of training data can only allow the translation result of translation system to obtain very Small lifting, or even can sometimes reduce the translation quality of translation system, it can be seen that the quality of translation system not only with training The quantity of data has relation.Data source on internet is complicated, while also tend to belong to different fields in content, including political affairs Control, economic, tourism, amusement etc..When being test for data and training data and belonging to same field, often effect than with The effect of multiple fields or the translation system of other single field training is good.For example, if the data of test set are come From in political realms, an English-Chinese translation system trained with 500W political realms data with 500W than being entertained in theory It is more preferable that the English-Chinese translation system of FIELD Data training is showed.

In summary 2 points, training data is not The more the better, and a word of English may in different fields Having different translators of Chinese causes the increasing for sometimes quantity the performance of translation system is become worse.Based on data selection Domain-adaptive method be suggested to solve this problem, the core concept of this method is exactly at one The training data related to test data is selected in multi-field data, translation system is trained using the data for choosing, then Data to be translated are translated using this system.

In sum, most machine translation systems are obtained by tens million of or even more than one hundred million bilingual data training at this stage 's；Whole training process needs the substantial amounts of training time, while being also required to huge disk space comes data storage and model, together When the translation system trained with a large amount of multi-field training datas the translation result of certain specific area can not be reached it is best, and Translation result is optimal merely with the part in these data or by giving specific data weight higher in fact.

The content of the invention

It is an object of the invention to provide a kind of method for improving English-Chinese mechanical translation quality based on data selection, it is intended to solve Certainly most machine translation systems need the substantial amounts of training time in whole training process at this stage, while being also required to huge magnetic Disk space comes data storage and model, and the translation system at the same time with a large amount of multi-field training datas training is specific to certain The translation result in field can not reach best problem.

The present invention is achieved in that

A kind of method that English-Chinese mechanical translation quality is improved based on data selection, it is described that English-Chinese machine is improved based on data selection The method of device translation quality includes：

Step one, the form of expression of data separate bag of words is showed again；

Step 2, recycles the distance between computational methods performance sentence of cosine, then by the correlation computations to cosine Obtain each to final score；

Step 3, is ranked up using score to conventional data, and the related data of final choice carry out machine translation system Systematic training.

Further, data need to prepare three kinds of data before being converted to the form of bag of words, and a kind of is comprising the general of each field Data；It is for second data in the field of data related to testing data or specific area；The third is and test set Data outside unrelated or completely irrelevant with the specific area field of data.

Further, in step one, the form of expression of data separate bag of words is showed again, is specifically included：

Three kinds of data are all converted into the form of bag of words；The bag of words are the matrix of row of N row, and the number of N is equal to whole The total amount of word in individual data；

Assuming that in short having n word, S_iI-th word is represented, then i ∈ [1,2,3 ..., n], if the common m row of bag of words, V_j Represent the representative word of jth row, V_jcRepresent the final numerical value of the row, V_jc∈ [0,1] then bag of words m row in jth row final number Value is represented with equation below：

For every a word, if comprising the corresponding word of the i-th row, the train value is 1, is represented with 0 if not comprising if.

If certain packet is respectively I am a boy and I am a girl containing two words, the data include five altogether The value that kind word is respectively I am a boy girl, N is 5.It is assumed that this five columns value respectively correspond to word I, am, a, boy and girl.Then the bag of words form of expression of a word is (1,1,1,1,0), and the second word can then be expressed as (1,1,1,0,1).

Further, in step 2, the distance between computational methods performance sentence of cosine is recycled, then by cosine Correlation computations obtain each to final score, specifically include,

The algorithm for comparing correlation is calculated using cosine value, and the cosine value of every two word is calculated with formula below：

Wherein S and T corresponding two word respectively, i represent the value of the vector i-th row, in conventional data original language it is every Word, calculates its every cosine value of words of original language with data in field, then all cosine values corresponding to the word are carried out Summation is averaged

Wherein C_jThe cosine value of the words and the jth word of data in field is represented, m represents the sentence of data in field Number；Identical operation is carried out to the words of original language every of data outside field again and tries to achieve P_OS, it is also carried out for object language same P is tried to achieve in operation_ITAnd P_OT；The scoring of final the words is determined by following formula：

P=P_IS-P_OS+P_IT-P_OT,

P in formula_ISRepresent that this belongs to the probability of data in the field in original language direction, P_OSThen represent that the sentence belongs to The probability of the field extraneous data in original language direction, P_ITWith P_OTRepresent respectively this belong in the field in object language direction and The probability of field extraneous data.

Further, in step 3, conventional data is ranked up using score, the related data of final choice carry out machine The systematic training of translation system, specifically includes：

Select specific data；There is a final scoring for every words inside conventional data after step 2, use Cosine value represent two words apart from when numerical value it is bigger represent two words it is more similar, according to this score to the institute in data The order for carrying out from high to low is ranked up with sentence, the final data for choosing special ratios are used as final training data；Institute It is N words before specific to state final training data, or the specific percentage of selection data；The data for choosing are most Whole training data.By extract in training data each to word alignment and corresponding probability obtain translation model, lead to Cross statistics object language list language data n units frequency and carry out train language model, and by issuable during phrase extraction Phrase or word reconfigure to train reconstructed models.

A kind of domain-adaptive method based on data selection that the present invention is provided, it can select relatively effective Training data, using this partial data train translation system so that the performance of English-Chinese translation system gets a promotion.

This domain-adaptive method based on data selection of the present invention on the one hand can be with time-consuming and space cost, separately On the one hand the performance for training the translation system come can also be allowed better than the performance that the translation system for obtaining is trained with all data.

One aspect of the present invention can reduce the time cost and memory space in statictic machine translation system training process Cost.Because compared to the system trained with multi-field conventional data, the method can reduce the data volume of training data.It is another Aspect is all to come from same field with data to be tested due to the data for choosing, and is to compare related, institute in content The performance of the system trained with the data selected using the method in theory can be better than the machine translation system trained with all data System.On time cost, by the use of a 20000000 general sentences to training a translation system to take around 24 as training data Hour, and about select 5,000,000 data can then to train a more preferable specific area translation system of performance using this method, and 5000000 training datas train a translation system to only need to about 4 hours using same configuration and training parameter.Storing into In sheet, system translation model, language model and the reconstructed models that 20,000,000 data are trained have altogether and account for 37GB, and 5,000,000 training These three models that data are produced add up about 9GB altogether.In News Field, using above-mentioned data and data selecting method The translation result of test set can be allowed to lift 1 to 2 BLEU values.

Brief description of the drawings

Fig. 1 is the method flow diagram for improving English-Chinese mechanical translation quality based on data selection provided in an embodiment of the present invention.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Application principle of the invention is described in detail below in conjunction with the accompanying drawings.

As shown in figure 1, a kind of side for improving English-Chinese mechanical translation quality based on data selection provided in an embodiment of the present invention Method, the method for improving English-Chinese mechanical translation quality based on data selection includes：

S101:The form of expression of data separate bag of words is showed again.

S102:The distance between computational methods performance sentence of cosine is recycled, then is obtained by the correlation computations to cosine To each to final score.

S103:Conventional data is ranked up using score, the related data of final choice carry out machine translation system Systematic training.

Application principle of the invention is further described with reference to specific embodiment.

The method of the data-driven that statistical machine translation is utilized, so data volume is bigger in theory, machine translation system Performance it is better.The order of magnitude of most of commercial system training data has reached ten million or even hundred million grades on the market, so huge Data volume on the one hand can take substantial amounts of memory space, on the other hand also may require that huge time cost.But actually Quantity of the quality of translation system not only with training data has relation, while also having very big pass with the quality of training data System.When data to be tested and training data compare related in terms of content, there is a strong possibility that property can not yet for test result It is wrong.Domain-adaptive method based on data selection puts forward to solve such problem.The master of this method Want thought be pick out specific area from a big multi-field data or the data related to data to be tested enter The training of row translation system, the method that the present invention puts forward falls within one kind of this method.

Method proposed by the present invention can substantially be divided into three steps,

The first step is the form that all data are converted to bag of words；

Second step be each sentence in more common data with field in and field extraneous data correlation；

3rd step is the final profit to being ranked up according to the relevance score of previous step to each English-Chinese sentence of conventional data With the machine translation system of the data training need elected.

Application principle of the invention is further described with reference to data conversion.

Before data selecting method is carried out, three kinds of data need to be prepared, one kind is conventional data, i.e., comprising the number in each field According to, this partial data enormous amount, the data of final training are also to be chosen from this data.Second is in field Data, i.e., data related to testing data or the data of specific area.Last one kind is data outside field, i.e., with test Unrelated or completely irrelevant with the specific area data of collection data.

In this step, three kinds of data are all converted into the form of bag of words.Bag of words are a kind of matrix method for expressing, are one The matrix of row of N row is planted, the number of N is equal to the total amount of word in whole data.For every a word, if right comprising the i-th row The word answered, then the train value is 1, is represented with 0 if not comprising if.If certain packet is respectively I am a containing two words Boy and I am a girl, the value that the data are respectively I am a boy girl, N comprising five kinds of words altogether is 5.It is assumed that this Five columns values correspond to word I, am, a, boy and girl respectively.Then the bag of words form of expression of a word is (1,1,1,1,0), and Second word can then be expressed as (1,1,1,0,1).

Relatively application principle of the invention is further described with reference to correlation.

The step is primarily to every words in more common FIELD Data and number outside the estimated field of data in field According to correlation.

The main algorithm of correlation is calculated using cosine value, and the cosine value of every two word can be calculated with formula below：

Wherein C_jThe cosine value of the words and the jth word of data in field is represented, m represents the sentence of data in field Number.Identical operation is carried out to the words of original language every of data outside field again and tries to achieve P_OS, it is also carried out for object language same P is tried to achieve in operation_ITAnd P_OT.The scoring of final the words is determined by following formula：

P=P_IS-P_OS+P_IT-P_OT；

Application principle of the invention is further described with reference to data selection.

This step mainly selects specific data.Had for every words inside conventional data after second step Individual final scoring, with cosine value represent two words apart from when numerical value is bigger represents that two words are more similar, so according to This scoring to data in so sentence is ranked up to the order for carrying out from high to low, the final data for choosing special ratios are made It is final training data, can is specific preceding N words, it is also possible to select the data of specific percentage.Selected using data Select translation model, language model and the reconstructed models of the training machine translation system elected.

The present invention form of expression of data separate bag of words is showed again, recycle cosine computational methods performance sentence it Between distance, then by the correlation computations to cosine obtain each to final score, conventional data is carried out using score Sequence, the related data of final choice carry out the systematic training of machine translation system.

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of method that English-Chinese mechanical translation quality is improved based on data selection, it is characterised in that described based on data selection The method for improving English-Chinese mechanical translation quality includes：

Step 2, recycles the distance between computational methods performance sentence of cosine, then obtain by the correlation computations to cosine Each to final score；

Step 3, is ranked up using score to conventional data, and what the related data of final choice carried out machine translation system is System training.

2. it is as claimed in claim 1 to be based on the method that data selection improves English-Chinese mechanical translation quality, it is characterised in that data Need to prepare three kinds of data before being converted to the form of bag of words, a kind of is the conventional data comprising each field；Be for second with it is to be measured Data in the field of the related data of data or specific area；The third be it is unrelated with test set data or with specific neck Data outside the completely irrelevant field in domain.

3. it is as claimed in claim 1 to be based on the method that data selection improves English-Chinese mechanical translation quality, it is characterised in that step In one, the form of expression of data separate bag of words is showed again, specifically included：

Three kinds of data are all converted into the form of bag of words；The bag of words are the matrix of row of N row, and the number of N is equal to whole number According to the total amount of middle word；

Assuming that in short having n word, S_iI-th word is represented, then i ∈ [1,2,3 ..., n], if the common m row of bag of words, V_jRepresent The representative word of jth row, V_jcRepresent the final numerical value of the row, V_jc∈ [0,1] then bag of words m row in jth row final numerical value use Equation below is represented：

V_{j c} = \{\begin{matrix} 1, & S_{i} = V_{j} \\ 0, & O t h e r s \end{matrix}

4. it is as claimed in claim 1 to be based on the method that data selection improves English-Chinese mechanical translation quality, it is characterised in that step In two, the distance between computational methods performance sentence of cosine is recycled, then each sentence is obtained by the correlation computations to cosine To final score, specifically include,

C = \frac{Σ_{i = 1}^{n} S_{i} T_{i}}{\sqrt{Σ_{i = 1}^{n} S_{i}^{2}} \sqrt{Σ_{i = 1}^{n} T_{i}^{2}}}

Wherein S and T corresponding two word respectively, i represents the value of the vector i-th row, for every words of original language in conventional data, Its every cosine value of words of original language with data in field is calculated, then all cosine values corresponding to the word carry out summation and take Averagely

P_{I S} = \frac{Σ_{j = 1}^{m} C_{j}}{m}

Wherein C_jThe cosine value of the words and the jth word of data in field is represented, m represents the sentence number of data in field；It is right again Every words of the original language of data carry out identical operation and try to achieve P outside field_OS, it is also carried out same operation for object language and tries to achieve P_ITAnd P_OT；The scoring of final the words is determined by following formula：

P=P_IS-P_OS+P_IT-P_OT,

P in formula_ISRepresent that this belongs to the probability of data in the field in original language direction, P_OSThen represent that the sentence belongs to original language The probability of the field extraneous data in direction, P_ITWith P_OTRepresent respectively this belong in the field in object language direction and field without Close the probability of data.

5. it is as claimed in claim 1 to be based on the method that data selection improves English-Chinese mechanical translation quality, it is characterised in that step In three, conventional data is ranked up using score, the related data of final choice carry out the systematic training of machine translation system, Specifically include：

Select specific data；There is a final scoring for every words inside conventional data after step 2, use cosine Value represent two words apart from when numerical value it is bigger represent two words it is more similar, according to this score to data in so sentence Order to carrying out from high to low is ranked up, and the final data for choosing special ratios are used as final training data；It is described most Whole training data is N words before specific, or the specific percentage of selection data；The data for choosing are final Training data.By extract in training data each to word alignment and corresponding probability obtain translation model, by system Meter object language list language data n units frequency carrys out train language model, and by issuable phrase during phrase extraction Or word reconfigures to train reconstructed models.