CN111160046A

CN111160046A - Data processing method and device and data processing device

Info

Publication number: CN111160046A
Application number: CN201811320799.7A
Authority: CN
Inventors: 施亮亮
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2020-05-15

Abstract

The embodiment of the invention provides a data processing method and device and a device for data processing. The method specifically comprises the following steps: determining the probability value of a first corpus corresponding to a second corpus in a corpus pair according to the translation model; wherein, the first corpus and the second corpus have a translation relationship; the translation model is obtained by training a sample according to the corpus; and filtering the corpus in the corpus according to the probability value. The embodiment of the invention can filter more corpus pairs with smaller error difference in semantic aspect, thereby improving the filtering accuracy and the translation quality of a machine translation system.

Description

Data processing method and device and data processing device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus, and an apparatus for translating words.

Background

With the increasing international information exchange, the demand of users for high quality machine translation is increasing. The translation quality of the machine translation system depends on the quality of the bilingual corpus to a great extent, and the translation quality of the machine translation system is higher when the data volume of the bilingual corpus in the bilingual corpus is larger and the quality of the bilingual corpus is higher.

However, at the beginning of the bilingual corpus establishment, there may be some incorrect corpus pairs in the bilingual translation corpus. In addition, with the rise of internet technology, a large amount of corpora appearing on the internet are gradually accumulated in the bilingual corpus, and the bilingual corpus may contain more wrong corpus pairs due to the uneven quality of the corpora on the internet. Therefore, it is necessary to filter the incorrect corpus pairs in the bilingual corpus to improve the translation quality of the machine translation system.

At present, a rule-based method is usually adopted to filter the wrong corpus pairs in the bilingual corpus. However, the rule-based approach can only filter out the corpus pairs with obvious errors, such as whether the lengths of two sentences in the corpus pairs are different greatly, or whether the words in the two sentences in the corpus pairs correspond to each other. The rule-based filtering method is difficult to filter out the corpus pair with small error difference, for example, for the corpus pair ("we are eating", "In the mouning, we eat"), because the length difference of two sentences In the corpus pair is small, and the words In the two sentences also have corresponding relationship. However, the correct corpus correspondence should be either ("We are eating" ) or ("in the morning, We eat", "internal warning, We eat"). By using a rule-based filtering mode, the wrong corpus pair cannot be identified, so that the filtering accuracy is not high, and the translation quality of a machine translation system is influenced.

Disclosure of Invention

The embodiment of the invention provides a data processing method and device and a data processing device, which can improve the accuracy of filtering a bilingual corpus and further improve the translation quality of a machine translation system.

In order to solve the above problem, an embodiment of the present invention discloses a data processing method, including:

determining the probability value of a first corpus corresponding to a second corpus in a corpus pair according to the translation model; wherein, the first corpus and the second corpus have a translation relationship; the translation model is obtained by training a sample according to the corpus;

and filtering the corpus in the corpus according to the probability value.

On the other hand, the embodiment of the invention discloses a data processing device, which comprises:

the probability determining module is used for determining the probability value of the first corpus corresponding to the second corpus in the corpus according to the translation model; wherein, the first corpus and the second corpus have a translation relationship; the translation model is obtained by training a sample according to the corpus;

and the filtering module is used for filtering the corpus in the corpus according to the probability value.

In yet another aspect, an embodiment of the present invention discloses an apparatus for data processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:

and filtering the corpus in the corpus according to the probability value.

In yet another aspect, an embodiment of the invention discloses a machine-readable medium having stored thereon instructions, which, when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

according to the embodiment of the invention, the probability value of the first corpus corresponding to the second corpus in the corpus can be determined according to the translation model, and the corpus in the corpus is filtered according to the probability value. The first corpus and the second corpus have a translation relation, and the translation model is obtained by training the corpus pairs in the corpus as samples. Because the corpus pair data volume in the corpus is large, the translation model obtained by training according to a large number of samples has certain semantic understanding capacity, errors in more semantic aspects can be recognized, the errors are not simply recognized according to rules such as the length of sentences and corresponding relations between words, therefore, the probability value of the first corpus corresponding to the second corpus in the corpus pair can be more accurately calculated according to the translation model, the corpus pairs in the corpus are filtered according to the probability value, more corpus pairs with smaller semantic error difference can be filtered, the filtering accuracy can be improved, and the translation quality of a machine translation system can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one data processing method embodiment of the present invention;

FIG. 2 is a block diagram of an embodiment of a data processing apparatus according to the present invention;

FIG. 3 is a block diagram of an apparatus 800 for data processing of the present invention; and

fig. 4 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention is shown, which may specifically include the following steps:

step 101, determining a probability value of a first corpus corresponding to a second corpus in a corpus according to a translation model; wherein, the first corpus and the second corpus can have a translation relationship; the translation model can be obtained by training a sample according to the corpus;

optionally, the corpus pair samples may be derived from the corpus; alternatively, the corpus pair sample may be a filtered corpus pair in a corpus.

And 102, filtering the corpus according to the probability value.

The embodiment of the invention can be used for filtering the wrong corpus pair in the corpus so as to improve the translation quality of a machine translation system. The corpus may specifically include: the language corresponding to the first corpus can be a source language, and the language corresponding to the second corpus can be a target language, or the language corresponding to the first corpus can be a target language, and the language corresponding to the second corpus can be a source language.

The types of the source language and the target language are not limited in the embodiments of the present invention, and the language types of the source language and the target language may include: chinese, English, Japanese, Korean, German, French, etc. may be also used as language of a few names. For example, the language corresponding to the first corpus in the corpus pair may be a source language, and the source language may be chinese, and the language corresponding to the second corpus in the corpus pair may be a target language, and the target language may be english. For another example, the language corresponding to the first corpus in the corpus pair may be a source language, and the source language may be japanese, and the language corresponding to the second corpus in the corpus pair may be a target language, and the target language may be german, etc.

It can be understood that the embodiments of the present invention do not limit the type of the corpus and the corpus, for example, the type of the corpus may specifically include: vocabulary, sentences, paragraphs; the type of corpus may specifically include: bilingual vocabulary corpus, bilingual sentence corpus, and bilingual paragraph corpus. For convenience of description, the embodiments of the present invention all use the type of corpus pair as an example for description, and the operation processes of other types of corpus pairs are similar and refer to each other.

In one example of an application of the present invention, for the following corpus pairs: ("you must pay with yen.", "you pay in Japanese yen."), wherein the chinese sentence "you must pay with yen. The "may be a first corpus, and the English sentence" You must pay in Japanese yen "may be a second corpus, and the two are in a translation relationship with each other.

Specifically, the embodiment of the invention can train a translation model by taking the corpus pairs in the corpus as samples, and the translation model can be used for determining the translation probability from the source language sentence to the target language sentence.

In the embodiment of the invention, the corpus pairs in the existing corpus can be used as samples to train the translation model. Of course, in practical applications, the corpus pair samples for training the translation model may also be obtained through other approaches, for example, a large number of corpus pairs may also be collected from the network as corpus pair samples for training the translation model.

It is to be understood that the embodiment of the present invention does not limit the specific way of training the translation model and the specific type of the translation model. For example, the translation model may be trained in a statistical-based manner, or in a neural network-based manner.

The translation model is constructed by carrying out statistical analysis on a large number of corpus pairs based on a statistical mode, and the translation model is transited from an early translation model based on words to a translation model based on phrases and then to a translation model gradually fusing syntactic information so as to continuously improve the accuracy of machine translation. Machine translation based on a statistical method considers the process of translating a source language sentence into a target language sentence as a probability problem, namely any target language sentence is likely to be a translation of any source language sentence, only the probability is different, and the purpose of machine translation based on the statistical method is to find the target language sentence with the maximum probability under the condition of the source language sentence. The statistical-based translation model may include: a noise channel model, IBM models 1 to 5 proposed by researchers of IBM (international business machines corporation), a maximum entropy model-based, and the like.

The core of the machine translation based on the neural network is a deep neural network with massive nodes (neurons), and translation knowledge can be automatically learned from a corpus through the deep neural network. The neural network based translation model may include: a translation model based on RNN (Recurrent Neural Network), a translation model based on LSTM (long short-Term Memory Network), and the like.

Taking a phrase-based translation model as an example, the training process of the translation model may specifically include: preprocessing of corpus pairs, word alignment, phrase extraction, phrase feature extraction, and error rate minimization training.

Preprocessing the corpus pairs refers to performing certain text normalization processing on the collected corpus pairs, for example, performing morpheme segmentation on an english corpus, for example, independently dividing's into a word, or isolating symbols connected with the word, and the like; and word segmentation processing is required for Chinese linguistic data. Alternatively, some corpus pairs with more obvious errors, such as sentences that are too long or sentences with unmatched lengths (too different lengths), may be preliminarily filtered. In addition, the collected corpus pairs may be divided into two parts, the first part for word alignment and phrase extraction and the second part for minimum error rate training.

Word alignment refers to pairing words that can be translated into each other in a corpus pair. And phrase extraction, which is used for extracting phrase pairs with mutual translation relations from the corpus pairs, wherein the phrase pairs comprise source language phrases and target language phrases.

After the phrase extraction is finished, the phrase feature extraction can be carried out, namely, the phrase translation probability and the phrase word translation probability are calculated.

For example, "you want" to appear 70 times as a source language phrase in bilingual corpus, "you want" to appear 90 times as a target language phrase in bilingual corpus, "you want" to appear 20 times as an inter-translated phrase, a phrase translation probability Prob (you want) of "you want" to be translated into "you want" is 20/70, and a phrase translation probability Prob (you want | you want) of "you want" to be translated back into "you want" is 20/90.

And (4) performing error rate minimization training, namely optimizing the feature weight according to the second part of data (namely an optimization set) in the collected corpus pair to obtain an optimal optimization criterion. Common optimization criteria include: entropy of information, BLEU (bilingual evaluation understandy), and the like. This stage requires multiple decodes of the optimized set using the decoder, each producing the N highest scoring results, and adjusting the feature weights. When the feature weights are adjusted, the ordering of the N results will also change, and the highest scoring, i.e., the decoding result, will be used to calculate the BLEU score. When a new set of feature weights is obtained, such that the score of the entire optimization set is improved, the next decoding pass is repeated. This was repeated until no new improvement could be observed.

The translation model obtained through training of a large number of samples has certain semantic representation capacity, errors in more semantic aspects can be recognized, the errors are not simply recognized according to rules such as the length of sentences and corresponding relations among words, therefore, the probability value of the first corpus corresponding to the second corpus in the corpus can be calculated more accurately according to the translation model, the corpus pairs in the corpus are filtered according to the probability value, more corpus pairs with smaller error differences can be filtered, the filtering accuracy can be improved, and the translation quality of a machine translation system can be improved.

In an alternative embodiment of the present invention, the translation model may be specifically trained by the following steps:

taking a language corresponding to a first corpus in the sample as a source language and taking a language corresponding to a second corpus in the sample as a target language, and training a translation model; or

And taking the language corresponding to the first corpus in the sample as a target language and taking the language corresponding to the second corpus in the sample as a source language, and training a translation model.

In practical applications, the translation may be bidirectional, for example, a chinese sentence may be translated into an english sentence, and the english sentence may also be translated into the chinese sentence, so in the embodiment of the present invention, the probability value of the first corpus corresponding to the second corpus may specifically refer to a probability value of the first corpus being translated into the second corpus, or a probability value of the second corpus being translated into the first corpus.

Therefore, the embodiment of the invention can train two translation models from a first language (source language) to a second language (target language) or from the second language (source language) to the first language (target language) according to two directions of translation.

In an application example of the embodiment of the present invention, taking a language corresponding to the first corpus as chinese and a language corresponding to the second corpus as english as an example, the embodiment of the present invention may train the language corresponding to the first corpus in the sample as a source language and the language corresponding to the second corpus in the sample as a target language to obtain the translation model S1.

According to the translation model S1, a probability value for translating the first corpus into the second corpus can be calculated. Specifically, the first corpus SentZh in the corpus pair may be used as an input of the translation model S1, and the second corpus SentEn may be used as an expected output of the translation model S1, so that the probability P of SentEn being translated under the conditions of SentZh (SentEn | SentZh) may be calculated by the translation model S1.

For example, for a corpus pair ("I like you." ), the following conditional probabilities can be calculated by the translation model S1: p ("I like you." | "I like you.").

In addition, the embodiment of the present invention may also train the language corresponding to the second corpus in the sample as the source language and the language corresponding to the first corpus in the sample as the target language to obtain the translation model S2.

According to the translation model S2, a probability value for translating the second corpus into the first corpus can be calculated. Specifically, the second corpus SentEn in the corpus pair may be used as an input of the translation model S2, and the first corpus SentZh may be used as an expected output of the translation model S2, and then the probability P of obtaining SentZh translated under the SentEn condition (SentZh | SentEn) may be calculated by the translation model S2.

For example, for a corpus pair ("I like you." ), the following conditional probabilities can be calculated by the translation model S2: p ("I like you." | I like you. ").

step 31, segmenting a first corpus and a second corpus in the corpus by taking words as units to obtain a word sequence corresponding to the first corpus and a word sequence corresponding to the second corpus;

step S32, performing a reverse operation on the word sequence corresponding to the first corpus and the word sequence corresponding to the second corpus to obtain a reverse word sequence corresponding to the first corpus and a reverse word sequence corresponding to the second corpus;

step S32, taking a reverse corpus pair composed of the reverse word sequence corresponding to the first corpus and the reverse word sequence corresponding to the second corpus as a sample, and training a translation model.

In the embodiment of the present invention, a first corpus in a corpus pair may be segmented by taking a word as a unit to obtain a word sequence corresponding to the first corpus, and a reverse operation is performed on the word sequence corresponding to the first corpus to obtain a reverse word sequence corresponding to the first corpus. For example, for a corpus pair ("I am a boy." ), assume that the first corpus is "I am a boy. The second corpus is "I am a boy", and the first corpus can be segmented by taking words as units to obtain a word sequence corresponding to the first corpus, i.e., "I | is | one | boy |. ", performing a reverse operation on the word sequence, may result in a first corpus" i is a boy. The "corresponding reverse word sequence" is that "child man is me. ". Similarly, it can be obtained that the second corpus "I am a boy." corresponds to the reverse word sequence "boy a am I.".

The embodiment of the invention can execute the reverse operation on the first corpus and the second corpus in each corpus pair in the corpus base, and train the translation model by taking the reverse corpus pair consisting of the reverse word sequence corresponding to the first corpus and the reverse word sequence corresponding to the second corpus in each corpus pair as a sample, so that the translation probability of the reverse sentence can be calculated by the trained translation model.

Furthermore, the embodiment of the present invention may train two translation models from the first language (source language) to the second language (target language) or from the second language (source language) to the first language (target language) through a reverse corpus pair according to two directions of translation, which are assumed to be respectively denoted as translation model S3 and translation model S4.

Therefore, according to the embodiment of the present invention, the probability value of the first corpus corresponding to the second corpus in the corpus may be determined according to the translation model S3 or the translation model S4. For example, after performing segmentation and reverse operations on the first corpus and the second corpus in the corpus pair, a probability value for translating the reverse word sequence corresponding to the first corpus into the reverse word sequence corresponding to the second corpus is calculated according to the translation model S3 or the translation model S4.

Thus, for each corpus pair in the corpus, the probability value of the first corpus corresponding to the second corpus in the corpus pair can be determined according to the translation model. According to the embodiment of the invention, the corpus pairs in the corpus can be filtered according to the probability value. The translation models may include translation models trained by forward corpus pair, such as translation model S1 and translation model S2, or translation models trained by reverse corpus pair, such as translation model S3 and translation model S4.

In practical application, the forward corpus can be used for filtering the trained translation model to the corpus, or the reverse corpus can be used for filtering the trained translation model to the corpus, or the forward corpus can be used for filtering the trained translation model and the reverse corpus can be used for filtering the trained translation model to the corpus twice, so that the two translation models can be complemented, and more wrong corpus pairs can be filtered out, so that the accuracy of the corpus pairs can be improved.

In an optional embodiment of the present invention, the filtering the corpus pairs in the corpus according to the probability value may specifically include: and deleting corpus pairs with probability values not exceeding a preset threshold value from the corpus.

Specifically, in the embodiment of the present invention, a preset threshold may be set, and for a certain corpus pair, under the condition that the probability value of the second corpus corresponding to the first corpus in the corpus pair obtained through calculation does not exceed the preset threshold, the probability of translating the first corpus in the corpus pair into the second corpus may be considered to be lower, and the probability of the corpus pair being an incorrect corpus pair may be considered to be higher, and the corpus pair may be deleted from the corpus.

It is understood that the embodiment of the present invention does not limit the specific manner of setting the preset threshold. For example, the preset threshold may be set empirically, or may be set statistically by 3sigma (raleigh criterion) or the like.

In an optional embodiment of the present invention, the preset threshold may be specifically determined by the following steps:

step S21, determining a probability value corresponding to each corpus pair in the corpus according to the translation model;

step S22, determining an average probability value according to the probability values respectively corresponding to the plurality of corpus pairs;

and step S23, determining a preset threshold value according to the average probability value.

Specifically, in the embodiment of the present invention, first, according to a translation model obtained by training, probability values of a first corpus corresponding to a second corpus in each corpus pair are determined, so that a probability value is corresponding to each corpus pair in the corpus. Then, an average probability value may be calculated for the probability values corresponding to the corpus pairs in the corpus. Finally, a preset threshold value may be determined according to the average probability value.

In an application example of the embodiment of the present invention, an average probability value F and an average variance value T may be calculated according to probability values corresponding to each corpus pair in the corpus, and the preset threshold is set as: F-3T. Of course, in practical application, a person skilled in the art can flexibly set the preset threshold value according to needs, and can adjust the preset threshold value according to practical situations, and the larger the preset threshold value is, the more corpus pairs are filtered.

In an application example of the embodiment of the present invention, assuming that the preset threshold is 0.0001, for a corpus pair ("I likes you.", "I like applet."), it is assumed that a probability value of a first corpus corresponding to a second corpus in the corpus pair is calculated according to the translation model S1 as follows: p ("I like applet." | "I like you.") -0.00003. Since 0.00003 does not exceed the preset threshold of 0.0001, it indicates that the corpus pair is an incorrect corpus pair, and the corpus pair can be deleted from the corpus.

According to the embodiment of the invention, the probability value of the first corpus corresponding to the second corpus in each corpus pair can be calculated according to the translation model, the probability value obtained through calculation is compared with the preset threshold, and the corpus pair with the probability value not exceeding the preset threshold is deleted from the corpus so as to delete the wrong corpus pair in the corpus, thereby improving the correctness of the corpus pair in the corpus and improving the machine translation quality.

In an alternative embodiment of the present invention, the method may further comprise the steps of:

step S11, determining the corpus pair with the probability value not exceeding a preset threshold value as a target corpus pair;

step S12, determining the error type of the target expected pair;

step S13, determining a correct first corpus and a correct second corpus of the target corpus pair according to the error type;

step S14, updating the target corpus pair in the corpus according to the correct first corpus and the correct second corpus.

Therefore, in order to ensure the number of the corpus, the embodiment of the present invention may further identify an incorrect corpus pair in the corpus according to the translation model, and correct the incorrect corpus pair to obtain a correct corpus pair.

Specifically, the embodiment of the present invention may calculate, according to a translation model, a probability value of a first corpus corresponding to a second corpus in each corpus pair, compare the calculated probability value with a preset threshold, use the corpus pair having the probability value not exceeding the preset threshold as a target corpus pair, determine, in a machine recognition or manual recognition manner, an error type of the target prediction pair, determine, according to the error type, a correct first corpus and a correct second corpus of the target corpus pair, and update, according to the correct first corpus and the correct second corpus, the target corpus pair in a corpus to store the correct corpus pair in the corpus.

In an application example of the embodiment of the present invention, assuming that the preset threshold is 0.0001, for a corpus pair ("I likes you.", "I like applet."), assuming that the probability value of a first corpus corresponding to a second corpus in the corpus pair calculated according to the translation model S is: p ("I like applet." | "I like you.") -0.00003. Since 0.00003 does not exceed the preset threshold value of 0.0001, the corpus pair is indicated as a wrong corpus pair, the wrong type of the target expected pair is determined as a word translation error through syntactic analysis, and the word with the translation error can be corrected according to the wrong type, so that the correct first corpus of the target corpus pair is obtained, namely 'i like you'. And the correct second corpus is "I like you", and finally, the target corpus pair in the corpus can be updated according to the correct first corpus and the correct second corpus, and the original corpus pair ("I like you", "I like applet") in the corpus is updated to "I like you", "I ke you".

In an alternative embodiment of the present invention, the method may further comprise the steps of: and taking the corpus pairs in the filtered corpus as samples to train the translation model.

By the data processing method, the corpus is filtered, and the corpus pairs in the filtered corpus have high correctness, so that the corpus pairs in the filtered corpus are used as samples to train the translation model, the accuracy of the translation model can be improved, and the quality of machine translation can be further improved by utilizing the translation model to carry out machine translation.

To sum up, the embodiment of the present invention may determine, according to the translation model, a probability value of a first corpus corresponding to a second corpus in a corpus, and filter the corpus according to the probability value. The first corpus and the second corpus have a translation relation, and the translation model is obtained by training the corpus pairs in the corpus as samples. Because the corpus pair data volume in the corpus is large, the translation model obtained by training according to a large number of samples has certain semantic understanding capacity, errors in more semantic aspects can be recognized, the errors are not simply recognized according to rules such as the length of sentences and corresponding relations between words, therefore, the probability value of the first corpus corresponding to the second corpus in the corpus pair can be more accurately calculated according to the translation model, the corpus pairs in the corpus are filtered according to the probability value, more corpus pairs with smaller semantic error difference can be filtered, the filtering accuracy can be improved, and the translation quality of a machine translation system can be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 2, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, which may specifically include: a probability determination module 201 and a filtering module 202;

the probability determining module 201 is configured to determine, according to the translation model, a probability value of a first corpus corresponding to a second corpus in a corpus pair; wherein, the first corpus and the second corpus can have a translation relationship; the translation model can be obtained by training a sample according to the corpus;

the filtering module 202 is configured to filter the corpus pairs in the corpus according to the probability value.

Optionally, the filtering module 202 may specifically include:

and the first filtering submodule is used for deleting the corpus pairs with the probability value not exceeding a preset threshold value from the corpus.

Optionally, the apparatus may further include: a threshold determination module, configured to determine the preset threshold; the threshold determination module includes:

the probability value calculation submodule is used for determining the probability value corresponding to each corpus pair in the corpus according to the translation model;

the average value calculation submodule is used for determining an average probability value according to the probability values respectively corresponding to the plurality of corpus pairs;

and the threshold value determining submodule is used for determining a preset threshold value according to the average probability value.

Optionally, the apparatus may further include: a first training module for training the translation model; the first training module comprising:

the first training submodule is used for training a translation model by taking the language corresponding to the first corpus in the sample as a source language and the language corresponding to the second corpus in the sample as a target language; or

And the second training submodule is used for training a translation model by taking the language corresponding to the first corpus in the sample as a target language and the language corresponding to the second corpus in the sample as a source language.

Optionally, the apparatus may further include: the second training module is used for training the translation model; the second training module comprising:

the segmentation submodule is used for segmenting a first corpus and a second corpus in the corpus by taking words as units so as to obtain a word sequence corresponding to the first corpus and a word sequence corresponding to the second corpus;

the reverse submodule is used for performing reverse operation on the word sequence corresponding to the first corpus and the word sequence corresponding to the second corpus to obtain a reverse word sequence corresponding to the first corpus and a reverse word sequence corresponding to the second corpus;

and the reverse training submodule is used for training a translation model by taking a reverse corpus pair consisting of the reverse word sequence corresponding to the first corpus and the reverse word sequence corresponding to the second corpus as a sample.

Optionally, the apparatus may further include:

the target determining module is used for determining the corpus pair with the probability value not exceeding a preset threshold value as a target corpus pair;

an error determination module for determining the error type of the target prospective pair;

a correction determining module, configured to determine, according to the error type, a correct first corpus and a correct second corpus of the target corpus pair;

and the updating module is used for updating the target corpus pair in the corpus according to the correct first corpus and the correct second corpus.

Optionally, the apparatus may further include:

and the third training module is used for taking the corpus pairs in the filtered corpus as samples to train the translation model.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: determining the probability value of a first corpus corresponding to a second corpus in a corpus pair according to the translation model; wherein, the first corpus and the second corpus have a translation relationship; the translation model is obtained by training a sample according to the corpus.

Fig. 3 is a block diagram illustrating an apparatus 800 for data processing in accordance with an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice data processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the data processing method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a data processing method, the method comprising: determining the probability value of a first corpus corresponding to a second corpus in a corpus pair according to the translation model; wherein, the first corpus and the second corpus have a translation relationship; the translation model is obtained by training a sample according to the corpus.

The embodiment of the invention discloses A1 and a data processing method, wherein the method comprises the following steps:

and filtering the corpus in the corpus according to the probability value.

A2, according to the method in a1, the filtering corpus pairs in the corpus according to the probability value includes:

and deleting corpus pairs with probability values not exceeding a preset threshold value from the corpus.

A3, according to the method of A2, determining the preset threshold by:

determining a probability value corresponding to each corpus pair in the corpus according to the translation model;

determining an average probability value according to the probability values respectively corresponding to the plurality of corpus pairs;

and determining a preset threshold value according to the average probability value.

A4, training the translation model according to the method of A1 by:

A5, training the translation model according to the method of A1 by:

segmenting a first corpus and a second corpus in the corpus by taking words as units to obtain a word sequence corresponding to the first corpus and a word sequence corresponding to the second corpus;

performing reverse operation on the word sequence corresponding to the first corpus and the word sequence corresponding to the second corpus to obtain a reverse word sequence corresponding to the first corpus and a reverse word sequence corresponding to the second corpus;

and taking a reverse corpus pair consisting of the reverse word sequence corresponding to the first corpus and the reverse word sequence corresponding to the second corpus as a sample, and training a translation model.

A6, the method of A1, the method further comprising:

determining the corpus pair with the probability value not exceeding a preset threshold value as a target corpus pair;

determining an error type of the target expectation pair;

determining a correct first corpus and a correct second corpus of the target corpus pair according to the error type;

and updating the target corpus pair in the corpus according to the correct first corpus and the correct second corpus.

A7, the method of A1, the method further comprising:

and taking the corpus pairs in the filtered corpus as samples to train the translation model.

The embodiment of the invention discloses B8 and a data processing device, which comprises:

B9, the device according to B8, the filtration module comprising:

B10, the apparatus of B9, the apparatus further comprising: a threshold determination module, configured to determine the preset threshold; the threshold determination module includes:

B11, the apparatus of B8, the apparatus further comprising: a first training module for training the translation model; the first training module comprising:

B12, the apparatus of B8, the apparatus further comprising: the second training module is used for training the translation model; the second training module comprising:

B13, the apparatus of B8, the apparatus further comprising:

B14, the apparatus of B8, the apparatus further comprising:

The embodiment of the invention discloses C15, an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

and filtering the corpus in the corpus according to the probability value.

C16, the apparatus according to C15, the filtering corpus pairs in the corpus according to the probability value, comprising:

C17, according to the device of C16, determining the preset threshold value by:

C18, according to the device of C15, training the translation model by:

C19, according to the device of C15, training the translation model by:

C20, the device of C15, the device also configured to execute the one or more programs by one or more processors including instructions for:

determining an error type of the target expectation pair;

C21, the device of C15, the device also configured to execute the one or more programs by one or more processors including instructions for:

Embodiments of the present invention disclose D22, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of a 1-a 7.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The data processing method, the data processing apparatus and the apparatus for data processing provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of data processing, the method comprising:

and filtering the corpus in the corpus according to the probability value.

2. The method of claim 1, wherein the filtering corpus pairs in the corpus according to the probability value comprises:

3. The method of claim 2, wherein the preset threshold is determined by:

4. The method of claim 1, wherein the translation model is trained by:

5. The method of claim 1, wherein the translation model is trained by:

6. The method of claim 1, further comprising:

determining an error type of the target expectation pair;

7. The method of claim 1, further comprising:

8. A data processing apparatus, comprising:

9. An apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

and filtering the corpus in the corpus according to the probability value.

10. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a data processing method as claimed in one or more of claims 1 to 7.