CN112199963A

CN112199963A - Text processing method and device and text processing device

Info

Publication number: CN112199963A
Application number: CN202011063600.4A
Authority: CN
Inventors: 李质轩; 许静芳; 鲁涛; 戴磊; 武静; 杨正彪; 殷明明; 王坤; 王青龙
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-08

Abstract

The embodiment of the invention provides a text processing method and device and a text processing device. The method comprises the following steps: vectorizing the original text to obtain an original text vector; inputting the original text vector into a first model, outputting a target text through the first model, wherein the original text and the target text correspond to the same language, the first model comprises a copy network, the copy network is used for reserving the copy text in the original text in the target text, the first model is obtained by training based on a translation parallel corpus of the first language corresponding to a second language and an output result of the second model, and the second model is used for translating the text of the first language into the text of the second language. The embodiment of the invention can improve the efficiency and the accuracy of text retouching.

Description

Text processing method and device and text processing device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a text processing method and apparatus, and an apparatus for text processing.

Background

For non-english users, there may be various problems in using english. For example, in the process of writing using english, the expression features of the native language may be used instead of the expression features of english, and although written english words conform to english grammar, they do not conform to english expression habits.

Therefore, in scenarios such as english learning and english writing, it is necessary to perform coloring (policy) on english texts written by non-english users so that the colored english texts conform to english grammar and english expression habits. At present, the way of manual retouching is usually adopted for retouching, and an English user manually reads, corrects and corrects English texts written by non-English users. The manual color-moistening mode not only consumes a large amount of labor cost, but also has lower efficiency.

Disclosure of Invention

The embodiment of the invention provides a text processing method and device and a text processing device, which can improve the efficiency and accuracy of text retouching.

In order to solve the above problem, an embodiment of the present invention discloses a text processing method, where the method includes:

vectorizing the original text to obtain an original text vector;

inputting the original text vector into a first model, outputting a target text through the first model, wherein the original text and the target text correspond to the same language, the first model comprises a copy network, the copy network is used for reserving the copy text in the original text in the target text, the first model is obtained by training based on a translation parallel corpus of the first language corresponding to a second language and an output result of the second model, and the second model is used for translating the text of the first language into the text of the second language.

On the other hand, the embodiment of the invention discloses a text processing device, which comprises:

the first vectorization module is used for vectorizing the original text to obtain an original text vector;

the processing module is used for inputting the original text vector into a first model, outputting a target text through the first model, wherein the original text and the target text correspond to the same language, the first model comprises a copy network, the copy network is used for keeping the copy text in the original text in the target text, the first model is obtained by training based on translation parallel linguistic data corresponding to a second language of the first language and an output result of the second model, and the second model is used for translating the text of the first language into the text of the second language.

In yet another aspect, an embodiment of the present invention discloses an apparatus for text processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:

vectorizing the original text to obtain an original text vector;

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a text processing method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

the embodiment of the invention trains the first model in advance, the first model can be used for carrying out the color rendering processing on the original text to obtain the target text, and the original text and the target text correspond to the same language. The first model is obtained by training based on the translation parallel corpus of the first language corresponding to the second language and the output result of the second model, and the second model is used for translating the text of the first language into the text of the second language. According to the embodiment of the invention, the automatic retouching processing of the original text can be realized through the first model, compared with a manual retouching processing mode, a large amount of labor cost can be saved, and the retouching processing efficiency can be improved. In addition, because the translation parallel corpus is easier to obtain than the retouching parallel corpus, the embodiment of the invention adopts a mode of combining the first model and the second model, trains the first text by utilizing the translation parallel corpus from the first language to the second language and the output result of the second model, can solve the problem of scarcity of the training data of the first model, and can improve the robustness of the first model. In addition, the first model comprises a copy network, and the copy network is used for keeping the copy text in the original text in the target text, that is, the copy network can directly copy the correct part (such as named entity) which does not need to be translated or modified in the original text, so that the probability of occurrence of semantic deviation can be reduced, and the accuracy of touch-up can be further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one embodiment of a text processing method of the present invention;

FIG. 2 is a block diagram of a text processing apparatus according to an embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus 800 for text processing of the present invention;

fig. 4 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a text processing method according to the present invention is shown, which may specifically include the following steps:

step 101, vectorizing an original text to obtain an original text vector;

102, inputting the original text vector into a first model, and outputting a target text through the first model, wherein the original text and the target text correspond to the same language, the first model comprises a copy network, the copy network is used for reserving the copy text in the original text in the target text, the first model is obtained by training based on a translation parallel corpus of the first language corresponding to a second language and an output result of the second model, and the second model is used for translating the text of the first language into the text of the second language.

The copy network is configured to keep the copy text in the original text in the target text, that is, the copy network may directly copy a correct part (such as a named entity) that does not need to be translated or modified in the original text, so that the probability of occurrence of semantic deviation may be reduced, and the accuracy of rendering may be further improved.

The text processing method provided by the embodiment of the invention can be applied to electronic equipment. The electronic devices include, but are not limited to: a server, a smart phone, a recording pen, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a car computer, a desktop computer, a set-top box, a smart tv, a wearable device, and the like.

The embodiment of the invention obtains the first model by pre-training, the first model can be used for automatically carrying out the color rendering processing on the original text to obtain the target text, and the original text and the target text correspond to the same language.

In order to ensure the accuracy of the prediction of the first model, a large amount of touch-up parallel corpora are required to be used in the process of training the first model. Parallel Corpora (Parallel Corpora), refers to a data set of linguistic material that contains pairs of sentences in different domains aligned two by two. Taking english as an example, training the first model requires using a large number of english touch-down parallel corpora, which include pairs of english sentences aligned in pairs before touch-down and after touch-down. However, in practical applications, less touch-up parallel corpora are available, resulting in insufficient training data for the first model. To solve the problem, the embodiment of the present invention trains the first model based on the translation parallel corpus of the first language corresponding to the second language and the output result of the second model. The translation parallel corpus of the first language corresponding to the second language comprises sentence pairs of the first language and the second language aligned pairwise. The second model may be a machine translation model operable to translate text in the first language to text in the second language.

The first language and the second language are different languages. In one example, the first language is Chinese and the second language is English. Of course, the embodiment of the present invention does not limit the types of the first language and the second language. For example, the first language is english and the second language is chinese, or the first language is french and the second language is japanese. The first language and the second language may include any language such as chinese, english, french, italian, german, portuguese, japanese, korean, and the like. For convenience of description, in the embodiments of the present invention, the first language is chinese, and the second language is english, that is, the second model may be used to translate a chinese text into an english text, and the first model may be used to perform rendering processing on an english original text to obtain a rendered english target translation text.

In practical application, the translated parallel corpus is easier to obtain than a retouched parallel corpus, so that the embodiment of the invention trains the first model by combining the first model and the second model and by using the translated parallel corpus from the first language to the second language and the output result of the second model, so as to solve the problem of scarcity of the training data of the first model.

After the first model training is completed, the first model can be used for performing the rendering processing on the original text of the second language. Specifically, first, vectorization processing is performed on the original text to obtain an original text vector. The original text may be a sentence, a word, an article, etc. Optionally, in the embodiment of the present invention, the rendering process is performed in units of sentences. Firstly, vectorization processing is carried out on the original text, namely, each participle in the original text is converted into a corresponding vector. Assuming that the original text has a length of N (the original text includes N participles, where N is a positive integer), each participle is converted into a vector x of fixed length, and the original text vector can be represented as (x)₁,x₂,…,x_N). And then inputting the original text vector into the first model, and outputting the target text subjected to the rendering processing on the original text through the first model.

In one example, the original text is "all, in the past peer used the letter or email time more than the used letter. The original text may be converted into the target text "In general, peer used letters or email more than ten people In the next paragraph" by the rendering process of the first model. The target text obtained after conversion better conforms to the English expression habit, and the expression is more standard.

Optionally, in order to solve the problem of scarcity of the training data of the first model, the embodiment of the present invention may also use the machine translation model to train an inverse model, where the inverse model is used to convert a text expressing a normative preset language into a text expressing an irregular preset language, so as to obtain a large amount of touch-up parallel corpora required for training the first model. The preset language refers to a text language which needs to be subjected to the touch-up processing.

In practical application, the first model may be trained based on the trained second model, or the first model and the second model may be simultaneously trained based on the translated parallel corpus of the first language corresponding to the second language. The embodiment of the present invention is not limited thereto.

In an embodiment of the present invention, the first Model and the second Model may be Sequence-to-Sequence models (Sequence to Sequence models). Sequence-to-sequence model: refers to a neural network model that is capable of transforming the sequence of one domain into a corresponding sequence of another domain.

It should be noted that, the model structures and the training methods of the first model and the second model are not limited in the embodiments of the present invention. The first model and the second model may fuse a plurality of neural networks. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory) Network, RNN (Simple current Neural Network), GRU (Gated Recurrent unit), attention Neural Network, and the like.

In an optional embodiment of the present invention, the translation parallel corpus includes a first language text and a second language standard text corresponding to the first language text, and the first model may be trained through the following steps:

step S11, vectorizing the first language text to obtain a first language text vector;

step S12, inputting the first language text vector into a second model, and outputting a translation text of the first language text corresponding to a second language through the second model;

step S13, inputting the translated text into a first model, and outputting a processed text corresponding to the translated text through the first model;

step S14, calculating a total loss value according to the difference between the processed text and the second language standard text;

and step S15, adjusting the model parameters of the first model according to the total loss value until the calculated total loss value reaches a preset convergence condition, and obtaining the trained first model.

The translation parallel corpus is a pre-collected data set containing sentence pairs of a first language and a second language aligned in pairs. Specifically, the translation parallel corpus includes a first language text and a second language standard text corresponding to the first language text. The first language text and the second language standard text are sentences as units, and the first language text and the second language standard text are both pre-collected texts with expression specifications.

In one example, the translation parallel corpus corresponding to the second language in the first language includes the following sentence pairs: the first language text is "people used letters or e-mail more frequently than phone in the past. "and the second language standard text corresponding to the first language text is" peer used letters or e-mail more than one languages in the pass ".

Firstly, vectorizing a first language text to obtain a first language text vector, and supposing that the first language text vector is marked as (x)₁,x₂,…,x_N). And then inputting the first language text vector into a second model, and outputting a translation text of the first language text corresponding to the second language through the second model. Suppose the output translated text is "In the past words used the letters or email times more than the y used tiles". The second model used here may be a trained second model, or may be an initial second model. And then, inputting the translated text into a first model, and outputting a processed text corresponding to the translated text through the first model. The processed text is the text after rendering the translated text, and the processed text is assumed to be "Peer used letters or email times more than once called letters in the pass". It is to be understood that the first model herein is an initial first model. Finally, calculating a total loss value according to the difference between the processed text (such as "Peer used letters or email times more than one languages disposed in the text") and the second language standard text (such as "Peer used letters or e-mail more than one languages disposed in the text in the past"), and adjusting the model parameters of the first model according to the total loss value until the calculated total loss value reaches a preset convergence condition to obtain the trained first model.

Further, if the first model and the second model are trained simultaneously, the model parameters of the first model and the model parameters of the second model may be adjusted according to the total loss value until the calculated total loss value reaches a preset convergence condition, so as to obtain the trained first model and the trained second model.

In an alternative embodiment of the invention, the first model comprises a first encoder and a first decoder, and the second model comprises a second encoder and a second decoder.

Wherein the first encoder and the first decoder may form a machine translation model structure from the first language to the second language. The second encoder and the second decoder may form a rendering model structure for the second language. After the training process is completed, the structure of the rendering model including the first encoder and the first decoder may be extracted, and the first model for automatically rendering the text in the second language may be obtained.

In an optional embodiment of the present invention, the step S12 of inputting the first language text vector into a second model, and outputting the translated text of the first language text corresponding to a second language through the second model includes:

step S121, inputting the first language text vector into the second encoder for encoding to obtain a second encoder intermediate vector;

step S122, inputting the intermediate vector of the second encoder into the second decoder for decoding to obtain an intermediate vector of the second decoder;

step S123, generating a translation word segmentation probability sequence according to the second decoder intermediate vector, wherein each element in the translation word segmentation probability sequence represents the probability of each word in a second language word list appearing at each position in a translation text;

and step S124, determining target participles corresponding to each position in the translation text in the second language word list according to the translation participle probability sequence to obtain the translation text.

The second model comprises a second encoder and a second decoder, and the first language text vector is input to the second model, i.e. the first language text vector is input to said second encoder. The second encoder inputs a first language text vector (x)₁,x₂,…,x_N) And after the vector sequence is converted into the vector sequence, the vector sequence is input into a second decoder, and the second decoder converts the vector sequence output by the second encoder into a translation text.

The second encoder may be an encoder including a multi-layer neural network, and the second decoder may be a decoder including a multi-layer neural network. The second encoder and the second decoder may use the same neural network or different neural networks.

The first language text vector is usedInputting the second encoder for encoding to obtain a second encoder intermediate vector, and recording the second encoder intermediate vector as

Then the process of the second encoder encoding the text vector in the first language to obtain the intermediate vector in the second encoder can be represented by the following formula:

wherein encoder2 represents the encoding operation of the second encoder. Second encoder intermediate vector

Inputting the intermediate vector h into a second decoder for decoding to obtain an intermediate vector h of the second decoder_tAssume is noted as

Wherein decoder2 denotes a decoding operation of the second decoder.

According to the second decoder intermediate vector h_tA translation word segmentation probability sequence may be generated. In particular, the second decoder intermediate vector h may be interpolated using a normalization function (e.g., softmax, etc.)_tAnd mapping the translation word segmentation probability sequence into a translation word segmentation probability sequence, wherein each element in the translation word segmentation probability sequence represents the probability of each word in the second language word list appearing at each position in the translation text. Taking the second language as English as an example, the vocabulary of the second language is the vocabulary of English. The English word list contains words whose occurrence frequency is above a set threshold value, which are summarized by respectively counting the occurrence frequency of all words in an English sentence.

In one example, the generated translation participle probability sequence is: [ { word1:0.01, word2:0.91, word3:0.01 … }, { word1:0.01, word2:0.04, word3:0.82 … } … ]. The 1 st element in the translation word segmentation probability sequence is { word1:0.01, word2:0.91, word3:0.01 … }, and the element represents the probability that each word in the English word list appears at the first position in the translation text, namely, the probability that the word1 in the English word list appears at the first position in the translation text is 0.01, the probability that the word2 appears at the first position in the translation text is 0.91, the probability that the word3 appears at the first position in the translation text is 0.01, and so on. Similarly, the 2 nd element in the translation word segmentation probability sequence is { word1:0.01, word2:0.04, word3:0.82 … }, and the element represents the probability that each word in the english word list appears at the second position in the translation text, that is, the probability that the word1 in the english word list appears at the second position in the translation text is 0.01, the probability that the word2 appears at the second position in the translation text is 0.04, the probability that the word3 appears at the second position in the translation text is 0.82, and so on.

The number of elements in the translation word segmentation probability sequence can be determined according to the length of the translation text predicted by the second model. In the embodiment of the invention, the length of the text can be represented by the number of the participles contained in the text. For example, the length of the translated text predicted by the second model is m (m is a positive integer), which means that the translated text includes m participles, and then the translated participle probability sequence contains m elements, each element corresponding to a participle probability distribution of one position.

According to the translation word segmentation probability sequence, target word segmentation corresponding to each position in the translation text can be determined in the second language word list, and the translation text can be obtained. For example, in the above example, the probability of the word2 in the 1 st element is the highest and is 0.91, and therefore, the target word corresponding to the first position of the translated text is word 2. Similarly, the probability of word3 in element 2 is the highest and is 0.82, so the target word in the second position of the translated text is word 3. And repeating the steps until the target word segmentation at the m positions is determined, so as to obtain the translated text.

Optionally, in step S121, after the first language text vector is input to the second encoder for encoding, so as to obtain a second encoder intermediate vector, the method may further include: and carrying out normalization processing on the intermediate vector of the second encoder to obtain the length of the predicted translation text.

In particular, the second step may be performed using a normalization function (e.g., softmax, etc.)Two-coder intermediate vector

Mapping the probability distribution of the length of the predicted translated text relative to each integer between 1 and the maximum length, and selecting the value with the maximum probability, namely the length L of the predicted translated text_predict. Wherein the maximum length refers to the predicted maximum length of the translated text, and the length of the translated text should be between 1 and the maximum length.

In an optional embodiment of the present invention, after obtaining the translated text in step S124, the method may further include:

step S125, determining out-of-set words in the translation text, wherein the out-of-set words are words existing in the translation text but not existing in the second language word list;

and step S126, adding the extracorporeal words in the second language word list.

In practical applications, named entities often appear in the original text. Named entities refer to person names, organization names, place names, and all other entities identified by name. If the second language vocabulary does not contain the named entities appearing in the original text, semantic deviation of the obtained target text may occur in the process of the rendering processing of the first model. Semantic deviation refers to the phenomenon that sentence deviation or information loss occurs on the text after rendering processing relative to the original text.

For example, assume that The original text is "The price given by Hualias is very supportive", and The original text includes a named entity "Hualias", which is a business name. Since The named entity does not exist in The english dictionary, The following target text "The price of by Hunan is applied to The original" may be output by rendering The original text through The first model, resulting in a semantic offset.

In order to avoid the situation of semantic deviation, in the embodiment of the present invention, after obtaining the translated text, an out-of-set word is determined in the translated text, and the out-of-set word is added to the second language word list, so as to implement expansion of the second language word list.

The out-of-set word refers to a word that does not require translation or modification. For example, the out-of-set words may include named entities, common abbreviations or acronyms, and the like. According to the embodiment of the invention, the second language word list is expanded, and the out-of-set words in the translated text are added, so that the expanded second language word list can be used in the subsequent process of training the first model, the situation that the semantic deviation occurs because the out-of-set words in the translated text cannot be identified by the first model is avoided, and the accuracy of the first model rendering processing can be improved.

In an optional embodiment of the present invention, the step S13, inputting the translated text into a first model, and outputting a processed text corresponding to the translated text through the first model, includes:

step S131, vectorizing the translation text to obtain a translation text vector;

step S132, inputting the translation text vector into the first encoder for encoding to obtain a first encoder intermediate vector;

step S133, inputting the first encoder intermediate vector into the first decoder for decoding to obtain a first decoder intermediate vector;

step S134, generating a processed word segmentation probability sequence according to the first decoder intermediate vector, wherein each element in the processed word segmentation probability sequence represents the probability of each word in the first language word list appearing at each position in the processed text;

step S135, determining target segmented words corresponding to each position in the processed text in the first language word list according to the processed segmented word probability sequence, so as to obtain the processed text.

The first model comprises a first encoder and a first decoder, and the translated text is input into the first model, i.e. into said first encoder. The first encoder converts the input translation text into a vector sequence, and then inputs the vector sequence into a first decoder, and the first decoder converts the vector sequence output by the first encoder into a processed text. The processed text is the text after the first language text is subjected to the rendering processing.

The first encoder may be an encoder including a multi-layer neural network, and the first decoder may be a decoder including a multi-layer neural network. The first encoder and the first decoder may use the same neural network or may use different neural networks.

Vectorizing the translation text with the length of m to obtain a translation text vector which is supposed to be recorded as

Will translate the text vector

Inputting the first encoder for encoding, obtaining a first encoder intermediate vector, and recording the first encoder intermediate vector as

The first encoder may represent the translated text vector by the following equation

And (3) a process of obtaining a first coder intermediate vector by coding:

wherein encoder1 represents the encoding operation of the first encoder. Intermediate vector of first coder

Inputting the first decoder for decoding to obtain a first decoder intermediate vector H_tAssume is noted as

Wherein decoder1 denotes a decoding operation of the first decoder.

From the first decoder intermediate vector, a processed participle probability sequence can be generated, assumed to be P_t(w) is carried out. The above-mentionedEach element in the processed word segmentation probability sequence represents the probability of each segmentation in the first language word list appearing at each position in the processed text. And determining target word segmentation corresponding to each position in the processed text in the first language word list according to the processed word segmentation probability sequence, so as to obtain the processed text. For example, after rendering the translated text "In the passed text used the letters or email times In the used tiles" the processed text "is obtained"

Further, the embodiment of the present invention adds a copy network in the first model, and the copy network can be used to directly copy the correct part (such as named entity) of the original text that does not need to be translated or modified, so as to reduce the probability of occurrence of semantic deviation.

In an optional embodiment of the present invention, the generating a processed word segmentation probability sequence according to the first decoder intermediate vector in step S134 includes:

step A1, generating a first word segmentation probability sequence according to the first decoder intermediate vector;

step A2, connecting the first encoder intermediate vector and the first decoder intermediate vector based on a preset dimension to obtain a combined vector;

step A3, inputting the combination vector into a full connection layer for calculation to obtain a copy vector;

step A4, generating a second word segmentation probability sequence according to the copy vector;

and A5, determining a processed word segmentation probability sequence according to the first word segmentation probability sequence and the second word segmentation probability sequence.

Embodiments of the invention provide for intermediate vector H to be decoded according to a first decoder_tGenerating a processed word segmentation probability sequence P_t(w) first, the intermediate vector H is determined from the first decoder_tAnd generating a first segmentation probability sequence. Specifically, the first decoder intermediate vector H is scaled using a normalization function (e.g., softmax, etc.)_tMapping into a first segmentation probability sequence.The first segmentation probability sequence represents the probability distribution of each segmentation in a sentence relative to the probability of the whole second language vocabulary, that is, the first segmentation probability sequence is used for representing the probability of each segmentation in the text after the generation processing, and therefore, the first segmentation probability can also be referred to as the "generation probability". Suppose that the first segmentation probability sequence is denoted as P_gen。

Then, based on the preset dimension, the intermediate vector of the first encoder is subjected to

And the first decoder intermediate vector H_tConnecting to obtain a combined vector, and recording the combined vector as

Wherein dim-1 indicates that the preset dimension is the first dimension. Further, the first dimension represents the length of the text. Based on the first dimension, the first encoder intermediate vector and the first decoder intermediate vector are connected, that is, the first encoder intermediate vector and the first decoder intermediate vector corresponding to each word in the first language text are merged.

Next, the vectors are combined

Inputting the full-connection layer for calculation to obtain a copy vector, and recording the copy vector as H_copyThen the process of obtaining the copy vector through the full-concatenation layer calculation can be represented by the following formula:

wherein, W_concatRepresenting full connection layer computation, copy vector H_copyOf the first dimension and the first decoder intermediate vector H_tAre equal in length.

The combined first encoder intermediate vector and first decoder intermediate vector results in a copy vector in which information is retained for the portions that are correct and do not require translation or modification. According to the copyAnd vector, and a second word segmentation probability sequence can be generated. Specifically, the copy vector is normalized using a normalization function (e.g., softmax, etc.), and the copy vector H is normalized_copyMapped as a probability distribution of each participle in a sentence relative to the entire second language vocabulary probability. And each element in the second word segmentation probability sequence represents the probability that the word segmentation at each position in the processed text is the corresponding word segmentation in the original text. Therefore, the second participle probability may also be referred to as "copy probability". Suppose that the second participle probability sequence is denoted as P_copyThe higher the copy probability of a word in the second word segmentation probability sequence is, the higher the probability that the word is copied from the original text is.

Finally, according to the first word segmentation probability sequence P_genAnd said second participle probability sequence P_copyDetermining a processed segmentation probability sequence P_t(w) is carried out. In the embodiment of the invention, the word segmentation probability sequence P is processed_tAnd (w) is comprehensively determined according to the first word segmentation probability sequence (generation probability) and the second word segmentation probability sequence (copy probability). In particular, a trainable parameter α may be used_copyAnd alpha_genFor the first segmentation probability sequence P_genAnd a second participle probability sequence P_copyPerforming comprehensive calculation to obtain a processed word segmentation probability sequence P_t(w), i.e. the final probability distribution.

Specifically, P can be calculated by the following formula_t(w)：

P_t(w)＝α_copy×P_copy+α_gen×P_gen (1)

According to the processed word segmentation probability sequence P_t(w), selecting target word segmentation at corresponding position from the second language word list to obtain processed text, and supposing that the processed text vector is recorded as

It should be noted that the length of the processed text may be the same as or different from the length of the translated text. In practical applications, the length of the processed text may be greater than, less than, or equal to the length of the translated text. Hair brushIn the embodiment, the length of the processed text is equal to the length of the translated text.

The embodiment of the invention adds a copy network in the first model, and the parameter alpha of the copy network_copyAnd alpha_genThe method can update the translation text and the unchanged part of the second language standard text according to the training process of the first model, and continuously adjust the parameter alpha_copyAnd alpha_genAnd learning the normal form of the correct or unmodified part in the second language standard text. After the training is finished, the first model can automatically identify the correct or unmodified part in the original text and directly copy the correct or unmodified part.

For example, for an original text "The price given by The clause is The top can", The copy network combines The first encoder intermediate vector and The first decoder intermediate vector of The original text to obtain a copy vector, and obtains a copy probability by normalizing The copy vector, so that The copy probability of The participle "clause" can be obtained, and therefore, in The obtained processed participle probability sequence, The participle "clause" can be directly copied, so that The named entity clause can be reserved in a target text, and The probability of semantic deviation caused by The fact that The named entity cannot be identified can be reduced.

In an optional embodiment of the present invention, the training of the first model and the second model may be performed simultaneously based on parallel corpora from the first language to the second language, and the calculating of the total loss value according to the difference between the processed text and the standard text in the second language in step S14 specifically includes:

step S141, calculating a first loss value according to the difference between the processed text and the second language standard text;

step S142, calculating a second loss value according to the difference between the translated text and the second language standard text;

step S143, calculating a third loss value according to the difference between the length of the translation text and the length of the second language standard text;

step S144, calculating a total loss value according to the first loss value, the second loss value, and the third loss value.

Specifically, a first Loss value Loss is calculated based on a difference between the processed text and the second language standard text_polish. Calculating a second Loss value Loss according to the difference between the translated text and the standard text of the second language_trans. According to the length L of the translated text_predictAnd the actual length L of the second language standard text_realThe difference between them, the third Loss value Loss is calculated_length. Finally, according to the first Loss value Loss_polishThe second Loss value Loss_transAnd the third Loss value Loss_lengthAnd calculating the total Loss value Loss.

Alternatively, using two trainable parameters α and β, the total Loss value Loss is calculated according to:

Loss＝Loss_polish+αLoss_trans+βLoss_length (2)

and adjusting the model parameters of the first model and the model parameters of the second model according to the total Loss value Loss to ensure that the value of Loss is minimum, and finishing the training process of the first model and the second model at the moment. The first model can be extracted for automatically rendering the original text to be rendered in the second language.

It should be noted that, due to the general characteristics of the neural network model, at the initial stage of the first model training, the problem of sentence redundancy (e.g., repeating a certain word or phrase in a sentence) or missing is easily generated, which is represented by too long or too short sentence length. Therefore, the embodiment of the invention takes the loss value of the translation text length as the training parameter for guiding the first model and the second model, so that the probability of generating redundant or missing sentences by the second decoder can be reduced, and the overall training effect of the first model and the second model is improved. It can be understood that the number and the type of the training parameters for calculating the total loss value are not limited in the embodiments of the present invention.

The following describes in detail a specific process of performing rendering processing on an original text in a second language by using the trained first model to obtain a target text. Taking the second language as english for example, before performing rendering processing on the original text, the trained first model and the trained english vocabulary need to be obtained first. The process of performing the rendering process on the original text to be rendered in the second language by using the first model is specifically as follows:

and step B1, vectorizing the original text to obtain an original text vector.

Specifically, vectorization processing is performed on the original text by taking a sentence as a unit, and each participle in the sentence is converted into a corresponding vector. Assuming that the original text has a length of N (N is a positive integer) and each participle is converted into a fixed-length vector x, the original text vector can be represented as (x)₁,x₂,…,x_N)。

And step B2, inputting the original text vector into the first encoder for encoding to obtain an intermediate vector of the first encoder.

Specifically, the original text vector (x)₁,x₂,…,x_N) Inputting a first encoder for encoding, wherein the first encoder inputs an original text vector (x)₁,x₂,…,x_N) Conversion to a first encoder intermediate vector

And step B3, inputting the first coder intermediate vector into the first decoder for decoding to obtain a first decoder intermediate vector.

Specifically, the first encoder intermediate vector is encoded

Inputting the intermediate vector into a first decoder for decoding to obtain an intermediate vector of the first decoder

And step B4, generating a first word segmentation probability sequence according to the first decoder intermediate vector.

Specifically, the first decoder intermediate vector H is normalized using a normalization function such as softmax_tMappingProbability distribution P for each participle in a sentence relative to the probability of the entire second language vocabulary_genAnd may also be referred to as "probability of generation" for short.

And step B5, connecting the first coder intermediate vector and the first decoder intermediate vector based on a preset dimension to obtain a combined vector.

Specifically, the first encoder intermediate vector is encoded

And the first decoder intermediate vector H_tPerforming connection combination according to the first dimension to obtain a combination vector

And step B6, inputting the combined vector into a full connection layer for calculation to obtain a copy vector.

In particular, vectors will be combined

Inputting a full connection layer for calculation to obtain a copy vector

Wherein H_copyIs compared with the decoder intermediate vector H_tAre equal in length.

And B7, generating a second word segmentation probability sequence according to the copy vector.

Specifically, the copy vector H is normalized by using a normalization function such as softmax_copyNormalization processing is carried out to copy the vector H_copyProbability distribution P mapped as probability of each participle in a sentence relative to the whole second language vocabulary_copyAlso referred to as "copy probability".

And step B8, determining the processed word segmentation probability sequence according to the first word segmentation probability sequence, the second word segmentation probability sequence and the trained parameters.

In particular, using the trained parameter α_copyAnd alpha_genA1 is to P_copyAnd P_genPerforming combined calculation to obtain a processed word segmentation probability sequence as follows: p_t(w)＝α_copy×P_copy+α_gen×P_gen. And selecting the target word segmentation at the corresponding position from the English word list according to the processed word segmentation probability sequence to obtain the target text.

In an optional embodiment of the present invention, before the vectorizing the original text in step S101, the method may further include:

step S21, identifying the named entity in the original text;

step S22, marking the participles corresponding to the named entities in the original text;

after the target text is output through the first model, the method may further include: and replacing the translation result corresponding to the marked word segmentation in the target text with the named entity.

To further reduce the probability of semantic drift due to the inability to identify named entities. Before the original text is subjected to the color retouching treatment, the embodiment of the invention carries out named entity recognition on the original text so as to recognize the named entity in the original text.

After the named entities in the original text are identified, the participles corresponding to the named entities in the original text are marked. And after the marked original text is subjected to rendering processing through the trained first model to obtain a target text, replacing the translation result corresponding to the marked word segmentation with the marked word segmentation (namely named entity) according to the corresponding relation between the mark and the word segmentation in the original text.

For example, for The original text "The price given by The named entity is The change able", before performing rendering processing, entity recognition is first performed to obtain that The original text contains The named entity "Hualian", and The corresponding participle of The named entity is labeled. After The first model is used for carrying out The rendering processing on The original text, assuming that The obtained target text is "The price offset by Hunan is applied", replacing a translation result (Hunan) corresponding to The marked participle (hunanian) with The marked participle (hunanian) according to The corresponding relation between The mark and The participle in The original text, and obtaining The replaced target text as "The price offset by hunanian is applied", so that The probability of semantic deviation caused by The fact that a named entity cannot be identified can be further reduced, and The accuracy of The finally obtained target text is improved.

In an optional embodiment of the present invention, after the outputting the target text through the first model in step 102, the method may further include: and displaying the modification information of the target text relative to the original text.

According to the embodiment of the invention, after the original text is subjected to the color rendering processing through the first model and the target text is output, the modification information of the target text relative to the original text can be displayed, so that a user can intuitively know which modification is carried out on the original text. The modification information at least comprises deleting the participles in the original text, adding new participles, adjusting the order of the participles and the like.

To sum up, the first model is trained in advance in the embodiment of the present invention, and the first model may be used to perform rendering processing on an original text to obtain a target text, where the original text and the target text correspond to the same language. The first model is obtained by training based on the translation parallel corpus of the first language corresponding to the second language and the output result of the second model, and the second model is used for translating the text of the first language into the text of the second language. According to the embodiment of the invention, the automatic retouching processing of the original text can be realized through the first model, compared with a manual retouching processing mode, a large amount of labor cost can be saved, and the retouching processing efficiency can be improved. In addition, because the translation parallel corpus is easier to obtain than the retouching parallel corpus, the embodiment of the invention adopts a mode of combining the first model and the second model, trains the first text by utilizing the translation parallel corpus from the first language to the second language and the output result of the second model, can solve the problem of scarcity of the training data of the first model, and can improve the robustness of the first model.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 2, a block diagram of a text processing apparatus according to an embodiment of the present invention is shown, where the apparatus may include:

a first vectorization module 201, configured to perform vectorization processing on an original text to obtain an original text vector;

the processing module 202 is configured to input the original text vector into a first model, and output a target text through the first model, where the original text and the target text correspond to the same language, the first model includes a copy network, the copy network is configured to keep a copy text in the original text in the target text, the first model is obtained by training based on a translation parallel corpus of the first language corresponding to a second language and an output result of the second model, and the second model is configured to translate the text of the first language into a text of the second language.

Optionally, the translated parallel corpus includes a first language text and a second language standard text corresponding to the first language text, and the apparatus further includes:

the second directional quantization module is used for carrying out vectorization processing on the first language text to obtain a first language text vector;

the translation module is used for inputting the first language text vector into a second model and outputting a translation text of the first language text corresponding to the second language through the second model;

the touch-down module is used for inputting the translated text into a first model, reserving a copied text in the translated text through a copying network of the first model, and outputting a processed text corresponding to the translated text through the first model, wherein the copied text in the translated text is reserved in the processed text;

the loss calculation module is used for calculating a total loss value according to the difference between the processed text and the second language standard text;

and the parameter adjusting module is used for adjusting the model parameters of the first model according to the total loss value until the calculated total loss value reaches a preset convergence condition, so as to obtain the trained first model.

Optionally, the copy network includes a first encoder and a first decoder, and the rendering module includes:

the vectorization submodule is used for vectorizing the translation text to obtain a translation text vector;

the first coding submodule is used for inputting the translation text vector into the first coder for coding to obtain a first coder intermediate vector;

the first decoding submodule is used for inputting the first encoder intermediate vector into the first decoder for decoding to obtain a first decoder intermediate vector;

a copy sub-module for generating a copy vector from the first encoder intermediate vector and the first decoder intermediate vector;

the first probability calculation submodule is used for generating a processed word segmentation probability sequence according to the copy vector, and each element in the processed word segmentation probability sequence represents the probability of each word in the first language word list appearing at each position in the processed text;

and the processing submodule is used for determining target word segmentation corresponding to each position in the processed text in the first language word list according to the processed word segmentation probability sequence to obtain the processed text.

Optionally, the copy submodule includes:

a first probability calculation unit for generating a first segmentation probability sequence based on the first decoder intermediate vector;

a connection unit, configured to connect the first encoder intermediate vector and the first decoder intermediate vector based on a preset dimension to obtain a combined vector;

and the copying unit is used for inputting the combination vector into a full-connection layer for calculation to obtain a copy vector.

Optionally, the first probability computation submodule includes:

the second probability calculation unit is used for generating a second word segmentation probability sequence according to the copy vector;

and the third probability calculation unit is used for determining the processed word segmentation probability sequence according to the first word segmentation probability sequence, the second word segmentation probability sequence, the first parameter corresponding to the first word segmentation probability sequence and the second parameter corresponding to the second word segmentation probability sequence.

Optionally, the second model includes a second encoder and a second decoder, and the translation module includes:

the second coding submodule is used for inputting the first language text vector into the second coder for coding to obtain a second coder intermediate vector;

the second decoding submodule is used for inputting the second encoder intermediate vector into the second decoder for decoding to obtain a second decoder intermediate vector;

the second probability calculation submodule is used for generating a translation word segmentation probability sequence according to the second decoder intermediate vector, and each element in the translation word segmentation probability sequence represents the probability of each word in a second language word list appearing at each position in a translation text;

and the translation sub-module is used for determining target participles corresponding to each position in the translation text in the second language word list according to the translation participle probability sequence to obtain the translation text.

Optionally, the apparatus further comprises:

an extraset word determining module, configured to determine an extraset word in the translated text, where the extraset word is a word existing in the translated text but not existing in the second language vocabulary;

and the extraset word adding module is used for adding the extraset words in the second language word list.

Optionally, the loss calculating module includes:

the first loss calculation submodule is used for calculating a first loss value according to the difference between the processed text and the second language standard text;

the second loss calculation submodule is used for calculating a second loss value according to the difference between the translated text and the second language standard text;

a third loss calculation sub-module, configured to calculate a third loss value according to a difference between the length of the translated text and the length of the second language standard text;

and the total loss calculation submodule is used for calculating a total loss value according to the first loss value, the second loss value and the third loss value.

Optionally, the apparatus further comprises:

the identification module is used for identifying the named entities in the original text;

the marking module is used for marking the participles corresponding to the named entities in the original text;

the device further comprises:

and the replacing module is used for replacing the translation result corresponding to the marked word segmentation in the target text with the named entity.

Optionally, the apparatus further comprises:

and the information display module is used for displaying the modification information of the target text relative to the original text.

The embodiment of the invention trains the first model in advance, the first model can be used for carrying out the color rendering processing on the original text to obtain the target text, and the original text and the target text correspond to the same language. The first model is obtained by training based on the translation parallel corpus of the first language corresponding to the second language and the output result of the second model, and the second model is used for translating the text of the first language into the text of the second language. According to the embodiment of the invention, the automatic retouching processing of the original text can be realized through the first model, compared with a manual retouching processing mode, a large amount of labor cost can be saved, and the retouching processing efficiency can be improved. In addition, because the translation parallel corpus is easier to obtain than the retouching parallel corpus, the embodiment of the invention adopts a mode of combining the first model and the second model, trains the first text by utilizing the translation parallel corpus from the first language to the second language and the output result of the second model, can solve the problem of scarcity of the training data of the first model, and can improve the robustness of the first model. By the aid of the device, the translation result corresponding to the content to be translated can be obtained in real time in any currently running application interface, the complicated steps of exiting the currently running application, opening the translation application, inputting the content to be translated into the translation application, and finally submitting and obtaining the translation result are omitted, frequent switching between the current application and the translation application can be avoided repeatedly, translation efficiency and user experience can be improved, and continuity of using the current application by a user is maintained.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for text processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: vectorizing the original text to obtain an original text vector; inputting the original text vector into a first model, outputting a target text through the first model, wherein the original text and the target text correspond to the same language, the first model comprises a copy network, the copy network is used for reserving the copy text in the original text in the target text, the first model is obtained by training based on a translation parallel corpus of the first language corresponding to a second language and an output result of the second model, and the second model is used for translating the text of the first language into the text of the second language.

Fig. 3 is a block diagram illustrating an apparatus 800 for text processing according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the on/off status of the device 800, the relative positioning of components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also process a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the text processing method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a text processing method, the method comprising: vectorizing the original text to obtain an original text vector; inputting the original text vector into a first model, outputting a target text through the first model, wherein the original text and the target text correspond to the same language, the first model comprises a copy network, the copy network is used for reserving the copy text in the original text in the target text, the first model is obtained by training based on a translation parallel corpus of the first language corresponding to a second language and an output result of the second model, and the second model is used for translating the text of the first language into the text of the second language.

The embodiment of the invention discloses A1 and a text processing method, which comprises the following steps:

vectorizing the original text to obtain an original text vector;

A2, according to the method in A1, the translated parallel corpus includes a first language text and a second language standard text corresponding to the first language text, and the first model is trained through the following steps:

vectorizing the first language text to obtain a first language text vector;

inputting the first language text vector into a second model, and outputting a translation text of the first language text corresponding to a second language through the second model;

inputting the translated text into a first model, reserving the copied text in the translated text through a copy network of the first model, and outputting a processed text corresponding to the translated text through the first model, wherein the copied text in the translated text is reserved in the processed text;

calculating a total loss value according to the difference between the processed text and the second language standard text;

and adjusting the model parameters of the first model according to the total loss value until the calculated total loss value reaches a preset convergence condition, so as to obtain the trained first model.

A3, the method according to A2, wherein the copy network comprises a first encoder and a first decoder, the inputting the translated text into a first model, retaining the copy text in the translated text through the copy network of the first model, and outputting the processed text corresponding to the translated text through the first model, comprises:

vectorizing the translation text to obtain a translation text vector;

inputting the translation text vector into the first encoder for encoding to obtain a first encoder intermediate vector;

inputting the first encoder intermediate vector into the first decoder for decoding to obtain a first decoder intermediate vector;

generating a copy vector from the first encoder intermediate vector and the first decoder intermediate vector;

generating a processed word segmentation probability sequence according to the copy vector, wherein each element in the processed word segmentation probability sequence represents the probability of each word in the first language word list appearing at each position in the processed text;

and determining target word segmentation corresponding to each position in the processed text in the first language word list according to the processed word segmentation probability sequence to obtain the processed text.

A4, the method of A3, wherein generating a copy vector from the first encoder intermediate vector and the first decoder intermediate vector, comprises:

generating a first segmentation probability sequence according to the first decoder intermediate vector;

connecting the first encoder intermediate vector and the first decoder intermediate vector based on a preset dimension to obtain a combined vector;

and inputting the combination vector into a full connection layer for calculation to obtain a copy vector.

A5, according to the method of A4, the generating a processed word segmentation probability sequence according to the copy vector includes:

generating a second word segmentation probability sequence according to the copy vector;

and determining the word segmentation probability sequence after processing according to the first word segmentation probability sequence, the second word segmentation probability sequence, the first parameter corresponding to the first word segmentation probability sequence and the second parameter corresponding to the second word segmentation probability sequence.

A6, the method of a2, the second model comprising a second encoder and a second decoder, the inputting the first language text vector into the second model, outputting a translated text of the first language text corresponding to a second language through the second model, comprising:

inputting the first language text vector into the second encoder for encoding to obtain a second encoder intermediate vector;

inputting the intermediate vector of the second encoder into the second decoder for decoding to obtain an intermediate vector of the second decoder;

generating a translation word segmentation probability sequence according to the second decoder intermediate vector, wherein each element in the translation word segmentation probability sequence represents the probability of each word in a second language word list appearing at each position in a translation text;

and determining target participles corresponding to each position in the translation text in the second language word list according to the translation participle probability sequence to obtain the translation text.

A7, according to the method of A6, after the obtaining the translated text, the method further comprises:

determining out-of-set words in the translated text, wherein the out-of-set words are words existing in the translated text but not existing in the second language vocabulary;

and adding the extraset words in the second language word list.

A8, according to the method of A2, calculating a total loss value according to a difference between the processed text and the second language standard text, including:

calculating a first loss value according to the difference between the processed text and the second language standard text;

calculating a second loss value according to the difference between the translated text and the second language standard text;

calculating a third loss value according to a difference between the length of the translated text and the length of the second language standard text;

calculating a total loss value based on the first loss value, the second loss value, and the third loss value.

A9, before the vectorizing the original text according to the method of a1, the method further comprising:

identifying a named entity in the original text;

marking the participles corresponding to the named entities in the original text;

after the target text is output through the first model, the method further comprises:

and replacing the translation result corresponding to the marked word segmentation in the target text with the named entity.

A10, after outputting target text via the first model, the method of A1, further comprising:

and displaying the modification information of the target text relative to the original text.

The embodiment of the invention discloses B11, a text processing device, comprising:

B12, the apparatus according to claim 11, wherein the translated parallel corpus includes a first language text and a second language standard text corresponding to the first language text, the apparatus further including:

B13, the apparatus according to B12, the copy network comprising a first encoder and a first decoder, the rendering module comprising:

B14, the apparatus of B13, the copy submodule comprising:

B15, the apparatus of B14, the first probability computation submodule comprising:

B16, the apparatus according to B12, the second model comprising a second encoder and a second decoder, the translation module comprising:

B17, the apparatus of B16, the apparatus further comprising:

B18, the apparatus of B12, the loss calculation module comprising:

B19, the apparatus of B11, the apparatus further comprising:

the device further comprises:

B20, the apparatus of B11, the apparatus further comprising:

The embodiment of the invention discloses C21, a device for text processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

vectorizing the original text to obtain an original text vector;

C22, according to the device of C21, the translated parallel corpus includes a first language text and a second language standard text corresponding to the first language text, and the first model is trained through the following steps:

vectorizing the first language text to obtain a first language text vector;

C23, the apparatus according to C22, the copy network comprising a first encoder and a first decoder, the inputting the translated text into a first model, retaining the copied text in the translated text through the copy network of the first model, and outputting the processed text corresponding to the translated text through the first model, comprising:

vectorizing the translation text to obtain a translation text vector;

C24, the apparatus of C23, the generating a copy vector from the first encoder intermediate vector and the first decoder intermediate vector, comprising:

C25, the apparatus of C24, the generating a processed participle probability sequence according to the copy vector, comprising:

C26, the apparatus according to C22, the second model comprising a second encoder and a second decoder, the inputting the first language text vector into the second model, outputting a translated text of the first language text corresponding to a second language through the second model, comprising:

C27, the device of C26, the device also configured to execute the one or more programs by one or more processors including instructions for:

and adding the extraset words in the second language word list.

C28, the apparatus of C22, the calculating a total loss value according to the difference between the processed text and the second language standard text, comprising:

C29, the device of C21, the device also configured to execute the one or more programs by one or more processors including instructions for:

identifying a named entity in the original text;

the device is also configured to execute, by one or more processors, the one or more programs including instructions for:

C30, the device of C21, the device also configured to execute the one or more programs by one or more processors including instructions for:

Embodiments of the present invention disclose D31, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a text processing method as described in one or more of a 1-a 10.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The text processing method, the text processing device and the device for text processing provided by the invention are described in detail, specific examples are applied in the text to explain the principle and the implementation mode of the invention, and the description of the above examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of text processing, the method comprising:

vectorizing the original text to obtain an original text vector;

2. The method according to claim 1, wherein the translated parallel corpus comprises a first language text and a second language standard text corresponding to the first language text, and the first model is trained by:

vectorizing the first language text to obtain a first language text vector;

3. The method of claim 2, wherein the copy network comprises a first encoder and a first decoder, the inputting the translated text into a first model, retaining the copied text in the translated text through the copy network of the first model, and outputting the processed text corresponding to the translated text through the first model comprises:

vectorizing the translation text to obtain a translation text vector;

4. The method of claim 3, wherein generating a copy vector from the first encoder intermediate vector and the first decoder intermediate vector comprises:

5. The method of claim 4, wherein generating a processed word segmentation probability sequence from the copy vector comprises:

6. The method of claim 2, wherein the second model comprises a second encoder and a second decoder, and wherein inputting the first language text vector into the second model and outputting the translated text of the first language text corresponding to the second language through the second model comprises:

7. The method of claim 6, wherein after obtaining the translated text, the method further comprises:

and adding the extraset words in the second language word list.

8. A text processing apparatus, characterized in that the apparatus comprises:

9. An apparatus for text processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors comprises instructions for:

vectorizing the original text to obtain an original text vector;

10. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the text processing method of any of claims 1 to 7.