CN112149415B

CN112149415B - Training method and device for text generation model and readable storage medium

Info

Publication number: CN112149415B
Application number: CN202011086728.2A
Authority: CN
Inventors: 胡晓林; 刘涵; 李建民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2024-05-31
Anticipated expiration: 2040-10-12
Also published as: CN112149415A

Abstract

The present disclosure relates to a training method, apparatus, and readable storage medium for a text generation model. The training method of the text generation model comprises the following steps: acquiring a first text sequence generated by a text generation model; replacing the target word in the first word sequence by using the replacement word in the preset word stock to obtain a second word sequence; and adjusting model parameters of the text generation model according to the value of the loss function to obtain a trained text generation model, wherein the trained text generation model is used for generating a text sequence corresponding to the input content according to the input content. According to the training method, the training device and the readable storage medium of the text generation model, whether each target word in the generated text sequence is accurate or not can be fed back to the model based on content change and length change, so that the model obtains more guiding information, and the accuracy of generating the text sequence is improved.

Description

Training method and device for text generation model and readable storage medium

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a training method and apparatus for a text generation model, and a readable storage medium.

Background

Text generation may refer to natural language processing in the form of text sequences as output, commonly used in the tasks of machine translation, speech recognition, image description, video description, and the like.

Current training methods for text generation models, for example: the Self-evaluation sequence training (Self-Critical Sequence Training, SCST) method, SCST is a reinforcement learning algorithm that optimizes a model by comparing the result of random word selection with the result of word selection according to the maximum probability, and although a certain training effect is achieved on model training, the accuracy of the text generated by the trained model remains to be improved.

Disclosure of Invention

In view of this, the present disclosure proposes a training method, apparatus and readable storage medium for a text generation model, so as to improve the accuracy of text generation model in generating text sequences.

According to an aspect of the present disclosure, there is provided a training method of a text generation model, including: acquiring a first text sequence generated by a text generation model;

Replacing the target word in the first word sequence by using a replacement word in a preset word stock to obtain a second word sequence;

According to the value of the loss function, adjusting the model parameters of the text generation model to obtain a trained text generation model, wherein the trained text generation model is used for generating a text sequence corresponding to input content according to the input content;

Wherein the loss function comprises a first loss function determined from a content change and a length change of the second literal sequence relative to the first literal sequence.

In one possible implementation, the content change is represented by a text content change score, and the length change is represented by a text length change score;

the first penalty function is determined from the text content variation score and the text length variation score.

In one possible implementation manner, the training method of the text generation model further includes:

And determining a text content change score according to the first frequency number of the n-gram of the first text sequence and the second frequency number of the n-gram of the second text sequence.

In one possible implementation, determining the text content variation score according to a first frequency number of the n-gram of the first literal sequence and a second frequency number of the n-gram of the second literal sequence includes:

Determining a first frequency vector formed by the first frequency and a second frequency vector formed by the second frequency;

Under the condition that the length of the second text sequence relative to the first text sequence is unchanged, determining the text content change score according to the difference value between natural language evaluation indexes respectively corresponding to the second frequency vector and the first frequency vector,

And the natural language evaluation index is determined according to the frequency vector and the length of the text sequence.

In one possible implementation manner, determining the second frequency vector formed by the second frequency includes:

When the target word in the first word sequence is replaced by the replacement word, the second frequency number of the n-gram ending with the target word is reduced by 1 relative to the first frequency number, the second frequency number of the n-gram ending with the replacement word is increased by 1 relative to the first frequency number, and the second frequency numbers of other n-grams are unchanged relative to the first frequency number.

And under the condition that a second frequency vector corresponding to a second text sequence is unchanged relative to a first frequency vector corresponding to a first text sequence, determining a text length change score according to a difference value between natural language evaluation indexes corresponding to the length of the second text sequence and the length of the first text sequence, wherein the natural language evaluation indexes are determined according to the frequency vector and the length of the text sequence.

In one possible implementation, when the target word is the ending identifier of the first literal sequence, the length of the second literal sequence is increased by 1 relative to the length of the first literal sequence, and when the target word is a word other than the ending identifier, the length of the second literal sequence is equal to the length of the first literal sequence.

In one possible implementation, the first penalty function is determined based on a text content change score when each target word of the first literal sequence is replaced by a replacement word and a text length change score when the target word is an ending identifier of the first literal sequence.

In a possible implementation, the penalty function further comprises a second penalty function obtained by a self-evaluation sequence training method SCST, the penalty function being determined from a weighted sum of the first and second penalty functions.

In one possible implementation, the trained text generates input content of a model, including one or more of images, text, audio, and video.

According to another aspect of the present disclosure, there is provided a text generation method including:

acquiring input content to be processed;

Inputting the input content into a trained text generation model to obtain a text sequence corresponding to the input content output by the trained text generation model;

The trained text generation model is a text generation model trained in advance by adopting the training method of any text generation model.

According to another aspect of the present disclosure, there is provided a training apparatus of a text generation model, the training apparatus of a text generation model including:

the acquisition module is used for acquiring a first text sequence generated by the text generation model;

The replacing module is used for replacing the target word in the first word sequence by utilizing the replacement word in the preset word stock so as to obtain a second word sequence;

The adjusting module is used for adjusting the model parameters of the text generating model according to the value of the loss function to obtain a trained text generating model, and the trained text generating model is used for generating a text sequence corresponding to the input content according to the input content; wherein the loss function comprises a first loss function determined from a content change and a length change of the second literal sequence relative to the first literal sequence.

Training means for the text generation model, in one possible implementation, the content change is represented by a text content change score, and the length change is represented by a text length change score; the first penalty function is determined from the text content variation score and the text length variation score.

In one possible implementation manner, the training device of the text generation model further includes:

And the text content change score determining module is used for determining a text content change score according to the first frequency number of the n-gram of the first text sequence and the second frequency number of the n-gram of the second text sequence.

In one possible implementation manner, the text content change score determining module includes:

A frequency vector determining unit, configured to determine a first frequency vector configured by the first frequency and a second frequency vector configured by the second frequency;

And the text content change score determining unit is used for determining the text content change score according to the difference value between the natural language evaluation indexes respectively corresponding to the second frequency vector and the first frequency vector under the condition that the length of the second text sequence relative to the first text sequence is unchanged, and the natural language evaluation indexes are determined according to the frequency vector and the length of the text sequence.

For the training apparatus of the text generation model, in one possible implementation manner, the determining a first frequency vector formed by the first frequency may include: when the target word in the first word sequence is replaced by the replacement word, the second frequency number of the n-gram ending with the target word is reduced by 1 relative to the first frequency number, the second frequency number of the n-gram ending with the replacement word is increased by 1 relative to the first frequency number, and the second frequency numbers of other n-grams are unchanged relative to the first frequency number.

the text length change score determining module is configured to determine the text length change score according to a difference value between natural language evaluation indexes corresponding to the length of the second text sequence and the length of the first text sequence, where the second frequency vector corresponding to the second text sequence is set to be unchanged relative to the first frequency vector corresponding to the first text sequence, and the natural language evaluation indexes are determined according to the frequency vector and the length of the text sequence.

In one possible implementation manner, the training device for the text generation model adds 1 to the text length of the second text sequence when the target word is the ending identifier of the first text sequence, and adds 1 to the text length of the first text sequence when the target word is a word other than the ending identifier.

In one possible implementation manner, the first loss function is determined according to a text content change score when each target word of the first text sequence is replaced by a replacement word and a text length change score when the target word is an ending identifier of the first text sequence.

In one possible implementation manner, the training device for the text generation model further comprises a second loss function obtained by a self-evaluation sequence training method SCST, wherein the loss function is determined according to a weighted sum of the first loss function and the second loss function.

In one possible implementation, the training device for the text generation model includes input content of the trained text generation model, including one or more of images, text, audio and video.

According to another aspect of the present disclosure, there is provided a text generating apparatus including:

the input content acquisition module is used for acquiring input content to be processed;

The text sequence output module is used for inputting the input content into the trained text generation model so as to obtain a text sequence corresponding to the input content output by the trained text generation model; the trained text generation model is a text generation model which is trained in advance by adopting any training method of the text generation model.

According to another aspect of the present disclosure, there is provided a training apparatus of a text generation model, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, a first text sequence generated by a text generation model is acquired; replacing the target word in the first word sequence by using a replacement word in a preset word stock to obtain a second word sequence; and the model parameters of the text generation model are adjusted based on a first loss function determined according to the content change and the length change of the second text sequence relative to the first text sequence, so that the trained text generation model is obtained, whether each target word in the generated text sequence is accurate or not can be fed back to the model based on the content change and the length change in model training, and therefore the model obtains more guiding information, and the accuracy of generating the text sequence is improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a training method of a text generation model according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of a structure of a text generation model of an input content for an image according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of a text generation method according to an embodiment of the present disclosure;

FIG. 4 illustrates a block diagram of a training device of a text generation model, according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a text generation device according to an embodiment of the present disclosure;

fig. 6 shows a block diagram of a training apparatus of a text generation model according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

In the embodiment of the present disclosure, the text generation model may refer to a model in which a text sequence is output, for example, an image description model, a machine translation model, a voice recognition model, a video description model, and the like. Based on the input content of the model and the application scenario, a person skilled in the art may select an existing natural language processing technique to construct a text generation model, for example, may select a neural network model of an "encoder-decoder" structure, which may be used in the scenarios of image description, machine learning, video description, speech recognition, etc. It will be appreciated that those skilled in the art may adjust the neural network structure of the encoder and decoder according to actual needs, and in the embodiments of the present disclosure, there is no limitation on the model structure of the text generation model. The content such as images, texts, audios, videos and the like is input into the text generation model, so that a corresponding text sequence can be obtained, the text sequence can be used for describing the content contained in the images, the texts, the audios, the videos and the like, for example, the images are input into the text generation model, and a section of text sequence can be output for expressing the content displayed in the images.

Existing text generation model training methods, for example: the SCST method may be based on an evaluation index, for example: bilingual inter-translation quality assessment (Bilingual Evaluation Understudy, BLEU) index, explicit ordered translation assessment (Metric for Evaluation of Translation with Explicit Ordering, METEOR) index, graph-semantic-based description assessment (Semantic Propositional Image Caption Evaluation, SPICE) index, consistency-based image description (consisu-based Image Description, CIDEr) index, etc., score the model-generated text as a whole, allowing the model to learn the accuracy of the generated text, to achieve optimal model parameters. However, the training mode cannot provide more accurate guide information for each word in the text of the model, the training effect of the model is limited, and the accuracy of generating the text by the model still needs to be improved.

Based on the above, the training method for generating the model by the text provided by the embodiment of the disclosure can replace each target word in model training, and based on the loss function determined according to the content change and the length change, feedback whether each target word in the generated text sequence is accurate or not to the model, so that the model can obtain more guiding information, and the accuracy of generating the text sequence is improved.

FIG. 1 illustrates a flow chart of a training method of a text generation model according to an embodiment of the present disclosure. As shown in fig. 1, the training method of the text generation model includes:

And step 10, acquiring a first text sequence generated by the text generation model.

In one possible implementation, the first literal sequence may be a natural language corresponding to the input content generated by the text generation model for the input content. The input content of the text generation model may include one of an image, a text, an audio and a video, for example, "A plate of food on the table" shown in fig. 2 may be a first text sequence generated for the input image when the input content is the image, and in the text generation model formed by the convolutional neural network CNN (Convolutional Neural Network) and the Long-short-term memory network LSTM (Long-Short Term Memory), the CNN is used for extracting an image feature, and the LSTM is used for outputting the text sequence according to the extracted image feature.

It may be understood that the text sequence in the embodiment of the present disclosure may be chinese, english, or text of other language types, and those skilled in the art may control the language type of the text generated by the text generation model, and the embodiment of the present disclosure is not limited to the language type of the generated text sequence.

And 11, replacing the target word in the first word sequence by using the replacement word in the preset word stock to obtain a second word sequence.

In one possible implementation, the preset word stock may be a set of preset words. The language type of the words in the preset word stock can be corresponding to the language type adopted by the generated word sequence. The words included in the preset word stock may be set according to actual requirements, for example, may include common words or subject words, and the embodiment of the disclosure is not limited.

In one possible implementation, the replacement word may be a word in a preset word stock for replacement. The target word may be a word that is replaced in the first literal sequence. In the embodiment of the present disclosure, the replacement word and the target word may each refer to a language unit constituting a text sequence, for example, food, table, and the like.

In one possible implementation manner, the second word sequence may be text obtained by replacing the target word with all or part of the words in the preset word stock, and the target word may be all or any part of the words in the first word sequence. It is understood that the first text sequence may include at least one target word, the preset word library may include at least one replacement word, and those skilled in the art will recognize that replacing the target word with the replacement word may result in at least one second text sequence. For example, assume that the preset vocabulary includes: food, salad, french, fries the first literal sequence is "A plate of food on the table", the word "A" can be replaced with food to obtain a second literal sequence "food plate of food on the table", the word "A" can be replaced with salad or French or fries, the word food can be replaced with food or salad or French or fries, and so on.

In the embodiment of the disclosure, the second text sequence obtained by replacing at least one target word in the first text sequence can be used for giving more guiding information in model training.

And step 12, adjusting model parameters of the text generation model according to the value of the loss function to obtain a trained text generation model, wherein the trained text generation model is used for generating a text sequence corresponding to the input content according to the input content.

In one possible implementation, the word of the text sequence is replaced, so that content change, length change and the like of the text sequence can be brought, the change can show change of the second text sequence relative to the first text sequence in terms of semantics and/or sentence after the target word is replaced by the replacement word, and the change is fed back to the model, so that effective training of the model can be achieved. In the embodiment of the disclosure, the content change and the length change of the second text sequence relative to the first text sequence can be fed back to the model, so that the model is effectively generated by training the text.

In one possible implementation, the feedback of the content change and the length change to the model may be implemented by a loss function, and in one or more embodiments of the present disclosure, the loss function may include a first loss function determined according to the content change and the length change of the second text sequence relative to the first text sequence.

In one possible implementation, the content change may be represented by a text content change score and the length change may be represented by a text length change score to quantify the content change and the length change, and the first penalty function may be determined from the text content change score and the text length change score.

In natural language processing techniques, it is generally considered that subsequent words in a text sequence may be associated with the preceding n words, and when the text content changes, the frequency of the n-gram in the text sequence may also change, and then the content change score of the text may be determined according to the first frequency of the n-gram of the first text sequence and the second frequency of the n-gram of the second text sequence in one or more embodiments of the present disclosure.

The n-gram may refer to n words that continuously appear in the text, and the frequency of the n-gram may be the number of times the n-gram appears in the text, for example, assuming that n is 3 and text is abcdefabcd, the 3-gram of the text may include abc, bcd, cde, def, efa, fab, abc, bcd, where abc and bcd appear 2 times, the frequency of adc and bcd are 2, the frequency of other 3-grams appear 1 time, and the frequency of other 3-grams are 1. It will be appreciated that those skilled in the art may set n to 1 or 2 or 3 or other values as desired, and embodiments of the present disclosure are not limited thereto.

The frequency number of the n-gram may form a frequency number vector, and the elements in the frequency number vector may be that the frequency numbers of the n-gram corresponding to the text are arranged in the order of high frequency and low frequency, for example, in the above example, when n is 3, the frequency number vector of the text abcdefabcd is {2,2,1,1,1,1}.

In the embodiment of the disclosure, the value of n may be determined according to needs.

In one possible implementation, the text length may refer to the number of words contained in the literal sequence, and in general, the literal sequence may contain an ending identifier that may identify the end of the literal sequence. Accordingly, the length transformation may refer to a change in the number of words in the first text sequence relative to the second text sequence, and in one or more embodiments of the present disclosure, a text length change score may be determined based on the length of the first text sequence and the length in the second text sequence.

In one possible implementation, the content change and the length change may be scored by a natural language evaluation index to determine a text content change score and a text length change score. Wherein, the natural language evaluation index may include: BLEU, METEOR, CIDEr, etc., one or more natural language scoring indicators may be selected to score content changes and length changes according to the model structure, without limitation to embodiments of the present disclosure.

In the embodiment of the disclosure, the content change and the length change can be more simply and effectively quantified by representing the content change and the length change by the text content change score and the text length change score.

In one possible implementation, the model parameters of the text generation model are adjusted according to the value of the loss function, and the model parameters may be adjusted by using back propagation or the like according to the training purpose, which is not limited to the embodiments of the present disclosure.

As shown in table 1, the text generation model when the input content is an image, that is, the image description model, may be the comparison result of the natural language evaluation index corresponding to the output text sequence through the SCST training method and the training method of the text generation model in the embodiment of the disclosure. Wherein Att2in (Att 2in model for image captioning) is an image description model with attention mechanism, top-Down (Image Captioning model with Top-Down attention) is an image description model with Top-Down attention, up-Down (Image Captioning model with Top-Down and Bottom-Up attention) is an image description model with Top-Down and Bottom-Up attention, SGAE (Image Captioning model with Auto-Encoding) is an image description model with self-Encoding scene graph. VCST stands for training method of text generation model in the embodiments of the present disclosure.

It can be understood that, the higher the value of the natural language evaluation index, the better the training effect of the representative model, that is, the higher the accuracy of the text sequence output by the model, it can be obviously seen from table 1 that by the training method of the text generation model in the embodiment of the disclosure, the trained index of the image description model is higher, that is, the training method of the text generation model in the embodiment of the disclosure can significantly improve the accuracy of the generated text sequence.

Table 1 comparative results

In the embodiment of the disclosure, a first text sequence generated by a text generation model is acquired; replacing the target word in the first word sequence by using the replacement word in the preset word stock to obtain a second word sequence; and adjusting model parameters of the text generation model according to the value of the loss function, wherein the loss function comprises a first loss function determined according to the content change and the length change of the second text sequence relative to the first text sequence, and in model training, whether each target word in the generated text sequence is accurate or not can be fed back to the model by replacing each target word and determining the loss function based on the content change and the length change, so that the model can obtain more guiding information, and the accuracy of the generated text sequence is improved.

In one or more embodiments of the present disclosure, considering that the frequency number of the n-gram of the text sequence may include a plurality, the plurality of frequency numbers may be expressed based on a form of a vector so as to score the text content change by a natural language evaluation index, determining the text content change score according to the first frequency number of the n-gram of the first text sequence and the second frequency number of the n-gram of the second text sequence may include:

And under the condition that the length of the second text sequence relative to the first text sequence is unchanged, determining a text content change score according to the difference value between the natural language evaluation indexes respectively corresponding to the second frequency vector and the first frequency vector.

In one possible implementation, the score may be determined by using the frequency vector and the length of the text sequence as input parameters of the natural language evaluation index, and then the natural language evaluation index may be determined according to the frequency vector and the length of the text sequence. The frequency vector may be a vector composed of frequency numbers.

In practical application, the words after the replacement word in the second text sequence are actually generated according to the n-gram of the target word, so that the words possibly generated by the text generation model after the replacement word are difficult to know, and the real text content change scores caused by the replacement word replacing the target word are difficult to obtain. To solve this problem, in one possible implementation, it may be assumed that each replacement affects only the frequency number of the n-gram ending with the replacement word and the frequency number of the n-gram ending with the target word, and determining the second frequency number vector of the second frequency number may include: when the target word in the first word sequence is replaced by the replacement word, the second frequency number of the n-gram ending with the target word is subtracted by 1 from the first frequency number, the second frequency number of the n-gram ending with the replacement word is added by 1 from the first frequency number, the second frequency numbers of other n-grams are unchanged from the first frequency number, and the second frequency number can be simply and effectively approximated to the frequency number vector of the word sequence possibly generated after actual replacement.

In the embodiment of the present disclosure, the second frequency number H ⁿ (x|s ') may be represented as formula 1, and the second frequency number vector H (s') composed of the second frequency number may be represented as H (s ') = { H ⁿ (x|s') },

Wherein n represents n of n-gram, t represents a t-th word (i.e., a target word currently replaced) in the first text sequence, w '_t represents a replacement word replacing the t-th word, w' _t e a preset word stock, s ^* represents the first text sequence, s 'represents a second text sequence after the t-th word is replaced by the replacement word w' _t,H ⁿ denotes the frequency of n-gram, x denotes the n-gram of the text sequence, h ⁿ(x|s^*) and/orSecond frequency, h ⁿ(x|s^*), of the n-gram that may represent the end of the replacement term,/>A second frequency number of n-grams ending with the target word, h ⁿ(x|s^*) may be represented, and x= { else } may represent a second frequency number of other n-grams.

In one possible implementation manner, when determining the text content change score, the text content change score may be determined by controlling a variable, that is, setting the length of the second text sequence relative to the first text sequence to be unchanged, so that the text content change can be reflected more accurately, where the text content change score r _t(w′_t|s^* in the embodiment of the present disclosure may be expressed as formula 2:

r _t(w′_t|s^*)≈Ψ(H(s′),l(s^*))-Ψ(H(s^*),l(s^*)) type 2

Wherein t represents a t-th word (i.e., a currently replaced target word) in the first text sequence, r _t(w′_t|s^* represents a text content change score when the t-th word is replaced by a replacement word w '_t in the first text sequence, ψ represents a natural language evaluation index, s' represents a second text sequence after the t-th word is replaced by a replacement word w '_t, H (s') represents a second frequency vector, H (s ^*) represents the first frequency vector, and l (s ^*) represents the length of the first text sequence.

In the embodiment of the disclosure, under the condition that the length of the second text sequence is unchanged relative to the length of the first text sequence, the text content change score is determined according to the difference value between the natural language evaluation indexes respectively corresponding to the second frequency vector and the first frequency vector, so that the text content change can be reflected more accurately, and the value of the loss function is determined according to the text content change score.

In one or more embodiments of the present disclosure, as described above, the replacement word and the target word may each refer to a linguistic unit that constitutes a literal sequence, which may include an ending identifier. Since the ending identifier is also replaced when the first word sequence is replaced, when the target word is the ending identifier of the first word sequence, the length of the second word sequence is increased by 1 relative to the length of the first word sequence, which means that after the ending identifier is replaced by the replacement word, the number of words in the second word sequence is increased by 1, and when the target word is a word other than the ending identifier, the length of the second word sequence is equal to the length of the first word sequence, which means that the number of words in the second word sequence is not increased.

In one possible implementation, determining the text length change score may include: and under the condition that the second frequency vector corresponding to the second character sequence is unchanged relative to the first frequency vector corresponding to the first character sequence, determining a text length change score according to the difference value between natural language evaluation indexes respectively corresponding to the length of the second character sequence and the length of the first character sequence.

Since the length of the second word sequence is equal to the length of the first word sequence when the target word is a word other than the ending identifier, only the change in length of the second word sequence relative to the first word sequence when the target word is the ending identifier of the first word sequence may be considered in determining the text length change score. The text length change score r _len(s^* in the embodiments of the present disclosure) can be expressed as formula 3:

r _len(s^*)≈Ψ(H(s^*),l(s^*)+1)-Ψ(H(s^*),l(s^*)) type 3

Wherein ψ represents a natural language evaluation index, H (s ^*) represents a first frequency vector, l (s ^*) represents the length of a first text sequence, and l (s ^*) +1 represents the length of a second text sequence.

The text length change score may be represented by equation 3 when the target word is the ending identifier of the first literal sequence, and may be considered to be 0 when the target word is other words than the ending identifier of the first literal sequence.

In the embodiment of the disclosure, the information of the length change after replacement can be provided through the length change score so as to provide comprehensive guiding information for the model.

In one or more embodiments of the present disclosure, the first penalty function may be determined from a text content change score when each target word of the first literal sequence is replaced by a replacement word and a text length change score when the target word is an ending identifier of the first literal sequence. The first loss function L _vcst may be expressed as equation 4,

Wherein T represents the T-th word in the first word sequence, 1.ltoreq.t.ltoreq.T+1, T represents the total number of words in the first word sequence, r _t(w′_t|s^*) represents the text content change score after the T-th word in the first word sequence is replaced by the replacement word w ' _t, w ' _t represents the replacement word, w ' _t E the preset word stock V, s ^* represents the first word sequence,P _t(w′_t) represents the probability corresponding to w' _t in the probability distribution generated by the text generation model when outputting the t-th word, r _len(s^*) represents the text length change score,/>Representing a mathematical expectation of the text length change score, w _T+1 represents the ending identifier EOS of the first literal sequence, and p _T+1 ("EOS") represents the probability of the ending identifier EOS corresponding in a probability distribution generated by the text generation model when outputting the ending identifier EOS. The reason for r _len(s^*)p_T+1 ("EOS") preceding the plus sign in equation 4 may be due to the negative effect that the presence of the end identifier has on the increase in length of the literal sequence.

It should be noted that, the above formula indicates that all words in the first text sequence are used as target words, if some words in the first text sequence are used as target words, T may indicate the T-th target word in the first text sequence, 1T is less than or equal to t+1, and T may indicate the total number of target words in the first text sequence.

In one possible implementation manner, training purposes of the text generation model may include maximizing score improvement caused by word replacement, for example, the convergence condition of training may be set such that the value of the loss function is sufficiently small (for example, smaller than a predetermined threshold value), which means that the text content change score and the text length change score are generally maximized, and the text generated by the obtained text generation model is closer to the natural language description as the training convergence condition, where the text sequence accuracy generated by the text generation model is high.

When the model parameters are adjusted through the loss function L _vcst in the embodiment of the disclosure, a gradient descent mode can be adopted.

It should be noted that, although the first loss function is described above by taking the loss function L _vcst as an example, those skilled in the art can understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set the first loss function according to the actual training target, as long as the first loss function is determined according to the text content variation score and the text length variation score.

In one or more embodiments of the present disclosure, the loss function may further include a second loss function obtained by the self-evaluation sequence training method SCST, which may be determined from a weighted sum of the first loss function and the second loss function.

In one possible implementation, the weighting may be by a hyper-parameter of the text generation model, e.g., the loss function may be L _all,L_all＝L_scst+αL_vcst, where L _scst represents the second loss function, α represents the hyper-parameter, and L _vcst represents the first loss function. It can be understood that the super parameters of the text generation model under different model structures are different, and the weights of the first loss function and the second loss function can be adjusted through weighting, so that a better model training effect is obtained.

It should be noted that although the above weighting method is described by taking the super parameter as an example, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set the weighting mode according to the actual requirement, so long as the adjustment of the weights of the first loss function and the second loss function can be realized.

In the embodiment of the disclosure, through the first loss function and the second loss function, the accuracy of the text sequence on the whole can be fed back to the model, and the accuracy of the text sequence on the details can be fed back, so that the model can be comprehensively trained, and the accuracy of the model for generating the text sequence can be improved.

Fig. 3 shows a flowchart of a text generation method according to an embodiment of the present disclosure, as shown in fig. 3, including:

step 20, obtaining input content to be processed;

step 21, inputting the input content into the trained text generation model to obtain a text sequence corresponding to the input content output by the trained text generation model;

in one possible implementation, the input content may include one or more of text, images, audio, video.

In one possible implementation, a model structure of the text generation model may be determined according to input content, and a person skilled in the art may select an existing text generation model, for example, an image description model, a machine translation model, a speech recognition model, a video description model, etc., according to input content and application scenario, and the embodiment of the present disclosure is not limited to the structure of the text generation model.

In one possible implementation, the trained text generation model may be a pre-trained text generation model using the training method of the text generation model provided in any of the embodiments of the present disclosure. Reference may be made to the above detailed description of the embodiments of the present disclosure for the training process of the model, which is not described herein.

In one possible implementation, the text sequence of the input content may be text that the text generation model outputs according to the input content, for example, may be a natural language description for image output, may be output translation text for text, may be recognition text for audio output, may be a natural language description for video output, and so on.

By the text generation method in the embodiment of the disclosure, the text sequence corresponding to the input content generated by the text generation model can be more accurate.

Fig. 4 shows a block diagram of a training apparatus for a text generation model according to an embodiment of the present disclosure, as shown in fig. 4, including:

an obtaining module 101, configured to obtain a first text sequence generated by a text generation model;

a replacing module 102, configured to replace a target word in the first text sequence with a replacement word in a preset word stock to obtain a second text sequence;

The adjustment module 103 is configured to adjust model parameters of the text generation model according to a value of the loss function, to obtain a trained text generation model, where the trained text generation model is used to generate a text sequence corresponding to the input content according to the input content; wherein the penalty function comprises a first penalty function determined from content variations and length variations of the second text sequence relative to the first text sequence.

In one possible implementation, the content change is represented by a text content change score and the length change is represented by a text length change score; the first penalty function is determined from the text content variation score and the text length variation score.

In one possible implementation manner, the training device of the text generation model may further include:

and the text content change score determining module is used for determining the text content change score according to the first frequency number of the n-gram of the first text sequence and the second frequency number of the n-gram of the second text sequence.

In one possible implementation, the text content change score determining module may include:

A frequency vector determining unit for determining a first frequency vector composed of the first frequency and a second frequency vector composed of the second frequency;

the text content change score determining unit is used for determining a text content change score according to the difference value between the natural language evaluation indexes respectively corresponding to the second frequency vector and the first frequency vector under the condition that the length of the second text sequence relative to the length of the first text sequence is unchanged, and the natural language evaluation indexes are determined according to the frequency vector and the length of the text sequence.

For the training device of the text generation model, in one possible implementation manner, determining a first frequency vector formed by the first frequency may include: when the target word in the first word sequence is replaced by the replacement word, the second frequency number of the n-gram ending with the target word is reduced by 1 relative to the first frequency number, the second frequency number of the n-gram ending with the replacement word is increased by 1 relative to the first frequency number, and the second frequency numbers of other n-grams are unchanged relative to the first frequency number.

The text length change score determining module is used for determining a text length change score according to the difference value between the text length of the second text sequence and the natural language evaluation index corresponding to the text length of the first text sequence under the condition that the second frequency vector corresponding to the second text sequence is unchanged relative to the first frequency vector corresponding to the first text sequence, and the natural language evaluation index is determined according to the frequency vector and the length of the text sequence.

For the training device of the text generation model, in one possible implementation manner, the loss function may further include a second loss function obtained by a self-evaluation sequence training method SCST, and the loss function is determined according to a weighted sum of the first loss function and the second loss function.

In one possible implementation manner, the training device for the text generation model includes input content of the trained text generation model, including one or more of images, texts, audios and videos.

According to the training device for generating the model by the text, in the model training, by replacing each target word and determining the loss function based on the content change and the length change, whether each target word in the generated text sequence is accurate or not is fed back to the model, so that the model can obtain more guiding information, and the accuracy of generating the text sequence is improved.

Fig. 5 shows a block diagram of a text generating apparatus according to an embodiment of the present disclosure, as shown in fig. 5, including:

An input content acquisition module 201, configured to acquire input content to be processed;

The text sequence output module 202 is configured to input the input content to the trained text generation model, so as to obtain a text sequence corresponding to the input content output by the trained text generation model.

In one possible implementation, the input content includes one or more of text, images, audio, video.

In one possible implementation, the trained text generation model is a text generation model that is pre-trained using the training method of the text generation model provided in any one of the embodiments of the present disclosure above.

By the text generation device in the embodiment of the invention, the text sequence corresponding to the input content generated by the text generation model can be more accurate.

FIG. 6 is a block diagram illustrating a training apparatus 1900 of a text generation model, according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to fig. 6, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The apparatus 1900 may further include a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of training a text generation model, comprising:

Acquiring a first text sequence generated by a text generation model;

wherein the penalty function comprises a first penalty function determined from a content variation and a length variation of the second text sequence relative to the first text sequence, the content variation being represented by a text content variation score; wherein the method further comprises:

Determining a first frequency vector formed by the first frequency and a second frequency vector formed by the second frequency according to the first frequency of the n-gram of the first character sequence and the second frequency of the n-gram of the second character sequence;

and under the condition that the length of the second text sequence relative to the first text sequence is unchanged, determining the text content change score according to the difference value between natural language evaluation indexes respectively corresponding to the second frequency vector and the first frequency vector, wherein the natural language evaluation indexes are determined according to the frequency vector and the length of the text sequence.

2. The method of claim 1, wherein the length change is represented by a text length change score;

3. The method of claim 1, wherein determining a second frequency vector of the second frequency components comprises:

4. The method according to claim 2, wherein the method further comprises:

5. The method of claim 4, wherein the length of the second literal sequence is increased by 1 relative to the length of the first literal sequence when the target word is the ending identifier of the first literal sequence, and wherein the length of the second literal sequence is equal to the length of the first literal sequence when the target word is a word other than the ending identifier.

6. The method of any one of claims 2 to 5, wherein the first penalty function is determined based on a text content change score when each target word of the first literal sequence is replaced by a replacement word and a text length change score when the target word is an ending identifier of the first literal sequence.

7. The method according to claim 1, characterized in that the penalty function further comprises a second penalty function obtained by a self-evaluation sequence training method SCST, the penalty function being determined from a weighted sum of the first and second penalty functions.

8. The method of claim 1, wherein the trained text generates input content of a model comprising one or more of images, text, audio, video.

9. A text generation method, characterized in that the text generation method comprises:

acquiring input content to be processed;

Wherein the trained text generation model is a text generation model pre-trained by the method of any one of claims 1 to 8.

10. A training device for a text generation model, the training device comprising:

The adjusting module is used for adjusting the model parameters of the text generating model according to the value of the loss function to obtain a trained text generating model, and the trained text generating model is used for generating a text sequence corresponding to the input content according to the input content; wherein the penalty function comprises a first penalty function determined from a content variation and a length variation of the second text sequence relative to the first text sequence, the content variation being represented by a text content variation score; wherein, the training device further includes:

The text content change score determining module is used for determining a first frequency vector formed by the first frequency and a second frequency vector formed by the second frequency according to the first frequency of the n-gram of the first text sequence and the second frequency of the n-gram of the second text sequence;

11. A text generation apparatus, characterized in that the text generation apparatus comprises:

the text sequence output module is used for inputting the input content into the trained text generation model so as to obtain a text sequence corresponding to the input content output by the trained text generation model; wherein the trained text generation model is a text generation model pre-trained by the method of any one of claims 1 to 8.

12. A training device for a text generation model, comprising:

A processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to: performing the method of any one of claims 1 to 9.

13. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 9.