CN116151194B

CN116151194B - Method, device, equipment and storage medium for generating Chinese universal language

Info

Publication number: CN116151194B
Application number: CN202310348704.7A
Authority: CN
Inventors: 屈鑫; 张亚林; 高笑天; 叶永青
Original assignee: Shanghai Enflame Technology Co ltd
Current assignee: Shanghai Suiyuan Technology Co ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-07-07
Anticipated expiration: 2043-04-04
Also published as: CN116151194A

Abstract

The invention relates to the technical field of natural language processing, and discloses a method, a device, equipment and a storage medium for generating a Chinese universal language. The method comprises the following steps: acquiring a style prompt, a Chinese text prefix and a text generation length input by a user; the style prompt, the Chinese text prefix and the text generation length input by the user are input to a pre-trained target language model, and a continuous writing text output by the target language model is obtained; the target language model is built based on a generating type pre-training network; and displaying the renewal text. According to the technical scheme, the language model capable of generating the Chinese universal language is established based on the generation type pre-training network, so that automatic generation of Chinese languages of different styles can be realized based on a single language model, and the diversity and the universality of Chinese language generation can be improved.

Description

Method, device, equipment and storage medium for generating Chinese universal language

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a chinese universal language.

Background

With the continuous development of machine learning technology, the machine learning technology has been widely applied to the scenes of language recognition, machine translation, text generation and the like. Through a pre-trained machine learning model, efficient processing of Chinese language can be achieved under different task scenarios.

At present, in the existing processing method of chinese language, corresponding language models are usually generated for chinese languages of different styles, for example, novels, old poems, and prose, so that the generating tasks of the chinese language of the corresponding style are executed by the different language models. However, with the continuous abundance of task scenes generated by the chinese language, the language generating task applicable to the language model in the prior art is relatively single, and the universality is poor.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for generating Chinese universal language, which can realize automatic generation of Chinese languages of different styles based on a single language model and can promote the diversity and the universality of Chinese language generation.

According to one aspect of the present invention, there is provided a method for generating a chinese general language, including:

acquiring a style prompt, a Chinese text prefix and a text generation length input by a user;

Inputting the style prompt, the Chinese text prefix and the text generation length input by the user into a pre-trained target language model, and obtaining a renewal text output by the target language model; wherein the target language model is established based on a generating pre-training network;

and displaying the renewal text.

According to another aspect of the present invention, there is provided a generating apparatus of a chinese general language, including:

the user input acquisition module is used for acquiring style prompt, chinese text prefix and text generation length input by a user;

the renewal text acquisition module is used for inputting the style prompt, the Chinese text prefix and the text generation length input by the user into a pre-trained target language model to acquire a renewal text output by the target language model; wherein the target language model is established based on a generating pre-training network;

and the renewal text display module is used for displaying the renewal text.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method for generating chinese language according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the method for generating a chinese generic language according to any one of the embodiments of the present invention when executed.

According to the technical scheme, the user input style prompt, the Chinese text prefix and the text generation length are obtained, and the user input style prompt, the Chinese text prefix and the text generation length are input into a pre-trained target language model to obtain a continuous text output by the target language model; and then, displaying the continuous writing text, and establishing a language model capable of generating Chinese universal language based on the generation type pre-training network, so that automatic generation of Chinese languages of different styles can be realized based on a single language model, and the diversity and the universality of Chinese language generation can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1A is a flowchart of a method for generating a Chinese generic language according to an embodiment of the present invention;

FIG. 1B is a flowchart of another method for generating a Chinese generic language according to an embodiment of the present invention;

FIG. 1C is a schematic flow chart of a top-k method according to a first embodiment of the present invention;

FIG. 1D is a schematic flow chart of an optimized top-k method according to a first embodiment of the present invention;

FIG. 2A is a flowchart of a method for generating a Chinese generic language according to a second embodiment of the present invention;

FIG. 2B is a schematic diagram of a training text provided according to a second embodiment of the present invention;

fig. 2C is a schematic diagram of a network structure of GPT2 according to a second embodiment of the present invention;

FIG. 2D is a schematic diagram of a text labeling scheme of a prior art training text;

FIG. 2E is a schematic diagram of a text labeling mode of an optimized training text according to a second embodiment of the present invention;

FIG. 2F is a schematic diagram of a loss vector update flow according to a second embodiment of the present invention;

FIG. 2G is a flowchart of another method for generating a Chinese generic language according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a generating device for generating a Chinese generic language according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device implementing a method for generating a chinese general language according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," "target," and the like in the description and claims of the present invention and in the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1A is a flowchart of a method for generating a chinese general language according to an embodiment of the present invention, where the method may be applied to a case of automatically renewing a subsequent text based on a chinese text input by a user, and the method may be performed by a chinese general language generating device, which may be implemented in hardware and/or software, and the chinese general language generating device may be configured in an electronic device, typically, a computer device or a server. As shown in fig. 1A, the method includes:

S110, acquiring a style prompt, a Chinese text prefix and a text generation length which are input by a user.

The style prompt may be a string for identifying different language styles, for example, the style prompt corresponding to the style of the ancient poetry may be [ GSC ], the style prompt corresponding to the style of the dialect may be [ WYW ], and the style prompt corresponding to the style of the wiki text may be [ WIK ]. In this embodiment, the language style is not particularly limited.

The Chinese text prefix can be a Chinese text which is input by a user and needs text writing, such as a yellow crane tower of a hometown West; wherein the number of characters of the Chinese text prefix should be greater than or equal to 1. It can be appreciated that the number of characters of the chinese text prefix can affect the accuracy of the written text, with about a greater number of characters, the higher the accuracy of the written text. The text generation length may be a text length of the text to be renewed specified by the user, i.e. the number of characters of the renewed text, for example, may be 32, 64, etc.

In one specific example, a user may enter a style prompt, a Chinese text prefix, and a text generation length through a web page interface, a command line, or a functional page of an application, etc. The language style of the style prompt corresponding to the Chinese text prefix can be the same or different.

S120, inputting the style prompt, the Chinese text prefix and the text generation length input by the user into a pre-trained target language model, and obtaining a continuous text output by the target language model; wherein the target language model is built based on a generative pre-training network.

In this embodiment, an initial language model may be first built based on a Generative Pre-Training network, e.g., GPT2 (generating Pre-Training 2); then, training text with labels can be obtained, and the training text is used for carrying out distributed model training on the initial language model until a target language model with the training completed is obtained. The training text may be a Chinese corpus text added with style prompt, and may correspond to different language styles. In this embodiment, the training text may be from an open source dataset or may be acquired through a network acquisition. The target language model can automatically generate Chinese languages of different language styles according to task scenes or user configuration.

In a specific example, after the style prompt, the Chinese text prefix and the text generation length input by the user are obtained, the target language model can be automatically invoked to generate a corresponding renewal text based on the user input information; the character length of the continuous writing text is equal to the text generation length, and the corresponding language style and style prompt are the same as the Chinese text prefix or different from each other.

For example, when the style prompt is the same as the language style corresponding to the chinese text prefix, the renewal text may be the same language style as the chinese text prefix, at which point similar language style text generation may be achieved. For another example, when the language style corresponding to the style prompt and the Chinese text prefix is different, the language style corresponding to the style prompt and the continuous writing text is the same, and the language style corresponding to the Chinese text prefix is different, and at this time, the generation of the text with other language styles can be realized.

Alternatively, the user may also input only the style prompt and the chinese text prefix, and the character length of the renewal text output by the target language model may be a preset maximum renewal text length (e.g., 32, 64, etc.).

S130, displaying the renewal text.

In this embodiment, after the renewal text is obtained, the renewal text may be visually displayed on a web interface or a designated location of a command line where the user inputs information. The presentation form of the written text is not particularly limited in this embodiment.

In a specific implementation manner of this embodiment, a flow of a method for generating a chinese generic language may be as shown in fig. 1B. Firstly, a user inputs a style prompt, a Chinese text prefix and a text to generate a length value at the WEB front end or a command line; then, similar style text generation or other style text generation is performed based on the user input information through a pre-trained target language model of the chinese generic language generation system.

In one case, when the style indicator is the same as the language style of the prefix of the chinese text, for example, the user input is "[ GSC ] hometown western-style building", and the language styles of the style indicator and the prefix of the chinese text are both ancient poetry, and the language style of the written text is also ancient poetry; or, in the Jintaiyuan of the ' WYW ', the martial arts people use fishing as industry ', and the language styles of the two are the literary texts at the moment, and the language styles of the continuous writing texts are also the literary texts; or the user inputs are "[ WIK ] I am on the work of XX laboratory, and when the language styles of the user inputs are wiki texts, the language style of the continuous writing text is wiki text. Thus, similar style text generation may be achieved.

In another case, when the language style of the style indicator is different from that of the prefix of the Chinese text, for example, the user inputs "[ WIK ] to be the ancient Chinese character, the language style corresponding to the style indicator is wiki text, and the language style corresponding to the prefix of the Chinese text is ancient poem; correspondingly, the language style corresponding to the written text is the same as the style prompt, and the written text is wiki text. Thus, other styles of text generation may be implemented. The display effect of the written text may be as shown in fig. 1B.

In an optional implementation manner of this embodiment, obtaining the renewal text output by the target language model may include:

obtaining an output character vector corresponding to the current text position output by the target language model and output probabilities corresponding to all output characters in the output character vector;

when the target output character in the output character vector is detected to be a style prompt, updating the output probability corresponding to the target output character to be an infinite small value;

and obtaining the continuous text characters corresponding to the current text position according to the output probabilities corresponding to the output characters.

In this embodiment, in the text renewal output stage, a top-k method may be used to determine the text renewal characters corresponding to each text position, and the method flow may be as shown in fig. 1C. Specifically, first, for each text position, obtaining an output probability corresponding to each vocabulary character; then, carrying out normalization processing on the output probabilities corresponding to the vocabulary characters, and sequentially screening and obtaining k characters according to the order from high to low according to the output probabilities after normalization processing; and finally, randomly sampling in k characters, and taking the sampled characters as final output characters of each text position. The value of k can be adaptively set according to an actual task scene.

Note that, in this embodiment, since style prompts such as [ UNK ], [ SEP ], [ WIK ], [ GSC ] are newly added to the training text, each style prompt is output probabilistic in the reasoning stage, and thus the accuracy of language generation is affected. In this embodiment, in order to avoid the influence of the style prompt on the language generation result, the top-k method may be optimized, and the flow of the optimized top-k method may be as shown in fig. 1D.

Specifically, aiming at the current text position, obtaining an output character vector [1,2, …,25600] output by the target language model and output probabilities corresponding to all output characters in the output character vector; wherein the output characters may include vocabulary characters and style prompts. Then, the output probabilities corresponding to the style hints such as [ UNK ], [ SEP ], [ WIK ], [ GSC ] in the output character vector are set to an infinitesimal value (e.g., -inf). At this time, when the k output characters with the highest output probability are screened, each style prompt can be automatically removed, so that the influence of the style prompt on the language generation result can be avoided, and the Chinese language generation effect can be optimized.

In another optional implementation manner of this embodiment, obtaining the continuous text character corresponding to the current text position according to the output probability corresponding to each output character may include:

normalizing the output probability corresponding to each output character to obtain the normalized output probability corresponding to each output character;

acquiring a preset number of candidate output characters from the output characters according to the order of the normalized output probability from high to low;

And randomly sampling in each candidate output character to obtain the continuous text character corresponding to the current text position.

The preset number may be a preset number of candidate output characters, that is, a value of k.

In a specific example, in determining the final renewal text character based on the output probabilities corresponding to each output character, the formula may be based first

Output probability y corresponding to each output character _j Normalization processing is performed to obtain a normalized output probability softmax (y) _i . Then, according to the order of the normalized output probability from high to low, the candidate output characters with preset number, namely TOP_K output [1,2, …, K, are obtained by screening]The method comprises the steps of carrying out a first treatment on the surface of the And finally, randomly sampling in the TOP-1 to TOP-K to obtain the continuous text characters corresponding to the current text position.

The technical scheme of the embodiment has the characteristic of wide applicability, and can be widely applied to Chinese model generation tasks of various different software and hardware products.

Example two

Fig. 2A is a flowchart of a method for generating a chinese general language according to a second embodiment of the present invention, where the technical solution in this embodiment may be combined with one or more of the foregoing embodiments. As shown in fig. 2A, the method includes:

S210, acquiring a style prompt, a Chinese text prefix and a text generation length input by a user.

S220, obtaining Chinese corpus texts of different text types, and dividing each Chinese corpus text based on a preset text length to obtain at least one sub-corpus text.

The preset text length may be a preset number of text characters, for example, 1023.

In a specific example, the Chinese corpus texts with different text types can be obtained through open source data sets or network data collection. Thereafter, each Chinese corpus text may be divided into one or more sub-corpus texts having a number of characters of 1023.

And S230, if the text length of the target sub-corpus text is detected to be smaller than the preset text length, filling the target sub-corpus text by adopting preset characters so that the text length of the target sub-corpus text is equal to the preset text length.

The target sub-corpus text may be a sub-corpus text with a text length smaller than a preset text length. The preset character may be a preset text character, for example, may be a [ PAD ] character.

In this embodiment, when the text length of the target sub-corpus text is less than 1023, the [ PAD ] characters may be used to complement the target sub-corpus text, so as to ensure that the text length of each sub-corpus text may be equal to 1023.

S240, according to the text types corresponding to the Chinese corpus texts, style prompts corresponding to the sub-corpus texts are obtained, and according to the sub-corpus texts and the corresponding style prompts, training texts are generated.

In this embodiment, the text type corresponding to each sub-corpus text may be determined according to the text type corresponding to each chinese corpus text; and further, according to the corresponding relation between the preset text type and the style prompt, the style prompt corresponding to each sub-corpus text can be obtained. Finally, a corresponding style prompt is inserted before each sub-corpus text to generate training text with a text length of 1024.

In a specific example, the training text may be shown in fig. 2B, where the text lengths of the training text are 1024, the first text character is a style prompt, the 2 nd to 1024 th text characters are chinese characters, and the chinese character deficiency portions are complemented by [ PAD ] characters.

S250, performing text labeling on each training text to obtain labeled training texts.

In a specific example, in each training text, a label value (the upcoming next text character) corresponding to each text character is set, so as to complete text labeling of each training text, thereby obtaining labeled training texts.

S260, establishing an initial language model based on the generated pre-training network, and training the initial language model based on each training text with completed labels to obtain a target language model with completed training.

Wherein the language model is typically composed of a set of data (X ₁ ,X ₂ ,…,X _n ) An unsupervised distribution estimate is constructed, each piece of data consisting of a variable length symbol sequence (S ₁ ,S ₂ ,…,S _n ) The composition, since the language has a natural order arrangement, can decompose the joint probability on the symbol into products of conditional probabilities, i.e

。

In this embodiment, in order to achieve the purpose of effective learning and training of multiple styles of corpus, the language model can support language generation tasks of multiple language styles, and has diversity and versatility, and a style prompt is inserted before corpus texts of different styles, the modified text symbol sequence can be expressed as #

). Thus, the updated text joint distribution probability can be expressed as +.>

。

In another alternative implementation of the present embodiment, establishing the initial language model based on the generated pre-training network may include:

according to

Based on generative pretrainingThe training network builds an initial language model;

wherein,,

representing a Chinese corpus text comprising n text characters, i representing an index of the Chinese corpus text,/->

Representing the first text character,/->

Representing style prompts->

Representing text joint distribution probability, < >>

Representing text type and +.>

Probability of match, ++>

Representing Chinese corpus text as +.>

Probability of "+.>

"means multiplication.

In this embodiment, an initial language model may be built based on the updated text joint distribution probability, and the generated pre-training network, e.g., GPT 2; and then, training the initial language model by adopting each training text with completed labels, and ending model training when the preset training iteration times or training termination conditions are reached, thereby obtaining the target language model with completed training. The network structure of GPT2 may be as shown in fig. 2C.

S270, inputting the style prompt, the Chinese text prefix and the text generation length input by the user into a pre-trained target language model, and obtaining a continuous text output by the target language model; wherein the target language model is built based on a generative pre-training network.

And S280, displaying the renewal text.

According to the technical scheme, before the style prompt, the Chinese text prefix and the text generation length input by a user are input into a pre-trained target language model, chinese corpus texts of different text types are obtained, and each Chinese corpus text is segmented based on the preset text length, so that a plurality of sub-corpus texts are obtained; then, if the text length of the target sub-corpus text is detected to be smaller than the preset text length, filling the target sub-corpus text by adopting preset characters so that the text length of the target sub-corpus text is equal to the preset text length; further, according to the text type corresponding to each Chinese corpus text, a style prompt corresponding to each sub-corpus text is obtained, and according to each sub-corpus text and the corresponding style prompt, each training text is generated; secondly, performing text labeling on each training text to obtain labeled training texts; finally, an initial language model is established based on the generated pre-training network, and the initial language model is trained based on each training text with completed labels, so that a target language model with completed training is obtained; the corresponding style prompt is added into the divided sub-corpus text to generate a training text, and then the training text is adopted to train to obtain a target language model, so that the language model can support Chinese language generation of multiple language styles, and the language model has diversity and universality.

In an optional implementation manner of this embodiment, text labeling is performed on each training text to obtain each labeled training text, which may include:

judging whether the current text character is the last text character in the current training text;

if the current text character is determined not to be the last text character, acquiring a label value corresponding to the current text character as a next adjacent text character;

if the current text character is determined to be the last text character, acquiring a second text character of the next training text, wherein the label value corresponding to the current text character is adjacent to the second text character;

and acquiring the current training text with the completed annotation.

It should be noted that, the text labeling manner of the existing training text may be as shown in fig. 2D, that is, the label value corresponding to each text character is the next adjacent text character, and for the last text character of each training text, the label value corresponding to the last text character of each training text is the first text character of the next adjacent training text; the current training text and the next adjacent training text can be adjacent texts obtained by dividing the same Chinese corpus text and can correspond to the same style prompt. However, in this embodiment, the corresponding style prompt is inserted in the forefront of each training text, so that the label value corresponding to the last text character of each training text is the style prompt of the next adjacent training text, and the original continuous semantics are interrupted.

Aiming at the problems, the text labeling mode of the training text optimized in the embodiment can be shown in fig. 2E; for text characters which are not the last text character, setting the corresponding label value as the next text character adjacent to the current training text; and for the text character which is the last text character, setting the corresponding label value as the second text character of the next adjacent training text. By adopting a training text splicing mode, semantic continuity can be maintained.

In another optional implementation manner of this embodiment, training the initial language model based on each training text with completed labeling to obtain a target language model with completed training may include:

judging whether the current training times is smaller than or equal to a preset time threshold value or not;

if yes, acquiring a label vector corresponding to the current training text, and judging whether each label value of the label vector is a preset character or a style prompt;

updating a preset identification vector according to a judging result of whether each label value of the label vector is a preset character or a style prompt, so as to obtain an updated identification vector corresponding to the current training text;

Acquiring an original loss vector corresponding to the current training text, and updating the original loss vector by adopting the updating identification vector to acquire an updating loss vector corresponding to the current training text;

and acquiring a loss function corresponding to each training text according to the updated loss vector corresponding to the current training text, and performing current training based on the loss function.

In this embodiment, the language model training may be in a distributed training manner, i.e., the workload of the model training may be split and shared among multiple worker nodes (e.g., servers). For example, the language model may be divided into different parts that may run concurrently in different servers, and each part may be trained using the same training text set, and the calculation results (e.g., forward loss values, reverse gradients, etc.) for the different servers may be calculated twice based on preset specifications (e.g., calculation averages) to obtain final calculation results, and then the weight parameter update for the model parts in the different servers may be performed based on the final calculation results.

In a specific example, firstly, whether the current training frequency n is smaller than or equal to a preset frequency threshold skip_num is judged, if yes, the forward loss values of different training nodes in the distributed training system are calculated according to the training text set

And inverse gradient value->

And based on preset rules all-reduce (average value), training nodes according to +.>

And->

Calculating to obtain final loss value->

And final inverse gradient value

And based on the->

And->

Updating the model weight parameters to finish the training of the current times; wherein (1)>

And->

Can be expressed as:

，/>

；

where N represents the number of training nodes and k represents the index of the training nodes.

In this embodiment, in order to avoid the increase of the loss value caused by introducing the preset character and the style prompt, the original loss vector may be updated based on the preset identification vector, so as to accelerate the training and convergence process of the model. The update flow of the loss vector may be as shown in fig. 2F.

Firstly, presetting an identification vector loss_mask with an element of 1 and a length equal to the length of a training text; and then, acquiring a label vector corresponding to the current training text, wherein the label vector consists of label values corresponding to each text character, and each element in the identification vector corresponds to the label value at the same position in the label vector. Further, a judgment index is performed on each tag value in the tag vector, and if a certain tag value is a preset character or a style prompt, the identification vector element corresponding to the tag value is assigned to 0, for example, loss_mask [ tag vector= = '[ PAD ]' ] =0, loss_mask [ tag vector= '[ GSC ]' ] =0, loss_mask [ tag vector= '[ WYW ]' ] =0, and loss_mask [ tag vector= '[ WIK ]' ] =0. Thereby, the update identification vector can be acquired.

Finally, the update identification vector may be compared to the original loss vector

Performing point multiplication to obtain an updated loss vector, and acquiring a current loss function as

Further, a forward loss value of the current iterative training can be obtained through calculation based on the loss function, and a model weight parameter can be adjusted according to the forward loss value. And then, repeating the process until the iterative training times are greater than a preset time threshold value, and recovering to calculate the forward loss value by adopting the original loss vector.

In a specific implementation manner of this embodiment, a flow of a method for generating a chinese generic language may be as shown in fig. 2G. Specifically, the method can comprise a Chinese corpus preprocessing stage, a model training stage and an inference optimization stage; in the Chinese corpus preprocessing stage, the Chinese corpus text can be divided into sub-corpus texts with equal text length, the insufficient part can be supplemented by preset characters, and a corresponding style prompt can be inserted into the forefront of each sub-corpus text to generate a training text; further, a corresponding label value can be set for each text character in each training text, so that the labeled training text is obtained.

And secondly, in the model training stage, when the training times are smaller than or equal to a preset times threshold, acquiring an updating identification vector by judging and indexing each label value in the label vector, updating the original loss vector based on the updating identification vector, and further calculating the forward loss value based on the updated loss vector. And when the training times are larger than the preset times threshold, the forward loss value can be calculated based on the original loss vector.

Finally, in the reasoning optimization stage, on the basis of a top-k method, the output probability corresponding to the style prompt is given to an infinitely small value, so that the influence of the style prompt on the final reasoning output, namely writing text characters, is avoided.

Example III

Fig. 3 is a schematic structural diagram of a generating device for chinese language in accordance with a third embodiment of the present invention. As shown in fig. 3, the apparatus may include: a user input acquisition module 310, a follow-up text acquisition module 320, and a follow-up text presentation module 330; wherein,,

a user input obtaining module 310, configured to obtain a style prompt, a chinese text prefix, and a text generation length input by a user;

the renewal text obtaining module 320 is configured to input the style prompt, the chinese text prefix, and the text generation length input by the user to a pre-trained target language model, and obtain a renewal text output by the target language model; wherein the target language model is established based on a generating pre-training network;

And the writing text display module 330 is configured to display the writing text.

Optionally, the generating device of the chinese general language further includes:

the sub-corpus text acquisition module is used for acquiring Chinese corpus texts of different text types, and dividing each Chinese corpus text based on a preset text length so as to acquire at least one sub-corpus text;

the text filling module is used for filling the target sub-corpus text by adopting preset characters if the text length of the target sub-corpus text is detected to be smaller than the preset text length, so that the text length of the target sub-corpus text is equal to the preset text length;

The training text generation module is used for acquiring style prompt corresponding to each sub-corpus text according to the text type corresponding to each Chinese corpus text and generating each training text according to each sub-corpus text and the corresponding style prompt;

the text labeling module is used for labeling the text of each training text so as to obtain each labeled training text;

the model training module is used for establishing an initial language model based on the generated pre-training network, and training the initial language model based on each training text with completed labels so as to obtain a target language model with completed training.

Optionally, the text labeling module includes:

the text character judging unit is used for judging whether the current text character is the last text character in the current training text;

a first tag value obtaining unit, configured to obtain, if it is determined that the current text character is not the last text character, that a tag value corresponding to the current text character is an adjacent next text character;

a second tag value obtaining unit, configured to obtain, if it is determined that the current text character is the last text character, a second text character whose tag value corresponding to the current text character is the next adjacent training text;

And the training text acquisition unit is used for acquiring the current training text with completed annotation.

Optionally, the model training module includes:

the training frequency judging unit is used for judging whether the current training frequency is smaller than or equal to a preset frequency threshold value;

the label vector obtaining unit is used for obtaining a label vector corresponding to the current training text and judging whether each label value of the label vector is a preset character or a style prompt;

the updating identification vector obtaining unit is used for updating the preset identification vector according to the judging result of whether each label value of the label vector is a preset character or a style prompt, so as to obtain an updating identification vector corresponding to the current training text;

the updating loss vector obtaining unit is used for obtaining an original loss vector corresponding to the current training text, and updating the original loss vector by adopting the updating identification vector so as to obtain an updating loss vector corresponding to the current training text;

and the loss function acquisition unit is used for acquiring the loss function corresponding to each training text according to the updated loss vector corresponding to the current training text, and carrying out current training based on the loss function.

Optionally, the model training module is specifically configured to:

according to

Establishing an initial language model based on the generated pre-training network;

wherein,,

Representing the first text character,/->

Representing style prompts->

Representing text joint distribution probability, < >>

Representing text type and +.>

Probability of match, ++>

Representing Chinese corpus text as +.>

Is a probability of (2).

Optionally, the text-to-write acquisition module 320 includes:

the output probability acquisition unit is used for acquiring an output character vector corresponding to the current text position output by the target language model and the output probability corresponding to each output character in the output character vector;

the output probability updating unit is used for updating the output probability corresponding to the target output character to an infinitely small value when the target output character in the output character vector is detected to be a style prompt;

and the writing text character acquisition unit is used for acquiring writing text characters corresponding to the current text position according to the output probabilities corresponding to the output characters.

Optionally, the text character obtaining unit is specifically configured to normalize an output probability corresponding to each output character, so as to obtain a normalized output probability corresponding to each output character;

The device for generating the Chinese universal language provided by the embodiment of the invention can execute the method for generating the Chinese universal language provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the technical solution of the present embodiment, the related acquisition, storage, application, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.

Example IV

Fig. 4 shows a schematic diagram of an electronic device 40 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 40 includes at least one processor 41, and a memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, etc., in which the memory stores a computer program executable by the at least one processor, and the processor 41 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from the storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data required for the operation of the electronic device 40 may also be stored. The processor 41, the ROM 42 and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

Various components in electronic device 40 are connected to I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the electronic device 40 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 41 may be various general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 41 performs the respective methods and processes described above, for example, a generation method of a chinese general language.

In some embodiments, the method of generating chinese generic language may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into the RAM 43 and executed by the processor 41, one or more steps of the chinese general language generation method described above may be performed. Alternatively, in other embodiments, the processor 41 may be configured to perform the method of generating the chinese generic language in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating a chinese generic language, comprising:

obtaining Chinese corpus texts of different text types, and dividing each Chinese corpus text based on a preset text length to obtain at least one sub-corpus text;

if the text length of the target sub-corpus text is detected to be smaller than the preset text length, filling the target sub-corpus text by adopting preset characters so that the text length of the target sub-corpus text is equal to the preset text length;

According to the text types corresponding to the Chinese corpus texts, style prompt corresponding to the sub-corpus texts are obtained, and according to the sub-corpus texts and the corresponding style prompt, training texts are generated;

performing text labeling on each training text to obtain labeled training texts;

establishing an initial language model based on a generated pre-training network, and training the initial language model based on each training text with completed labels to obtain a target language model with completed training;

and displaying the renewal text.

2. The method of claim 1, wherein said text labeling each of said training texts to obtain labeled each of said training texts comprises:

and acquiring the current training text with the completed annotation.

3. The method of claim 2, wherein training the initial language model based on each of the training texts for which labeling is complete to obtain a training completed target language model comprises:

4. The method of claim 1, wherein the building an initial language model based on the generative pre-training network comprises:

according to

wherein,,

representing a Chinese corpus text comprising n text characters,iindex representing Chinese corpus text, ++>

Represent the firstlText characters->

Representing style prompts->

Representing text joint distribution probability, < >>

Representing text type and +.>

Probability of match, ++>

Representing Chinese corpus text as +.>

Is a probability of (2).

5. The method of claim 1, wherein the obtaining the renewal text output by the target language model comprises:

6. The method of claim 5, wherein the obtaining the updated text character corresponding to the current text position according to the output probability corresponding to each of the output characters comprises:

7. A chinese language generic language generation apparatus, comprising:

the model training module is used for establishing an initial language model based on the generated pre-training network, and training the initial language model based on each training text with completed labels so as to obtain a target language model with completed training;

And the renewal text display module is used for displaying the renewal text.

8. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of generating chinese generic language according to any one of claims 1 to 6.

9. A computer readable storage medium storing computer instructions for causing a processor to execute the method of generating a chinese generic language according to any one of claims 1-6.