CN116842155B

CN116842155B - Text generation method, training method and device of text generation model

Info

Publication number: CN116842155B
Application number: CN202310797048.9A
Authority: CN
Inventors: 姜文斌; 郝洋; 冯知凡; 吕雅娟; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2024-06-21
Anticipated expiration: 2043-06-30
Also published as: CN118312598A; CN116842155A

Abstract

The disclosure provides a text generation method, a training method and a training device of a text generation model, relates to the technical field of artificial intelligence, and particularly relates to the fields of natural language processing, deep learning, reinforcement learning and the like. The implementation scheme is as follows: acquiring a first question text; initializing a historical step sequence text to a preset value; and updating the historical step sequence text at least once based on the first question text to obtain a target step sequence text, each update comprising: generating a current step text based on the first question text and the current historical step sequence text, wherein the current step text represents a current answering step of the first question; in response to the current step text not being the preset termination text, splicing the current historical step sequence text with the current step text to obtain an updated historical step sequence text; in response to the current step text being the termination text, determining the current historical step sequence text as the target step sequence text.

Description

Text generation method, training method and device of text generation model

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of natural language processing, deep learning, reinforcement learning, and the like, and in particular, to a text generation method and apparatus, a training method and apparatus for a text generation model, an electronic device, a computer readable storage medium, and a computer program product.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) that make computers simulate humans, both hardware-level and software-level technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

A large language model (Large Language Model, LLM, also known as a large model) is a deep learning model trained using large amounts of text data that can generate natural language text or understand the meaning of natural language text. Large language models can handle a variety of natural language tasks, such as text classification, text generation, questions and answers, conversations, etc., and are an important approach to artificial intelligence.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a text generation method and apparatus, a training method and apparatus of a text generation model, an electronic device, a computer readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a text generation method including: acquiring a first question text, wherein the first question text represents a first question of a to-be-determined solving step sequence; initializing a historical step sequence text to a preset value; and updating the historical step sequence text at least once based on the first question text to obtain a target step sequence text, wherein the target step sequence text represents a solving step sequence of the first question, the target step sequence text comprises at least one step text, and each step text in the at least one step text represents one solving step of the first question; wherein each of the at least one update comprises: generating a current step text based on the first question text and a current historical step sequence text, wherein the current step text represents a current answering step of the first question; in response to the current step text not being a preset termination text, splicing the current historical step sequence text with the current step text to obtain an updated historical step sequence text; or determining the current historical step sequence text as the target step sequence text in response to the current step text being the termination text.

According to an aspect of the present disclosure, there is provided a training method of a text generation model, including: acquiring a question-answer text pair, wherein the question-answer text pair comprises a sample question text representing a sample question and a sample answer text representing an answer of the sample question; initializing a historical step sequence text to a preset value; repeatedly performing the following operations to generate a target step sequence text, wherein the target step sequence text represents a solution step sequence of the sample question: inputting the sample question text and the current historical step sequence text into the text generation model to obtain a current step text output by the text generation model, wherein the current step text represents a current answering step of the sample question; in response to the current step text not being a preset termination text, splicing the current historical step sequence text with the current step text to obtain an updated historical step sequence text; or determining the current historical step sequence text as the target step sequence text in response to the current step text being the termination text; generating a predicted answer text of the sample question based on the target step sequence text; determining rewards of the text generation model based on the predicted answer text and the sample answer text; and adjusting parameters of the text generation model based on the rewards.

According to an aspect of the present disclosure, there is provided a text generating apparatus including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a first question text, and the first question text represents a first question of a to-be-determined solution step sequence; the initialization module is configured to initialize the historical step sequence text to a preset value; and an updating module configured to update the historical step sequence text at least once based on the first question text to obtain a target step sequence text, wherein the target step sequence text represents a solution step sequence of the first question, the target step sequence text comprises at least one step text, and each step text in the at least one step text represents one solution step of the first question; wherein the update module comprises: a generating unit configured to generate a current step text based on the first question text and a current history step sequence text, wherein the current step text represents a current solving step of the first question; an updating unit configured to splice the current historical step sequence text with the current step text in response to the current step text not being a preset termination text, so as to obtain an updated historical step sequence text; and a determining unit configured to determine the current history step sequence text as the target step sequence text in response to the current step text being the termination text.

According to an aspect of the present disclosure, there is provided a training apparatus of a text generation model, including: an acquisition module configured to acquire a question-answer text pair, wherein the question-answer text pair includes a sample question text representing a sample question and a sample answer text representing an answer to the sample question; the initialization module is configured to initialize the historical step sequence text to a preset value; a first generation module configured to repeatedly perform the following operations to generate a target step sequence text, wherein the target step sequence text represents a solution step sequence of the sample question: inputting the sample question text and the current historical step sequence text into the text generation model to obtain a current step text output by the text generation model, wherein the current step text represents a current answering step of the sample question; in response to the current step text not being a preset termination text, splicing the current historical step sequence text with the current step text to obtain an updated historical step sequence text; or determining the current historical step sequence text as the target step sequence text in response to the current step text being the termination text; a second generation module configured to generate a predicted answer text for the sample question based on the target step sequence text; a determining module configured to determine rewards of the text generation model based on the predicted answer text and the sample answer text; and an adjustment module configured to adjust parameters of the text generation model based on the rewards.

According to an aspect of the present disclosure, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to an aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the above aspects.

According to an aspect of the present disclosure, there is provided a computer program product comprising computer program instructions which, when executed by a processor, implement the method of any of the above aspects.

In accordance with one or more embodiments of the present disclosure, machine question-answer prompt text with a solution step is efficiently constructed.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a text generation method according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of a training method of a text generation model according to an embodiment of the present disclosure;

Fig. 4 shows a block diagram of a text generating apparatus according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a training device of a text generation model according to an embodiment of the present disclosure; and

Fig. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items. "plurality" means two or more.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

The large language model technology has made remarkable progress in recent years, becoming a revolutionary breakthrough in the AI field. Large language models, such as text dialects, chatGPT (CHAT GENERATIVE PRE-trained Transformer), GPT-4, etc., exhibit powerful task solvers capable of handling multiple natural language tasks such as text classification, text generation, question-answering, dialog, etc. In question-answering tasks, large language models can give accurate answers to relatively simple questions, but due to their lack of memory and reasoning capabilities, it is difficult to solve complex questions (e.g., complex mathematical, physical computational questions) effectively. For complex questions, large language models may produce answers that lack logical basis or fact errors, misleading the user.

In the related art, in order to improve the processing capability of a large language model on complex problems, a thinking Chain prompt (Chain-of-Thought Prompting) method is generally adopted to guide the large language model to give an answer by inference step by step.

The thought Chain (Chain-of-Thought, coT) is a series of consecutive intermediate reasoning steps for deriving the final answer to a question, i.e. the thought Chain is a sequence of solving steps consisting of a series of solving steps for the question. For example, to solve the problem of "how much is the min and xiao Hua", the four solving steps of "age of min", "age of xiao Hua", "comparing two ages", "describing the comparison result" are specifically included. Therefore, the thought chain for solving this problem is "find age of small → find age of xiao Hua → compare two ages → describe the comparison result".

The thought chain Prompt refers to taking a reference question and an answer of the reference question with a solving process (namely, a thought chain) as a Prompt text (Prompt), and inputting the Prompt text and the question to be solved into a large language model together, so as to guide the large language model to gradually reasoning and give an answer of the question to be solved.

In the related art, a thought chain is generally constructed by manually labeling the solving steps of a problem. The method has high labor cost and low labeling efficiency, and is difficult to quickly construct a large amount of thinking chain data.

In view of the above problems, the embodiments of the present disclosure provide a text generation method capable of automatically generating a solution step sequence of a problem by means of text generation, so as to efficiently construct a large number of machine question-answer prompt texts with solution steps. Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the client devices 101, 102, 103, 104, 105, and 106 and the server 120 may run one or more services or software applications that enable execution of a text generation method or training method of a text generation model.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The client devices 101, 102, 103, 104, 105, and/or 106 may provide interfaces that enable a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, vehicle-mounted devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, appli os, UNIX-like operating systems, linux, or Linux-like operating systems; or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, wi-Fi), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and/or 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual special server (VPS PRIVATE SERVER) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

Fig. 2 shows a flow chart of a text generation method 200 according to an embodiment of the present disclosure. The subject of execution of the various steps of method 200 is typically a server, such as server 120 shown in fig. 1. In some embodiments, the subject of execution of method 200 may also be a client device, such as client devices 101-106 shown in FIG. 1.

As shown in FIG. 2, method 200 includes operations S210-S230.

In operation S210, a first question text is acquired. The first question text represents a first question of the sequence of solution steps to be determined.

In operation S220, the history step sequence text is initialized to a preset value.

In operation S230, the history step sequence text is updated at least once based on the first question text to obtain the target step sequence text. The target step sequence text represents a solution step sequence of the first question, the target step sequence text comprising at least one step text, each of the at least one step text representing one solution step of the first question.

Each of the at least one update of operation S230 described above includes operations S231-S233.

In operation S231, a current step text is generated based on the first question text and the current historical step sequence text. Wherein the current step text represents the current solving step of the first question.

In operation S232, in response to the current step text not being the preset termination text, the current historical step sequence text is spliced with the current step text to obtain an updated historical step sequence text.

In operation S233, in response to the current step text being the termination text, the current history step sequence text is determined as the target step sequence text.

According to the embodiment of the disclosure, based on the first question text and the generated historical step sequence text, each step text is gradually generated, and then the target step sequence text is obtained, namely, a solution step sequence (i.e. thinking chain) of the first question is obtained. According to the method and the device for automatically generating the answer step sequence of the questions in the text generation mode, a large number of machine question answer prompt texts with answer steps can be efficiently constructed.

The steps of method 200 are described in detail below.

In an embodiment of the present disclosure, the first problem is a problem to be determined of a sequence of solving steps, i.e. a problem to be constructing a thought chain. The first question is represented in text form, i.e. the first question is represented as first question text.

In an embodiment of the present disclosure, the historical step sequence text represents a solution step sequence generated in the text generation process. The target step sequence text represents the generated solution step sequence of the first question, i.e. the thought chain of the first question.

It will be appreciated that in embodiments of the present disclosure, the historical step sequence text is a variable whose initial value is a preset value. With the execution of operations S231-S233, the value of the history step sequence text is continuously updated, and the termination value thereof is the target step sequence text. Specifically, each time operation S231 is performed, a new step text representing the current answering step of the first question, that is, representing the next answering step to be performed, which is currently determined based on the first question text and the generated history step sequence text, will be generated. By performing operation S232, the newly generated step text is continuously added to the history step sequence text. And ending the updating process of the historical step sequence text until the newly generated step text is a termination text, wherein the historical step sequence text is the target step sequence text.

According to some embodiments, the preset value (initial value) of the history step sequence text may be the blank string "Null", string "thought chain: "etc. The preset termination text may be, for example, "complete", "end", or the like.

There are various methods for generating the current step text with respect to operation S231.

According to some embodiments, the first question text and the current historical step sequence text may be entered into a text generation model to obtain the current step text output by the text generation model.

According to the embodiment, the text generation model is utilized to generate the step text, so that the efficiency and generalization of text generation can be improved.

According to some embodiments, the text generation model may be a large language model that generally includes an N-layer fransformer network with an encoder (Encoder) and a Decoder (Decoder). The large language model is obtained by pre-training (pre-traning) with a large amount of natural language data. Pre-training allows a large language model to have some prior knowledge and knowledge, thereby improving its performance on a variety of tasks.

In other embodiments, the text generation model may also be other deep neural network models, such as Seq2Seq, etc.

According to some embodiments, the text generation model may be derived by training a pre-trained first language model through a reinforcement learning strategy. The training samples for training are question-answer text pairs, which include sample question text representing a sample question and sample answer text representing an answer to the sample question. The first language model may be, for example, a large language model. The text generation model may be trained, for example, by the training method 300 of the text generation model of an embodiment of the present disclosure.

According to the above embodiment, the text generation model is further trained on the basis of the pre-trained first language model by using question-answer text pairs, so that the text generation model has not only language understanding capability of the first language model but also the capability of accurately constructing a thinking chain of a problem. In addition, the text generation model is trained by the reinforcement learning strategy in a plurality of training samples, so that the behaviors of the text generation model have global consistency. Thus, the text generation model is able to generate a correct and consistent chain of thought for new tasks (e.g., mathematical calculations, physical calculations, etc.). The generated thought chain may guide a large language model (e.g., a third language model below) to gradually reason to give an answer to the question to be answered as Prompt text (Prompt), thereby improving the accuracy of the answer.

According to further embodiments, the text generation model may be a pre-trained second language model. The second language model may be, for example, a large language model. According to the embodiment, the pre-trained large language model is directly adopted as the text generation model, and additional training steps are not required to be executed to adjust parameters of the large language model, so that the consumption of computing resources and time caused by model training is avoided, and the method has strong practicability.

According to some embodiments, the second language model may be the same as the first language model above.

It can be appreciated that in the above embodiment, since the text generation model directly adopts the pre-trained second language model, no question-answer text is used to make a targeted adjustment on the parameters of the second language model, so that the accuracy of the thinking chain generated by the text generation model is low. In this case, in order to ensure the accuracy of the generated mental chain, a preset executor may be used to execute the answering step corresponding to each step text, and the generated mental chain may be verified according to the execution result of the last answering step. Therefore, the accuracy of the generated thinking chain can be ensured while the efficiency and the practicability are improved.

Specifically, each update of the history step sequence text further includes operation S234. In operation S234, a current answering step represented by the current step text is performed to obtain a current execution result text of the current answering step.

In correspondence with the above-described operation S234, operation S233 further includes: in response to the current step text being the termination text and the current execution result text being the first answer text representing an answer to the first question, determining the current historical step sequence text as the target step sequence text.

It can be appreciated that the above embodiment needs to obtain the first answer text corresponding to the first question text, and generate the target step sequence text of the first question, i.e. the thought chain, based on the first question text and the first answer text. The current step text is a termination text, and indicates that the updating process of the historical step sequence text is finished, and correspondingly, the current step text indicates the last answering step of the first question, and the execution result of the answering step is the predicted answer of the first question. If the execution result text corresponding to the last answering step is the same as the first answer text, the predicted answer representing the first question is the same as the correct answer, and the generated historical step sequence text can correctly answer the first question, so that the generated historical step sequence text is considered to be correct and is taken as the target step sequence text. Otherwise, if the text of the execution result corresponding to the last answer step is different from the text of the first answer, the generated text of the historical step sequence is considered to be wrong, and is discarded and is not used as the text of the target step sequence.

According to some embodiments, operation S234 may be implemented using a preset actuator. The actuator may be, for example, a large language model or a manually designed solution model.

According to some embodiments, operation S234 may include operations S2341-S2343.

In operation S2341, a knowledge domain corresponding to the first problem is identified.

In operation S2342, an actuator for performing a problem solving step of the knowledge domain is acquired.

In operation S2343, the current step text is input to the actuator to obtain the current execution result text output by the actuator.

According to the above-described embodiments, the solving step of the first question is performed using the actuator corresponding to the knowledge domain to which the first question belongs, and the accuracy of the execution result can be improved.

According to some embodiments, in operation S2341, the first question text may be input into the trained classification model to obtain a knowledge domain to which the first question output by the classification model belongs.

The classification model may be implemented as a neural network. In some embodiments, the classification model may be a large language model, such as the first language model, the second language model, and the like, above.

The knowledge domain includes, for example, a general knowledge domain, a mathematical knowledge domain, a physical knowledge domain, a medical knowledge domain, and the like.

According to some embodiments, actuators corresponding to different knowledge domains may be preset, and a correspondence between the knowledge domains and the actuators may be stored. Accordingly, in operation S2342, a corresponding actuator may be found from the correspondence relationship based on the knowledge field of the first problem. It will be appreciated that actuators corresponding to different knowledge domains are typically different. For example, the actuator in the general knowledge domain may be a large language model. The large language model lacks the solving capability in the technical knowledge fields of mathematics, physics and the like, so that the solving model in each technical knowledge field can be manually designed according to the characteristics of the technical knowledge field and used as an actuator.

According to some embodiments, the executor may acquire external knowledge by calling an external interface and generate a current execution result text based on the acquired external knowledge in operation S2343.

According to some embodiments, in operation S231, a question-answer step template may be utilized to generate a current step text. For example, a plurality of question-answering step templates may be preset. Each question-answering step template defines a correspondence from question text and step sequence text to step text. Matching the first question text and the current historical step sequence text with a plurality of preset question-answering step templates to obtain target question-answering step templates corresponding to the first question text and the current historical step sequence text. Further, based on the target question-answering step template, a current step text is determined.

According to some embodiments, in operation S231, a current step text may be determined using a preset step database. For example, the first question text and the current historical step sequence text may be matched with a plurality of step texts in the step database, and the step text with the highest matching degree may be used as the current step text.

In operations S231-S233, the text generation model continuously generates a step text of a next solving step according to the current state (including the first question text and the current history step series text), and updates the current state (i.e., updates the history step series text) according to the generated step text until the generated step text is the termination text. The historical step sequence text is the target step sequence text of the first text, namely the thinking chain of the first text.

The following table illustrates an exemplary text generation process according to an embodiment of the present disclosure:

according to some embodiments, the method 200 further comprises operations S240-S260.

In operation S240, each solution step in the target step sequence text is separately executed to obtain an execution result text of each solution step.

In operation S250, a first solution text is generated based on the target step sequence text and the execution result text of each solution step. Wherein the first answer text represents an answer process of the first question.

In operation S260, an output of the third language model is optimized based on the first question text and the first answer text.

According to the embodiment, the output of the third language model is optimized by using the first question text and the first answer text, so that the third language model can learn the answer process of the first question, and the accuracy of answers of other questions output by the third language model is improved.

According to some embodiments, operation S240 may include operations S2341-S2343 above. That is, in operation S240, each solution step in the target step sequence text is executed by a preset executor, respectively, resulting in an execution result text of each solution step.

According to some embodiments, the step text of each solution step may be combined with the execution result text thereof to generate a first solution text in operation S250. For example, the target sequence of steps text of the first question text "Ming and xiao Hua who is big" is "thought chain: the method comprises the steps of calculating the age of the Ming, calculating the age of xiao Hua, comparing the two ages and describing the comparison result, wherein the method comprises the steps of calculating the age of the Ming, calculating the age of xiao Hua, comparing the two ages and describing the comparison result, and corresponding execution result texts are respectively 3,5 and xiao Hua. The first answer text generated may be, for example, "Xiaoming 3 years old, xiao Hua years old, 5 greater than 3, and thus xiao Hua greater. "

According to some embodiments, operation S260 may include operations S261 and S262.

In operation S261, a second question text is acquired, wherein the second question text represents a second question to be solved.

The first question text, the first answer text, and the second question text are input into the third language model to obtain a second answer text output from the third language model in operation S262. Wherein the second answer text represents an answer process of the second question.

According to the embodiment, the first question text and the first answer text are taken as reference examples of machine questions and input into the third language model together with the second question text, so that the third language model can learn the answer process of the first question, and the second question is answered by referring to the answer process of the first question, thereby improving the accuracy of the answer of the second question.

According to some embodiments, the second question text may be user specified. For example, the user may enter the second question text via a client device (e.g., client devices 101-106 shown in FIG. 1). In the case where the subject of execution of the method 200 is a server, the client device sends the second question text entered by the user to the server (e.g., the server 120 shown in fig. 1). In the case where the subject of execution of method 200 is a client device, the second question text is saved locally at the client device.

It should be noted that, according to operations S210-S250, the plurality of first question texts may be processed, so as to obtain a first answer text of each of the plurality of first question texts. The plurality of first question texts may belong to different knowledge domains. Accordingly, according to some embodiments. In operation S262, a knowledge domain to which the second question belongs may be identified, and a certain first question text and a first answer text thereof of the knowledge domain may be input into the third language model together with the second question text.

The third language model may be a large language model. According to some embodiments, the third language model may be the same as the trained text generation model (i.e., the text generation model trained on the pre-trained first language model) used to generate the current step text in operation S231. Therefore, the third language model has the capability of generating the answering step and answering the questions, the answering step generated by the third language model can be better understood and learned, and the efficiency, accuracy and consistency of question and answer are improved.

According to some embodiments, the third language model may be the pre-trained large language model itself, such as the second language model above.

According to some embodiments, the first language model, the second language model, and the third language model may be the same large language model.

According to an embodiment of the present disclosure, a training method of a text generation model is also provided. The training method trains a text generation model based on the reinforcement learning strategy, and the trained text generation model can be used to perform operation S231 above to generate step texts corresponding to respective solution steps of the problem, thereby generating a thought chain of the problem.

Fig. 3 illustrates a flowchart of a training method 300 of a text generation model according to an embodiment of the present disclosure. The subject of the method 300 is typically a server. In some embodiments, the execution subject of method 300 may also be a client device, which generally requires a higher hardware configuration and computing power for the client device. As shown in FIG. 3, method 300 includes operations S310-S360.

In operation S310, a question-answer text pair is acquired. Wherein the question-answer text pair includes a sample question text representing a sample question and a sample answer text representing an answer to the sample question.

In operation S320, the history step sequence text is initialized to a preset value.

In operation S330, the following operations S331 to S333 are repeatedly performed to generate a target step sequence text. Wherein the target step sequence text represents a solution step sequence of the sample question.

In operation S331, the sample question text and the current history step sequence text are input into the text generation model to obtain the current step text output by the text generation model. The current step text represents the current solving step of the sample question.

In operation S332, in response to the current step text not being the preset termination text, the current historical step sequence text is spliced with the current step text to obtain an updated historical step sequence text.

In operation S333, in response to the current step text being the termination text, the current history step sequence text is determined as the target step sequence text.

In operation S340, a predicted answer text of the sample question is generated based on the target step sequence text.

In operation S350, a reward for the text generation model is determined based on the predicted answer text and the sample answer text.

In operation S360, parameters of the text generation model are adjusted based on the rewards.

According to an embodiment of the present disclosure, a text generation model is trained with reinforcement learning strategies based on round-robin tasks (Episodic Tasks). The text generation model corresponds to a Policy (Policy) adopted by an Agent (Agent) in the reinforcement learning Policy. The text generation model takes a sample question text and a current historical step sequence text as input states (State), and outputs the current step text as a next Action (Action) taken in the input states. And when the step text output by the text generation model is the termination text, ending the round. The generated target step sequence text (i.e., the thought chain) is the termination state at the end of the round. Generating a predictive answer text based on the target sequence of steps text, evaluating rewards of the text generation model based on the predictive answer text and the sample answer text (Reward), and adjusting parameters of the text generation model accordingly, so that the text generation model always evolves towards a direction of leading the thinking chain to a correct answer, thereby being capable of accurately generating the thinking chain.

Since the agent's strategy (i.e., text generation model) is trained through reinforcement learning in numerous training samples (i.e., question-answer text pairs), its behavior tends to have global consistency. Thus, the text generation model trained in accordance with embodiments of the present disclosure is capable of automatically generating a correct and consistent mental chain step by step for any new problem, thereby enabling efficient generation of large-scale high-quality mental chain data. The generated thinking chain can be used as a prompt text of a machine question answer to guide a large language model (such as the third language model above) to gradually reason and give an answer to the question to be answered, so that the accuracy of the answer is improved.

According to some embodiments, the initial value of the text generation model is a pre-trained language model. The language model is a large language model. According to the embodiment, the text generation model is trained based on a pre-trained large language model, so that the text generation model has strong language understanding capability of the large language model, and the capability of accurately generating a thinking chain of a problem is convenient to learn quickly.

According to some embodiments, operation S340 may include operations S341 and S342.

In operation S341, each solution step in the target step sequence text is executed separately to obtain an execution result text of each solution step.

In operation S342, the execution result text of the last answer step in the target step sequence text is determined as the predicted answer text.

According to the embodiment, the predictive answer text can be quickly and automatically generated.

According to some embodiments, operation S341 may include operations S3411-S3413.

In operation S3411, a knowledge domain corresponding to the sample problem is identified.

In operation S3412, an actuator for performing a problem solving step in the knowledge domain is acquired.

In operation S3413, the step text corresponding to the solving step is input to the actuator to obtain the execution result text output by the actuator.

According to the above embodiment, the execution result accuracy can be improved by executing the solving step of the sample question with the actuator corresponding to the knowledge field to which the sample question belongs.

Specific embodiments of S3411-S3413 may refer to the relevant descriptions of operations S2341-S2343 above, and are not repeated here.

According to some embodiments, in operation S350, a reward of the text generation model may be determined based on the similarity of the predicted answer text and the sample answer text. The rewards are positively correlated with the similarity, i.e., the greater the similarity of the predicted answer text to the sample answer text, the greater the rewards; the smaller the similarity of the predicted answer text and the sample answer text, the smaller the reward. This can generate a feedback signal to the text generation model, which is directed to evolve in a direction that generates the correct predictive answer text.

The similarity between the predicted answer text and the sample answer text can be the literal similarity of the predicted answer text and the sample answer text, such as editing distance, maximum continuous matching character number and the like; or the cosine distance of the embedded vector of the two. The embedded vectors of predicted answer text and sample answer text may be derived, for example, by a trained text representation model. That is, the predicted answer text or the sample answer text is input into the text representation model, and the embedded vector of the predicted answer text or the sample answer text output by the text representation model can be obtained.

In operation S360, parameters of the text generation model are adjusted in a direction in which the reward is increased, thereby causing the text generation model to evolve toward "causing the mind chain to lead to the correct answer".

It is understood that operations S310-S360 may be performed in a loop a plurality of times until the text generation model training is completed when a preset termination condition is reached. The termination condition may be, for example, that the accuracy of the thought chain generated by the text generation model reaches a threshold, that the number of loops reaches a threshold, that the accuracy converges, or the like.

According to an embodiment of the present disclosure, there is also provided a text generating apparatus. Fig. 4 shows a block diagram of a text generating apparatus 400 according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes an acquisition module 410, an initialization module 420, and an update module 430.

The acquisition module 410 is configured to acquire a first question text, wherein the first question text represents a first question of a sequence of solution steps to be determined.

The initialization module 420 is configured to initialize the historical step sequence text to a preset value.

The updating module 430 is configured to update the historical step sequence text at least once based on the first question text to obtain a target step sequence text, wherein the target step sequence text represents a solution step sequence of the first question, the target step sequence text includes at least one step text, and each step text in the at least one step text represents one solution step of the first question.

The update module 430 includes a generation unit 431, an update unit 432, and a determination unit 433.

The generating unit 431 is configured to generate a current step text based on the first question text and a current historical step sequence text, wherein the current step text represents a current solving step of the first question;

The updating unit 432 is configured to splice the current historical step sequence text with the current step text to obtain an updated historical step sequence text in response to the current step text not being a preset termination text; and

The determining unit 433 is configured to determine the current historical step sequence text as the target step sequence text in response to the current step text being the termination text.

According to some embodiments, the generating unit 431 is further configured to: and inputting the first question text and the current historical step sequence text into a text generation model to obtain the current step text output by the text generation model.

According to some embodiments, the text generation model is obtained by training a pre-trained first language model through a reinforcement learning strategy, and the training samples of the training are question-answer text pairs, and the question-answer text pairs comprise sample question text representing sample questions and sample answer text representing answers of the sample questions.

According to some embodiments, the text generation model is a pre-trained second language model, and the updating module 430 further includes: an execution unit configured to execute a current answering step represented by the current step text to obtain a current execution result text of the current answering step; and wherein the determining unit 433 is further configured to: and determining the current historical step sequence text as the target step sequence text in response to the current step text being the termination text and the current execution result text being a first answer text representing an answer to the first question.

According to some embodiments, the execution unit comprises: an identification subunit configured to identify a knowledge domain corresponding to the first problem; an acquisition subunit configured to acquire an actuator for executing a problem solving step of the knowledge domain; and an execution subunit configured to input the current step text into the executor to obtain the current execution result text output by the executor.

According to some embodiments, the apparatus 400 further comprises: the execution module is configured to execute each answering step in the target step sequence text respectively so as to obtain an execution result text of each answering step; a generation module configured to generate a first answer text based on the target step sequence text and the execution result text of each answer step, wherein the first answer text represents an answer process of the first question; and an optimization module configured to optimize an output of a third language model based on the first question text and the first answer text.

According to some embodiments, the optimization module comprises: an acquisition unit configured to acquire a second question text, wherein the second question text represents a second question to be solved; and an input unit configured to input the first question text, the first answer text, and the second question text into the third language model to obtain a second answer text output by the third language model, wherein the second answer text represents an answer process of the second question.

According to some embodiments, the third language model is the same as a trained text generation model used to generate the current step text.

It should be appreciated that the various modules and units of the apparatus 400 shown in fig. 4 may correspond to the various steps in the method 200 described with reference to fig. 2. Thus, the operations, features and advantages described above with respect to method 200 are equally applicable to apparatus 400 and the modules and units comprising the same. For brevity, certain operations, features and advantages are not described in detail herein.

According to an embodiment of the present disclosure, there is also provided a training apparatus of a text generation model. Fig. 5 shows a block diagram of a training apparatus 500 of a text generation model according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 includes an acquisition module 510, an initialization module 520, a first generation module 530, a second generation module 540, a determination module 550, and an adjustment module 560.

The acquisition module 510 is configured to acquire a question-answer text pair, wherein the question-answer text pair includes a sample question text representing a sample question and a sample answer text representing an answer to the sample question.

The initialization module 520 is configured to initialize the historical step sequence text to a preset value.

The first generation module 530 is configured to repeatedly perform the following operations to generate a target step sequence text, wherein the target step sequence text represents a solution step sequence of the sample question:

Inputting the sample question text and the current historical step sequence text into the text generation model to obtain a current step text output by the text generation model, wherein the current step text represents a current answering step of the sample question;

In response to the current step text not being a preset termination text, splicing the current historical step sequence text with the current step text to obtain an updated historical step sequence text; or alternatively

And determining the current historical step sequence text as the target step sequence text in response to the current step text being the termination text.

The second generation module 540 is configured to generate predicted answer text for the sample question based on the target step sequence text.

The determination module 550 is configured to determine rewards for the text generation model based on the predicted answer text and the sample answer text.

The adjustment module 560 is configured to adjust parameters of the text generation model based on the rewards.

Since the agent's strategy (i.e., text generation model) is trained through reinforcement learning in numerous training samples (i.e., question-answer text pairs), its behavior tends to have global consistency. Thus, the text generation model trained in accordance with embodiments of the present disclosure is capable of automatically generating a correct and consistent mental chain step by step for any new problem, generating large-scale high-quality mental chain data. The generated thinking chain can be used as a prompt text of a machine question answer to guide a large language model (such as the third language model above) to gradually reason and give an answer to the question to be answered, so that the accuracy of the answer is improved.

According to some embodiments, the initial value of the text generation model is a pre-trained language model.

According to some embodiments, the second generating module 540 includes: an execution unit configured to execute each answering step in the target step sequence text respectively to obtain an execution result text of each answering step; and a determining unit configured to determine an execution result text of a last answering step in the target step sequence text as the predicted answer text.

According to some embodiments, the execution unit comprises: an identification subunit configured to identify a knowledge domain corresponding to the sample problem; an acquisition subunit configured to acquire an actuator for executing a problem solving step of the knowledge domain; and the execution subunit is configured to input the step text corresponding to the answering step into the executor so as to obtain the execution result text output by the executor.

According to some embodiments, the determining module 550 is further configured to: determining rewards of the text generation model based on the similarity of the predicted answer text and the sample answer text, wherein the rewards are positively correlated with the similarity.

It should be appreciated that the various modules and units of the apparatus 500 shown in fig. 5 may correspond to the various steps in the method 300 described with reference to fig. 3. Thus, the operations, features and advantages described above with respect to method 300 apply equally to apparatus 500 and the modules and units comprising it. For brevity, certain operations, features and advantages are not described in detail herein.

Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module.

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various units described above with respect to fig. 4, 5 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the units may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these units may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the modules 410-560 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a Processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

There is also provided, in accordance with an embodiment of the present disclosure, an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor, the memory storing instructions executable by the at least one processor to enable the at least one processor to perform the text generation method and/or the training method of the text generation model of the disclosed embodiments.

According to an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the text generation method and/or training method of the text generation model of the embodiments of the present disclosure.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising computer program instructions which, when executed by a processor, implement the text generation method and/or the training method of the text generation model of the embodiments of the present disclosure.

Referring to fig. 6, a block diagram of an electronic device 600 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic device 600 can also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the electronic device 600, the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, 802.11 devices, wi-Fi devices, wiMAX devices, cellular communication devices, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as method 200 or method 300. For example, in some embodiments, the methods 200 and 300 may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of method 200 and method 300 described above may be performed. Alternatively, in other embodiments, computing unit 601 may be configured to perform method 200 or method 300 in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely illustrative embodiments or examples and that the scope of the present disclosure is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A training method of a text generation model, comprising:

Acquiring a question-answer text pair, wherein the question-answer text pair comprises a sample question text representing a sample question and a sample answer text representing an answer of the sample question;

Initializing a historical step sequence text to a preset value;

repeatedly performing the following operations to generate a target step sequence text, wherein the target step sequence text represents a solution step sequence of the sample question:

Determining the current historical step sequence text as the target step sequence text in response to the current step text being the termination text;

generating a predicted answer text of the sample question based on the target step sequence text;

Determining rewards of the text generation model based on the predicted answer text and the sample answer text; and

And adjusting parameters of the text generation model based on the rewards.

2. The method of claim 1, wherein the initial value of the text generation model is a pre-trained language model.

3. The method of claim 1 or 2, wherein the generating predicted-answer text for the sample question based on the target-step-sequence text comprises:

executing each answering step in the target step sequence text respectively to obtain an execution result text of each answering step; and

And determining the execution result text of the last answering step in the target step sequence text as the predicted answer text.

4. The method of claim 3, wherein the separately executing each solution step in the target step sequence text to obtain execution result text for each solution step comprises:

Identifying a knowledge domain corresponding to the sample problem;

acquiring an actuator for executing a problem solving step in the knowledge field; and

And inputting a step text corresponding to the answering step into the executor to obtain the execution result text output by the executor.

5. The method of claim 1, wherein the determining a reward for the text generation model based on the predicted answer text and the sample answer text comprises:

determining rewards of the text generation model based on the similarity of the predicted answer text and the sample answer text, wherein the rewards are positively correlated with the similarity.

6. A text generation method, comprising:

Acquiring a first question text, wherein the first question text represents a first question of a to-be-determined solving step sequence;

Initializing a historical step sequence text to a preset value; and

Updating the historical step sequence text at least once based on the first question text to obtain a target step sequence text, wherein the target step sequence text represents a solving step sequence of the first question, the target step sequence text comprises at least one step text, and each step text in the at least one step text represents one solving step of the first question;

wherein each of the at least one update comprises:

inputting the first question text and the current historical step sequence text into a text generation model to obtain a current step text output by the text generation model, wherein the text generation model is trained according to the method of any one of claims 1-5, and the current step text represents a current solving step of the first question;

7. The method of claim 6, further comprising:

executing each answering step in the target step sequence text respectively to obtain an execution result text of each answering step;

generating a first answer text based on the target step sequence text and the execution result text of each answer step, wherein the first answer text represents an answer process of the first problem; and

And optimizing the output of a third language model based on the first question text and the first answer text.

8. The method of claim 7, wherein the optimizing the output of a third language model based on the first question text and the first answer text comprises:

acquiring a second question text, wherein the second question text represents a second question to be solved; and

And inputting the first question text, the first answer text and the second question text into the third language model to obtain a second answer text output by the third language model, wherein the second answer text represents an answer process of the second question.

9. The method of claim 7 or 8, wherein the third language model is the same as a trained text generation model used to generate the current step text.

10. A training device for a text generation model, comprising:

An acquisition module configured to acquire a question-answer text pair, wherein the question-answer text pair includes a sample question text representing a sample question and a sample answer text representing an answer to the sample question;

The initialization module is configured to initialize the historical step sequence text to a preset value;

a first generation module configured to repeatedly perform the following operations to generate a target step sequence text, wherein the target step sequence text represents a solution step sequence of the sample question:

a second generation module configured to generate a predicted answer text for the sample question based on the target step sequence text;

A determining module configured to determine rewards of the text generation model based on the predicted answer text and the sample answer text; and

An adjustment module configured to adjust parameters of the text generation model based on the rewards.

11. The apparatus of claim 10, wherein the initial value of the text generation model is a pre-trained language model.

12. The apparatus of claim 10 or 11, wherein the second generation module comprises:

An execution unit configured to execute each answering step in the target step sequence text respectively to obtain an execution result text of each answering step; and

And the determining unit is configured to determine the execution result text of the last answering step in the target step sequence text as the prediction answer text.

13. The apparatus of claim 12, wherein the execution unit comprises:

An identification subunit configured to identify a knowledge domain corresponding to the sample problem;

an acquisition subunit configured to acquire an actuator for executing a problem solving step of the knowledge domain; and

And the execution subunit is configured to input the step text corresponding to the answering step into the executor so as to obtain the execution result text output by the executor.

14. The apparatus of claim 10, wherein the determination module is further configured to:

15. A text generation apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a first question text, and the first question text represents a first question of a to-be-determined solution step sequence;

The initialization module is configured to initialize the historical step sequence text to a preset value; and

An updating module configured to update the historical step sequence text at least once based on the first question text to obtain a target step sequence text, wherein the target step sequence text represents a solution step sequence of the first question, the target step sequence text comprises at least one step text, and each step text in the at least one step text represents one solution step of the first question;

Wherein the update module comprises:

A generating unit configured to input the first question text and a current historical step sequence text into a text generating model to obtain a current step text output by the text generating model, wherein the text generating model is trained according to the apparatus of any one of claims 10-14, and the current step text represents a current solving step of the first question;

an updating unit configured to splice the current historical step sequence text with the current step text in response to the current step text not being a preset termination text, so as to obtain an updated historical step sequence text; and

And a determining unit configured to determine the current history step sequence text as the target step sequence text in response to the current step text being the termination text.

16. The apparatus of claim 15, further comprising:

The execution module is configured to execute each answering step in the target step sequence text respectively so as to obtain an execution result text of each answering step;

a generation module configured to generate a first answer text based on the target step sequence text and the execution result text of each answer step, wherein the first answer text represents an answer process of the first question; and

And an optimization module configured to optimize an output of a third language model based on the first question text and the first answer text.

17. The apparatus of claim 16, wherein the optimization module comprises:

an acquisition unit configured to acquire a second question text, wherein the second question text represents a second question to be solved; and

And an input unit configured to input the first question text, the first answer text, and the second question text into the third language model to obtain a second answer text output by the third language model, wherein the second answer text represents an answer process of the second question.

18. The apparatus of claim 16 or 17, wherein the third language model is the same as a trained text generation model used to generate the current step text.

19. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9.

21. A computer program product comprising computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1-9.