CN110909527A

CN110909527A - Text processing model operation method and device, electronic equipment and storage medium

Info

Publication number: CN110909527A
Application number: CN201911222138.5A
Authority: CN
Inventors: 王晓晖; 李磊
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-03-24
Anticipated expiration: 2039-12-03
Also published as: CN110909527B

Abstract

The embodiment of the disclosure discloses a method, a device, an electronic device and a storage medium for operating a text processing model, wherein the text processing model comprises at least one encoder layer and at least one decoder layer, and the method comprises the following steps: acquiring an input text vector; inputting the text vector into at least one encoder layer for processing to form a hidden layer vector; inputting the hidden layer vector into at least one decoder layer for processing to generate an output text vector; in the data calculation process of the encoder layer and/or the decoder layer, a combined kernel function is called to process data, the combined kernel function comprises at least two basic kernel functions, the basic kernel functions are used for finishing mathematical level calculation of the data, and the combined kernel function is used for finishing functional level calculation of the data. The technical scheme of the embodiment can reduce the read-write times of the video memory of the GPU and improve the operation efficiency.

Description

Text processing model operation method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of natural language processing, in particular to a method and a device for operating a text processing model, electronic equipment and a storage medium.

Background

In the field of natural language processing, a sequence-to-sequence (seq2seq) model is generally used for processing, which mainly includes at least one encoder and at least one decoder. The main principle of seq2seq model is: dividing the source sentences into word sequences, inputting the word sequences into an encoder, and outputting vectors of a hidden layer; the hidden layer vector is used as the input of the decoder, a target vocabulary can be generated at each time step, the latter target vocabulary is generated based on the hidden layer and the former output target vocabulary, and the final target vocabulary sequence forms the translated target sentence.

Generally, when the seq2seq model is used for natural language processing, the calculation amount of the model operation is so large that the required processing time is long, for example, about 1 second is required for translating a sentence of 20 words on a GPU model P4 based on a Transformer model. This is unacceptable in a business scenario where tens of thousands of translation requests are often per second, both from a machine cost and user experience perspective.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide an operating method and apparatus for a text processing model, an electronic device, and a storage medium, so as to reduce the read-write frequency of a GPU video memory and improve the operating efficiency.

Additional features and advantages of the disclosed embodiments will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosed embodiments.

In a first aspect, an embodiment of the present disclosure provides a method for operating a text processing model, where the text processing model includes at least one encoder layer and at least one decoder layer, and includes:

acquiring an input text vector;

inputting the text vector into at least one encoder layer for processing to form a hidden layer vector;

inputting the hidden layer vector into at least one decoder layer for processing to generate an output text vector;

in the data calculation process of the encoder layer and/or the decoder layer, a combined kernel function is called to process data, the combined kernel function comprises at least two basic kernel functions, the basic kernel functions are used for finishing mathematical level calculation of the data, and the combined kernel function is used for finishing functional level calculation of the data.

In a second aspect, an embodiment of the present disclosure further provides an apparatus for running a text processing model, where the text processing model includes at least one encoder layer and at least one decoder layer, and the apparatus includes:

an input acquisition unit for acquiring an input text vector;

the encoder layer processing unit is used for inputting the text vector into at least one encoder layer for processing so as to form a hidden layer vector;

the decoder layer processing unit is used for inputting the hidden layer vector into at least one decoder layer for processing so as to generate an output text vector;

the encoder layer processing unit calls a combined kernel function to process data in the data calculation process of the encoder layer and/or the decoder layer processing unit in the decoder layer, wherein the combined kernel function comprises at least two basic kernel functions, the basic kernel functions are used for finishing mathematical level calculation of the data, and the combined kernel function is used for finishing functional level calculation of the data.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the instructions of the method of any one of the first aspects.

In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any one of the first aspect.

The embodiment of the disclosure inputs the input text vector into at least one encoder layer for processing to form a hidden layer vector; and inputting the hidden layer vector into at least one decoder layer for processing so as to generate an output text vector, wherein in the data calculation process of the encoder layer and/or the decoder layer, a combined kernel function is called to process the data, the combined kernel function comprises at least two basic kernel functions, the basic kernel functions are used for completing the mathematical grade calculation of the data, and the combined kernel function is used for completing the functional grade calculation of the data. The embodiment of the disclosure can greatly reduce the read-write frequency of the video memory of the GPU, and can improve the operation efficiency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly described below, and it is obvious that the drawings in the following description are only a part of the embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the contents of the embodiments of the present disclosure and the drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram illustrating a method for operating a text processing model according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating an example of processing data by invoking a combined kernel function according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an example of a Transformer model provided in an embodiment of the present disclosure;

FIG. 4 is a schematic internal structure diagram of an encoder layer and a decoder layer of a Transformer model provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating calculation of three vectors of a self attention mechanism layer of a Transformer model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a calculation process of a self-attention mechanism layer of a Transformer model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a calculation process of a self-attention mechanism layer of a Transformer model according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a calculation process of a self-attention mechanism layer of a Transformer model according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a calculation process of a self-attention mechanism layer of a Transformer model according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a calculation process of a self-attention mechanism layer of a Transformer model according to an embodiment of the present disclosure;

FIG. 11 is a flowchart illustrating another method for operating a text processing model according to an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of an apparatus for executing a text processing model according to an embodiment of the present disclosure;

FIG. 13 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

In order to make the technical problems solved, technical solutions adopted and technical effects achieved by the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments, but not all embodiments, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.

It should be noted that the terms "system" and "network" are often used interchangeably in the embodiments of the present disclosure. Reference to "and/or" in embodiments of the present disclosure is meant to include any and all combinations of one or more of the associated listed items. The terms "first", "second", and the like in the description and claims of the present disclosure and in the drawings are used for distinguishing between different objects and not for limiting a particular order.

It should also be noted that, in the embodiments of the present disclosure, each of the following embodiments may be executed alone, or may be executed in combination with each other, and the embodiments of the present disclosure are not limited specifically.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The technical solutions of the embodiments of the present disclosure are further described by the following detailed description in conjunction with the accompanying drawings.

Fig. 1 is a flowchart illustrating a method for operating a text processing model according to an embodiment of the present disclosure, where the embodiment is applicable to a case where a text processing model including at least one encoder layer and at least one decoder layer is operated, and the method may be executed by an apparatus for operating a text processing model configured in an electronic device, as shown in fig. 1, where the method for operating a text processing model according to the embodiment includes:

in step S110, an input text vector is acquired.

In step S120, the text vector is input into at least one encoder layer for processing to form a hidden layer vector, and in the data calculation process of each encoder layer, a combined kernel function is called to process data.

The combined kernel function comprises at least two basic kernel functions, the basic kernel functions are used for completing mathematical level calculation of data, and the combined kernel function is used for completing functional level calculation of the data.

In the data calculation process of each encoder layer, calculation steps such as invoking kernel functions are usually involved, for example, invoking kernel functions such as matrix row average, matrix row variance, and matrix dot product. Generally, the kernel function in the scientific computing base library is usually fine-grained, and if the kernel function in the scientific computing base library is directly called, video memory reading and writing are frequently required. If the operation of the multiple basic library kernel functions is spliced and combined to form a combined kernel function in the data calculation process of each encoder layer in the step, for example, the combined kernel function comprises matrix row average, matrix row variance, matrix dot product and the like, the combined kernel function is called to process data, the intermediate calculation value can be temporarily stored in a register, the final result of each component is directly written into the video memory, and the video memory read-write times can be reduced in multiples.

For example, if the text processing model employs a Transformer model, the computing components of its self-attention mechanism layer need to combine in sequence: generating at least 5 times of reading and writing of the whole matrix by using kernel functions such as matrix row average, matrix row variance, matrix point multiplication and the like; if the combined kernel function is called, and the process of calling kernel functions such as matrix row average, matrix row variance, matrix dot product and the like is realized by using one combined kernel function in a thread group communication mode, only one time of reading and writing of the matrix is needed, and the delay can be reduced by about 80%.

It should be noted that the combined kernel function described in this step refers to a kernel function with a large granularity obtained by combining at least two basic kernel functions called sequentially, and specifically, which basic kernel functions are combined is determined according to the basic kernel functions actually involved in the calculation of each encoder layer, which is not limited in this embodiment.

In step S130, the hidden layer vector is input into at least one decoder layer for processing, so as to generate an output text vector, and in the data calculation process of each decoder layer, a combined kernel function is called to process data.

Likewise, the combined kernel function comprises at least two basic kernel functions, wherein the basic kernel functions are used for completing mathematical level calculation of data, and the combined kernel function is used for completing functional level calculation of the data.

For the same reason as that in step S120, in order to reduce the number of times of reading and writing the video memory, in the data calculation process of each decoder layer, a self-made combined kernel function may be made according to a basic kernel function to be called, and the self-made combined kernel function is called to process the data.

It should be noted that, compared with the prior art in which each calculation process directly calls the basic kernel function in the basic library, if the combined kernel function is called to process data in the data calculation process of any encoder layer or any decoder layer, the present embodiment has the effect of reducing the number of times of reading and writing the video memory, and can reduce the frequency of reading and writing the video memory and improve the operation efficiency.

Exemplarily, fig. 2 is a schematic flowchart of an example of processing data by calling a combined kernel function according to an embodiment of the present disclosure, and as shown in fig. 2, in a data calculation process of the encoder layer and/or the decoder layer, the calling the combined kernel function to process data includes:

in step S210, in the data calculation process of the encoder layer and/or the decoder layer, a thread group is assigned to the called combined kernel function, and data is read from the video memory space.

In step S220, the combined kernel function is run through the thread group, and the intermediate data in the calculation process of the combined kernel function is read and written in the register in the thread group communication manner.

In step S230, the final output data of the combined kernel function is written into the video memory space.

Exemplarily, the following describes the technical solution of the present embodiment by taking a Transformer model as an example of the text processing model, where the Transformer model used as the text processing model in the present embodiment includes a plurality of encoder layers connected in sequence and a plurality of decoder layers connected in sequence, and a hidden layer vector is transmitted between a last encoder layer and each decoder layer; each encoder layer at least comprises a self-attention mechanism layer and a feedforward neural network layer; each decoder layer at least comprises a self-attention mechanism layer, a coding and decoding attention mechanism layer and a feedforward neural network layer.

Taking the example of the transform model including 6 encoder layers and 6 decoder layers shown in fig. 3, the internal simple structure of each encoder layer and decoder layer is shown in fig. 4.

As shown in fig. 4, for the encoder layer, which includes two layers, a self-attention mechanism layer and a feedforward neural network layer, the self-attention mechanism layer can help the current node to focus on only the current word, so that the semantics of the context can be obtained.

The decoder layer also comprises two layers of networks mentioned by the encoder layer, but a coding and decoding attention mechanism layer is arranged between the two layers to help the current node to acquire important contents needing attention currently.

For internal details of the model, reference may be made to fig. 5. HeadFirstly, the model needs to perform an embedding operation (similar to the operation of w2c, which is used for mapping a natural language vector into a digital vector, i.e. mapping a text vector into a sequence vector), after the embedding is finished, the model is input into an encoder layer, after the self-attention mechanism layer processes the data, the data is sent to a feedforward neural network layer, the calculation of the feedforward neural network layer can be parallel, and the obtained output can be input into the next encoder layer. If the input text vectors are 'Thinking' and 'Machines', after embedding mapping processing is carried out, sequence vectors X are respectively obtained₁And X₂。

For the self-attention mechanism layer, the idea is similar to the attention mechanism, but the self-attention mechanism layer is a concept used by the Transformer to convert the "understanding" of other related words into the word being processed.

First, the self attention mechanism layer will compute three new vectors, one for reflecting the weight of the word from three sides. For example, the sequence vector is respectively subjected to three weight calculations, which are called Query, Key and Value, and the three vectors are embedding vectors (such as sequence vector X)₁And X₂) With a random initialization matrix (e.g., the matrix W shown in FIG. 5)^Q、W^KAnd W^V) The result of the multiplication (according to X as shown in FIG. 5)₁And matrix W^Q、W^KAnd W^VVector q obtained by multiplication_1、k₁And v_1，According to X₂And matrix W^Q、W^KAnd W^VVector q obtained by multiplication_2、k₂And v₂)。

For example, this matrix employs a random initialization matrix with dimension (64, 512). Note that the second dimension needs to be the same as the embedding dimension, the value of the second dimension is updated all the time in the BP process, the dimension of the obtained three vectors is 64 lower than the embedding dimension, and a schematic diagram of the calculation process can be shown in fig. 5.

Then, it is necessary to calculate a score value from the attention mechanism layer, which determines when a word is encoded at a certain position, for the wordThe attention level of other parts of the sentence is input. The calculation method of the score value is that Query and Key are made into points, taking fig. 6 as an example, wherein the mark name in the graph refers to the mark name in fig. 5, firstly, aiming at the word "Thinking", a score value of other words for the word is calculated, and firstly aiming at the word itself, namely q₁·k₁Then for the second word, q₁·k₂. The schematic calculation process thereof can be referred to fig. 6. Q as shown in the example of FIG. 6₁And k is₁The dot product Score is 112, q₁And k is₂The dot product result is 112.

Next, the result of the dot multiplication is divided by a constant, for example, this value is generally the root of the first dimension of the matrix mentioned above, for example, root 8 of 64, although other values may be selected, and then the result is subjected to a softmax calculation for determining the text and text confidence probability of the output bit, i.e., the relevance of each word to the word at the current position, which of course must be large.

On the basis of fig. 6, the calculation result in this step is shown in fig. 7, in which the tag names refer to those of fig. 5 to 6, and in addition,

this represents the square after dividing Score of the previous step by 8.

Then, the values obtained by Value and softmax are multiplied and added, and the obtained result is the Value of the self-attention mechanism layer at the current node. The above example shows the calculation result in this step as shown in fig. 8, in which the tag names refer to those of fig. 5 to 7, and Sum indicates addition.

It should be noted that, in an actual application scenario, in order to increase the calculation speed, a matrix manner is adopted to directly calculate the matrix of Query, Key, and Value, and the calculation result in this step is shown in fig. 9.

The value of embedding is then directly multiplied by three matrices, the resulting new matrix Q is multiplied by K, multiplied by a constant, subjected to softmax operation, and finally multiplied by V matrix, the above example is illustrated in fig. 10, wherein the symbols refer to fig. 5-9.

In the text processing model described in this embodiment, multiple types of seq2seq models may be used, and for example, a Transformer model is used as the text processing model, in the above models, if the prior art is used, a basic kernel function needs to be called in a process of calculating softmax by a self-attention mechanism layer of each encoder layer and a decoder layer, for example, when matrices of Query, Key, and Value are calculated, matrix multiplication kernel functions are all involved, then matrix dot multiplication and functions need to be called when the Query of each layer and the Key dot multiplication of each layer are multiplied, and finally, a variance function needs to be called when each softmax Value is obtained.

According to the embodiment, the combined kernel function can be prefabricated according to the calling process of each function so as to call the combined kernel function, the reading and writing times of the video memory can be greatly reduced, and the running efficiency can be improved.

Further, on the basis, the combined kernel function can be operated through a thread group, intermediate data in the calculation process of the combined kernel function is read and written in a register in a thread group communication mode, and final output data of the combined kernel function is written into a video memory space, so that the efficiency of calling the combined kernel function can be further improved.

The embodiment forms the hidden layer vector by inputting the input text vector into at least one encoder layer for processing; and inputting the hidden layer vector into at least one decoder layer for processing so as to generate an output text vector, wherein in the data calculation process of the encoder layer and/or the decoder layer, a combined kernel function is called to process the data, the combined kernel function comprises at least two basic kernel functions, the basic kernel functions are used for completing the mathematical grade calculation of the data, and the combined kernel function is used for completing the functional grade calculation of the data. The video memory read-write frequency can be greatly reduced, and the operating efficiency can be improved.

Fig. 11 is a flowchart illustrating another method for operating a text processing model according to an embodiment of the present disclosure, where the embodiment is based on the foregoing embodiment and is optimized. As shown in fig. 11, the method for operating the text processing model according to this embodiment includes:

in step S1110, an input text vector is acquired.

In step S1120, a video memory space with a fixed position is allocated to at least one computing module of the encoder layer, and the size of the video memory space is fixed and remains in an empty state when no data is read or written.

Taking the models described in fig. 3 and fig. 4 as an example, a first video memory space with a fixed position may be allocated to the attention mechanism modules of the first encoder model, the third encoder model, and the fifth encoder model of the transform model, and a second video memory space with a fixed position may be allocated to the attention mechanism modules of the second encoder model, the fourth encoder model, and the sixth encoder model. For the same reason, the allocation of the self-attention mechanism modules of the second encoder model, the fourth encoder model and the sixth encoder model to the reading and writing of the video memory are performed sequentially, the phenomenon of time crossing is not generated, and the same video memory space (namely, the second video memory space) can be reused.

In order to make the text processing range of the text processing model wider, the size of the allocated display memory space is determined by the maximum value of the inputtable text vector. Wherein the maximum value of the inputtable text vectors is determined based on a historical maximum value of the input text vectors, such that the text processing model is capable of processing text vectors having a length not exceeding the maximum value.

In step S1130, the text vector is input into at least one encoder layer for processing, so as to form a hidden layer vector, and in the data calculation process of each encoder layer, a combined kernel function is called to process data.

In step S1140, the hidden layer vector is input to at least one decoder layer for processing to generate an output text vector.

It should be noted that the decoder layer includes at least two computing modules having a time-sharing computing relationship, and may also allocate the same spatial video memory space at a fixed location to the at least two computing modules, and set the size of the video memory space to be fixed, and keep the video memory space in an idle state when there is no data to be read or written, so that the at least two computing modules multiplex the same video memory space. Similarly, after the read-write operation of the multiplexed video memory space is finished, the multiplexed video memory space does not need to be released by each module, the processing times of releasing the video memory space of the GPU and distributing the video memory space of the GPU in the process of running the text processing model can be reduced, and the processing efficiency can be further improved.

Taking the text processing model as an example of a Transformer model, the self-attention mechanism layer of the Transformer model has a mechanism called "MultiHead-attention". In an embodiment, the operation is logically equivalent and converted into a calculation mode with higher concurrency, and in the data calculation process of the encoder layer and/or the decoder layer, a logic equivalent calculation module can be further adopted to process the data so as to improve the concurrency and further improve the processing efficiency.

For example, a splicing weight calculation module may be used to perform matrix calculation on the sequence vector of the input text vector to generate a splicing matrix of the sequence vector. For example, the splicing weight calculation module includes a query weight calculation matrix, a key weight calculation matrix, a numerical weight calculation matrix, and the like spliced together, and the splicing matrix of the sequence vector includes a query weight matrix, a key weight matrix, a numerical weight matrix, and the like spliced together.

Processing the data using the logically equivalent computation module may include: during data processing of the decoder layer, texts of output bits and text confidence probabilities are determined through a softmax function. And for the current output bit, reserving partial texts and text credibility probabilities of which the text credibility probabilities meet set conditions, and transmitting the partial texts and the text credibility probabilities to the next output bit in a variable vector form to calculate the text and text credibility probabilities.

The scheme changes the inherent calculation process of the Transformer model and only ensures logical equivalence. For example, in the multi head-attribute process, the input is respectively subjected to mapping calculation of three weight matrixes to obtain query, key and value; after logical equivalence, the three weight matrices may be row-spliced, and input to the splicing matrix to be mapped once to obtain a (query, key, value) splicing matrix, thereby improving the concurrency. In addition, for example, in the process of calculating softmax, only the probability values of the output values of the K models before ranking are saved for subsequent cluster search calculation, instead of calculating all the probability values of each symbol (token) of the whole word vector, the processing efficiency can be further improved.

As an implementation of the methods shown in the above figures, the present application provides an embodiment of an apparatus for running a text processing model, and fig. 12 is a schematic structural diagram of an apparatus for running a text processing model provided in this embodiment, where the embodiment of the apparatus corresponds to the method embodiments shown in fig. 1 to fig. 11, and the apparatus may be applied to various electronic devices. As shown in fig. 12, the device for executing the text processing model according to the present embodiment includes an input obtaining unit 1210, an encoder layer processing unit 1220, and a decoder layer processing unit 1230.

The input obtaining unit 1210 is configured to obtain an input text vector.

The encoder layer processing unit 1220 is configured to input the text vector into at least one encoder layer for processing to form a hidden layer vector.

The decoder layer processing unit 1230 is configured to input the hidden layer vector into at least one decoder layer for processing to generate an output text vector.

Further, the base kernel function includes at least one of: matrix row average, matrix row variance, and matrix dot product;

the combined kernel function includes a normalization processing function including a matrix row average, a matrix row variance, and a matrix dot product.

Further, the calling, by the encoder layer processing unit 1220, the combining kernel function to process the data during the data calculation process of the encoder layer and/or the decoder layer processing unit 1230 at the decoder layer includes:

in the data calculation process of the encoder layer and/or the decoder layer, a thread group is allocated to the called combined kernel function, and data are read in from a video memory space;

running the combined kernel function through the thread group, and reading and writing intermediate data in the calculation process of the combined kernel function in a register by utilizing a thread group communication mode;

and writing the final output data of the combined kernel function into a video memory space.

In an embodiment, the apparatus further includes a video memory space configuration unit (not shown in fig. 12) configured to, before the data calculation process of the encoder layer and/or the decoder layer: allocating a video memory space with a fixed position for at least one computing module of the encoder layer and/or the decoder layer; the size of the video memory space is fixed, and the video memory space is kept in an idle state when no data is read and written.

Further, the video memory space configuration unit is configured to be used for multiplexing the video memory spaces allocated to the at least two computing modules having the time-sharing computing relationship.

Further, in the video memory space configuration unit, the computation module for multiplexing the same video memory space includes at least one group of the following: a self-attention mechanism module of two encoder layers spaced apart.

In one embodiment, the size of the video memory space is determined by the maximum value of the input text vector.

In an embodiment, during the calculation of the data at the encoder layer by the encoder layer processing unit 1220 and/or the data at the decoder layer by the decoder layer processing unit 1230, the method further includes processing the data by using a logical equivalent calculation module.

Further, the processing of the data by the logic equivalence computation module includes:

adopting a splicing weight calculation module to perform matrix calculation on the sequence vector of the input text vector to generate a splicing matrix of the sequence vector;

the splicing weight calculation module comprises a query weight calculation matrix, a key weight calculation matrix and a numerical weight calculation matrix which are spliced together; the splicing matrix of the sequence vector comprises a query weight matrix, a key weight matrix and a numerical weight matrix which are spliced together.

in the data processing process of a decoder layer, determining texts of output bits and text credibility probability through a softmax function;

and for the current output bit, reserving partial texts and text credibility probabilities of which the text credibility probabilities meet set conditions, and transmitting the partial texts and the text credibility probabilities to the next output bit in a variable vector form to calculate the text and text credibility probabilities.

In one embodiment, the text processing model includes a plurality of encoder layers connected in sequence and a plurality of decoder layers connected in sequence, and a hidden layer vector is transmitted between the last encoder layer and each decoder layer; each encoder layer at least comprises a self-attention mechanism layer and a feedforward neural network layer; each decoder layer at least comprises a self-attention mechanism layer, a coding and decoding attention mechanism layer and a feedforward neural network layer.

The running device of the text processing model provided by the embodiment can execute the running method of the text processing model provided by the embodiment of the method disclosed by the invention, and has corresponding functional modules and beneficial effects of the execution method.

Referring now to FIG. 13, shown is a schematic diagram of an electronic device 1300 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 13, electronic device 1300 may include a processing means (e.g., central processing unit, graphics processor, etc.) 1301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1302 or a program loaded from storage device 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for the operation of the electronic apparatus 1300 are also stored. The processing device 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

Generally, the following devices may be connected to the I/O interface 1305: input devices 1306 including, for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, and the like; an output device 1307 including, for example, a Liquid Crystal Display (LCD), speaker, vibrator, etc.; storage devices 1308 including, for example, magnetic tape, hard disk, etc.; and a communication device 1309. The communications device 1309 may allow the electronic device 1300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 13 illustrates an electronic device 1300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 1309, or installed from the storage device 1308, or installed from the ROM 1302. The computer program, when executed by the processing apparatus 1301, performs the functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium described above in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

acquiring an input text vector;

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

According to one or more embodiments of the present disclosure, in the method for operating the text processing model, the basic kernel function includes at least one of: matrix row average, matrix row variance, and matrix dot product;

According to one or more embodiments of the present disclosure, in the method for operating the text processing model, in the data calculation process of the encoder layer and/or the decoder layer, invoking a combined kernel function to process data includes:

According to one or more embodiments of the present disclosure, in the method for operating the text processing model, before the data calculation process of the encoder layer and/or the decoder layer, the method further includes:

allocating a video memory space with a fixed position for at least one computing module of the encoder layer and/or the decoder layer; the size of the video memory space is fixed, and the video memory space is kept in an idle state when no data is read and written.

According to one or more embodiments of the present disclosure, in the operation method of the text processing model, the video memory spaces allocated to at least two computing modules having a time-sharing computing relationship are the same multiplexed space.

According to one or more embodiments of the present disclosure, the computation module for multiplexing the same video memory space includes at least one group of:

a self-attention mechanism module of two encoder layers spaced apart.

According to one or more embodiments of the present disclosure, in the text processing model operating method, the size of the video memory space is determined by a maximum value of an input text vector.

According to one or more embodiments of the present disclosure, in the method for operating the text processing model, during the data calculation process of the encoder layer and/or the decoder layer, the method further includes:

and processing the data by adopting a logic equivalent calculation module.

According to one or more embodiments of the present disclosure, in the text processing model operating method, processing data by using a logic equivalent computation module includes:

According to one or more embodiments of the present disclosure, in the method for operating the text processing model, the text processing model includes a plurality of encoder layers connected in sequence and a plurality of decoder layers connected in sequence, and a hidden layer vector is transmitted between a last encoder layer and each decoder layer; each encoder layer at least comprises a self-attention mechanism layer and a feedforward neural network layer; each decoder layer at least comprises a self-attention mechanism layer, a coding and decoding attention mechanism layer and a feedforward neural network layer.

According to one or more embodiments of the present disclosure, in the running device of the text processing model, the basic kernel function includes at least one of: matrix row average, matrix row variance, and matrix dot product;

According to one or more embodiments of the present disclosure, in the apparatus for operating the text processing model, the invoking, by the encoder layer processing unit, a combination kernel function to process data in a data calculation process of the encoder layer and/or the decoder layer processing unit at the decoder layer includes:

According to one or more embodiments of the present disclosure, the apparatus for executing the text processing model further includes a video memory space configuration unit, configured to, before the data calculation process of the encoder layer and/or the decoder layer:

According to one or more embodiments of the present disclosure, in the device for operating the text processing model, the video memory space configuration unit is further configured to: at least two video memory spaces distributed by the computing modules with time-sharing computing relation are the same multiplexing space.

According to one or more embodiments of the present disclosure, in the video memory space configuration unit, the calculation module that multiplexes the same video memory space includes at least one group of:

a self-attention mechanism module of two encoder layers spaced apart.

According to one or more embodiments of the present disclosure, in the device for operating the text processing model, the size of the video memory space is determined by the maximum value of the input text vector.

According to one or more embodiments of the present disclosure, in the apparatus for operating the text processing model, during the data calculation process of the encoder layer processing unit at the encoder layer and/or the decoder layer processing unit at the decoder layer, the method further includes processing the data by using a logical equivalent calculation module.

According to one or more embodiments of the present disclosure, in the running device of the text processing model, processing data by using a logic equivalent computation module includes:

According to one or more embodiments of the present disclosure, in an apparatus for operating a text processing model, the text processing model includes a plurality of encoder layers connected in sequence and a plurality of decoder layers connected in sequence, and a hidden layer vector is transmitted between a last encoder layer and each decoder layer; each encoder layer at least comprises a self-attention mechanism layer and a feedforward neural network layer; each decoder layer at least comprises a self-attention mechanism layer, a coding and decoding attention mechanism layer and a feedforward neural network layer.

The foregoing description is only a preferred embodiment of the disclosed embodiments and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure in the embodiments of the present disclosure is not limited to the particular combination of the above-described features, but also encompasses other embodiments in which any combination of the above-described features or their equivalents is possible without departing from the scope of the present disclosure. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method of operating a text processing model, the text processing model including at least one encoder layer and at least one decoder layer, comprising:

acquiring an input text vector;

2. The method of claim 1, wherein:

the base kernel function includes at least one of: matrix row average, matrix row variance, and matrix dot product;

3. The method according to claim 1 or 2, wherein calling a combined kernel function to process data in the data calculation process of the encoder layer and/or the decoder layer comprises:

4. The method of claim 1, further comprising, prior to the data computation process at the encoder layer and/or decoder layer:

5. The method according to claim 4, wherein the video memory allocated to at least two computing modules having time-sharing computing relationship is the same multiplexed space.

6. The method of claim 5, wherein the computing modules that reuse the same video memory space comprise at least one of:

a self-attention mechanism module of two encoder layers spaced apart.

7. The method of claim 4, wherein the size of the video memory space is determined by a maximum value of the input text vector.

8. The method according to claim 1, further comprising, during the data calculation of the encoder layer and/or decoder layer:

and processing the data by adopting a logic equivalent calculation module.

9. The method of claim 8, wherein processing data using a logically equivalent computation module comprises:

10. The method of claim 8, wherein processing data using a logically equivalent computation module comprises:

11. The method of any of claims 1-10, wherein the text processing model comprises a plurality of encoder layers connected in sequence and a plurality of decoder layers connected in sequence, and wherein a hidden layer vector is transmitted between a last encoder layer and each decoder layer; each encoder layer at least comprises a self-attention mechanism layer and a feedforward neural network layer; each decoder layer at least comprises a self-attention mechanism layer, a coding and decoding attention mechanism layer and a feedforward neural network layer.

12. An apparatus for running a text processing model, the text processing model comprising at least one encoder layer and at least one decoder layer, the apparatus comprising:

an input acquisition unit for acquiring an input text vector;

13. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

instructions which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1-11.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.