CN114201588A

CN114201588A - Training method, device and medium for machine reading understanding model

Info

Publication number: CN114201588A
Application number: CN202010908799.XA
Authority: CN
Inventors: 张高升
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2022-03-18

Abstract

The invention discloses a training method, a device and a medium for a machine reading understanding model, wherein the method comprises the following steps: acquiring a trained first machine reading understanding model and a preset hierarchical mapping relation; constructing a second machine reading understanding model based on the first machine reading understanding model and the hierarchical mapping relation; and acquiring a training set and a target loss function, and training the second machine reading understanding model through the training set, the target loss function and the first machine reading understanding model to obtain a trained second machine reading understanding model as a target machine reading understanding model. According to the method and the device, the loss function is adopted to calculate the error for each layer with the corresponding relation, then the total loss value is calculated in a weighting and mode, the parameters of the second machine reading understanding model are adjusted based on the total loss value, and therefore the target machine reading understanding model is obtained, the video memory occupied by the model is greatly reduced, the reasoning time is greatly reduced, and the cost is reduced.

Description

Training method, device and medium for machine reading understanding model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a training method, a device and a medium for a machine reading understanding model.

Background

Deep learning has achieved great achievements in the fields of image recognition, voice recognition and the like, Machine Reading Comprehension (MRC) becomes a new hot spot in the fields of artificial intelligence research and application, and the main function of the MRC is to read and understand a given article or context and automatically give answers to related questions. With the development of machine reading understanding technology, the task of reading understanding is continuously upgraded, from the early 'full-shape gap-filling form', to the 'single-document reading understanding' based on wikipedia, and further to the 'multi-document reading understanding' based on web (web page) data.

Currently, a model based on bert (bidirectional Encoder retrieval from transforms) is adopted for the training method for machine reading understanding of the above type, and excellent performance is achieved in a machine reading understanding task. However, the existing BERT model has the problems of too deep network level, too large parameter quantity, too large occupied resources, too long reasoning time, too high deployment cost and the like when applied to the ground.

Accordingly, there is a need for improvements and developments in the art.

Disclosure of Invention

Based on this, the invention provides a training method, a device and a medium for a machine reading understanding model, aiming at the technical problems in the prior art that the machine reading understanding model training occupies too large resources, the reasoning time is too long, and the deployment cost is too high.

In order to achieve the purpose, the invention adopts the following technical scheme:

a training method of a machine reading understanding model comprises the following steps:

acquiring a trained first machine reading understanding model and a preset hierarchical mapping relation;

constructing a second machine reading understanding model based on the first machine reading understanding model and the hierarchical mapping relation; wherein the number of layers of the second machine reading understanding model is less than the number of layers of the first machine reading understanding model;

and acquiring a training set and a target loss function, training the second machine reading understanding model through the training set, the target loss function and the first machine reading understanding model to obtain a trained second machine reading understanding model, and taking the trained second machine reading understanding model as a target machine reading understanding model.

Optionally, the training step of the trained first machine-reading understanding model specifically includes:

acquiring a first training set and constructing a language model; the first training set comprises a plurality of training text corpora and problems corresponding to the training text corpora;

inputting the first training set into the language model to obtain answers corresponding to all questions in the first training set output by the language model;

adjusting parameters of the language model by taking a cross entropy loss function as a target so as to stop training when cross entropy loss values of answers corresponding to all questions in the first training set output by the language model and standard answers do not change any more;

and when the cross entropy loss value of the answer corresponding to each question in the first training set and the standard answer is not changed any more, the corresponding language model is used as a first machine reading understanding model.

Optionally, the hierarchical structure of the first machine reading understanding model includes an input layer, a plurality of cascaded interaction layers, and an output layer, and the hierarchical structure of the second machine reading understanding model includes an input layer, a plurality of cascaded interaction layers, and an output layer, where the number of interaction layers of the second machine reading understanding model is smaller than the number of interaction layers of the first machine reading understanding model.

Optionally, the function formula of the preset hierarchical mapping relationship is as follows:

n＝g(m)，

when m is 1, n is 1;

when M < (M/2), n is (M/2) + 1;

when M > (M/2), n ═ 1 (M/2);

m is more than or equal to 1 and less than or equal to M, M represents the number of interaction layers in the first machine reading understanding model, and M represents the mth layer in a plurality of interaction layers in the first machine reading understanding model; n is more than or equal to 1 and less than or equal to N, N represents the number of interaction layers in the second machine reading understanding model, and N represents the nth layer in a plurality of interaction layers in the second machine reading understanding model.

Optionally, the constructing a second machine reading understanding model based on the first machine reading understanding model and the hierarchical mapping relationship specifically includes:

determining an input layer and an output layer corresponding to the second machine reading understanding model based on the input layer and the output layer of the first machine reading understanding model;

determining the number of layers of a plurality of interaction layers in the second machine reading understanding model according to the layer mapping relation; wherein each interaction layer in the second machine-reading understanding model has its corresponding interaction layer in the first machine-reading understanding model.

Optionally, the step of obtaining the target loss function specifically includes:

acquiring a second training set and a plurality of loss functions to be tested; wherein, the second training set corresponding to each loss function to be tested is the same;

based on the same second training set, training the first machine reading understanding model and the second machine reading understanding model respectively through the loss functions to be tested to obtain a training result corresponding to each loss function to be tested;

and determining a target loss function according to the training result.

Optionally, the target loss function is a weighted sum of loss values of layers in the second machine-read understanding model.

Optionally, the obtaining a training set and a target loss function, and training the second machine reading understanding model through the training set, the target loss function and the first machine reading understanding model to obtain a trained second machine reading understanding model specifically includes:

acquiring a training set, inputting the training set into the first machine reading understanding model and the second machine reading understanding model respectively, and then training the second machine reading understanding model;

calculating a target loss value of the second machine reading understanding model according to the target loss function;

adjusting parameters of the second machine reading understanding model based on the target loss value until the target loss value is smaller than a preset threshold value, and stopping training of the second machine reading understanding model;

and acquiring a corresponding second machine reading understanding model when the target loss value is smaller than a preset threshold value, and taking the second machine reading understanding model as a target machine reading understanding model.

Based on the above method, the present application further provides a training apparatus for a machine reading understanding model, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the training method for the machine reading understanding model when executing the computer program.

Based on the above method, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the training method for machine reading understanding model.

Has the advantages that:

according to the method and the device, the machine reading understanding model and the second machine reading understanding model are used for establishing the hierarchical mapping relation, loss function calculation errors are adopted for each layer with the corresponding relation, then the total loss value is calculated in a weighting and mode, and the parameters of the second machine reading understanding model are adjusted based on the total loss value, so that the target machine reading understanding model is obtained, the video memory occupied by the model is greatly reduced, the reasoning time is greatly reduced, and the cost is reduced.

Drawings

Fig. 1 is a flowchart of a training method for a machine reading understanding model according to the present invention.

Fig. 2 is a flowchart illustrating an embodiment of a method for training a machine reading understanding model according to the present invention.

Fig. 3 is a block diagram of a training apparatus for machine reading understanding model according to the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Through research of the inventor, all the existing training modes of the bert model (Bidirectional Encoder reproduction from transformations) are only used for calculating the final output of the model, and as the network level is deepened, the parameter quantity is increased, so that the final loss function is excessively complicated to calculate, the reverse training time is excessively long, and when the model is applied to the ground, the problems of excessively large occupied resources, excessively long reasoning time, excessively high deployment cost and the like exist.

Therefore, in view of the above problems, the present application provides a training method, an apparatus, and a medium for a machine reading understanding model, in which a loss function is not calculated only for a final output, but a hierarchical mapping relationship is established between the machine reading understanding model and a second machine reading understanding model, a loss function calculation error is adopted for each layer having a corresponding relationship, a total loss value is calculated in a weighted sum manner, and a parameter of the second machine reading understanding model is adjusted based on the total loss value, so that a target machine reading understanding model is obtained, thus a video memory occupied by the model is greatly reduced, inference time is greatly reduced, and cost is reduced.

For further understanding of the technical solutions of the present application, the following description is made with reference to the accompanying drawings:

referring to fig. 1, fig. 1 is a flowchart of a method for training a machine reading understanding model according to the present invention, and it should be noted that the method for training a machine reading understanding model according to the present invention is not limited to the steps and the sequence in the flowchart shown in fig. 1, and the steps in the flowchart may be added, removed, or changed according to different requirements.

As shown in fig. 1, the training method of machine reading understanding model provided by the present invention includes the following steps:

and S10, acquiring the trained first machine reading understanding model and a preset hierarchical mapping relation.

In this embodiment, the first training set is used to train the first machine-reading understanding model. The first training set refers to a set of sample data selected for model training. The first training set comprises a plurality of training text corpora, questions corresponding to the training text corpora and answers corresponding to the questions. Thus, a sample is made up of a text corpus, questions, and answers to the questions.

For example: the text corpus "Changjiang river bridge in Nanjing city", problem: the sentence breaks form sentences with different semantics, and the sentences can be understood differently under different semantic scenes: "Nanjing city/Changjiang bridge", or "Nanjing city Changjiang/Jiang bridge". For the long word in the sentence, the word needs to be combined with the word 'river' in the former semantic scene to form a correct semantic unit; in the latter semantic scenario, it needs to be combined with "city" words to form a correct semantic unit. Thus, the question answers output: nanjing city/Changjiang bridge (position of the sentence break) or Nanjing city changjiang/Jiang bridge (position of the sentence break).

Further, in the sample selection, the subject of the sample is selected according to the application field of the model, for example, if the sample is a life-like machine reading understanding field, a life subject document is selected.

The preset language model refers to a model to be trained, and the model is a general BERT model. The generic BERT model includes an input layer, a cascade of several interaction layers (transformers), and an output layer. The output layer is a fully connected layer. Then, inputting the text corpus and questions of each sample in the first training set into a language model to be trained, and executing the following processing steps:

s121: the model carries out word segmentation and word preprocessing on the text corpus and sentences in the problem, maps words and words into corresponding word vectors and word vectors in a word list, splices the word vectors and the word vectors together, and forms initial characteristic vector representation of the text corpus and the problem through a high-speed network.

S122: the text corpus and the shallow word vector representation of the problem are respectively transmitted into an embedded coding layer (embedded encoder layer) to be processed, wherein the embedded coding layer is composed of coding blocks (encoder blocks), and a single encoder block structure sequentially comprises a position coding (posioning), a convolution (conv) layer, a self attention mechanism (self attention) layer and a feedforward network (fnn) layer from bottom to top. Convolution in the encoder block can capture context local structures, while self-attentions can capture global interactions between documents. Through this process, text corpora and deeper feature representations of the problem are learned.

S123: calculating two attention of a document-to-question attention (context-to-query attention) and a question-to-document attention (query-to-query attention) through a two-way attention mechanism, and finally forming an original text vector representation of a question-answer (query-aware) based on attention calculation;

s124: the bidirectional attention vector obtained in step S123 is processed by the encoding block.

S125: the coding blocks obtained in step S124 are spliced together, and the start and end positions of the answer to the question and the corresponding probability distributions are obtained through the linear layer and the softmax layer, respectively.

Through the above steps S121-S125, the answer corresponding to each question in the first training set output by the language model is output, where the answer includes answer content, the start and end positions of the answer, and the probability of being located at the start and end positions.

And then adjusting parameters of the language model by taking a cross entropy loss function as a target, and then executing steps S121-S125 to train so as to continuously optimize the probability distribution of the output answers of the language model, so that the predicted answers of the questions in the first training set are closer to the standard answers, namely, the training is stopped when the cross entropy loss value of the answer corresponding to each question in the first training set output by the language model and the standard answer is not changed any more. Thus, through multiple iterative training, a first machine reading understanding model is obtained. The first machine reading understanding model is also called a teacher model.

Illustratively, the training of the first machine-reading understanding model specifically includes:

s11, acquiring a first training set and constructing a language model; the first training set comprises a plurality of training text corpora and problems corresponding to the training text corpora;

s12, inputting the first training set into the language model to obtain answers corresponding to the questions in the first training set output by the language model;

s13, aiming at a cross entropy loss function, adjusting the parameters of the language model to ensure that the cross entropy loss value of the answer corresponding to each question in the first training set output by the language model and the standard answer is not changed any more, and stopping training;

and S14, taking the language model corresponding to the first training set when the cross entropy loss value of the answer corresponding to each question in the first training set and the standard answer is not changed any more as a first machine reading understanding model.

The hierarchical structure of the pre-trained first machine reading understanding model comprises an input layer, a plurality of interaction layers and an output layer which are cascaded, and the hierarchical structure is used as a basis for subsequently training a second machine reading understanding model (namely a student model).

S20, constructing a second machine reading understanding model based on the first machine reading understanding model and the hierarchical mapping relation; wherein the number of layers of the second machine reading understanding model is less than the number of layers of the first machine reading understanding model.

In this embodiment, the first machine-read understanding model obtained in step S10 is a complex model or a plurality of combined models, and in order to further increase the reasoning speed and improve the model training performance, the knowledge learned by the first machine-read understanding model is usually migrated to another lightweight model, which can have the same capability as the first machine-read understanding model but greatly reduce the occupied video memory and reasoning time. The technique of migrating the knowledge learned from the original model to a lightweight model becomes a model distillation technique. The original model may be referred to as a teacher model, such as the first machine-reading understanding model obtained at S10 above, and the lightweight model may become a student model, such as the second machine-reading understanding model.

In specific implementation, a second machine reading understanding model to be trained is constructed in advance. The second machine reading understanding model is constructed based on the first machine reading understanding model and the hierarchical mapping relation. The hierarchical mapping relation means that each interaction layer of the second machine reading understanding model has an interaction layer corresponding to each interaction layer in the first machine reading understanding model. The hierarchical mapping relationship is preconfigured.

Correspondingly, the function formula of the preset hierarchical mapping relationship is as follows:

n＝g(m)，

when m is 1, n is 1;

when M < (M/2), n is (M/2) + 1;

when M > (M/2), n ═ 1 (M/2);

It should be noted that the input layer in the first machine reading understanding model corresponds to the input layer in the second machine reading understanding model, and the output layer in the first machine reading understanding model corresponds to the output layer in the second machine reading understanding model.

That is to say, according to the layer number structure of the first machine reading understanding model and the second machine reading understanding model, a two-dimensional list of each layer in the second machine reading understanding model and the corresponding layer in the first machine reading understanding model is established and stored.

For example: l ═ [ (m1, n1), (m2, n2), (m3, n3), (m4, n4) … ]

Another example is: the number of the interaction layers (transform layers) in the teacher model is 12, that is, M is 12, and the number of the interaction layers (transform layers) in the student model is 4, that is, N is 4. Then the two-dimensional list of correspondences is L ═ [ (0,0), (1,1), (4,2), (8,3), (11,4), (13,5) ].

Wherein, (0,0) represents that the input layer of the teacher model corresponds to the input layer of the student model;

(1,1) representing that a first interaction layer in the teacher model corresponds to a first interaction layer in the student model;

(4,2) representing that the fourth interaction layer in the teacher model corresponds to the second interaction layer in the student model;

(8,3) the eighth interaction layer in the teacher model corresponds to the third interaction layer in the student model;

(11,4) representing that the eleventh interaction layer in the teacher model corresponds to the fourth interaction layer in the student model;

and (13,5) the output layers of the teacher model and the output layers of the student model correspond.

In this way, a second machine reading understanding model is constructed according to the trained first machine reading understanding model and the preset hierarchical mapping relation. The hierarchical structure of the second machine reading understanding model comprises an input layer, a plurality of cascaded interaction layers and an output layer, however, the number of the interaction layers of the second machine reading understanding model is far smaller than that of the interaction layers of the first machine reading understanding model.

It should be noted that the output layer is a fully connected layer regardless of the first machine-readable understanding model or the second machine-readable understanding model.

Illustratively, the building a second machine reading understanding model based on the first machine reading understanding model and the hierarchical mapping relationship specifically includes:

s21, determining an input layer and an output layer corresponding to the second machine reading understanding model based on the input layer and the output layer of the first machine reading understanding model;

s22, determining the number of layers of a plurality of interaction layers in the second machine reading understanding model according to the layer mapping relation; wherein each interaction layer in the second machine-reading understanding model has its corresponding interaction layer in the first machine-reading understanding model.

And S30, acquiring a training set and a target loss function, training the second machine reading understanding model through the training set, the target loss function and the first machine reading understanding model to obtain a trained second machine reading understanding model, and taking the trained second machine reading understanding model as the target machine reading understanding model.

In this embodiment, in order to improve the training quality, initialization is required to be performed after the second machine reading understanding model is constructed, so as to achieve the optimal training performance. The initialization refers to initialization of parameters in the second machine reading understanding model, and different loss functions are obtained according to different initializations.

In practical application, the initialization modes are three, namely the initialization modes correspond to three loss functions respectively, and firstly, the parameters in the second machine reading understanding model are set to be random values; secondly, setting parameters in the second machine reading understanding model as a fixed value; and thirdly, selecting a value of a certain layer in the first machine reading understanding model as a parameter of the second machine reading understanding model. Therefore, the second machine reading understanding model initialization process is to determine the objective loss function to optimize the training effect. In this embodiment, the third initialization manner is to achieve the best training effect. Thus, based on the hierarchical mapping relationship and the trained first machine reading understanding model, a second machine reading understanding model which is initialized is output.

After the second machine reading understanding model is initialized, the first machine reading understanding model and the second machine reading understanding model are trained jointly, and the trained model parameters are continuously corrected through the target loss function, so that the performance maximization is achieved. The joint training refers to that model parameters of the two models are continuously updated based on different weights, and the joint training in the embodiment only updates the parameters of the student model according to the loss function and does not update the parameters of the teacher model. In the training process, the level mapping relation between the teacher model and the student model and the weight values of all layers with the level mapping relation are unchanged.

Correspondingly, in the process of joint training, different loss functions are adopted for training, and target machine reading understanding models with different performances are obtained. The loss function may be divided into a first loss function, a second loss function, a third loss function, and the like.

Wherein the first loss function is formulated as

L_pred＝-softmax(z^T)·log softmax(z^S＝t)，

Wherein z is^TA prediction vector, Z, representing a layer of the first machine-read understanding model corresponding to the t-th layer of the second machine-read understanding model^SThe prediction vector of the t-th layer of the second machine reading understanding model is represented, softmax is a normalized exponential function and represents probability transformation on the vector, and log represents operation on the logarithm of the vector quantity.

The second loss function is represented as an average variance of a predictor vector of the first machine reading understanding model and a predictor vector of the second machine reading understanding model.

The third loss function is expressed as a weighted sum of loss values of layers having a hierarchical mapping relationship in the first machine-reading understanding model and the second machine-reading understanding model.

Illustratively, the determining process of the target loss function specifically includes:

m1, acquiring a second training set and a plurality of loss functions to be tested; wherein, the second training set corresponding to each loss function to be tested is the same;

the second training set refers to a set of sample data selected for model training. The second training set comprises a plurality of training text corpora, questions corresponding to the training text corpora and answers corresponding to the questions. Thus, a sample is made up of a text corpus, questions, and answers to the questions.

M2, based on the same second training set, respectively training the first machine reading understanding model and the second machine reading understanding model through the plurality of loss functions to be tested, and obtaining a training result corresponding to each loss function to be tested;

and M3, determining a target loss function according to the training result.

The training result refers to the probability distribution that the answers output by the first machine reading understanding model and the second machine reading understanding model approach the standard answer. Correspondingly, a loss function corresponding to the maximum probability that the answers output by the first machine reading understanding model and the second machine reading understanding model approach the standard answer is taken as the target loss function.

Through multiple experiments, the target loss function is the weighted sum of loss values of all layers in the second machine reading understanding model, so that the trained second machine reading understanding model can be in the best fit state, and the performance is maximized. That is, the first machine reading understanding model and the second machine reading understanding model are trained simultaneously through the training set and the target loss function to obtain a trained second machine reading understanding model. The training set refers to a set of sample data selected for model training. The second training set comprises a plurality of training text corpora, questions corresponding to the training text corpora and answers corresponding to the questions. Thus, a sample is made up of a text corpus, questions, and answers to the questions.

The training set, the first training set, and the second training set may be the same or different.

Illustratively, the obtaining a training set and a target loss function, and the training the first machine reading understanding model and the second machine reading understanding model through the training set and the target loss function at the same time to obtain a trained second machine reading understanding model specifically includes:

s31, acquiring a training set, inputting the training set into the first machine reading understanding model and the second machine reading understanding model respectively, and then training the second machine reading understanding model;

s32, calculating a target loss value of the second machine reading understanding model according to the target loss function;

s33, adjusting parameters of the second machine reading understanding model based on the target loss value until the target loss value is smaller than a preset threshold value, and stopping training of the second machine reading understanding model;

and S34, acquiring a corresponding second machine reading understanding model when the target loss value is smaller than a preset threshold value, and taking the second machine reading understanding model as a target machine reading understanding model.

In specific implementation, the text corpora and the questions of all samples in the training set are simultaneously input into the teacher model and the student model so as to continuously optimize the probability distribution of the output answers of the student model, and the predicted answers of the questions in the training set are closer to the standard answers. The training process steps are similar to those of S121-S125, and therefore are not described herein.

And in each round of training process, calculating the loss value between layers of the teacher model and the student model with a hierarchical mapping relation. The calculation process of the loss value comprises the steps of mapping a certain layer of the student model and a certain layer in the teacher model, which is related to the layer of the student model, to a vector space with the same dimension to obtain two prediction vectors output between the related layers, and taking the distance between the two prediction vectors as the loss value of the related layer. Then, based on the weight values of the associated layers, a total loss value of the student model is calculated. That is, each iteration is performed 1 time, the weighted sum between the layers corresponding to the first machine reading understanding model and the second machine reading understanding model which have the hierarchical mapping relationship is calculated to obtain the total loss value.

For example, as shown in FIG. 2:

the weight coefficient was defined as [0.1,0.1,0.15,0.15,0.2,0.3]

In a certain iteration:

the loss value of the hierarchical mapping relation (0,0) is 1.0;

the loss value of the hierarchical mapping relation (1,1) is 2.0;

the loss value of the hierarchical mapping relation (4,2) is 3.0;

the loss value of the hierarchical mapping relation (8,3) is 4.0;

the loss value of the hierarchical mapping relation (11,4) is 5.0;

the loss value of the hierarchical mapping relation (13,5) is 6.0;

the total loss value was 0.1 × 1.0+0.1 × 2.0+0.15 × 3.0+0.15 × 4.0+0.2 × 5.0+0.3 × 6.0.

In this embodiment, in each joint training, a total loss value in each joint training process is calculated through a target loss function, the total loss value in each joint training process is compared with a preset threshold, if the total loss value is greater than the preset threshold, the training is continued, and parameters of the second machine reading understanding model are updated; and if the total loss value is smaller than the preset threshold value, stopping training the second machine reading understanding model, wherein the second machine reading understanding model corresponding to the total loss value is used as a target machine reading understanding model. The preset threshold is 0.5-0.8, the preset threshold can be set according to different training scenes and requirements, and the preset threshold is preferably 0.5 in the embodiment.

Thus, after the training is completed, the student models are output through steps S10-S30. Through performance tests, the occupied video memory is reduced by half, and the reasoning time is accelerated by three times. Therefore, the hierarchical mapping relation is established for the first machine reading understanding model and the second machine reading understanding model, the loss function is adopted to calculate the error for each layer with the corresponding relation, then the total loss value is calculated in a weighting and summing mode, and the parameters of the second machine reading understanding model are adjusted based on the total loss value, so that the target machine reading understanding model is obtained, the production and popularization are facilitated, and the convenience is brought to users.

Further, in practical applications, only the final target machine reading understanding model is deployed.

Experimental data:

the data set contains 11 ten thousand english (question, text, answer) triplets, wherein the training set contains 10 ten thousand (question, text, answer) triplets, the verification set contains 5000 triplets, and the test set contains 5000 triplets. The data as shown in table 1 were obtained by the above training method of the present application.

TABLE 1

Learning rate	Epochs	EM	F1	Remarks for note
					0.00003	10	60.5	59.9	Under-fitting
0.00003	20	71.8	62.2	Under-fitting
					0.00003	30	78.3	70.0	Under-fitting
0.00003	40	86.8	79.3	Under-fitting
					0.00003	50	92.9	81.4	Optimization of
0.00003	60	87.8	79.3	Overfitting
					0.00003	70	86.6	78.9	Overfitting

In Table 1 above, epochs is the number of rounds of training iterations, EM refers to accuracy, and F1 includes accuracy and recall.

After many times of iterative training, the optimal displayed in the remarks in table 1 is the final student model, i.e. the target machine reading understanding model. Performance of its teacher model: occupying 800M of a video memory, and setting the average single inference time to be 90 milliseconds; performance of the student model: the video memory is occupied by 400M, and the average single inference time is 30 milliseconds.

In one embodiment, the present invention further provides a device for training a machine-readable understanding model, as shown in fig. 3, the device 100 includes a processor 11 and a memory 22 connected to the processor 11, and fig. 3 shows only some of the components of the device 100, but it is to be understood that not all of the shown components are required and more or less components may be implemented instead.

The memory 22 may be an internal storage unit of the device 100 in some embodiments, such as a memory of the device 100. The memory 22 may also be an external storage device of the apparatus 100 in other embodiments, such as a plug-in U-disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the apparatus 100. Further, the memory 22 may also include both internal storage units and external storage devices of the apparatus 100. The memory 22 is used for storing application software installed in the device 100 and various types of data, such as training program code of the machine reading understanding model. The memory 22 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 22 stores a training program of the machine reading understanding model, which can be executed by the processor 11, so as to implement the training method of the machine reading understanding model in the present application, as described above.

The processor 11 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor, a mobile phone baseband processor or other data Processing chip, and is configured to run program codes stored in the memory 22 or process data, for example, execute a training method of the machine reading understanding model, and the like, as described in the above method.

Based on the foregoing method, the present invention also provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps of the training method for machine-readable understanding models described above.

Those skilled in the art will appreciate that fig. 3 is a block diagram of only a portion of the structure associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, and that a particular intelligent terminal may include more or less components than those shown, or may combine certain components, or have a different arrangement of components. The processor, when executing the computer program, implements the steps of the training method for machine reading understanding model described above, which are specifically described above.

In summary, the present invention discloses a method, an apparatus, and a medium for training a machine reading understanding model, the method includes obtaining a trained first machine reading understanding model and a preset hierarchical mapping relationship; constructing a second machine reading understanding model based on the first machine reading understanding model and the hierarchical mapping relation; wherein the number of layers of the second machine reading understanding model is less than the number of layers of the first machine reading understanding model; and acquiring a training set and a target loss function, training the second machine reading understanding model through the training set, the target loss function and the first machine reading understanding model to obtain a trained second machine reading understanding model, and taking the trained second machine reading understanding model as a target machine reading understanding model. According to the method and the device, the machine reading understanding model and the second machine reading understanding model are used for establishing the hierarchical mapping relation, loss function calculation errors are adopted for each layer with the corresponding relation, then the total loss value is calculated in a weighting and mode, and the parameters of the second machine reading understanding model are adjusted based on the total loss value, so that the target machine reading understanding model is obtained, the video memory occupied by the model is greatly reduced, the reasoning time is greatly reduced, and the cost is reduced.

Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program instructing relevant hardware (such as a processor, a controller, etc.), and the program may be stored in a computer readable storage medium, and when executed, the program may include the processes of the above method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, etc.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A training method of a machine reading understanding model is characterized by comprising the following steps:

and acquiring a training set and a target loss function, training the second machine reading understanding model through the training set, the target loss function and the first machine reading understanding model to obtain a trained second machine reading understanding model, and taking the trained second machine reading understanding model as the target machine reading understanding model.

2. The method for training a machine-reading understanding model according to claim 1, wherein the step of training the trained first machine-reading understanding model specifically comprises:

3. The method of claim 1, wherein the hierarchy of the first machine-readable understanding model comprises an input layer, a plurality of cascaded interaction layers, and an output layer, and the hierarchy of the second machine-readable understanding model comprises an input layer, a plurality of cascaded interaction layers, and an output layer, wherein the number of interaction layers of the second machine-readable understanding model is less than the number of interaction layers of the first machine-readable understanding model.

4. The method for training a machine-readable understanding model according to claim 3, wherein the predetermined hierarchical mapping is a function of:

n＝g(m)，

when m is 1, n is 1;

when M < (M/2), n is (M/2) + 1;

when M > (M/2), n ═ 1 (M/2);

5. The method for training a machine reading understanding model according to claim 4, wherein the constructing a second machine reading understanding model based on the first machine reading understanding model and the hierarchical mapping relationship specifically includes:

6. The method for training a machine-readable understanding model of claim 1, wherein the step of obtaining the target loss function specifically comprises:

and determining a target loss function according to the training result.

7. The method of claim 6, wherein the target loss function is a weighted sum of loss values of layers in the second machine-read understanding model.

8. The method of claim 7, wherein the obtaining a training set and a target loss function, and training the second machine reading understanding model through the training set, the target loss function and the first machine reading understanding model to obtain a trained second machine reading understanding model specifically comprises:

9. A training apparatus for a machine-readable understanding model, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the training method for a machine-readable understanding model according to any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for training a machine-reading understanding model of any one of claims 1 to 8.