CN112712130B

CN112712130B - Visual understanding model training method and device, computer equipment and storage medium

Info

Publication number: CN112712130B
Application number: CN202110044054.8A
Authority: CN
Inventors: 戴琼海; 郭雨晨; 方璐; 丁贵广
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2022-06-10
Anticipated expiration: 2041-01-13
Also published as: CN112712130A

Abstract

The application provides a visual understanding model training method and device. The method comprises the following steps: respectively encoding a first image sample of the first visual information understanding task and a second image sample of the second visual information understanding task to obtain a first encoding result corresponding to the first image sample and a second encoding result corresponding to the second image sample; clustering a plurality of first image samples based on the first coding result to obtain a cluster category to which each first image sample belongs; forming at least one batch of training data by each first image sample according to the same category to train a decoder, and after training of all categories is finished, selecting M representative samples from the first image samples and storing the M representative samples in a primary memory of a multi-stage memory module; and forming at least one batch of training data according to the same clustering class based on all representative samples in the primary memory in the multi-stage memory module and the second image samples to train the decoder.

Description

Visual understanding model training method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer multimedia technologies, and in particular, to a method and an apparatus for training a visual understanding model, a computer device, and a storage medium.

Background

The rapid popularization of internet technology and smart phones makes the number of pictures and videos in the network show a tendency of explosive growth, and the understanding of visual information by using a computer is widely concerned and applied. The visual understanding system realizes the understanding of pictures and videos based on the information characteristics of the images and the videos, and generates semantic descriptions which can be understood by people, such as a session, on visual data. Among them, the picture header generation technology is an important branch of semantic understanding technology for images and videos. The existing semantic understanding technology of images and videos is based on an encoder-decoder structure, the improvement of single-task performance is realized mainly by improving an encoder or a decoder, adding a generation countermeasure network and the like, and the focus of the semantic understanding technology is the improvement of the single-task performance. However, with the complexity of semantic understanding tasks of images and videos, such as flexible change of generated sentence styles, large difference of image and video types, and the like, different models have to be trained for different tasks through the existing method, and the semantic understanding task of different images or videos by using one model cannot be completed, so that the time cost and the space cost are greatly increased.

Therefore, how to reduce the forgetting rate of the model as much as possible and establish a model capable of completing a plurality of visual information understanding tasks is an important research topic in academia and industry.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a visual understanding model training method, which can utilize a multi-level memory module to realize the memory transfer of information, effectively retain the information of old tasks in a model, reduce the forgetting rate of the model, and improve the self-adaptive performance of the model, so as to effectively reduce the time cost and the space cost of model training, and improve the utilization rate of the model.

A second object of the present application is to provide a visual understanding model training apparatus.

A third object of the present application is to propose a computer device.

A fourth object of the present application is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, a first aspect of the present application provides a method for training a visual understanding model, including:

acquiring a plurality of first image samples for a first visual information understanding task and acquiring a plurality of second image samples for a second visual information understanding task;

encoding each first image sample and each second image sample respectively based on an encoder to obtain a first encoding result corresponding to each first image sample and a second encoding result corresponding to each second image sample;

clustering the plurality of first image samples based on each first encoding result to obtain a cluster category to which each first image sample belongs;

according to the cluster type of each first image sample, forming at least one batch of training data of each first image sample according to the same type to train a decoder, selecting M representative samples from the plurality of first image samples after training of all types is finished, and storing a first coding result corresponding to each representative sample into a first-level memory of a multi-level memory module;

acquiring coding results corresponding to all representative samples in one-level memory in the multi-level memory module, and clustering the plurality of second image samples and the representative samples based on the coding results corresponding to the representative samples and each second coding result to obtain clustering categories to which each second image sample and each representative sample respectively belong;

and forming at least one batch of training data by each second image sample and each representative sample according to the same cluster type, and training the decoder until all the cluster types are trained.

The embodiment of the second aspect of the present application provides a visual understanding model training device, including:

the system comprises a sample acquisition module, a first image analysis module and a second image analysis module, wherein the sample acquisition module is used for acquiring a plurality of first image samples aiming at a first visual information understanding task and acquiring a plurality of second image samples aiming at a second visual information understanding task;

an encoding module, configured to encode each first image sample and each second image sample respectively based on an encoder to obtain a first encoding result corresponding to each first image sample and a second encoding result corresponding to each second image sample;

the clustering module is used for clustering the plurality of first image samples based on each first coding result so as to obtain a clustering category to which each first image sample belongs;

the training module is used for forming at least one batch of training data from each first image sample according to the same category to train a decoder according to the cluster category to which each first image sample belongs, selecting M representative samples from the plurality of first image samples after training of all categories is finished, and storing a first coding result corresponding to each representative sample in a first-level memory of the multi-level memory module;

the clustering module is further configured to obtain a coding result corresponding to all representative samples in a first-level memory of the multi-level memory module, and cluster the plurality of second image samples and the representative samples based on the coding result corresponding to all representative samples and each second coding result to obtain a cluster category to which each second image sample and each representative sample belong;

the training module is further configured to combine each second image sample and each representative sample into at least one batch of training data according to the same cluster type, and train the decoder until all cluster types are trained.

In an embodiment of the third aspect of the present application, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the visual understanding model training method described in the embodiment of the first aspect of the present application is implemented.

An embodiment of a fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method for training a visual understanding model according to the embodiment of the first aspect of the present application.

According to the technical scheme of the embodiment of the application, a first image sample of a first visual information understanding task and a second image sample of a second visual information understanding task are respectively encoded to obtain a first encoding result corresponding to the first image sample and a second encoding result corresponding to the second image sample; clustering a plurality of first image samples based on the first coding result to obtain a cluster category to which each first image sample belongs; forming at least one batch of training data by each first image sample according to the same category to train a decoder, and after training of all categories is finished, selecting M representative samples from the first image samples and storing the M representative samples in a primary memory of a multi-stage memory module; and forming at least one batch of training data according to the same clustering class based on all representative samples and second image samples in the first-level memory in the multi-level memory module to train the decoder until all clustering classes are trained. Therefore, the multi-task function of the same model is guaranteed by fusing multi-level memory in the model, good performance of the model in both an old task and a new task is guaranteed, the model is built at the lowest time cost and space cost, the function of a visual understanding system is improved, and the application effect of the method is improved. The method and the device can solve the problem that a training model of an old task forgets the old task after new task training is carried out again in picture title generation, realize that a single model completes a multi-task function by fusing a multi-level memory module in the model, reduce the training time and space cost as far as possible, and improve the self-adaptive performance of the model.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a method for training a visual understanding model according to an embodiment of the present disclosure;

FIG. 2 is a schematic block diagram of a visual understanding model system according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a visual understanding model training device according to an embodiment of the present application;

FIG. 4 is a schematic block diagram of a computer device according to one embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The rapid popularization of internet technology and smart phones makes the number of pictures and videos in the network show a tendency of explosive growth, and the understanding of visual information by using a computer is widely concerned and applied. The visual understanding system realizes the understanding of pictures and videos based on the information characteristics of the pictures and the videos, and generates semantic descriptions which can be understood by people for the visual data, such as a session. Among them, the picture header generation technology is an important branch of semantic understanding technology for images and videos. The existing semantic understanding technology of images and videos is based on an encoder-decoder structure, the improvement of single-task performance is realized mainly by improving an encoder or a decoder, adding a generation countermeasure network and the like, and the focus of the semantic understanding technology is the improvement of the single-task performance. However, with the complexity of semantic understanding tasks of images and videos, such as flexible change of generated sentence styles, large difference of image and video types, and the like, different models have to be trained for different tasks by the existing method, and the semantic understanding task of different images or videos by using one model cannot be completed, so that the time cost and the space cost are greatly increased. Therefore, how to reduce the forgetting rate of the model as much as possible and establish a model capable of completing a plurality of visual information understanding tasks is an important research topic in academia and industry.

In order to reduce the forgetting rate of the model and enable the same model to have good performance in a multitasking environment, an existing solution is to modify the parameter Weights in the model by using an Orthogonal Weight Modification (OWM) method. The basic idea of the method is as follows: when a network is trained continuously to perform different tasks, its weights are only allowed to be modified in a direction orthogonal to the subspace spanned by all the inputs of the trained network, in such a way that it is ensured that the new learning process does not interfere with the learned tasks. However, since the weight can only be modified in the direction orthogonal to the subspace spanned by all the inputs of the trained network, a local optimal solution is searched in the training process of the new task, and thus the model cannot be guaranteed to show better performance in the learning of the new task. In practical applications, the hope type can utilize the learned information to promote the learning of the new task, so that the model has good performance in the new task and the old task, but the existing solution can not realize the expected function.

From the current research, although the existing learning method can reduce the forgetting rate of the model, the method is realized at the cost of sacrificing the performance of the model in a new task, and the method improves the adaptive capacity of the model to a certain extent by balancing the performance of the model in the new task and an old task. In practical applications, however, it is desirable that the model not only maintain performance on the old task, but also exhibit the same superior effect on the new task as the old task. Therefore, how to establish a visual understanding system with multi-level memory is worth further research by enlightening the multi-level memory mechanism of the interaction between cortex and hippocampus in the process of learning things by human brain.

The method and the device aim to solve the problem of how to integrate multilevel memory in the visual understanding system, improve the self-adaption performance of the visual understanding system and solve the forgetting problem of the model in different tasks. The method and the device take a picture title generation model as an example, solve the problem that a training model of an old task forgets the old task after new task training is carried out again in picture title generation, realize that a single model completes a multi-task function by fusing a multi-level memory module in the model, reduce the training time and space cost as far as possible, and improve the self-adaptability of the model. In particular, a visual understanding model training method, apparatus, computer device, and storage medium of embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a visual understanding model training method according to an embodiment of the present disclosure. It should be noted that the visual understanding model training method according to the embodiment of the present application is applicable to the visual understanding model training apparatus according to the embodiment of the present application. The visual understanding model training apparatus may be configured to a computer device. As shown in fig. 1, the visual understanding model training method may include the following steps.

In step 101, a plurality of first image samples for a first visual information understanding task is acquired, and a plurality of second image samples for a second visual information understanding task is acquired.

In step 102, each first image sample and each second image sample are encoded based on an encoder to obtain a first encoding result corresponding to each first image sample and a second encoding result corresponding to each second image sample.

In some embodiments of the present application, each first encoding result may include a feature vector of the first image sample, semantic description information of the first image sample, and a task category of the first image sample. Each second encoding result comprises a feature vector of the second image sample, semantic description information of the second image sample and a task category of the second image sample.

The task category may be understood as a category of a task to which the image sample belongs, for example, the task category to which the first image sample belongs is a first visual information understanding task, and the task category to which the second image sample belongs is a second visual information understanding task.

In step 103, a plurality of first image samples are clustered based on each first encoding result to obtain a cluster category to which each first image sample belongs.

In the embodiment of the application, based on each first encoding result, a clustering algorithm is used for clustering a plurality of first image samples to obtain a cluster category to which each first image sample belongs.

In step 104, according to the cluster type to which each first image sample belongs, each first image sample forms at least one batch of training data according to the same type to train the decoder, and after all types of training are finished, M representative samples are selected from the plurality of first image samples, and a first coding result corresponding to each representative sample is stored in a primary memory of the multi-stage memory module.

In some embodiments, the training model of the decoder may be represented as follows:

the loss function is expressed as:

wherein γ is a variable coefficient representing the degree of importance of different training samples, and γ will correspond to different values if the training samples come from different tasks, specifically, γ is γ when the first visual information understanding task a is trained_A0When the sample is a representative sample stored in the primary memory in the training of the second visual information understanding task B, γ is γ_A1When the sample is a sample of the second visual information understanding task B, γ is γ_B0In general, γ may be_A1And gamma_B0Set to the same coefficient gamma₀，γ_A1Is set to gamma₁And has 1. ltoreq. gamma₀≤γ₁. By the method, the forgetting rate of the model to the old task when the model trains the new task can be reduced, and the self-adaption performance of the model is improved; i represents a training sample, namely a picture or a video input into the model; s represents a term generated according to I; θ represents a training parameter in the model.

Optionally, the cluster class to which each first image sample belongs is added to the corresponding first encoding result to obtain a third encoding result, so that the third encoding result corresponding to each representative sample is stored in the primary memory of the multi-level memory module. That is, each first image sample may be stored in the form of (feature vector of image sample, semantic description information, task category, cluster category) in the primary memory of the multi-level memory module.

In order to better utilize the first visual information to understand samples in the task, a representative sample is selected from all samples of the first visual information understanding task through an effective evaluation module and is stored in the multi-stage memory module as effective information. The first evaluation mechanism is to evaluate through a loss function, and the second evaluation mechanism is to calculate the score of each sample by using a score index, so that M representative samples are selected by using the scores.

As an example, the M most representative samples may be selected from the plurality of first image samples based on a preset loss function. As another example, a plurality of scoring indexes for the first visual information understanding task are determined, a composite score of each first image sample is obtained according to the plurality of scoring indexes and an evaluation coefficient of each scoring index, and M most representative samples are selected from the plurality of first image samples according to the composite score.

In step 105, the coding results corresponding to all the representative samples in the primary memory of the multi-level memory module are obtained, and based on the coding results corresponding to all the representative samples and each second coding result, the plurality of second image samples and all the representative samples are clustered to obtain the cluster categories to which each second image sample and each representative sample belong respectively.

In step 106, at least one batch of training data is formed by each second image sample and each representative sample according to the same cluster type, and the decoder is trained until all the cluster types are trained, so that the training for the second visual information understanding task is completed.

In this embodiment of the application, after the training for the second visual information understanding task is completed, new M representative samples may be further selected from among the plurality of second image samples and all representative samples in the primary memory, and the encoding result corresponding to each new representative sample may be stored in the primary memory of the multi-stage memory module. That is, after the training of the second visual information understanding task is completed, the representative samples in the primary memory and all samples of the second visual information understanding task are subjected to sample representative calculation at the same time to obtain M samples which are most representative at this time, and new representative samples are rewritten into the primary memory in the form of (feature vectors of image samples, semantic description information, task categories, and cluster categories) to complete the updating of the most representative samples in the primary memory.

It should be noted that there are many ways to select the new M representative samples from the plurality of second image samples and all representative samples in the primary memory, for example, the most representative samples may be selected by using a loss function, or the most representative samples may be selected by using a composite score calculated by using a scoring instruction. As an example, based on a preset loss function, selecting new M representative samples from the plurality of second image samples and all representative samples in the primary memory; or, determining a plurality of scoring indexes for the second visual information understanding task, obtaining a composite score of each representative sample in the primary memory and the composite score of each second image sample according to the plurality of scoring indexes and the evaluation coefficient of each scoring index, and selecting new M representative samples from all representative samples in the plurality of second image samples and the primary memory according to the composite score.

Therefore, the method and the device can be seen in the application, how to fuse multi-level memories in the visual understanding model with the multi-task requirement to ensure that the same model realizes the multi-task function, ensure that the model has good performance in both the old task and the new task, construct the model with the lowest time cost and space cost, improve the function of the visual understanding system and improve the application effect of the method.

In order to facilitate a clearer understanding of the present application by those skilled in the art, the present application will be described in detail below with reference to fig. 2.

In the embodiment of the present application, as shown in FIG. 2, the visionThe understanding model system may include: the device comprises a selection module, an encoder, a decoder, an evaluation module and a multi-stage memory module. The multistage memory module comprises a first-stage memory unit and a second-stage memory unit. The functional block diagram of the visual solution model system with multi-level memory provided by the application is shown in fig. 2, the memory of old task information is realized by fusing a multi-level memory module in the traditional encoder-decoder structure, and the selection of samples in the training process and the read operation and the write operation of the first-level memory-second-level memory in the multi-level memory module are realized by using the selection module, so that the self-adaptive capacity of the model is realized. For convenience of description, the present application assumes that the old task is task a (such as the first visual information understanding task described above), and the new task is task B (such as the second visual information understanding task described above), and introduces the picture title generation task as an example. In the coding link, a coder carries out feature coding on picture information in a task A and a task B to form a feature vector, and selects the feature vector by taking the feature vector of the task A as an index, wherein a selection module uses a clustering algorithm and forms K by using the method_ADividing the samples into K by using the cosine similarity between the sample characteristic information and the clustering center as an index_AThe task B and all representative samples of the first-level memory in the multi-level memory module are clustered and selected to form K_BAnd the final output results of the task A encoder and the task B encoder comprise the feature vector of the picture, the accurate semantic description information of the picture, the clustering category of the picture and a task tag which belongs to the picture. For the task A, the output result of the encoder forms one or more batches of decoders according to the same category according to the cluster category of the selection module, after all categories are trained, M representative samples are selected from all samples of the task A according to an evaluation mechanism, and the representative samples are stored into the primary memory of the multi-level memory module in the form of (picture feature vector, accurate semantic description, task category and cluster category). For task B, the secondary memory in the multi-level memory module reads the representative sample information of the u-th clustering class in the primary memory and the sample information of the u-th clustering class in the task B, and then the representative sample information and the sample information are combined into one or more batches of solutionAnd the decoder is trained until all classes are trained. And finally, updating the primary memory, scoring the original representative samples in the primary memory and all samples of the task B by using an evaluation module, selecting new M representative samples from the primary memory, and storing the M representative samples in the primary memory of the multi-level memory module again according to the forms of (picture characteristic vectors, accurate semantic description, task categories and clustering categories). The above process completes the learning of two tasks, and when a new task needs to be learned, the operation process of the task B is repeated. Meanwhile, a new parameter weight modification mode is provided in the decoder training process, and the forgetting rate of the model to the old task is reduced.

It can be seen from the above description that the information can be memorized and transferred by using the multi-stage memory module, the information of the old task is effectively retained in the model, the forgetting rate of the model is reduced, the self-adaptive performance of the model is improved, and thus the time cost and the space cost of model training can be effectively reduced, and the utilization rate of the model is improved. The specific implementation mode is as follows:

(1) image coding

In order to better describe the content contained in a picture, it is necessary to encode an input image into a vector form that can be understood by a computer, that is, feature encoding is performed, and picture information is encoded into a vector form. In contrast, the existing convolutional neural networks such as ResNet can be used for feature extraction of the pictures, and the pictures are combined into an integral vector, wherein the vector form is as follows:

x_i＝(x₁₁,x₁₂,...,x_1n1,...,x_k1,x_k2,...,x_knk,...,x_m1,x_m2,...,x_mnm)

wherein x is_kjDenotes the jth component in the feature vector in the kth, and x_knkThe last component in the feature vector in the k-th is represented. S for accurate semantic description of pictures_i＝(s₀,s₁,...,s_n) Is represented by the formula (I) in which s₀Indicating the beginning of the semantic description of the picture, s_nIndicating the end of the semantic description of the picture. After the encoding is finished, the image character is formedToken and picture semantic description pairs, using (x)_i,s_i) Representing the result of the encoding of an image. Since the model needs to be trained for multiple tasks, the task class to which the sample belongs is added to the output of the encoder, and the final output result of the encoder is (x)_i,s_i,t_i)，t_iThe task category of the sample is represented.

(2) Image selection-clustering method

The clustering operation is explained by taking a clustering algorithm K-means as an example. Firstly, a clustering category value K is set according to an elbow rule or an empirical value, then an initial clustering center is selected, a sample point is randomly selected from a data set to serve as a first clustering center, and the shortest distance between the remaining sample points and all the clustering centers is calculated, so that:

D(x⁽ⁱ⁾)＝min[dist(x⁽ⁱ⁾,C₁),dist(x⁽ⁱ⁾,C₂),...,dist(x⁽ⁱ⁾,C_n)]

wherein dist (x)⁽ⁱ⁾,C_j) Denotes x⁽ⁱ⁾Distance from the u-th cluster center, then the probability that a sample point is selected as the next cluster center is:

and taking the sample point with the maximum probability as a second clustering center, and repeating the previous two steps until K clustering centers are selected. After K clustering centers are determined, the characteristic part x in the semantic description pairs of the pictures after the encoding is finished_iAs input of the clustering, then the clustering result is output and expressed as a vector k_iIs added to the picture semantic description pair, the output result at this time becomes (x)_i,s_i,t_i,k_i) Wherein k is_iIndicating the cluster category to which the picture belongs.

(3) Semantic understanding decoder training

At present, there are many semantic generation models, such as RNN and LSTM. In this application, A, B is described as an example of two different tasks. For the A, B two tasks, the training of the decoder model is divided into two parts, the first part being the first training of the model under task a and the second part being the second training under task B.

Training the model on the task A, setting samples of the same category in the task A into a batch according to the clustering category labels of the samples, training the samples of the next category after completing the training of one category, and finally outputting the picture semantic descriptions of all the samples.

The training of the model on the task B is different from the training of the model on the task A in that the trained samples not only comprise the samples in the task B, but also comprise the representative samples of the task A stored in the multi-stage memory module. In the training process, a representative sample of the u-th clustering class in the primary memory is read and written into the secondary memory, meanwhile, data of the u-th class in the task B is also written into the secondary memory, the samples in the secondary memory are divided into one batch or a plurality of batches and are sent into a decoder for training, wherein the read operation and the write operation of the primary memory and the secondary memory are specifically introduced in the steps (5) to (8). The training model of the decoder can be expressed as:

the loss function is expressed as:

wherein, gamma is a variable coefficient representing the importance of different training samples, and gamma will correspond to different values if the training samples come from different tasks, specifically, gamma is gamma during the training of task A_A0When training of task B is performed and the sample is a representative sample stored in the primary memory, γ is γ_A1When the sample is a sample of task B, γ is γ_B0In general, γ may be_A1And gamma_B0Set to the same coefficient gamma₀，γ_A1Is set to gamma₁And has 1. ltoreq. gamma₀≤γ₁. By this method, it is possible to reduce the number of models in performing a new taskThe forgetting rate of old tasks during training of tasks is improved, and the self-adaption performance of the model is improved; i represents a training sample, namely a picture or a video input into the model; s represents a word generated according to I; θ represents a training parameter in the model.

(4) Evaluation module

In order to better utilize samples in the task A, a representative sample is selected from all samples of the task A through an effective evaluation module and is stored in a multi-stage memory module as effective information.

The first evaluation mechanism is an evaluation by a loss function of the form:

if the value of the loss function of a sample is larger, it indicates that the sample is more representative. And selecting the most representative M samples from all samples of the task A through a loss function, and storing the samples into a primary memory of a multi-stage memory module.

The second evaluation mechanism is to form a comprehensive scoring formula by combining the scoring indexes of the picture semantic generation task, i.e., BLUE1, BLUE2, BLUE3, BLUE4, and CIDER, etc., that is:

S＝α₁BLUE1+α₂BLUE2+α₃BLUE3+α₄BLUE4+α₅CIDER

wherein alpha is_iCoefficients for evaluating each scoring index. The method takes the result of the above formula as a representative evaluation index of each sample, and if the value of S is larger, the sample is more representative. Through the method, M representative task A samples are selected and stored in the primary memory of the multi-stage memory module.

(5) Write operation of first-level memory

First write of first level memory: selecting the most from the task A according to the evaluation mechanism in the step (4)Representative M samples are stored as a group of information in the form of (picture characteristics, picture description, task category and cluster category), and the storage form of one representative sample is as follows: (x)_i,s_i,t_i,k_i) Such a set of information serves as a memory cell. Compared with the coding information of the task A sample, although the cluster type to which the sample belongs is still reserved, the cluster type has no practical significance at this time, because the representative sample and the new sample in the task B are clustered again to form a new cluster type. The task category of the representative sample is still reserved, and the task category is used for distinguishing the representative sample from the sample of the task B in the process of training the task B, so that different gamma values are selected according to the task category when the model parameters are updated, and different influences of a new task and an old task on the model weight are realized.

Updating: after the training of the task B is finished, representative calculation of samples is carried out on the representative samples in the primary memory and all samples of the task B through the scoring mechanism in the step (4) at the same time, M samples which are most representative at the moment are obtained, and new representative samples are obtained according to the formula (x)_i,s_i,t_i,k_i) The form of (1) is rewritten into the primary memory.

(6) Read operation of first level memory

The clustering module reads the memory unit of the representative sample in the primary memory and uses the picture feature in the memory unit, namely x_iAnd clustering with the picture characteristics of the task B samples as a basis, forming a new clustering class for each representative sample according to a clustering result, and writing the new clustering class into the corresponding position of a representative sample memory unit in the primary memory in turn. Meanwhile, the cache can read the information of a representative sample of one cluster type in the primary memory according to new different clusters.

(7) Write operation for two-level memory

According to the clustering result, reading all representative memory units (x) of the u-th clustering class from the primary memory_i,s_i,t_i,k_i) Writing into the secondary memory. Not only writes in the secondary memoryAnd (4) writing the information of the u-th clustering class in the representative sample and the information of the u-th clustering class in the task B into a secondary memory.

(8) Read operation of two-level memory

When the task B starts training, the encoder reads information of the u-th clustering class from the secondary memory, and reads one batch for training each time, and the updating of the secondary memory and the training process of the u + 1-th clustering class are carried out until all samples in the secondary memory are trained.

(9) Repeating the steps (1) to (8)

The above steps only complete the model training of the double tasks, and if the model is expected to realize the function of three or more tasks, the repeated training can be carried out according to the above steps. The number of tasks that the model can realize differs according to the capacity of the primary memory in the multi-stage memory module, and if the capacity of the primary memory is larger, the number of tasks that the model can realize most is also larger. And according to different task demands, reasonably selecting the first-level memory capacity.

In summary, the present application improves the adaptive capability of the model and reduces the forgetting rate of the model by fusing the multi-level memory module of the first-level memory and the second-level memory in the existing basic encoder-decoder model. In addition, the representative samples in the task A are reserved and stored in the primary memory through a new training mode, the representative samples in the task B and the primary memory are divided into different categories through clustering, and then training is carried out according to the same group of samples in the same category, so that the performance of the model on the task A is favorably reserved, and meanwhile, a better effect is shown in the task B. In addition, the training method provided by the application is not only suitable for a single-task situation any more, and still has good performance in multi-task requirements, so that the self-adaptive performance of the model can be effectively improved.

In order to implement the above embodiments, the present application further provides a visual understanding model training apparatus.

Fig. 3 is a schematic structural diagram of a visual understanding model training apparatus according to an embodiment of the present application.

As shown in fig. 3, the visual understanding model training apparatus 300 includes: a sample acquisition module 310, an encoding module 320, a clustering module 330, and a training module 340.

Specifically, the sample acquiring module 310 is configured to acquire a plurality of first image samples for a first visual information understanding task and acquire a plurality of second image samples for a second visual information understanding task.

The encoding module 320 is configured to encode each first image sample and each second image sample based on an encoder to obtain a first encoding result corresponding to each first image sample and a second encoding result corresponding to each second image sample.

The clustering module 330 is configured to cluster the plurality of first image samples based on each first encoding result to obtain a cluster category to which each first image sample belongs.

The training module 340 is configured to train the decoder with at least one batch of training data composed of each first image sample according to the same class according to the cluster class to which each first image sample belongs, select M representative samples from the plurality of first image samples after training of all classes is completed, and store the first coding result corresponding to each representative sample in the primary memory of the multi-stage memory module.

In this embodiment of the application, the clustering module 330 is further configured to obtain a coding result corresponding to all representative samples in a first-level memory of the multi-level memory module, and cluster the plurality of second image samples and all representative samples based on the coding result corresponding to all representative samples and each second coding result, so as to obtain a cluster category to which each second image sample and each representative sample belong respectively.

The training module 340 is further configured to train at least one batch of training data for the decoder, where the batch of training data is formed by each second image sample and each representative sample according to the same cluster type, until all cluster types are trained.

It should be noted that the foregoing explanation on the embodiment of the visual understanding model training method is also applicable to the visual understanding model training apparatus of this embodiment, and details are not repeated here.

In order to implement the above embodiments, the present application also provides a computer device.

FIG. 4 is a schematic block diagram of a computer device according to one embodiment of the present application. As shown in fig. 4, the computer device 400 may include: a memory 401, a processor 402 and a computer program 403 stored in the memory 401 and operable on the processor 402, wherein the processor 402 implements the method for training a visual understanding model according to any of the above embodiments when the processor 402 executes the program 403.

To achieve the above embodiments, the present application further proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the visual understanding model training method described in any of the above embodiments of the present application.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for training a visual understanding model, comprising:

respectively encoding each first image sample and each second image sample based on an encoder to obtain a first encoding result corresponding to each first image sample and a second encoding result corresponding to each second image sample;

clustering the plurality of first image samples based on each first coding result to obtain a cluster category to which each first image sample belongs;

2. The method according to claim 1, wherein each of the first encoding results comprises a feature vector of the first image sample, semantic description information of the first image sample, and a task category of the first image sample; each second encoding result comprises a feature vector of the second image sample, semantic description information of the second image sample and a task category of the second image sample.

3. The method of claim 1, further comprising:

adding the cluster category to which each first image sample belongs to the corresponding first coding result to obtain a third coding result;

wherein the storing the first encoding result corresponding to each representative sample into a primary memory of a multi-stage memory module includes:

and storing a third encoding result corresponding to each representative sample into a primary memory of a multi-stage memory module.

4. The method of claim 1, wherein said selecting M representative samples from said plurality of first image samples comprises:

selecting M most representative samples from the plurality of first image samples based on a preset loss function; or,

determining a plurality of scoring indexes aiming at the first visual information understanding task, obtaining a comprehensive score of each first image sample according to the plurality of scoring indexes and the evaluation coefficient of each scoring index, and selecting M most representative samples from the plurality of first image samples according to the comprehensive score.

5. The method of claim 1, wherein after completing training for the second visual information understanding task, the method further comprises:

and selecting new M representative samples from the plurality of second image samples and all representative samples in the primary memory, and storing the encoding result corresponding to each new representative sample in the primary memory of the multistage memory module.

6. The method of claim 5, wherein said selecting new M representative samples from among said plurality of second image samples and all representative samples in said primary memory comprises:

selecting new M representative samples from the plurality of second image samples and all representative samples in the primary memory based on a preset loss function; or,

determining a plurality of scoring indexes for the second visual information understanding task, obtaining a composite score of each second image sample and a composite score of each representative sample in the primary memory according to the plurality of scoring indexes and an evaluation coefficient of each scoring index, and selecting new M representative samples from the plurality of second image samples and all representative samples in the primary memory according to the composite scores.

7. The method according to any of claims 1 to 6, wherein the training model of the decoder is represented as follows:

wherein gamma is a variable coefficient and represents the importance degree of different training samples, the training samples come from different tasks, the gamma corresponds to different values, and the larger the gamma is, the more important the samples are; i represents a training sample, namely a picture or a video input into the model; s represents a word generated according to I; θ represents a training parameter in the model.

8. A visual understanding model training apparatus, comprising:

the clustering module is further configured to obtain coding results corresponding to the representative samples in the first-level memory of the multi-level memory module, cluster the plurality of second image samples and the representative samples based on the coding results corresponding to the representative samples and each of the second coding results, and obtain cluster categories to which each of the second image samples and each of the representative samples belong;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements a visual understanding model training method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a visual understanding model training method according to any one of claims 1 to 7.