CN111813978A

CN111813978A - Image description model generation method and device and storage medium

Info

Publication number: CN111813978A
Application number: CN201910295123.5A
Authority: CN
Inventors: 潘滢炜; 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2020-10-23

Abstract

The application discloses a generation method, a generation device and a storage medium of an image description model, which particularly comprises the steps of obtaining an image sample and a description sentence corresponding to the image sample; inputting an image sample and a description sentence into a first prediction model to generate a first distribution probability set; acquiring a second distribution probability set corresponding to a target vocabulary corresponding to each entity object in the image sample; respectively distributing weighted values to the first distribution probability set and the second distribution probability set to obtain a third distribution probability set containing a third distribution probability corresponding to the target vocabulary at each decoding moment, and outputting a prediction statement; and respectively determining a statement entity object coverage loss function and a language sequence loss function based on the first distribution probability set, the second distribution probability set and the weight values, and determining a model loss function so as to optimize the image description model until the image description model is generated. According to the image description method and device, the accuracy of image description is improved by optimizing the image description model.

Description

Image description model generation method and device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for generating an image description model.

Background

The image description mainly describes the visual content of an image through natural language expression. In recent years, image description technology has been widely used, and automatic image description generation can be applied to image retrieval. The existing image description technology mainly extracts feature vectors of input image data through a Convolutional Neural Network (CNN), and then decodes the extracted features by using a Recurrent Neural Network (RNN) or a long-short Term Memory Network (LSTM), so as to obtain an output sequence and generate an image description.

In the conventional image description technology, when a generated model of image description is trained, images and corresponding description sentences are usually trained as a sample data set, so that a word list is fixed, the generated model is generally limited to be used in the trained image data, and the application of the generated model in a scene with new image data as input is limited. In addition, the recognition accuracy of the object to be recognized in the descriptive sentence output by the above model is not high.

Disclosure of Invention

The embodiment of the application provides a method for generating an image description model, which improves the accuracy of image description of an input image to be recognized by establishing the image description model.

The method comprises the following steps:

acquiring an image sample and a description sentence corresponding to the image sample;

sequentially inputting the image sample and each target word in the descriptive sentence into a pre-trained first prediction model, and generating a first distribution probability set of a first distribution probability corresponding to each target word in the descriptive sentence at each decoding moment;

acquiring a second distribution probability set of second distribution probabilities corresponding to the target vocabularies corresponding to the entity objects in the image sample at each decoding moment;

respectively distributing weighted values for a first distribution probability corresponding to the target vocabulary in the first distribution probability set and a second distribution probability corresponding to the target vocabulary in the second distribution probability set to obtain a third distribution probability set containing a third distribution probability corresponding to the target vocabulary at each decoding moment, and outputting a prediction statement;

respectively determining a statement entity object coverage loss function and a language sequence loss function based on the first distribution probability set, the second distribution probability set and the weight values;

and determining a model loss function according to the statement entity object coverage loss function and the language sequence loss function, and optimizing the image description model according to the model loss function until the image description model is generated.

Optionally, the generating method further comprises:

inputting the image sample into the convolution neural network model trained in advance, and extracting a feature vector of the image sample;

and inputting the target vocabulary corresponding to the feature vector and the descriptive statement at the last decoding moment into the long-short term memory network model.

Optionally, the generating method includes:

inputting the image sample into an object detector trained in advance based on a preset image data set, and generating a prediction score set, wherein each entity object contained in the image sample corresponds to a prediction score of the target vocabulary;

and determining the target vocabulary as a correct semantic word of the entity object based on the prediction score set and the hidden state of the long-short term memory network model at the current decoding moment, and generating a second distribution probability set of a second distribution probability of the target vocabulary at each decoding moment.

Optionally, the generating method includes:

calculating the third distribution probability set containing the corresponding third distribution probability of the target vocabulary at each decoding moment based on a first weight value distributed to the first distribution probability of the target vocabulary and a second weight value distributed to the second distribution probability, wherein the sum of the first weight value and the second weight value is 1.

Optionally, the generating method includes:

determining a sentence entity object coverage loss function based on the sum of the second distribution probability corresponding to each target vocabulary at each decoding moment in the second distribution probability set and the assigned weight value and the actual distribution probability of the target vocabulary in the descriptive sentence;

determining a language sequence loss function based on the prediction statement and the description statement for the image sample output by the first prediction model.

In another embodiment of the present invention, there is provided a method of acquiring an image description, the method comprising:

acquiring an image to be identified;

the image to be recognized is input into the image description model in each step in the image description model generation method to generate the description sentence of the image to be recognized.

In another embodiment of the present invention, there is provided an image description model generation apparatus including:

the acquisition module is used for acquiring an image sample and a descriptive statement corresponding to the image sample;

the first generation module is used for sequentially inputting the image sample and each target word in the descriptive sentence into a first pre-trained prediction model and generating a first distribution probability set of a first distribution probability corresponding to each target word contained in the descriptive sentence at each decoding moment;

the second generation module is used for acquiring a second distribution probability set of second distribution probabilities corresponding to the target vocabularies corresponding to the entity objects in the image sample at each decoding moment;

a third generating module, configured to respectively assign weighted values to a first distribution probability corresponding to the target vocabulary in the first distribution probability set and a second distribution probability corresponding to the target vocabulary in the second distribution probability set, so as to obtain a third distribution probability set including a third distribution probability corresponding to the target vocabulary at each decoding time, and output a prediction statement;

a first determining module, configured to determine a statement entity object coverage loss function and a language sequence loss function based on the first distribution probability set, the second distribution probability set, and the weight values, respectively;

and the second determining module is used for determining a model loss function according to the statement entity object coverage loss function and the language sequence loss function, and optimizing the image description model according to the model loss function until the image description model is generated.

Optionally, the generating device further includes:

the extraction module is used for inputting the image sample into the convolutional neural network model trained in advance and extracting the characteristic vector of the image sample;

and the input module is used for inputting the target vocabulary corresponding to the feature vector and the descriptive statement at the last decoding moment into the long-term and short-term memory network model.

Optionally, the second generating module includes:

the input unit is used for inputting the image sample into an object detector which is trained in advance based on a preset image data set, and generating a prediction score set of prediction scores of the target vocabulary corresponding to all entity objects contained in the image sample;

and the generating unit is used for determining the target vocabulary as the correct semantic words of the entity object based on the prediction score set and the hidden state of the long-short term memory network model at the current decoding moment, and generating a second distribution probability set of a second distribution probability of the target vocabulary at each decoding moment.

Optionally, the third generating module is further configured to:

Optionally, the first determining module includes:

a first determining unit, configured to determine a sentence entity object coverage loss function based on a sum of the second distribution probability corresponding to each target vocabulary at each decoding time in the second distribution probability set and the assigned weight value, and an actual distribution probability of the target vocabulary in the descriptive sentence;

a second determination unit configured to determine a language sequence loss function based on the prediction statement and the description statement for the image sample output by the first prediction model.

In another embodiment of the invention, there is provided an apparatus for acquiring a description of an image, the apparatus comprising:

the acquisition module is used for acquiring an image to be identified;

and the generating module is used for inputting the image to be recognized into the image description model in each step in the image description model generating method so as to generate the description sentence of the image to be recognized.

In another embodiment of the present invention, a non-transitory computer readable storage medium is provided, which stores instructions that, when executed by a processor, cause the processor to perform the steps of one of the above-described image description model generation methods.

In another embodiment of the present invention, a terminal device is provided, which includes a processor configured to execute the steps of the image description model generation method.

As can be seen from the above, based on the above-described embodiment, an image sample and a descriptive sentence corresponding to the image sample are first acquired. Secondly, sequentially inputting each target word in the image sample and the descriptive sentence into a first pre-trained prediction model, and generating a first distribution probability set of first distribution probabilities corresponding to each target word in the descriptive sentence at each decoding moment. And meanwhile, acquiring a second distribution probability set of a second distribution probability corresponding to each target vocabulary corresponding to each entity object in the image sample at each decoding moment. Further, weight values are respectively distributed to a first distribution probability corresponding to the target vocabulary in the first distribution probability set and a second distribution probability corresponding to the target vocabulary in the second distribution probability set, so that a third distribution probability set containing a third distribution probability corresponding to the target vocabulary at each decoding moment is obtained, and the prediction sentence is output. Then, respectively determining a statement entity object coverage loss function and a language sequence loss function based on the first distribution probability set, the second distribution probability set and the weight values. And finally, determining a model loss function according to the statement entity object coverage loss function and the language sequence loss function, and optimizing the image description model according to the model loss function until the image description model is generated. According to the image description method and device, the image description model is optimized through determining the model loss function, and the accuracy of image description is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flow chart illustrating a method for generating an image description model according to embodiment 10 of the present application;

fig. 2 is a schematic diagram illustrating a generating sentence entity object coverage loss function and a language sequence loss function in a generating method of an image description model provided in embodiment 20 of the present application;

fig. 3 is a schematic diagram illustrating a specific flow in a method for generating an image description model in embodiment 30 provided in the present application;

FIG. 4 is a flow chart illustrating a method for obtaining an image description provided by embodiment 40 of the present application;

fig. 5 is a schematic diagram illustrating an apparatus for generating an image description model according to embodiment 50 of the present application;

FIG. 6 shows a schematic diagram of an apparatus for obtaining an image description provided in an embodiment 60 of the present application;

fig. 7 shows a schematic diagram of a terminal device provided in embodiment 70 of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

Based on the problems in the prior art, the embodiments of the present application provide a method for generating an image description model by providing an LSTM-based pointing mechanism LSTM-P for an object to be recognized, first extracting visual features of a given image through CNN, and then inputting the visual features into LSTM, which outputs a first distribution probability of all target words contained in a word list. In addition, the entity objects in the input image are identified through a pre-trained object detector, the prediction scores of all the entity objects are obtained, the prediction scores of all the entity objects and the hidden state of the LSTM at the current decoding time are used as input in a copying layer, and the second distribution probability of the description of the entity objects directly taking the target vocabulary as the object words is output. And finally, inputting the first distribution probability and the second distribution probability which are respectively output by the two models into a pointing mechanism, dynamically distributing weights by the pointing mechanism, outputting the final probability distribution of each target vocabulary, and calculating to obtain a corresponding statement entity object coverage loss function and a language sequence loss function so as to determine a model loss function of the whole image description model and optimize the image description model. The image description model based on the LSTM-P enlarges the range of recognizable target words, and simultaneously improves the accuracy of description sentences of the image description.

The application field of the application is mainly in the technical field of computers, and is suitable for the technical field of computer vision and the field of natural language processing. Fig. 1 is a schematic diagram of an embodiment 10 of a method for generating an image description model according to an embodiment of the present application. The detailed steps are as follows:

and S11, acquiring the image sample and the descriptive sentence corresponding to the image sample.

In this step, a plurality of image samples and description sentences of the image samples are obtained as a plurality of pairs of input data, for example, the image samples shown in fig. 2, and the corresponding description sentences are "a large dog playing on a blanket on a core", where the description sentences corresponding to the image samples need to be preset. Meanwhile, the descriptive statement is a correct description of the meaning of the contents of the image sample.

S12, the image sample and the descriptive sentence are input into the first pre-trained prediction model, and a first distribution probability set of first distribution probabilities corresponding to each target word included in the descriptive sentence at each decoding time is generated.

In this step, the first prediction model is mainly composed of CNN and LSTM. The CNN model is mainly used for extracting the characteristic vector of the image sample, inputting the image sample into the CNN model and outputting the characteristic vector of the image sample. Specifically, the image sample is input into the CNN, and the feature vector of the image sample is extracted through a network layer except for the softmax classifier which outputs the category score by using a pre-trained network VGG-16 or ResNet. CNN, commonly referred to as an Encoder (Encoder), encodes a large amount of information contained in an image sample into feature vectors, processes them, and uses the extracted feature vectors as an initial input for subsequent LSTM. The role of the LSTM is mainly to process features by decodingAnd vector converting the vector into natural language. Where the image sample is input to the LSTM along with the corresponding descriptive statement. Specifically, the target vocabulary at the last decoding time in the CNN extracted feature and descriptive sentences is input into LSTM, and the LSTM generates the predicted sentences one by one according to the order of words in the sentences. Here, the LSTM finally outputs a first distribution probability set of the first distribution probability of the target word at each decoding time in the order of words in the sentence. As shown in FIG. 2, wherein

The method comprises the steps of obtaining a first distribution probability set which is composed of predicted first distribution probabilities obtained by training an image sample and a corresponding descriptive statement input content through a first prediction model containing CNN and LSTM.

And S13, acquiring a second distribution probability set of second distribution probabilities corresponding to the target words corresponding to the entity objects in the image sample at each decoding moment.

In this step, first, an image sample is input into a pre-trained object detector, and the object detector identifies each entity object that may be included in the image sample. Wherein the object detector is trained beforehand from the image data set. The image data set contains a large amount of image data and corresponding description words, and the object detector is used for identifying each entity object in the image sample to obtain a prediction score set of prediction scores of each entity object which may be a corresponding target vocabulary. As shown in fig. 2, the Object detector is denoted as Object Leaner, and after the image sample is input into the Object Leaner, the prediction score of each entity Object contained in the image sample, which may be the target vocabulary, is output. When an image sample as shown in fig. 2 is input at the Object Leaner, the Object Leaner outputs a prediction probability that each entity Object may be a relevant target vocabulary, i.e., a prediction score, such as dog: 1.00, touch; 0.21, bed: 0.13, blanket: 0.12.

next, after acquiring a prediction score set of prediction scores for each entity object, the hidden state of LSTM at the current decoding time and the acquired prediction score set are simultaneously input to a complexAnd (5) making a coating Layer. In the replication layer, the target words of the entity objects which are recognized are determined as the correct semantic words of the entity objects in the input image sample, and a second distribution probability set of second distribution probabilities corresponding to the target words at each decoding moment is generated

And S14, respectively allocating weights to the first distribution probability corresponding to the target vocabulary in the first distribution probability set and the second distribution probability corresponding to the target vocabulary in the second distribution probability set to obtain a third distribution probability set containing the third distribution probability corresponding to the target vocabulary at each decoding moment.

In this step, after a first distribution probability set generated by the first prediction model and a second distribution probability set generated in the Copying Layer are obtained, both the two distribution probability sets are input into the created Pointing Mechanism model Pointing Mechanism. Pointing Mechanism dynamically assigns a weight p to two sets of probability distributions_tAnd 1-p_tAnd calculating a weighted sum to determine a final generated third distribution probability set Pr containing third distribution probabilities of the respective target words in the prediction sentence^t。

And S15, determining a statement entity object coverage loss function and a language sequence loss function based on the first distribution probability set, the second distribution probability set and the weight values.

In this step, since the sentence finally generated by the first prediction model is composed of each vocabulary generated one by one in the order of describing the target vocabulary in the sentence. Therefore, the first distribution rate of the sentence finally generated by the first prediction model is equal to the sum of the logarithmic probabilities that each vocabulary in the sentence is generated to represent the language sequence loss function, and by minimizing the language sequence loss function, the context and grammar dependency of each target vocabulary in the finally output sentence is ensured, so that the generated sentence is fluent in grammar.

In addition, the target vocabulary is subjected to multi-label classification aiming at the final semantic consistency of the predicted sentences. Specifically, in the second distribution probability set generated by the replication layer, the sum of the products of the second distribution probability at each decoding time and the assigned corresponding weight value is calculated, and the word output probability of each target word at each decoding time is obtained. And finally, acquiring a statement entity object coverage loss function according to the word output probability. Meanwhile, by minimizing the sentence entity object coverage loss function, the accurate meanings of each target word in the finally output sentence are ensured to be consistent with the correct meanings of the entity objects in the corresponding image samples, and the accuracy of each target word in the generated sentence is improved.

And S16, determining a model loss function according to the statement entity object coverage loss function and the language sequence loss function, and optimizing the image description model according to the model loss function until the image description model is generated.

In this step, after the semantic consistency loss function and the language sequence loss function are determined, an optimization parameter is selected, and a model loss function is determined. And meanwhile, continuously correcting the LSTM-P image description model according to the model loss function until the optimal image description model is generated.

Based on the above embodiment of the present application, first, an image sample and a description sentence corresponding to the image sample are obtained, then, each target word in the image sample and the description sentence is sequentially input into a pre-trained first prediction model, a first distribution probability set of a first distribution probability corresponding to each target word included in the description sentence at each decoding time is generated, at the same time, a second distribution probability set of a second distribution probability corresponding to each target word of each entity object at each decoding time is obtained, then, weight values are respectively assigned to the first distribution probability corresponding to the target word in the first distribution probability set and the second distribution probability corresponding to the target word in the second distribution probability set, so as to obtain a third distribution probability set including a third distribution probability corresponding to the target word at each decoding time, and a prediction sentence is output, and finally, determining a model loss function according to the statement entity object coverage loss function and the language sequence loss function, and optimizing the image description model according to the model loss function until the image description model is generated. The image description method and the image description device design an image description model based on the LSTM-P to realize the image description of the image to be recognized. And dynamically integrating the correct target words corresponding to the identified entity objects into the output prediction sentences through a directional mechanism. The LSTM-P based image description model first fully mines the contextual relevance between the generated target words through the conventional CNN & RNN language model. At the same time, a subject trainer trained on the identified data set identifies solid subjects in the input image sample. In addition, words are directly copied from the target words corresponding to the recognized entity objects through the Copying Layer. Further, the pointing mechanism generates a predicted sentence for the first prediction model composed of CNN and LSTM, respectively, and copies the vocabulary directly from the recognized target vocabulary, and dynamically assigns a weight value to the target vocabulary through these two ways. The Pointing Mechanism can screen the best opportunity to copy the object sample to the corresponding descriptive statement in the current context. And the LSTM-P-based image description model is globally trained by reducing the error rate of statement entity object coverage loss and language sequence loss in the output prediction statement. In addition, the lack of coverage rate of the sentence level is used as feedback, and the coverage rate of the output prediction sentences on each entity object in the image sample is further improved.

Fig. 3 is a schematic diagram illustrating a specific flow of a method for generating an image description model in embodiment 30 provided in the present application. Wherein, the detailed process of the specific flow is as follows:

s301, acquiring an image sample and a descriptive statement corresponding to the image sample.

S302, sequentially inputting each target vocabulary in the image sample and the corresponding descriptive sentence into a pre-trained first prediction model.

Here, the first predictive training model includes a convolutional neural network CNN and a long-short term memory network model LSTM. Specifically, the image sample is input into a pre-trained convolutional neural network model, and the feature vector of the image sample is extracted. And further, inputting the target words corresponding to the feature vectors and the descriptive sentences at the last decoding moment into the long-term and short-term memory network model.

S303, generating a first distribution probability set.

In this step, the first prediction model outputs a first distribution probability set describing a first distribution probability corresponding to each target word included in the sentence at each decoding time. For an image sample I, the image sample I is described in sentence S. Wherein, S ═ { ω₁，ω₂，...，ω_NsFrom N_sAnd each word is formed. I is as large as R^DvAnd Wt ∈ R^DwRespectively representing Dv-dimensional visual features and Dw-dimensional text features of the t-th word in the sentence S. In the first prediction model stage, the target vocabulary ω_t+1Equation 1 for the first distribution probability of (a) is expressed as follows:

wherein h is^tIs the output state of the LSTM at the time of t-decode.

S304, the image sample is input to the object detector, and a set of prediction scores is generated.

Here, an image sample is input to an object detector trained in advance based on a preset image data set, and a prediction score set in which a prediction score corresponding to each entity object included in the image sample is a target vocabulary prediction score is generated.

S305, generating a second distribution probability set according to the prediction score set and the hidden state of the LSTM at the current time.

In this step, first, an image sample is input to an object detector trained in advance based on a preset image data set, and a prediction score set in which prediction scores corresponding to target words are generated for each of the entity objects included in the image sample. Further, the target vocabulary is determined as the correct semantic words of the entity objects based on the prediction score set and the hidden state of the long-short term memory network model at the current decoding moment, and a second distribution probability set of the second distribution probability of the target vocabulary at each decoding moment is generated. The second distribution probability is calculated as follows:

wherein h is^tFor the output state of the LSTM at the moment of t-decoding, I_cIs the output of the object detector and is,

and

is a transformation matrix.

S306, the pointing mechanism dynamically allocates weight values for the first distribution probability set and the second distribution probability set.

Here, after the first distribution probability set and the second distribution probability set are input into the pointing mechanism, the pointing mechanism dynamically assigns the weight values, and the specific calculation process is represented by the following calculation formula 3:

Pt＝σ(G_sw_t+G_hh^t+b_p) (3)

wherein the content of the first and second substances,

and

transformation matrices representing text features and LSTM model output, respectively, b_pIs a bias vector.

And S307, determining a third distribution probability set according to the distributed weight values.

Here, the calculation of the target vocabulary is performed based on the first weight value assigned to the first distribution probability of the target vocabulary and the second weight value assigned to the second distribution probabilityAnd a third distribution probability set of a third distribution probability corresponding to the decoding time, wherein the sum of the first weight value and the second weight value is 1. Wherein, the target vocabulary omega_t+1Is calculated by assigning the weight value p calculated by equation 3 to

equations

1 and 2 above_tTo obtain:

wherein the content of the first and second substances,

and

respectively, a first weight value and a second weight value, phi representing a normalization function, such as Softmax.

S308, determining a language sequence loss function.

In this step, a language sequence loss function is determined based on a prediction sentence and a description sentence for the image sample output by the first prediction model. Because the first prediction model generates a corresponding target vocabulary at each decoding moment, the distribution probability of a plurality of vocabularies is modeled by adopting a chain rule. The negative logarithmic probability of the whole output prediction sentence is equal to the sum of the negative logarithmic first distribution probabilities of all the target words, which can be expressed as the following equation 5:

by minimizing the language sequence loss function shown in equation 5, the context dependency between the target words in the output prediction statement can be made more accurate.

S309, determining a statement entity object coverage loss function.

Here, the coverage of each target word included in the output predicted sentence is further calculated based on the second distribution probability of the target word expressed by the above equation 2 and the weight value determined by the equation 3. The specific calculation formula is as follows:

wherein, ω is_oiRepresenting the copied target words, and sigma is a nonlinear activation function, such as Sigmoid function, and as shown in the right side of fig. 2, the products of the second distribution probability and the weight value of each target word are obtained at each decoding moment, and the sum of the products of each target word in the sentence, i.e. the weighted sum, is obtained. The coverage of each target vocabulary is determined. Further, determining a sentence entity object coverage loss function according to the following formula:

equation 7 is the final statement entity object coverage loss function. By minimizing the sentence entity object coverage loss function, the descriptive sentence is closer to the true meaning of the image sample.

S310, determining a model loss function.

Here, the model loss function is determined from the language sequence loss function determined by equation 5 and the sentence entity object coverage loss function determined by equation 7. The specific model loss function is expressed as follows:

where λ is a trade-off parameter. The purpose of the model loss function is to optimize the image description model so that the description statement finally predicted by the image description model conforms to the language logic, and simultaneously, the input image to be predicted can be accurately described.

And S311, optimizing the image description model and generating the model.

The embodiment of the application realizes a method for generating an image description model based on the steps. The image description method is realized by constructing an image description model based on an LSTM-P structure. Meanwhile, a configuration method of the vocabulary extension and a training method of a mixed network of the image description model are realized. Wherein the object detector is pre-trained based on available known image data sets, and then a pointing mechanism is constructed to equalize a first distribution probability for the vocabulary recognition results generated based on a first prediction model composed of CNN and LSTM, and to replicate the extracted target vocabulary directly from the data recognized by the object detector. In addition, the coverage rate of the target vocabulary in the prediction sentence is further mined, so that the accuracy of image description is improved. Experiments based on the COCO image description dataset and the ImageNet image recognition dataset fully validated the model and analysis results.

In addition, as shown in fig. 4, for a method for obtaining an image description provided in embodiment 40 of the present application, an image to be recognized is input into an image description model in each step in the above method for generating an image description model, so as to generate a description sentence of the image to be recognized.

And S41, acquiring the image to be recognized.

S42, inputting the image to be recognized into the image description model to generate a description sentence of the image to be recognized.

Based on the same inventive concept, embodiment 50 of the present application further provides an apparatus for generating an image description model, where as shown in fig. 5, the apparatus includes:

an obtaining module 51, configured to obtain an image sample and a description sentence corresponding to the image sample;

a first generating module 52, configured to sequentially input the image sample and each target word in the descriptive sentence into a pre-trained first prediction model, and generate a first distribution probability set of a first distribution probability corresponding to each target word included in the descriptive sentence at each decoding time;

the second generating module 53 is configured to obtain a second distribution probability set of second distribution probabilities, corresponding to each decoding time, of the target vocabulary corresponding to each entity object in the image sample;

a third generating module 54, configured to respectively assign weighted values to a first distribution probability corresponding to the target vocabulary in the first distribution probability set and a second distribution probability corresponding to the target vocabulary in the second distribution probability set, so as to obtain a third distribution probability set including a third distribution probability corresponding to the target vocabulary at each decoding time, and output a prediction statement;

a first determining module 55, configured to determine a sentence entity object coverage loss function and a language sequence loss function based on the first distribution probability set, the second distribution probability set, and the weight values, respectively;

and a second determining module 56, configured to determine a model loss function according to the statement entity object coverage loss function and the language sequence loss function, and optimize the image description model according to the model loss function until the image description model is generated.

In this embodiment, specific functions and interaction manners of the obtaining module 51, the first generating module 52, the second generating module 53, the third generating module 54, the first determining module 55, and the second determining module 56 may refer to the record of the embodiment corresponding to fig. 1, and are not described herein again.

Optionally, the generating device further includes:

the extraction module 57 is configured to input the image sample into a pre-trained convolutional neural network model, and extract a feature vector of the image sample;

and the input module 58 is used for inputting the target vocabulary corresponding to the feature vector and the descriptive statement at the last decoding moment into the long-term and short-term memory network model.

Optionally, the second generating module 53 includes:

the input unit is used for inputting the image sample into an object detector which is trained in advance based on a preset image data set, and generating a prediction score set of prediction scores of target vocabularies corresponding to all entity objects contained in the image sample;

Optionally, the third generating module 54 is further configured to:

and calculating a third distribution probability set containing a third distribution probability corresponding to the target vocabulary at each decoding moment based on a first weight value distributed to the first distribution probability of the target vocabulary and a second weight value distributed to the second distribution probability, wherein the sum of the first weight value and the second weight value is 1.

Optionally, the first determining module 55 includes:

a first determining unit, configured to determine a statement entity object coverage loss function based on a sum of a second distribution probability corresponding to each target word at each decoding time in a second distribution probability set and an assigned weight value, and an actual distribution probability of the target word in the description statement;

and a second determination unit configured to determine a language sequence loss function based on the prediction statement and description statement for the image sample output by the first prediction model.

Based on the same inventive concept, an embodiment 60 of the present application further provides an apparatus for obtaining an image description, wherein, as shown in fig. 6, the apparatus includes:

the acquisition module 61 is used for acquiring an image to be identified;

and a generating module 62, configured to input the image to be recognized into the image description model to generate a description sentence of the image to be recognized.

In this embodiment, the specific functions and interaction manners of the obtaining module 61 and the generating module 62 can be referred to the record of the embodiment corresponding to fig. 4, and are not described herein again.

As shown in fig. 7, another embodiment 70 of the present application further provides a terminal device, which includes a processor 70, wherein the processor 70 is configured to execute the steps of the … method.

As can also be seen from fig. 7, the terminal device provided by the above embodiment further includes a non-transitory computer readable storage medium 71, the non-transitory computer readable storage medium 71 stores thereon a computer program, and the computer program is executed by the processor 70 to perform the steps of the above-mentioned method for generating an image description model.

In particular, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, a FLASH, and the like, and when the computer program on the storage medium is executed, the method for generating the image description model can be executed.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for generating an image description model, comprising:

respectively determining a semantic consistency loss function and a grammar consistency loss function based on the first distribution probability set, the second distribution probability set and the weight values;

and determining a model loss function according to the semantic consistency loss function and the grammar consistency loss function, and optimizing the image description model according to the model loss function until the image description model is generated.

2. The generation method according to claim 1, wherein the first prediction model is composed of a convolutional neural network model and a long-short term memory network model, and between the step of sequentially inputting each target word in the image sample and the description sentence into the pre-trained first prediction model and the step of generating the first distribution probability set of the first distribution probability corresponding to each target word included in the description sentence at each decoding time, the generation method further comprises:

3. The method according to claim 2, wherein the step of obtaining a second distribution probability set of second distribution probabilities corresponding to the target vocabulary for each physical object in the image sample at each decoding time comprises:

4. The method according to claim 3, wherein the step of obtaining a third distribution probability set containing a third distribution probability corresponding to the target vocabulary at each decoding time comprises:

5. The method according to claim 4, wherein the step of determining the semantic consistency loss function and the syntactic consistency loss function respectively comprises:

determining a semantic consistency loss function based on the sum of the second distribution probability corresponding to each target vocabulary at each decoding moment in the second distribution probability set and the assigned weight value and the actual distribution probability of the target vocabulary in the description statement;

determining a syntax continuity loss function based on the prediction statement and the description statement for the image sample output by the first prediction model.

6. A method for obtaining a description of an image, characterized in that, based on the method according to any one of the steps of claims 1 to 5:

acquiring an image to be identified;

inputting the image to be recognized into an image description model to generate a description sentence of the image to be recognized.

7. An apparatus for generating an image description model, comprising:

a first determining module, configured to determine a semantic consistency loss function and a syntax consistency loss function based on the first distribution probability set, the second distribution probability set, and the weight values, respectively;

and the second determining module is used for determining a model loss function according to the semantic consistency loss function and the grammar consistency loss function, and optimizing the image description model according to the model loss function until the image description model is generated.

8. The generation apparatus according to claim 7, characterized in that the generation apparatus further comprises:

9. The generation apparatus according to claim 8, wherein the second generation module comprises:

10. The generation apparatus of claim 9, wherein the third generation module is further configured to:

11. The generation apparatus according to claim 10, wherein the first determination module comprises:

a first determining unit, configured to determine a semantic consistency loss function based on a sum of the second distribution probability corresponding to each target vocabulary at each decoding time in the second distribution probability set and the assigned weight value, and an actual distribution probability of the target vocabulary in the description sentence;

a second determination unit configured to determine a syntax continuity loss function based on the description statement and the prediction statement for the image sample output by the first prediction model.

12. An apparatus for obtaining a description of an image, comprising:

the acquisition module is used for acquiring an image to be identified;

and the generation module is used for inputting the image to be recognized into an image description model so as to generate a description sentence of the image to be recognized.

13. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of a method of generating an image description model as claimed in any one of claims 1 to 5.

14. A terminal device, characterized in that it comprises a processor for executing the steps of a method for generating an image description model according to any one of claims 1 to 5.