CN111783809A

CN111783809A - Image description generation method and device and computer readable storage medium

Info

Publication number: CN111783809A
Application number: CN201910841842.2A
Authority: CN
Inventors: 潘滢炜; 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2020-10-16
Anticipated expiration: 2039-09-06
Also published as: CN111783809B

Abstract

The disclosure relates to a method and a device for generating image description and a computer readable storage medium, and relates to the technical field of computers. The method of the present disclosure comprises: constructing a semantic tree of the image according to the relation among each target in the image, the target frame of each target and the image; each node of the semantic tree corresponds to each target, each target frame and each image respectively; according to the relation of each node in the semantic tree, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by using a tree-shaped long-term memory network, and determining the characteristics of each fused target frame and the overall characteristics of the fused image; the target frame characteristics are characteristics of images in the target frame of each target; and determining a description text of the image by using an image description generation model according to each target feature, each fused target frame feature and each fused image global feature.

Description

Image description generation method and device and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating an image description, and a computer-readable storage medium.

Background

The automatic generation of the image description is to make a machine understand the image and automatically generate a description text of the image.

At present, the method of automatically generating image description by adopting a cycle depth neural network is a common method in academia.

Disclosure of Invention

The invention discovers that: in the current image description generation method, the obtained description text is not accurate, for example, the description of some targets is lacked, or the relationship between the targets cannot be reflected.

One technical problem to be solved by the present disclosure is: the accuracy of image description is improved.

According to some embodiments of the present disclosure, there is provided a method of generating an image description, including: constructing a semantic tree of the image according to the relation among each target in the image, the target frame of each target and the image; each node of the semantic tree corresponds to each target, each target frame and each image respectively; according to the relation of each node in the semantic tree, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by using a tree-shaped long-term memory network, and determining the characteristics of each fused target frame and the overall characteristics of the fused image; the target frame characteristics are characteristics of images in the target frame of each target; and determining a description text of the image by using an image description generation model according to each target feature, each fused target frame feature and each fused image global feature.

In some embodiments, constructing the semantic tree of the image comprises: configuring nodes corresponding to the images as root nodes of a semantic tree; configuring nodes corresponding to the target frames into middle layer nodes of a semantic tree; configuring nodes corresponding to all targets as leaf nodes of a semantic tree; and the leaf node corresponding to each target is configured as a child node of the target frame corresponding to the target.

In some embodiments, configuring the node corresponding to each target box as a middle level node of the semantic tree includes: arranging the target frames in the order of increasing area; sequentially taking nodes corresponding to all targets as nodes to be added according to the arrangement sequence; determining the overlapping area of a target frame corresponding to a node to be added and a target frame corresponding to each added node; and under the condition that the added node with the overlapping area of the target frame corresponding to the node to be added exceeding the threshold exists, configuring the node to be added as the child node of the added node, otherwise, configuring the node to be added as the child node of the root node.

In some embodiments, the feature fusion using the tree-shaped long and short term memory network comprises: starting from the layer where the leaf nodes of the semantic tree are located, inputting the characteristics corresponding to all the child nodes belonging to the same father node and the characteristics corresponding to the father node into a tree-shaped long-term memory network to obtain the fused characteristics corresponding to the outputted father node, and updating the characteristics corresponding to the father node into the fused characteristics; sequentially updating the characteristics corresponding to the nodes of each layer according to the sequence from bottom to top; determining the fused target frame features and the fused image global features according to the updated features corresponding to the nodes; when the root node is used as a father node, performing average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic; and weighting the first global target feature and the first global target frame feature to obtain the feature corresponding to the input root node.

In some embodiments, determining the description text of the image according to the respective target features, the fused respective target frame features, and the fused image global features by using the image description generation model includes: aiming at each target feature, merging the target feature with the corresponding target frame feature and the corresponding fused target frame feature to obtain a first target local feature; carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic; merging the first global target feature, the first global target frame feature and the fused image global feature to obtain a merged first global image expression feature; and inputting the first global image expression characteristics and each first target local characteristic into an image description generation model to obtain a description text of an output image.

In some embodiments, inputting the first global image expression feature and each first target local feature into an image description generation model, and obtaining the description text of the output image includes: combining the feature of the description words at the current moment, the first global image expression feature and the feature output at the last moment of the second layer long-short time memory network of the image description generation model, and inputting the first layer long-short time memory network of the image description generation model; the first layer of long-term memory network output characteristics and each first target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first-layer long-short-term memory network, and inputting the characteristics into the second-layer long-short-term memory network to obtain the description words output at the next moment.

In some embodiments, determining the description text of the image according to the respective target features, the fused respective target frame features, and the fused image global features by using the image description generation model includes: inputting each target characteristic into a graph convolution network to obtain each output updated target characteristic; inputting the fused target frame features into a graph convolution network to obtain output updated fused target frame features; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic; aiming at each updated target feature, merging the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature to obtain a second target local feature; carrying out average pooling operation on each updated target feature to obtain a second global target feature; carrying out average pooling operation on each updated target frame characteristic to obtain a second global target frame characteristic; carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic; merging the second global target feature, the second global target frame feature and the third global target frame feature to obtain a merged second global image expression feature; and inputting the second global image expression characteristics and each second target local characteristic into an image description generation model to obtain a description text of the output image.

In some embodiments, inputting the second global image expression feature and each second target local feature into an image description generation model, and obtaining the description text of the output image includes: inputting the characteristics of the description words at the current moment, the second global image expression characteristics and the characteristics output at the last moment of a second layer long-term and short-term memory network of the image description generation model into a first layer long-term and short-term memory network of the image description generation model; the first layer of long-term memory network output characteristics and each second target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first-layer long-short-term memory network, and inputting the characteristics into the second-layer long-short-term memory network to obtain the description words output at the next moment.

In some embodiments, the method further comprises: carrying out target detection on the image to obtain each target frame and a segmentation area in the image; extracting features of the image in each target frame to obtain the target frame features of each target frame in the output image; and extracting features of the image in each divided area to obtain target features of each target in the output image.

In some embodiments, extracting features from the image in each of the segmented regions, and obtaining target features of each target in the output image comprises: setting the image in each divided area to be white and setting the images of other parts to be black as the images after the binarization processing; superposing the image after binarization processing and an original image to obtain an image with a background removed; and inputting the image without the background into an object detector to obtain the target characteristics of the targets in the output sub-areas.

In some embodiments, the method further comprises: obtaining training samples, the training samples comprising: the sample image and the description text corresponding to the sample image; acquiring target characteristics of each target in the sample image and target frame characteristics of each target frame; constructing a semantic tree of the sample image according to the relation among each target in the sample image, the target frame of each target and the sample image; and training a long-term memory network of the tree shape to be trained and an image description generation model to be trained according to the semantic tree, the target characteristics of each target in the sample image and the target frame characteristics of each target frame.

In some embodiments, training the tree-shaped long-time memory network to be trained and the image description generation model to be trained includes: according to the relation of each node in the semantic tree of the sample image, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by using a tree-shaped long-time memory network to be trained, and determining each target frame characteristic after sample image fusion and the overall image characteristic after fusion; according to each target feature of the sample image, each fused target frame feature and each fused image global feature, determining a description text of the output sample image by using an image description generation model to be trained; and according to the output description text of the sample image and the description text of the labeled sample image, adjusting parameters of the tree-shaped long-term memory network to be trained and the image description generation model to be trained until preset convergence conditions are met, thereby completing the training of each model.

According to other embodiments of the present disclosure, there is provided an image description generating apparatus including: the semantic tree building module is used for building a semantic tree of the image according to the relation among each target in the image, the target frame of each target and the image; each node of the semantic tree corresponds to each target, each target frame and each image respectively; the feature fusion module is used for performing feature fusion by utilizing a tree-shaped long-time memory network according to the relation of each node in the semantic tree, the target features of each target corresponding to the node and the target frame features of each target frame corresponding to the node, and determining the fused target frame features and the fused image global features; the target frame characteristics are characteristics of images in the target frame of each target; and the description generation module is used for determining a description text of the image by utilizing an image description generation model according to each target feature, each fused target frame feature and the fused image global feature.

In some embodiments, the semantic tree construction module is configured to configure a node corresponding to the image as a root node of the semantic tree; configuring nodes corresponding to the target frames into middle layer nodes of a semantic tree; configuring nodes corresponding to all targets as leaf nodes of a semantic tree; and the leaf node corresponding to each target is configured as a child node of the target frame corresponding to the target.

In some embodiments, the semantic tree construction module is configured to arrange the target boxes in an order from large area to small area; sequentially taking nodes corresponding to all targets as nodes to be added according to the arrangement sequence; determining the overlapping area of a target frame corresponding to a node to be added and a target frame corresponding to each added node; and under the condition that the added node with the overlapping area of the target frame corresponding to the node to be added exceeding the threshold exists, configuring the node to be added as the child node of the added node, otherwise, configuring the node to be added as the child node of the root node.

In some embodiments, the feature fusion module is configured to input, from a layer where leaf nodes of the semantic tree are located, a tree-shaped long-and-short memory network to features corresponding to all child nodes belonging to the same parent node and features corresponding to the parent node, obtain fused features corresponding to the output parent node, and update the features corresponding to the parent node to the fused features; sequentially updating the characteristics corresponding to the nodes of each layer according to the sequence from bottom to top; determining the fused target frame features and the fused image global features according to the updated features corresponding to the nodes; when the root node is used as a father node, performing average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic; and weighting the first global target feature and the first global target frame feature to obtain the feature corresponding to the input root node.

In some embodiments, the description generation module is configured to, for each target feature, combine the target feature with the corresponding target frame feature and the corresponding fused target frame feature to obtain a first target local feature; carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic; merging the first global target feature, the first global target frame feature and the fused image global feature to obtain a merged first global image expression feature; and inputting the first global image expression characteristics and each first target local characteristic into an image description generation model to obtain a description text of an output image.

In some embodiments, the description generation module is configured to combine features of description words at a current time, the first global image expression feature, and features output at a time on a second-layer long-term memory network of the image description generation model, and input the first-layer long-term memory network of the image description generation model; the first layer of long-term memory network output characteristics and each first target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first-layer long-short-term memory network, and inputting the characteristics into the second-layer long-short-term memory network to obtain the description words output at the next moment.

In some embodiments, the description generation module is configured to input each target feature into the graph convolution network to obtain each output updated target feature; inputting the fused target frame features into a graph convolution network to obtain output updated fused target frame features; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic; aiming at each updated target feature, merging the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature to obtain a second target local feature; carrying out average pooling operation on each updated target feature to obtain a second global target feature; carrying out average pooling operation on each updated target frame characteristic to obtain a second global target frame characteristic; carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic; merging the second global target feature, the second global target frame feature and the third global target frame feature to obtain a merged second global image expression feature; and inputting the second global image expression characteristics and each second target local characteristic into an image description generation model to obtain a description text of the output image.

In some embodiments, the description generation module is configured to input a feature of a description word at a current time, a second global image expression feature, and a feature output at a time on a second-layer long-term memory network of the image description generation model into a first-layer long-term memory network of the image description generation model; the first layer of long-term memory network output characteristics and each second target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first-layer long-short-term memory network, and inputting the characteristics into the second-layer long-short-term memory network to obtain the description words output at the next moment.

In some embodiments, the apparatus further comprises: the characteristic extraction module is used for carrying out target detection on the image to obtain each target frame and each segmentation area in the image; extracting features of the image in each target frame to obtain the target frame features of each target frame in the output image; and extracting features of the image in each divided area to obtain target features of each target in the output image.

In some embodiments, the feature extraction module is configured to set, as the binarized image, the image in each of the divided regions to white and the images in other portions to black, for the image in each of the divided regions; superposing the image after binarization processing and an original image to obtain an image with a background removed; and inputting the image without the background into an object detector to obtain the target characteristics of the targets in the output sub-areas.

In some embodiments, the apparatus further comprises: a training module for obtaining training samples, the training samples comprising: the sample image and the description text corresponding to the sample image; acquiring target characteristics of each target in the sample image and target frame characteristics of each target frame; constructing a semantic tree of the sample image according to the relation among each target in the sample image, the target frame of each target and the sample image; and training a long-term memory network of the tree shape to be trained and an image description generation model to be trained according to the semantic tree, the target characteristics of each target in the sample image and the target frame characteristics of each target frame.

In some embodiments, the training module is configured to perform feature fusion by using a tree-shaped long-time memory network to be trained according to a relation between each node in a semantic tree of a sample image, a target feature of each target corresponding to the node, and a target frame feature of each target frame corresponding to the node, and determine each target frame feature after sample image fusion and a global feature of the fused image; according to each target feature of the sample image, each fused target frame feature and each fused image global feature, determining a description text of the output sample image by using an image description generation model to be trained; and according to the output description text of the sample image and the description text of the labeled sample image, adjusting parameters of the tree-shaped long-term memory network to be trained and the image description generation model to be trained until preset convergence conditions are met, thereby completing the training of each model.

According to still other embodiments of the present disclosure, there is provided an apparatus for generating an image description, including: a processor; and a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform a method of generating an image description as in any of the preceding embodiments.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the method of generating an image description of any of the foregoing embodiments.

The image is divided into semantic information of different levels such as targets, target frames and images, and a semantic tree of the image is constructed according to the relation among the targets, the target frames of the targets and the images, so that the semantic information of different levels is embodied. Further, feature fusion is carried out by utilizing a tree-shaped long-time memory network according to the semantic tree, so that the fused target frame features and the fused image global features are obtained, and the fused features reflect the relation of semantic information of different levels. And finally, generating a model by using image description, each target feature, each fused target frame feature and the fused image global feature to obtain a description text of the image. According to the scheme, the semantic tree is utilized to mine and embody rich and multi-level semantic information of the image, and further the relation between features of different levels is obtained based on the semantic tree, so that the image description generation model can understand the multi-level semantic information of the image, and the generated description text information is richer and more accurate.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 illustrates a flow diagram of a method of generating an image description of some embodiments of the present disclosure.

Fig. 2A illustrates a schematic diagram of a semantic tree of an image of some embodiments of the present disclosure.

Fig. 2B illustrates a structural schematic of an image description generative model of some embodiments of the present disclosure.

FIG. 2C illustrates a structural schematic of an image description generative model of further embodiments of the present disclosure.

Fig. 3 shows a flow diagram of a method of generating an image description of further embodiments of the present disclosure.

Fig. 4 shows a flow diagram of a method of generating an image description of further embodiments of the present disclosure.

Fig. 5 shows a schematic structural diagram of an image description generation apparatus of some embodiments of the present disclosure.

Fig. 6 shows a schematic structural diagram of an image description generating device of further embodiments of the present disclosure.

Fig. 7 shows a schematic structural diagram of an image description generation apparatus according to further embodiments of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The scheme is provided for solving the problem that the existing image description generation method is not accurate. Some embodiments of the generation method described for the disclosed images are described below in conjunction with fig. 1.

Fig. 1 is a flow chart of some embodiments of a method of generating an image description of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S106.

In step S102, a semantic tree of the image is constructed according to the relationship among the objects in the image, the object frames of the objects, and the image.

The target and the target frame may be obtained by a target detection method, which will be described in detail in the following embodiments. The target frame, that is, the target frame (Bounding Box) obtained when the target is detected, may also be understood as an image framed by the target frame. Each target corresponds to a target frame, and the image in the target frame belongs to a part of the global image.

Each node of the semantic tree corresponds to each target, each target frame and the image respectively. In some embodiments, the node corresponding to the image is configured as a root node of the semantic tree. And configuring the nodes corresponding to the target boxes as middle-layer nodes of the semantic tree. And configuring the nodes corresponding to the targets as leaf nodes of the semantic tree. And the leaf nodes corresponding to the targets are configured as child nodes of the target frame corresponding to the targets. The image (global image or part of global image) corresponding to the parent node contains the image (part of global image) corresponding to the child node. The image in the target frame includes a target and a background image.

In some embodiments, the target boxes are arranged in order of decreasing area. And sequentially taking the nodes corresponding to the targets as nodes to be added according to the arrangement sequence. And determining the overlapping area of the target frame corresponding to the node to be added and the target frame corresponding to each added node. And under the condition that the added node with the overlapping area of the target frame corresponding to the node to be added exceeding the threshold exists, configuring the node to be added as the child node of the added node, otherwise, configuring the node to be added as the child node of the root node.

As shown in fig. 2A, a semantic tree of an image, a global image is used as a root node of the tree, and each target includes: trees, people, and hats and glasses worn by people, these objects being leaf nodes. The target frame of the person, the target frame of the tree, the target frame of the hat, and the target frame of the glasses serve as intermediate level nodes. And arranging the target frames according to the area of each target frame from large to small, firstly taking the nodes corresponding to the target frames of the tree as nodes to be added, and taking the nodes corresponding to the target frames of the tree as child nodes of the root node for adding. And further, adding the nodes corresponding to the human target frames as child nodes of the root node. And taking the node corresponding to the target frame of the hat as a node to be added, and adding the node corresponding to the target frame of the hat as a child node of the node corresponding to the target frame of the person if the overlapping area is found to exceed a threshold value by comparing the target frame of the hat with the target frame of the person. Similarly, the node corresponding to the target frame of the glasses is added as a child node of the node corresponding to the target frame of the person. And finally, adding the nodes corresponding to the targets as child nodes of the corresponding target frames. It can be seen that the image corresponding to the parent node and the image corresponding to the child node belong to the relationship between inclusion and inclusion, and the images are divided into semantic information of different levels through the semantic tree.

In step S104, according to the relationship of each node in the semantic tree, the target feature of each target corresponding to the node, and the target frame feature of each target frame corresponding to the node, feature fusion is performed by using the tree-shaped long-term memory network, and each fused target frame feature and the fused image global feature are determined.

The target frame feature is a feature of an image in the target frame of each target. The target features of each target and the target frame features of each target frame may be extracted by a model such as an object detector, and the following embodiments will be described.

The method comprises the steps of utilizing a Tree-shaped long-time memory network (Tree-LSTM) to conduct feature fusion, namely encoding features corresponding to all nodes in a semantic Tree, forming Tree-shaped sequences of the features corresponding to all the nodes according to the structure of the semantic Tree, inputting the Tree-shaped long-time memory network, and combining the relationships of all the nodes in the encoding process, so that the fused features comprise the relationships of different nodes.

In some embodiments, starting from a layer where leaf nodes of a semantic tree are located, inputting features corresponding to all child nodes belonging to the same parent node and features corresponding to the parent node into a tree-shaped long-term memory network, obtaining fused features corresponding to the output parent node, and updating the features corresponding to the parent node into the fused features. And repeating the processes according to the sequence from bottom to top, and sequentially updating the characteristics corresponding to the nodes of each layer. And determining the fused target frame features and the fused image global features according to the updated features corresponding to the nodes. And aiming at the node corresponding to each target, the characteristic corresponding to the node represents the target characteristic, aiming at the node corresponding to each target frame, the characteristic corresponding to the node represents the target frame characteristic, and aiming at the node corresponding to the image, the characteristic corresponding to the node represents the image global characteristic.

Similar to LSTM, Tree-LSTM includes memory cells c_jHidden state h_jInput door i_jAnd an output gate o_jAnd j is a positive integer and represents an index of a node in the semantic tree. Unlike LTSM which updates memory cells only based on the previous hidden state, memory cell updates corresponding to a parent node in Tree-LSTM are based on the hidden states of all child nodes of the parent node. Each child node in the Tree-LSTM also has a forgetting gate f_ikK is a positive integer representing the index of a child node under the same father node, and for the node j, x in the semantic tree_j，h_jThe features representing the input and the fused features of the output, respectively, may be represented in a feature vector. The set of children of the node is denoted C (j). W denotes an input weight matrix, U denotes a cyclic weight matrix (RecurrentWeight Matrices), and b denotes an offset. sigmoid function sigma and hyperbolic tangent function

⊙ represents the dot product of two vectors Tree-LSTM performs feature fusion, i.e., the update process can be calculated according to the following formula.

f_jk＝σ(W_fx_j+U_fh_k+b_f)forget gate (4)

c_j＝u_j⊙i_j+∑_k∈C(j)c_k⊙f_jkcell state (6)

Respectively carrying out object feature on the nodes corresponding to the object and the nodes corresponding to the object frame

And object frame characteristics

As input x to the node_jI is a positive integer representing the index of the target. Carrying out average pooling operation on each target feature to obtain a first global target feature

Carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic

Characterizing the first global object

And a first global object box feature

Weighting to obtain features corresponding to the input root node

Updating the characteristics corresponding to the individual layer nodes from bottom to top through the Tree-LSTM, and enhancing the target frame characteristics according to the context characteristics excavated from the target characteristics and/or the finer target frame characteristics to obtain the fused target frame characteristics

The fused image global feature Ih is given with multi-level information of the target, the target frame and the whole image.

In step S106, a description text of the image is determined by using the image description generating model according to each target feature, each fused target frame feature, and each fused image global feature.

The image description generation model may use an existing model, such as an Attention-based top-Down long-short-term memory network decoding model (Attention-based LSTM decoder in UP-Down), or GCN-LSTM (graph convolution-long-short-term memory network), and the like, and is not limited to the illustrated examples.

The input of the image description generative model is improved. In some embodiments, for each target feature, the target feature is combined with a corresponding target frame feature (the target feature corresponds to the target frame feature of the target frame where the target is located one by one), and the corresponding fused target frame feature to obtain a first target local feature

A first target local feature set may be constructed. Carrying out average pooling operation on each target feature to obtain a first global target feature

Characterizing the first global object

First global target box feature

And fused image global characteristics I^hAnd merging to obtain a merged first global image expression characteristic. And inputting the first global image expression characteristics and each first target local characteristic into an image description generation model to obtain a description text of an output image.

As shown in FIG. 2B, in some embodiments, the image description generative model includes a two-layer LSTM and an attention mechanism module connected to the first layer of LSTM and the second layer of LSTM, respectively. Describing word w at the current moment_t(i.e., the last descriptive word obtained), the first global image representation feature, and a feature output at a time on a second layer of the long-term memory network of the image description generative model

Merging, and inputting a first-layer long-term memory network of the image description generation model. The first layer is long-term and short-term memorized with the output characteristics of the network

Local feature of each first target

And inputting the attention mechanism module. The characteristics output by the attention mechanism module and the characteristics of the first layer of long-term memory network output are compared

Merging, inputting the second layer of long-time memory network to obtain the output description word w at the next moment_t+1。

As shown in FIG. 2C, in some embodiments, the image description generative model includes a two-layer LSTM, an attention mechanism module, and a graph convolution network. In some embodiments, each target feature is characterized

Inputting the graph convolution network to obtain each output updated target characteristic

The fused characteristics of each target frame

Inputting the graph convolution network to obtain the output updated fused characteristics of each target frame

Characterizing each target frame

Inputting the graph convolution network to obtain the output updated characteristics of each target frame

The graph convolution network can update the corresponding characteristics according to the relationship among the nodes, so that the characteristics can reflect the relationship among different targets, the generated description text can represent the relationship among the targets, and the description accuracy is further improved.

For each updated target feature

The updated target characteristics

With corresponding updated target box features

Corresponding updated fused target frame features

Merging to obtain the local features of the second targetAnd establishing a second target local feature set. The updated target characteristics

Performing average pooling operation to obtain a second global target feature

Each updated target frame characteristic

Performing average pooling operation to obtain a second global target frame characteristic

Each updated fused target frame feature

Performing average pooling operation to obtain third global target frame characteristics

Merging the second global target feature, the second global target frame feature and the third global target frame feature to obtain a merged second global image expression feature; and inputting the second global image expression characteristics and each second target local characteristic into an image description generation model to obtain a description text of the output image.

Describing word w at the current moment_t(i.e., the last descriptive word obtained), the second global image representation feature, and a feature output at a time on a second-layer long-term memory network of the image description generative model

Each second target local feature is input to the attention mechanism module. The characteristics output by the attention mechanism module and the characteristics of the first layer of long-term memory network output are compared

Experiments prove that the image semantic tree is constructed in the embodiment, the image semantic tree is subjected to feature fusion according to the semantic tree and the tree-shaped long-time memory network, and then the description text is generated by using the image description generation model, so that the accuracy of the description text generated in the prior art is obviously improved. For example, where an image includes a giraffe and two zebras standing on the side of the tree, the description generated by the prior art method may be a group of zebras standing next to a giraffe, whereas the solution of the present disclosure may be described with accuracy as a giraffe and two zebras standing on the side of the tree.

The method of the embodiment divides the image into semantic information of different levels such as the target, the target frame and the image, and constructs the semantic tree of the image according to the relation among the target, the target frame of the target and the image, so as to embody the semantic information of different levels. Further, feature fusion is carried out by utilizing a tree-shaped long-time memory network according to the semantic tree, so that the fused target frame features and the fused image global features are obtained, and the fused features reflect the relation of semantic information of different levels. And finally, generating a model by using image description, each target feature, each fused target frame feature and the fused image global feature to obtain a description text of the image. According to the scheme of the embodiment, the semantic tree is utilized to mine and embody rich and multi-level semantic information of the image, and the relation between the features of different levels is further obtained based on the semantic tree, so that the image description generation model can understand the multi-level semantic information of the image, and the generated description text information is richer and more accurate.

Further embodiments of the image description generation method of the present disclosure are described below in conjunction with fig. 3.

FIG. 3 is a flow chart of further embodiments of a method of generating an image description of the present disclosure. As shown in fig. 3, steps S102 to S106 are preceded by: steps S302 to S306.

In step S302, target detection is performed on the image, and each target frame and each divided region in the image are obtained.

The target detection may be performed by using an existing model, for example, by using Mask R-CNN (Mask cyclic convolution neural network), and each target frame and segmentation area in the image may be obtained, which is not limited to the illustrated example. The Mask R-CNN belongs to a pixel level semantic segmentation model, and the category of each pixel can be determined, so that the semantic segmentation of the image is realized. The divided regions are regions surrounded by edge lines of the respective objects.

In step S304, features are extracted from the image in each target frame, and target frame features of each target frame in the output image are obtained.

The image in each target frame can be characterized using existing models, for example, using the object detector fast R-CNN (Faster circular convolution neural network).

In step S306, features are extracted from the image in each divided region, and target features of each target in the output image are obtained.

Steps S304 and S306 may be performed in parallel. In order to improve the accuracy of the target features of the respective targets, in some embodiments, for the image in each divided region, the image in the divided region is set to white, and the image of the other portion is set to black as the binarized image. And superposing the image after the binarization processing and the original image to obtain the image with the background removed. And inputting the image with the background removed into an object detector to obtain target characteristics of targets in each output area. The object detector is for example Faster R-CNN.

According to the method, the background except the target is removed, so that the characteristics of the target can be extracted more accurately, and the subsequently generated image description text is more accurate.

The overall model training method is described below with reference to fig. 4.

FIG. 4 is a flow chart of still further embodiments of the disclosed image description generation method. As shown in fig. 4, steps S102 to S106 are preceded by: steps S402 to S408.

In step S402, training samples are obtained, the training samples including: the sample image and the corresponding description text of the sample image.

In step S404, the target feature of each target in the sample image and the target frame feature of each target frame are obtained.

The target feature and the target frame feature can be obtained according to the method of the foregoing embodiment, and models for target detection, target feature extraction, and target frame feature extraction can be trained in advance.

In step S406, a semantic tree of the sample image is constructed according to the relationship among the objects in the sample image, the object frames of the objects, and the sample image.

The method of building a semantic tree refers to the previous embodiments.

In step S408, a long-term memory network of the tree shape to be trained and an image description generation model to be trained are trained according to the semantic tree, the target feature of each target in the sample image, and the target frame feature of each target frame.

In some embodiments, according to the relation of each node in the semantic tree of the sample image, the target feature of each target corresponding to the node and the target frame feature of each target frame corresponding to the node, feature fusion is performed by using a tree-shaped long-time memory network to be trained, and each target frame feature after sample image fusion and the image global feature after fusion are determined. And according to each target feature of the sample image, each fused target frame feature and each fused image global feature, determining a description text of the output sample image by using the image description generation model to be trained. And according to the output description text of the sample image and the description text of the labeled sample image, adjusting parameters of the tree-shaped long-term memory network to be trained and the image description generation model to be trained until preset convergence conditions are met, thereby completing the training of each model.

The processing method of the target feature and the target frame feature of the sample image refers to the processing method in the practical application of the foregoing embodiment.

The present disclosure also provides an image description generation apparatus, which is described below with reference to fig. 5.

Fig. 5 is a block diagram of some embodiments of a generating device described by the images of the present disclosure. As shown in fig. 5, the apparatus 50 of this embodiment includes: semantic tree construction module 502, feature fusion module 504, description generation module 506.

A semantic tree constructing module 502, configured to construct a semantic tree of an image according to relationships among objects in the image, object frames of the objects, and the image; and each node of the semantic tree corresponds to each target, each target frame and each image respectively.

In some embodiments, semantic tree construction module 502 is configured to configure a node corresponding to an image as a root node of a semantic tree; configuring nodes corresponding to the target frames into middle layer nodes of a semantic tree; configuring nodes corresponding to all targets as leaf nodes of a semantic tree; and the leaf node corresponding to each target is configured as a child node of the target frame corresponding to the target.

In some embodiments, the semantic tree construction module 502 is configured to arrange the target boxes in an order from large area to small area; sequentially taking nodes corresponding to all targets as nodes to be added according to the arrangement sequence; determining the overlapping area of a target frame corresponding to a node to be added and a target frame corresponding to each added node; and under the condition that the added node with the overlapping area of the target frame corresponding to the node to be added exceeding the threshold exists, configuring the node to be added as the child node of the added node, otherwise, configuring the node to be added as the child node of the root node.

A feature fusion module 504, configured to perform feature fusion by using a tree-shaped long-and-short memory network according to a relationship between each node in the semantic tree, a target feature of each target corresponding to the node, and a target frame feature of each target frame corresponding to the node, and determine each fused target frame feature and a fused image global feature; the target frame feature is a feature of an image in the target frame of each target.

In some embodiments, the feature fusion module 504 is configured to, starting from a layer where leaf nodes of the semantic tree are located, input features corresponding to all child nodes belonging to the same parent node and features corresponding to the parent node into the tree-shaped long-term memory network to obtain fused features corresponding to the output parent node, and update the features corresponding to the parent node into the fused features; sequentially updating the characteristics corresponding to the nodes of each layer according to the sequence from bottom to top; determining the fused target frame features and the fused image global features according to the updated features corresponding to the nodes; when the root node is used as a father node, performing average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic; and weighting the first global target feature and the first global target frame feature to obtain the feature corresponding to the input root node.

And the description generation module 506 is configured to determine a description text of the image by using an image description generation model according to each target feature, each fused target frame feature and each fused image global feature.

In some embodiments, the description generation module 506 is configured to, for each target feature, combine the target feature with the corresponding target frame feature and the corresponding fused target frame feature to obtain a first target local feature; carrying out average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic; merging the first global target feature, the first global target frame feature and the fused image global feature to obtain a merged first global image expression feature; and inputting the first global image expression characteristics and each first target local characteristic into an image description generation model to obtain a description text of an output image.

In some embodiments, the description generation module 506 is configured to combine features of description words at the current time, the first global image expression feature, and features output at a time on a second layer of long-term memory network of the image description generation model, and input the first layer of long-term memory network of the image description generation model; the first layer of long-term memory network output characteristics and each first target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first-layer long-short-term memory network, and inputting the characteristics into the second-layer long-short-term memory network to obtain the description words output at the next moment.

In some embodiments, the description generation module 506 is configured to input each target feature into the graph convolution network to obtain each output updated target feature; inputting the fused target frame features into a graph convolution network to obtain output updated fused target frame features; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic; aiming at each updated target feature, merging the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature to obtain a second target local feature; carrying out average pooling operation on each updated target feature to obtain a second global target feature; carrying out average pooling operation on each updated target frame characteristic to obtain a second global target frame characteristic; carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic; merging the second global target feature, the second global target frame feature and the third global target frame feature to obtain a merged second global image expression feature; and inputting the second global image expression characteristics and each second target local characteristic into an image description generation model to obtain a description text of the output image.

In some embodiments, the description generation module 506 is configured to input the feature of the description word at the current time, the second global image expression feature, and a feature output at a time on a second layer of long-term memory network of the image description generation model into a first layer of long-term memory network of the image description generation model; the first layer of long-term memory network output characteristics and each second target local characteristic are input into an attention mechanism module; and combining the characteristics output by the attention mechanism module with the characteristics output by the first-layer long-short-term memory network, and inputting the characteristics into the second-layer long-short-term memory network to obtain the description words output at the next moment.

In some embodiments, the apparatus 50 further comprises: a feature extraction module 508, configured to perform target detection on the image to obtain each target frame and a segmentation region in the image; extracting features of the image in each target frame to obtain the target frame features of each target frame in the output image; and extracting features of the image in each divided area to obtain target features of each target in the output image.

In some embodiments, the feature extraction module 508 is configured to set, as the image after the binarization processing, the image in each of the divided regions to white and the images in other portions to black for the image in each of the divided regions; superposing the image after binarization processing and an original image to obtain an image with a background removed; and inputting the image without the background into an object detector to obtain the target characteristics of the targets in the output sub-areas.

In some embodiments, the apparatus 50 further comprises: a training module 510, configured to obtain training samples, where the training samples include: the sample image and the description text corresponding to the sample image; acquiring target characteristics of each target in the sample image and target frame characteristics of each target frame; constructing a semantic tree of the sample image according to the relation among each target in the sample image, the target frame of each target and the sample image; and training a long-term memory network of the tree shape to be trained and an image description generation model to be trained according to the semantic tree, the target characteristics of each target in the sample image and the target frame characteristics of each target frame.

In some embodiments, the training module 510 is configured to perform feature fusion by using a tree-shaped long-time memory network to be trained according to a relationship between each node in a semantic tree of a sample image, a target feature of each target corresponding to the node, and a target frame feature of each target frame corresponding to the node, and determine each target frame feature after sample image fusion and a global feature of the fused image; according to each target feature of the sample image, each fused target frame feature and each fused image global feature, determining a description text of the output sample image by using an image description generation model to be trained; and according to the output description text of the sample image and the description text of the labeled sample image, adjusting parameters of the tree-shaped long-term memory network to be trained and the image description generation model to be trained until preset convergence conditions are met, thereby completing the training of each model.

The image description generation apparatus in the embodiments of the present disclosure may each be implemented by various computing devices or computer systems, which are described below in conjunction with fig. 6 and 7.

Fig. 6 is a block diagram of some embodiments of a generating device described by the images of the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to perform the method of generating an image description in any of some embodiments of the present disclosure based on instructions stored in the memory 110.

Memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.

FIG. 7 is a block diagram of further embodiments of a generating device depicted as an image of the present disclosure. As shown in fig. 7, the apparatus 70 of this embodiment includes: memory 710 and processor 720 are similar to memory 610 and processor 620, respectively. An input output interface 730, a network interface 740, a storage interface 750, and the like may also be included. These

interfaces

730, 740, 750, as well as the memory 710 and the processor 720, may be connected, for example, by a bus 760. The input/output interface 730 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 740 provides a connection interface for various networking devices, such as a database server or a cloud storage server. The storage interface 750 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method of generating an image description, comprising:

constructing a semantic tree of the image according to the relation among each target in the image, the target frame of each target and the image; each node of the semantic tree corresponds to each target, each target frame and the image respectively;

according to the relation of each node in the semantic tree, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by using a tree-shaped long-term memory network, and determining the characteristics of each fused target frame and the overall characteristics of the fused image; the target frame characteristics are characteristics of images in the target frame of each target;

and determining a description text of the image by using an image description generation model according to the target features, the fused target frame features and the fused image global features.

2. The method of claim 1, wherein,

the constructing the semantic tree of the image comprises:

configuring a node corresponding to the image as a root node of the semantic tree;

configuring nodes corresponding to the target boxes into middle layer nodes of the semantic tree;

configuring nodes corresponding to the targets as leaf nodes of the semantic tree;

and the leaf node corresponding to each target is configured as a child node of the target frame corresponding to the target.

3. The method of claim 1, wherein,

configuring the nodes corresponding to the target boxes into middle level nodes of the semantic tree comprises:

arranging the target frames in the order of increasing area;

sequentially taking nodes corresponding to all targets as nodes to be added according to the arrangement sequence;

determining the overlapping area of a target frame corresponding to a node to be added and a target frame corresponding to each added node;

and under the condition that an added node with the overlapping area of the target frame corresponding to the node to be added exceeding a threshold exists, configuring the node to be added as a child node of the added node, otherwise, configuring the node to be added as a child node of the root node.

4. The method of claim 1, wherein,

the characteristic fusion by using the tree-shaped long-time memory network comprises the following steps:

inputting the characteristics corresponding to all child nodes belonging to the same father node and the characteristics corresponding to the father node into the tree-shaped long-term memory network from the layer where the leaf nodes of the semantic tree are located to obtain the output fused characteristics corresponding to the father node, and updating the characteristics corresponding to the father node into the fused characteristics;

sequentially updating the characteristics corresponding to the nodes of each layer according to the sequence from bottom to top;

determining the fused target frame features and the fused image global features according to the updated features corresponding to the nodes;

when the root node is used as a father node, performing average pooling operation on each target feature to obtain a first global target feature; carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic; and weighting the first global target feature and the first global target frame feature to obtain the input feature corresponding to the root node.

5. The method of claim 1, wherein,

determining the description text of the image by using an image description generation model according to the target features, the fused target frame features and the fused image global features comprises:

aiming at each target feature, merging the target feature with the corresponding target frame feature and the corresponding fused target frame feature to obtain a first target local feature;

carrying out average pooling operation on each target feature to obtain a first global target feature;

carrying out average pooling operation on each target frame characteristic to obtain a first global target frame characteristic;

merging the first global target feature, the first global target frame feature and the fused image global feature to obtain a merged first global image expression feature;

and inputting the first global image expression characteristics and each first target local characteristic into the image description generation model to obtain an output description text of the image.

6. The method of claim 5, wherein,

the step of inputting the first global image expression feature and each first target local feature into the image description generation model to obtain an output description text of the image comprises:

combining the feature of the description words at the current moment, the first global image expression feature and the feature output at the moment on the second layer long-short time memory network of the image description generation model, and inputting the feature into the first layer long-short time memory network of the image description generation model;

inputting the characteristics output by the first layer long-time memory network and each first target local characteristic into an attention mechanism module;

and combining the characteristics output by the attention mechanism module with the characteristics output by the first layer long-short time memory network, and inputting the characteristics into the second layer long-short time memory network to obtain the description words output at the next moment.

7. The method of claim 1, wherein,

inputting each target characteristic into a graph convolution network to obtain each output updated target characteristic; inputting the fused target frame features into a graph convolution network to obtain output updated fused target frame features; inputting each target frame characteristic into a graph convolution network to obtain each output updated target frame characteristic;

aiming at each updated target feature, merging the updated target feature with the corresponding updated target frame feature and the corresponding updated fused target frame feature to obtain a second target local feature;

carrying out average pooling operation on each updated target feature to obtain a second global target feature;

carrying out average pooling operation on each updated target frame characteristic to obtain a second global target frame characteristic;

carrying out average pooling operation on each updated fused target frame characteristic to obtain a third global target frame characteristic;

combining the second global target feature, the second global target frame feature and the third global target frame feature to obtain a combined second global image expression feature;

and inputting the second global image expression characteristics and each second target local characteristic into the image description generation model to obtain an output description text of the image.

8. The method of claim 7, wherein,

the step of inputting the second global image expression features and each second target local feature into the image description generation model to obtain an output description text of the image comprises:

inputting the feature of the description words at the current moment, the second global image expression feature and the feature output at the last moment of the second-layer long-short-term memory network of the image description generation model into the first-layer long-short-term memory network of the image description generation model;

inputting the characteristics output by the first layer long-time memory network and each second target local characteristic into an attention mechanism module;

9. The method of claim 1, further comprising:

carrying out target detection on the image to obtain each target frame and a segmentation area in the image;

extracting features of the image in each target frame to obtain the features of the target frame of each target frame in the output image;

and extracting features of the image in each segmentation area to obtain the target features of each target in the output image.

10. The method of claim 9, wherein,

the extracting features of the image in each segmented area to obtain the target features of each target in the output image comprises:

setting the image in each segmentation area to be white and setting the images of other parts to be black as the image after binarization processing aiming at the image in each segmentation area;

superposing the image after binarization processing and an original image to obtain an image with a background removed;

and inputting the image without the background into an object detector to obtain and output the target characteristics of the targets in each region.

11. The method of claim 1, further comprising:

obtaining training samples, the training samples comprising: the sample image and the description text corresponding to the sample image;

acquiring target characteristics of each target in the sample image and target frame characteristics of each target frame;

constructing a semantic tree of the sample image according to the relation among each target in the sample image, the target frame of each target and the sample image;

and training a long-term memory network of the tree shape to be trained and an image description generation model to be trained according to the semantic tree, the target characteristics of each target in the sample image and the target frame characteristics of each target frame.

12. The method according to claim 11, wherein,

the training of the long-time memory network of the tree form to be trained and the image description generation model to be trained comprises the following steps:

according to the relation of each node in the semantic tree of the sample image, the target characteristics of each target corresponding to the node and the target frame characteristics of each target frame corresponding to the node, performing characteristic fusion by using a tree-shaped long-time memory network to be trained, and determining each target frame characteristic after sample image fusion and the overall image characteristic after fusion;

determining an output description text of the sample image by using an image description generation model to be trained according to each target feature of the sample image, each fused target frame feature and each fused image global feature;

and according to the output description text of the sample image and the annotated description text of the sample image, adjusting parameters of a model generated by describing the memory network and the image to be trained in a long-term manner until preset convergence conditions are met, thereby completing the training of each model.

13. An apparatus for generating an image description, comprising:

the semantic tree building module is used for building a semantic tree of the image according to the relation among each target in the image, the target frame of each target and the image; each node of the semantic tree corresponds to each target, each target frame and the image respectively;

the feature fusion module is used for performing feature fusion by utilizing a tree-shaped long-time memory network according to the relation of each node in the semantic tree, the target features of each target corresponding to the node and the target frame features of each target frame corresponding to the node, and determining the fused target frame features and the fused image global features; the target frame characteristics are characteristics of images in the target frame of each target;

and the description generation module is used for determining the description text of the image by utilizing an image description generation model according to the target features, the fused target frame features and the fused image global features.

14. An apparatus for generating an image description, comprising:

a processor; and

a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the method of generating an image description according to any of claims 1-12.

15. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the method of any of claims 1-12.