CN112734881A

CN112734881A - Text synthesis image method and system based on significance scene graph analysis

Info

Publication number: CN112734881A
Application number: CN202011381287.9A
Authority: CN
Inventors: 郎丛妍; 汪敏; 李浥东; 冯松鹤; 王涛; 孙鑫雨; 李尊
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-04-30
Anticipated expiration: 2040-12-01
Also published as: CN112734881B

Abstract

The invention provides a text image synthesis method and system based on significance scene graph analysis. The method comprises the following steps: constructing a text description into a dependency tree according to dependency analysis, performing tree conversion to obtain a semantic graph, and constructing a rule-based scene graph analyzer which is mapped to the scene graph from a dependency syntactic representation; retrieving the scene graph by using a background retrieval module to obtain a candidate segmentation graph most relevant to the given scene graph; coding the candidate segmentation map through a background fusion module to obtain background features; and inputting the foreground object and the background feature representation into a generation countermeasure network to obtain a text composite image model, and generating a high-resolution image with visually consistent foreground and background by using the model and taking the test text description as input. According to the invention, the scene graph based on the significance is introduced into the image synthesis, and the accuracy of the image synthesis is effectively improved by exploring the cross-modal text semantic space configuration.

Description

Text synthesis image method and system based on significance scene graph analysis

Technical Field

The invention relates to the technical field of computer vision, in particular to a text synthesis image method and system based on significance scene graph analysis.

Background

Generating images from textual descriptions has been an active research topic in computer vision. Due to its great potential and challenges in many applications, the generation of textual descriptions into image generations has become an active research area for natural language processing and the computer vision community, with its wide range of applications including photo editing and computer aided design. By allowing a user to describe visual concepts in natural language, it provides a natural and flexible interaction for adjusting image generation. With the rise of the generation of antagonistic networks, the image synthesis technique exhibits excellent results. Based on the framework of generating a competing network, the quality of the generated image can be further improved by generating a high resolution image or enhancing textual information.

At present, the generation of complex images remains challenging. For example, to generate an image from the textual description "people ride elephants and go through a river" requires a variety of inferences to be made about various visual concepts, such as object categories (people and elephants), spatial configuration of objects (riding), scene background (through a river, etc.), which is much more complex than generating a single large object. Due to the complexity of learning direct text-to-pixel mappings from common images, existing methods have not been successful in generating reasonable images for such complex text descriptions. Therefore, it is a problem to be solved to develop a method for efficiently generating a complex image directly from a text description.

Disclosure of Invention

The embodiment of the invention provides a text synthesis image method and system based on saliency scene graph analysis, so as to effectively and directly generate a complex image from text description.

In order to achieve the purpose, the invention adopts the following technical scheme.

According to one aspect of the invention, a method for synthesizing an image of a text based on a saliency scene graph analysis is provided, which comprises the following steps:

step 1: extracting text descriptions from the existing data sets, and constructing an object data set, an attribute data set and a relation data set based on the text descriptions according to dependency analysis;

step 2: extracting all objects in the object data set, all attributes in the attribute data set and all relations in the relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, and performing tree conversion on the dependency tree to obtain a semantic graph;

and step 3: constructing a rule-based scene graph analyzer according to the semantic graph, mapping the syntax of the dependence items in the dependence tree into a scene graph through the scene graph analyzer, and obtaining a foreground object from the scene graph;

and 4, step 4: retrieving the scene graph by using a background retrieval module, and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;

and 4, step 4: encoding the candidate segmentation map through a background fusion module to generate a best matched background feature;

and 5: inputting the foreground object and the background feature representation into a generated countermeasure network, and training the weight of the generated countermeasure network by using a countermeasure loss function and a perception loss function to obtain a trained generated countermeasure network;

step 6: and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.

Preferably, the extracting all objects in the object data set, all attributes in the attribute data set, and all relationships in the relationship data set, parsing the obtained objects, attributes, and relationships into a dependency tree, and performing tree transformation on the dependency tree to obtain a semantic graph includes:

extracting all objects in an object data set, extracting all attributes in an attribute data set, extracting all relations in a relation data set, analyzing an image description into a starting point of a scene graph, outputting the dependency relations among the objects, the attributes and the relations by using Stanford Parser v3.5.2, analyzing the objects, the attributes and the relations into a dependency tree according to the dependency relations, performing quantization modifier processing, pronoun analysis processing and plural noun processing on the dependency tree, and converting the dependency tree into a semantic graph;

the quantitative modifier process takes a word as a semantically significant word that becomes the beginning of all other words in the expression, making this new multi-word expression dependent on the next noun phrase; the pronoun parsing process uses improved intrinsic pronoun parsers inspired by the first three rules of the Hobbs algorithm, so that the intrinsic pronoun parsers can operate in a dependency tree to restore the relationship between objects in a sentence; plural noun processes replicate each node of the graph according to the value of the numerical modifier.

Preferably, the retrieving the scene graph by using the background retrieving module, and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score includes:

searching in the candidate semantic segmentation map database by using a background search module based on example information in the scene map, calculating distance values between the scene map and each searched segmentation map, taking the distance values as layout similarity scores, performing descending ordering on all the layout similarity scores, and taking a plurality of scene maps which are arranged in front as a group of candidate segmentation maps which are most relevant to the scene map;

given containing k instances

Scene graph S, c of_iIs example S_iGiven a category comprising l instances

The segmentation map M and the corresponding correct label map I, the distance IoU between the scene map S and the segmentation map M_rThe calculation method comprises the following steps:

where C is the total number of object classes,

M^jrepresents l division graphs

Set of (1), S^jRepresenting k scene graphs

And (2) the sets of (U) and (n) respectively represent a union operation.

Preferably, the encoding the candidate segmentation map by the background fusion module to obtain the best matching background features includes:

the method comprises the steps of encoding a group of x candidate segmentation maps by using a background fusion module, inputting the encoded x segmentation maps into x convolution layers, respectively outputting feature representations of the x segmentation maps by the x convolution layers, linking the feature representations of the x segmentation maps into a total feature representation channel by channel, performing pooling operation on the total feature representation, connecting the total feature representation and the pooled total feature representation into a feature representation containing a foreground scene and a background category by channels, and performing feature learning and correction on the feature representation containing the foreground scene and the background category by using 2 convolution layers to obtain a best matched background feature.

Preferably, the encoding the candidate segmentation map by the background fusion module to obtain the best matching background feature further includes:

let M division maps M be searched_r,0,...,M_r,mAnd corresponding background label map

Each scene graph corresponds to a background label graph

By passing

Obtaining l^* _r,0,...,l^* _r,mThrough cascade l^* _r,i(i ═ 0, 1.. times, m) was obtained

Using a convolutional network F₁Encoding the background label graph into a background feature graph;

wherein Pool represents the average pooling, using another convolutional neural network F₂To obtain an updated profile:

after T steps, the best matched background characteristic l is obtained^*＝l_TContaining information from the scene map and background of the salient objects.

Preferably, the inputting the foreground object and the background feature representation into the generated countermeasure network, training the weight of the generated countermeasure network by using the countermeasure loss function and the perceptual loss function, and obtaining the trained generated countermeasure network includes:

a generator for generating a confrontation network is formed by a scene graph generating module, a background retrieving module, a background fusing module and an image generating module, foreground objects and background characteristics are input into the confrontation network, generating an antagonistic network to carry out space self-adaptive normalized coding on the foreground object and the background characteristic, inputting the image subjected to the space self-adaptive normalized coding into a discriminator for generating the antagonistic network, judging whether the image input into the discriminator is a real image or a generated image by using a matching discriminator, training the weight of the generated countermeasure network by using a countermeasure loss function and a perception loss function, continuously generating more images by using a generator, meanwhile, the discriminator continuously identifies real images or generates images, and the generator and the discriminator continuously play games until Nash balance between the generator and the discriminator is achieved, so that a trained generation countermeasure network is obtained.

According to another aspect of the present invention, there is provided a text synthesis image device based on saliency scene graph analysis, including:

the data preprocessing module is used for extracting text descriptions from the existing data set and constructing an object data set, an attribute data set and a relation data set from the text descriptions according to dependency analysis;

the system comprises a scene graph generation module, a scene graph analysis module and a foreground object generation module, wherein the scene graph generation module is used for extracting all objects in an object data set, all attributes in an attribute data set and all relations in a relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, carrying out tree conversion on the dependency tree to obtain a semantic graph, constructing a rule-based scene graph analyzer according to the semantic graph, mapping a dependency syntax in the dependency tree into a scene graph through the scene graph analyzer, and obtaining a foreground object from the scene graph;

the background retrieval module is used for retrieving the scene graph by utilizing the background retrieval module and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;

the background fusion module is used for coding the candidate segmentation map through the background fusion module to generate the best matched background feature;

the image generation module is used for inputting the foreground object and the background feature representation into the generation countermeasure network, and training the weight of the generation countermeasure network by using the countermeasure loss function and the perception loss function to obtain the trained generation countermeasure network; and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.

Preferably, the scene graph generating module is specifically configured to extract all objects in an object data set, extract all attributes in an attribute data set, extract all relationships in a relationship data set, analyze an image description as a starting point of a scene graph, output a dependency relationship among the objects, the attributes, and the relationships by using Stanford Parser v3.5.2, analyze the objects, the attributes, and the relationships as a dependency tree according to the dependency relationship, perform quantization modifier processing, pronoun analysis processing, and complex noun processing on the dependency tree, and convert the dependency tree into a semantic graph;

Preferably, the background retrieval module is specifically configured to utilize the background retrieval module to perform retrieval in the candidate semantic segmentation map database based on instance information in a scene graph, calculate a distance value between the scene graph and each retrieved segmentation map, use the distance value as a layout similarity score, perform descending order on all the layout similarity scores, and use a plurality of scene graphs arranged in front as a group of candidate segmentation maps most relevant to the scene graph;

given containing k instances

Scene graph S, c of_iIs example S_iGiven a category comprising l instances

where C is the total number of object classes,

M^jrepresents l division graphs

Set of (1), S^jRepresenting k scene graphs

And (2) the sets of (U) and (n) respectively represent a union operation.

Preferably, the background fusion module is specifically configured to encode the group of x candidate segmentation maps by using the background fusion module, input the encoded x segmentation maps into x convolutional layers, output feature representations of the x segmentation maps by the x convolutional layers respectively, link the feature representations of the x segmentation maps channel by channel into a total feature representation, perform pooling operation on the total feature representation, connect the total feature representation and the total feature representation after the pooling operation into a feature representation including a foreground scene and a background category by channel, perform feature learning and correction on the feature representation including the foreground scene and the background category by using 2 convolutional layers, and obtain a best-matching background feature;

let m be searchedA division map M_r,0,...,M_r,mAnd corresponding background label map

Each scene graph corresponds to a background label graph

By passing

Preferably, the image generation module is specifically configured to utilize the scene graph generation module, the background retrieval module, the background fusion module and the image generation module to form a generator for generating a countermeasure network, input the foreground object and the background feature into the generation countermeasure network, perform spatial adaptive normalization coding on the foreground object and the background feature by the generation countermeasure network, input the image subjected to the spatial adaptive normalization coding into a discriminator for generating the countermeasure network, utilize a matching discriminator to judge whether the image input into the discriminator is a real image or a generated image, utilize a countervailing loss function and a perceptual loss function to train weights for generating the countermeasure network, the generator continuously generates more images, meanwhile the discriminator continuously identifies the real image or the generated image, the generator and the discriminator continuously play games until nash balance between the generator and the discriminator is achieved, and obtaining a well-trained generated countermeasure network.

According to the technical scheme provided by the embodiment of the invention, the saliency-based scene graph is introduced into image synthesis by the saliency-based scene graph analysis-based text synthesis image method, and the accuracy of image synthesis is effectively improved by exploring cross-modal text semantic space configuration.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method and a system for synthesizing an image from a text based on a saliency scene graph analysis according to an embodiment of the present invention;

fig. 2 is a detailed structural diagram of a text synthesis image device based on saliency scene graph analysis according to an embodiment of the present invention, including: data preprocessing 21, a scene graph generation module 22, a background retrieval module 23, a background fusion module 24 and an image generation module 25.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

Example one

The processing flow of the text synthesis image method based on the saliency scene graph analysis provided by the embodiment of the invention is shown in fig. 1, and comprises the following processing steps:

step S10: the textual description is extracted from the existing dataset. The existing datasets include the CUB dataset, the Oxford-102 dataset, and the MS-COCO dataset.

Analyzing the dependency relationship of a large number of existing text descriptions, and constructing an object data set, an attribute data set and a relationship data set according to the analysis result, wherein the object data set is obtained by sampling noun objects in the text descriptions, the attribute data set is formed by the attributes of the description objects, and the relationship data set is formed by the relationship between any two objects in the object data set.

For example, given a textual description "A man is looking at a his black watch," it is parsed into two objects using text dependency parsing

And o₂Relation e ═ watch, { black }, and₁＝(o₁,look·at,o₂) And e₂＝(o₁,have,o₂) And attribute a ═ ({ black }).

Step S20: extracting all objects in the object data set, extracting all attributes in the attribute data set, extracting all relations in the relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, performing tree conversion on the dependency tree for several times to obtain a semantic graph,

generic dependency parsing is in many ways close to a shallow semantic representation and is therefore the starting point for parsing an image description into a scene graph. The enhanced dependency representation output by Stanford Parser v3.5.2 is utilized to generate a dependency tree, and then, quantized modifier processing, pronoun parsing processing and plural noun processing steps are performed on the dependency tree to process complex quantized modifiers, parse pronouns and process multiple nouns, thereby generating a semantic graph.

The quantitative modifier processing method makes the first word the beginning of all other words in the expression and then makes this new multi-word expression dependent on the next noun phrase. This step ensures that the semantic graphs of "booth cars" and "booth of the cars" both have similar structures, where the semantically significant word is the car of the beginning of the sentence. The pronoun parsing processing method uses an improved intrinsic pronoun parser derived from the first three rules of the Hobbs algorithm, so that it can operate in a dependency tree to restore the relationships between objects in the sentence. Plural noun processing methods copy each node of the graph according to the value of its numerical modifier. While limiting the number of copies per node to 20, if a plural noun lacks such a modifier, one copy of the node will be exactly duplicated.

The scene graph analyzer maps the syntax of the dependency item in the dependency tree into a scene graph, and the scene graph can contain a plurality of instances.

Step S30: and searching the scene graph generated in the last step by using a background searching module, and selecting a group of candidate segmentation graphs which are most relevant to the scene graph from a large candidate semantic segmentation graph database by using the layout similarity score.

The background retrieval module utilizes the distance IoU between the scene graph and the segmentation graph as described below_rAnd (4) obtaining the product. Scene graph has no segmentation graph corresponding to the pair in the segmentation graph data set, and we will IoU_rAnd (5) sorting the scores in a descending order, and calculating the Top-m (Top 3) scene graphs which are the most similar as candidates.

The layout similarity score is a variation of the intersection ratio evaluation index, and the distance between the scene graph and the candidate segmentation graph can be calculated through the layout similarity score. Given containing k instances

Scene graph S, c of_iIs example S_iGiven a category comprising l instances

The segmentation map M and the corresponding correct label map (ground route) I. Using the layout similarity score metric to retrieve the pair I and M containing the most similar to the S scene graph to measure the distance IoU between the scene graph of the salient object layout and the segmentation graph of the fine objects_r。

Where C is the total number of object classes,

M^jrepresents l division graphs

Set of (1), S^jRepresenting k scene graphs

And (2) the sets of (U) and (n) respectively represent a union operation.

Step S40: the set of candidate segmentation maps is encoded by a background fusion module to produce the best matching background features.

The background fusion module will encode the set of (x) candidate segmentations, input the x segmentations into x convolutional layers (Conv layers) to obtain feature representations (feature representation) of the different segmentations, and then link the channel-wise (channel) into one feature representation. The above features are indicated for pooling (Pooling) operation. And connecting the feature representation of the convolved scene graph and the feature representation of the segmented graph after pooling into a feature representation of a category containing a scene of the foreground and a background by a channel (channel). And finally, performing feature learning and correction by using 2 convolutional layers to obtain the scene graph features of the fusion background.

The background fusion module encodes the candidate segmentation maps to visualize a smoother background, and downsamples data by using aver-pooling according to the retrieved candidate segmentation maps and the corresponding background label maps thereof to reduce dimensionality. And encoding the background label image into a background feature image by using a convolution network.

Each scene graph corresponds to a background label graph

By passing

Then using the convolutional network F₁And encoding the background label image into a background feature image.

Wherein Pool represents average pooling while using another convolutional neural network F₂To obtain an updated profile:

after T steps, a final background characteristic picture l is obtained^*＝l_TContaining information from the scene map and background of the salient objects.

Step S50: and obtaining the foreground object from the generated scene graph.

The scene graph generation module, the background retrieval module, the background fusion module and the image generation module together form a generator (generator) for generating the countermeasure network. Meanwhile, a matching-aware discriminator is used to determine whether the image inputted into the discriminator is a real image (real image) or a generated image (generated image). The generator continuously generates more vivid images, and meanwhile, the discriminator continuously identifies whether the generated images are real images or not, and the two networks continuously game until Nash equilibrium is reached.

The foreground object and the background feature are input into a generation countermeasure network, the generation countermeasure network carries out space self-adaptive normalized coding on the foreground object and the background feature, the generation countermeasure network converts a batch normalization layer into a condition normalization layer, the batch normalization layer is modulated and activated by using the input through space self-adaptive learning conversion, and semantic information can be effectively spread in the whole network.

Let hⁱIndicating the activation of the i-th layer of the deep convolutional network given a batch of N samples, CⁱIs the number of channels in the layer. Let HⁱAnd WⁱThe height and width of the active features in the layer. Unlike batch normalization with channel-by-channel activation, where the spatial adaptive normalization is modulated using learnable scaling and translation parameters, the activation value is:

wherein N belongs to N, C belongs to Cⁱ,y∈Hⁱ,x∈Wⁱ，

Is the activation at that location prior to normalization,

and

is the mean and standard deviation of activation in channel C.

And training the weight of the generated countermeasure network by using the countermeasure loss function and the perception loss function to obtain the trained generated countermeasure network, and taking the trained generated countermeasure network as a text synthetic image model.

Using oppositional loss functions

Training generates a confrontation network to fine-tune (fine-tune) the entire network. The generator continuously generates more vivid images, and meanwhile, the discriminator continuously identifies the generated images instead of real images, and the two networks continuously game until Nash equilibrium is reached.

The global difference between a generated image feature (generated image feature) and a corresponding real image feature (corrected real image feature) is measured using a perceptual loss function. This function can extract features from the 7 th fully-connected layer (FC7) of the pre-trained VGG16 network to measure global differences in images.

Wherein x is_jRepresenting the jth generated image; x is the number of_rRepresenting the corresponding real image of the jth generated image; n represents the total number of images in the dataset. C_j、H_jAnd W_jThe number of channels (channel), height (height), and width (width) of the image feature. L is_pRepresenting the perceptual loss function.

Step S60: and inputting the text description to be converted into the trained text composite image model, and outputting a high-resolution image with consistent foreground and background vision by the text composite image model.

Example two

Fig. 2 is a schematic structural diagram of a text synthesis image apparatus based on saliency scene graph analysis, including:

the data preprocessing module 21 is configured to extract text descriptions from an existing data set, and construct an object data set, an attribute data set, and a relationship data set from the text descriptions according to dependency analysis;

the scene graph generation module 22 is configured to extract all objects in the object data set, all attributes in the attribute data set, and all relationships in the relationship data set, analyze the obtained objects, attributes, and relationships into a dependency tree, perform tree transformation on the dependency tree to obtain a semantic graph, construct a rule-based scene graph analyzer according to the semantic graph, map a dependency syntax in the dependency tree into a scene graph through the scene graph analyzer, and obtain a foreground object from the scene graph;

a background retrieval module 23, configured to retrieve the scene graph by using the background retrieval module, and select a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;

a background fusion module 24, configured to encode the candidate segmentation map through the background fusion module to generate a best matching background feature;

the image generation module 25 is configured to input the foreground object and the background feature representation into the generated countermeasure network, and train the weight of the generated countermeasure network by using the countermeasure loss function and the perceptual loss function to obtain a trained generated countermeasure network; and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.

Specifically, the scene graph generating module 22 is specifically configured to extract all objects in an object data set, extract all attributes in an attribute data set, extract all relationships in a relationship data set, analyze an image description as a starting point of a scene graph, output a dependency relationship among the objects, the attributes, and the relationships by using Stanford Parser v3.5.2, analyze the objects, the attributes, and the relationships as a dependency tree according to the dependency relationship, perform quantization modifier processing, pronoun analysis processing, and complex noun processing on the dependency tree, and convert the dependency tree into a semantic graph;

Specifically, the background retrieval module 23 is specifically configured to utilize the background retrieval module to perform retrieval in the candidate semantic segmentation map database based on the instance information in the scene graph, calculate a distance value between the scene graph and each retrieved segmentation map, use the distance value as a layout similarity score, perform descending order on all the layout similarity scores, and use a plurality of scene graphs arranged in front as a group of candidate segmentation maps most relevant to the scene graph;

given containing k instances

Scene graph S, c of_iIs example S_iGiven a category comprising l instances

where C is the total number of object classes,

M^jrepresents l division graphs

Set of (1), S^jRepresenting k scene graphs

And (2) the sets of (U) and (n) respectively represent a union operation.

Specifically, the background fusion module 24 is specifically configured to encode the group of x candidate segmentation maps by using a background fusion module, input the encoded x segmentation maps into x convolution layers, output feature representations of the x segmentation maps by the x convolution layers respectively, link the feature representations of the x segmentation maps into a total feature representation channel by channel, perform pooling operation on the total feature representation, connect the total feature representation and the total feature representation after the pooling operation into a feature representation including a foreground scene and a background category through a channel, and perform feature learning and correction on the feature representation including the foreground scene and the background category by using 2 convolution layers to obtain a best-matched background feature;

Each scene graph corresponds to a background label graph

By passing

Specifically, the image generation module 25 is specifically configured to utilize the scene graph generation module, the background retrieval module, the background fusion module, and the image generation module to form a generator for generating a countermeasure network, input the foreground object and the background feature into the generation countermeasure network, perform spatial adaptive normalization coding on the foreground object and the background feature by the generation countermeasure network, input the image subjected to the spatial adaptive normalization coding into a discriminator for generating the countermeasure network, utilize a matching discriminator to judge whether the image input into the discriminator is a real image or a generated image, utilize a countervailing loss function and a perceptual loss function to train weights for generating the countermeasure network, continuously generate more images by the generator, continuously identify whether the discriminator is a real image or a generated image, continuously play the generator and the discriminator until nash equilibrium between the generator and the discriminator is reached, and obtaining a well-trained generated countermeasure network.

The specific process of using the apparatus of the embodiment of the present invention to perform the text synthesis image based on the significant scene graph analysis is similar to that of the foregoing method embodiment, and is not repeated here.

In summary, the method for synthesizing the text based on the saliency scene graph analysis according to the embodiment of the invention introduces the saliency scene graph into the image synthesis, and effectively improves the accuracy of the image synthesis by exploring the cross-modal text semantic space configuration.

The method can automatically screen the segmentation maps with better quality by screening the segmentation maps in the candidate set, integrates background characteristics, generates the consistency of the foreground and background images, improves the image quality and reduces the memory usage amount.

All modules of the system of the embodiment of the invention are automatic and do not need manual intervention, so that the system can be easily integrated into other text-based image synthesis systems. The system can also be divided into two subsystems: a text synthesis salient scene graph subsystem and a salient object layout generation image subsystem. The text synthesis salient scene graph subsystem can be embedded into an image retrieval general analysis system, and similarly, the salient object layout generation image subsystem can change the layout position and the type according to the user requirements to generate a general scene meeting the user requirements, so that the text synthesis salient scene graph subsystem has a wide application prospect.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for synthesizing an image based on a text analyzed by a saliency scene graph is characterized by comprising the following steps:

2. The method of claim 1, wherein the extracting all objects in the object dataset, all attributes in the attribute dataset, and all relationships in the relationship dataset, parsing the obtained objects, attributes, and relationships into a dependency tree, and performing tree transformation on the dependency tree to obtain the semantic graph comprises:

3. The method of claim 1, wherein the retrieving the scene graph using a background retrieval module, and selecting a set of candidate segmentation graphs from the database of candidate semantic segmentation graphs that are most relevant to the scene graph according to the layout similarity score, comprises:

given containing k instances

Scene graph S, c of_iIs example S_iGiven a category comprising l instances

where C is the total number of object classes,

M^jrepresents l division graphs

Set of (1), S^jRepresenting k scene graphs

And (2) the sets of (U) and (n) respectively represent a union operation.

4. The method according to claim 1, wherein the encoding the candidate segmentation map by the context fusion module to obtain the best matching context feature comprises:

5. The method of claim 4, wherein the encoding the candidate segmentation map by a context fusion module to obtain a best matching context feature further comprises:

Each scene graph corresponds to a background label graph

By passing

6. The method of claim 4, wherein the inputting foreground object and background feature representation into the generative confrontation network, training the weights of the generative confrontation network by using the confrontation loss function and the perceptual loss function to obtain a trained generative confrontation network, comprises:

7. A text-synthesized image device based on saliency scene map analysis, comprising:

8. The apparatus of claim 7, wherein:

the scene graph generation module is specifically used for extracting all objects in an object data set, extracting all attributes in an attribute data set, extracting all relations in a relation data set, analyzing the image description into a starting point of a scene graph, outputting the dependency relations among the objects, the attributes and the relations by using Stanford Parser v3.5.2, analyzing the objects, the attributes and the relations into a dependency tree according to the dependency relations, performing quantization modifier processing, pronoun analysis processing and plural noun processing on the dependency tree, and converting the dependency tree into a semantic graph;

9. The apparatus of claim 8, wherein:

the background retrieval module is specifically used for retrieving in the candidate semantic segmentation map database by using the background retrieval module based on example information in the scene graph, calculating distance values between the scene graph and each retrieved segmentation map, taking the distance values as layout similarity scores, performing descending ordering on all the layout similarity scores, and taking a plurality of scene graphs arranged in front as a group of candidate segmentation maps most relevant to the scene graph;

given containing k instances

Scene graph S, c of_iIs example S_iGiven a category comprising l instances

where C is the total number of object classes,

M^jrepresents l division graphs

Set of (1)And then, S^jRepresenting k scene graphs

And (2) the sets of (U) and (n) respectively represent a union operation.

10. The apparatus of claim 9, wherein:

the background fusion module is specifically configured to encode the group of x candidate segmentation maps by using the background fusion module, input the encoded x segmentation maps into x convolution layers, output feature representations of the x segmentation maps by the x convolution layers respectively, link the feature representations of the x segmentation maps channel by channel into an overall feature representation, perform pooling operation on the overall feature representation, connect the overall feature representation and the overall feature representation after the pooling operation into a feature representation including a foreground scene and a background category through a channel, and perform feature learning and correction on the feature representation including the foreground scene and the background category by using 2 convolution layers to obtain a best-matched background feature;

Each scene graph corresponds to a background label graph

By passing

after T steps, the best matched background characteristic l is obtained^*＝l_TIncluding information from the scene map and background of the salient objects;

the image generation module is specifically used for forming a generator for generating a countermeasure network by using a scene image generation module, a background retrieval module, a background fusion module and an image generation module, inputting a foreground object and background features into the generation countermeasure network, performing space adaptive normalized coding on the foreground object and the background features by the generation countermeasure network, inputting an image subjected to the space adaptive normalized coding into a discriminator for generating the countermeasure network, judging whether the image input into the discriminator is a real image or a generated image by using a matching discriminator, training the weight of the generation countermeasure network by using a countermeasure loss function and a perception loss function, continuously generating more images by the generator, continuously identifying the real image or the generated image by the discriminator, continuously playing games by the generator and the discriminator until the Nash balance between the generator and the discriminator is achieved, and obtaining a well-trained generated countermeasure network.