CN112734881A - Text synthesis image method and system based on significance scene graph analysis - Google Patents

Text synthesis image method and system based on significance scene graph analysis Download PDF

Info

Publication number
CN112734881A
CN112734881A CN202011381287.9A CN202011381287A CN112734881A CN 112734881 A CN112734881 A CN 112734881A CN 202011381287 A CN202011381287 A CN 202011381287A CN 112734881 A CN112734881 A CN 112734881A
Authority
CN
China
Prior art keywords
background
graph
scene
scene graph
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011381287.9A
Other languages
Chinese (zh)
Other versions
CN112734881B (en
Inventor
郎丛妍
汪敏
李浥东
冯松鹤
王涛
孙鑫雨
李尊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202011381287.9A priority Critical patent/CN112734881B/en
Publication of CN112734881A publication Critical patent/CN112734881A/en
Application granted granted Critical
Publication of CN112734881B publication Critical patent/CN112734881B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a text image synthesis method and system based on significance scene graph analysis. The method comprises the following steps: constructing a text description into a dependency tree according to dependency analysis, performing tree conversion to obtain a semantic graph, and constructing a rule-based scene graph analyzer which is mapped to the scene graph from a dependency syntactic representation; retrieving the scene graph by using a background retrieval module to obtain a candidate segmentation graph most relevant to the given scene graph; coding the candidate segmentation map through a background fusion module to obtain background features; and inputting the foreground object and the background feature representation into a generation countermeasure network to obtain a text composite image model, and generating a high-resolution image with visually consistent foreground and background by using the model and taking the test text description as input. According to the invention, the scene graph based on the significance is introduced into the image synthesis, and the accuracy of the image synthesis is effectively improved by exploring the cross-modal text semantic space configuration.

Description

Text synthesis image method and system based on significance scene graph analysis
Technical Field
The invention relates to the technical field of computer vision, in particular to a text synthesis image method and system based on significance scene graph analysis.
Background
Generating images from textual descriptions has been an active research topic in computer vision. Due to its great potential and challenges in many applications, the generation of textual descriptions into image generations has become an active research area for natural language processing and the computer vision community, with its wide range of applications including photo editing and computer aided design. By allowing a user to describe visual concepts in natural language, it provides a natural and flexible interaction for adjusting image generation. With the rise of the generation of antagonistic networks, the image synthesis technique exhibits excellent results. Based on the framework of generating a competing network, the quality of the generated image can be further improved by generating a high resolution image or enhancing textual information.
At present, the generation of complex images remains challenging. For example, to generate an image from the textual description "people ride elephants and go through a river" requires a variety of inferences to be made about various visual concepts, such as object categories (people and elephants), spatial configuration of objects (riding), scene background (through a river, etc.), which is much more complex than generating a single large object. Due to the complexity of learning direct text-to-pixel mappings from common images, existing methods have not been successful in generating reasonable images for such complex text descriptions. Therefore, it is a problem to be solved to develop a method for efficiently generating a complex image directly from a text description.
Disclosure of Invention
The embodiment of the invention provides a text synthesis image method and system based on saliency scene graph analysis, so as to effectively and directly generate a complex image from text description.
In order to achieve the purpose, the invention adopts the following technical scheme.
According to one aspect of the invention, a method for synthesizing an image of a text based on a saliency scene graph analysis is provided, which comprises the following steps:
step 1: extracting text descriptions from the existing data sets, and constructing an object data set, an attribute data set and a relation data set based on the text descriptions according to dependency analysis;
step 2: extracting all objects in the object data set, all attributes in the attribute data set and all relations in the relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, and performing tree conversion on the dependency tree to obtain a semantic graph;
and step 3: constructing a rule-based scene graph analyzer according to the semantic graph, mapping the syntax of the dependence items in the dependence tree into a scene graph through the scene graph analyzer, and obtaining a foreground object from the scene graph;
and 4, step 4: retrieving the scene graph by using a background retrieval module, and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;
and 4, step 4: encoding the candidate segmentation map through a background fusion module to generate a best matched background feature;
and 5: inputting the foreground object and the background feature representation into a generated countermeasure network, and training the weight of the generated countermeasure network by using a countermeasure loss function and a perception loss function to obtain a trained generated countermeasure network;
step 6: and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.
Preferably, the extracting all objects in the object data set, all attributes in the attribute data set, and all relationships in the relationship data set, parsing the obtained objects, attributes, and relationships into a dependency tree, and performing tree transformation on the dependency tree to obtain a semantic graph includes:
extracting all objects in an object data set, extracting all attributes in an attribute data set, extracting all relations in a relation data set, analyzing an image description into a starting point of a scene graph, outputting the dependency relations among the objects, the attributes and the relations by using Stanford Parser v3.5.2, analyzing the objects, the attributes and the relations into a dependency tree according to the dependency relations, performing quantization modifier processing, pronoun analysis processing and plural noun processing on the dependency tree, and converting the dependency tree into a semantic graph;
the quantitative modifier process takes a word as a semantically significant word that becomes the beginning of all other words in the expression, making this new multi-word expression dependent on the next noun phrase; the pronoun parsing process uses improved intrinsic pronoun parsers inspired by the first three rules of the Hobbs algorithm, so that the intrinsic pronoun parsers can operate in a dependency tree to restore the relationship between objects in a sentence; plural noun processes replicate each node of the graph according to the value of the numerical modifier.
Preferably, the retrieving the scene graph by using the background retrieving module, and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score includes:
searching in the candidate semantic segmentation map database by using a background search module based on example information in the scene map, calculating distance values between the scene map and each searched segmentation map, taking the distance values as layout similarity scores, performing descending ordering on all the layout similarity scores, and taking a plurality of scene maps which are arranged in front as a group of candidate segmentation maps which are most relevant to the scene map;
given containing k instances
Figure RE-GDA0002991057340000031
Scene graph S, c ofiIs example SiGiven a category comprising l instances
Figure RE-GDA0002991057340000032
The segmentation map M and the corresponding correct label map I, the distance IoU between the scene map S and the segmentation map MrThe calculation method comprises the following steps:
Figure RE-GDA0002991057340000033
where C is the total number of object classes,
Figure RE-GDA0002991057340000041
Mjrepresents l division graphs
Figure RE-GDA0002991057340000042
Set of (1), SjRepresenting k scene graphs
Figure RE-GDA0002991057340000043
And (2) the sets of (U) and (n) respectively represent a union operation.
Preferably, the encoding the candidate segmentation map by the background fusion module to obtain the best matching background features includes:
the method comprises the steps of encoding a group of x candidate segmentation maps by using a background fusion module, inputting the encoded x segmentation maps into x convolution layers, respectively outputting feature representations of the x segmentation maps by the x convolution layers, linking the feature representations of the x segmentation maps into a total feature representation channel by channel, performing pooling operation on the total feature representation, connecting the total feature representation and the pooled total feature representation into a feature representation containing a foreground scene and a background category by channels, and performing feature learning and correction on the feature representation containing the foreground scene and the background category by using 2 convolution layers to obtain a best matched background feature.
Preferably, the encoding the candidate segmentation map by the background fusion module to obtain the best matching background feature further includes:
let M division maps M be searchedr,0,...,Mr,mAnd corresponding background label map
Figure RE-GDA0002991057340000044
Figure RE-GDA0002991057340000045
Each scene graph corresponds to a background label graph
Figure RE-GDA0002991057340000046
By passing
Figure RE-GDA0002991057340000047
Obtaining l* r,0,...,l* r,mThrough cascade l* r,i(i ═ 0, 1.. times, m) was obtained
Figure RE-GDA0002991057340000048
Using a convolutional network F1Encoding the background label graph into a background feature graph;
Figure RE-GDA0002991057340000049
wherein Pool represents the average pooling, using another convolutional neural network F2To obtain an updated profile:
Figure RE-GDA00029910573400000410
after T steps, the best matched background characteristic l is obtained*=lTContaining information from the scene map and background of the salient objects.
Preferably, the inputting the foreground object and the background feature representation into the generated countermeasure network, training the weight of the generated countermeasure network by using the countermeasure loss function and the perceptual loss function, and obtaining the trained generated countermeasure network includes:
a generator for generating a confrontation network is formed by a scene graph generating module, a background retrieving module, a background fusing module and an image generating module, foreground objects and background characteristics are input into the confrontation network, generating an antagonistic network to carry out space self-adaptive normalized coding on the foreground object and the background characteristic, inputting the image subjected to the space self-adaptive normalized coding into a discriminator for generating the antagonistic network, judging whether the image input into the discriminator is a real image or a generated image by using a matching discriminator, training the weight of the generated countermeasure network by using a countermeasure loss function and a perception loss function, continuously generating more images by using a generator, meanwhile, the discriminator continuously identifies real images or generates images, and the generator and the discriminator continuously play games until Nash balance between the generator and the discriminator is achieved, so that a trained generation countermeasure network is obtained.
According to another aspect of the present invention, there is provided a text synthesis image device based on saliency scene graph analysis, including:
the data preprocessing module is used for extracting text descriptions from the existing data set and constructing an object data set, an attribute data set and a relation data set from the text descriptions according to dependency analysis;
the system comprises a scene graph generation module, a scene graph analysis module and a foreground object generation module, wherein the scene graph generation module is used for extracting all objects in an object data set, all attributes in an attribute data set and all relations in a relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, carrying out tree conversion on the dependency tree to obtain a semantic graph, constructing a rule-based scene graph analyzer according to the semantic graph, mapping a dependency syntax in the dependency tree into a scene graph through the scene graph analyzer, and obtaining a foreground object from the scene graph;
the background retrieval module is used for retrieving the scene graph by utilizing the background retrieval module and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;
the background fusion module is used for coding the candidate segmentation map through the background fusion module to generate the best matched background feature;
the image generation module is used for inputting the foreground object and the background feature representation into the generation countermeasure network, and training the weight of the generation countermeasure network by using the countermeasure loss function and the perception loss function to obtain the trained generation countermeasure network; and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.
Preferably, the scene graph generating module is specifically configured to extract all objects in an object data set, extract all attributes in an attribute data set, extract all relationships in a relationship data set, analyze an image description as a starting point of a scene graph, output a dependency relationship among the objects, the attributes, and the relationships by using Stanford Parser v3.5.2, analyze the objects, the attributes, and the relationships as a dependency tree according to the dependency relationship, perform quantization modifier processing, pronoun analysis processing, and complex noun processing on the dependency tree, and convert the dependency tree into a semantic graph;
the quantitative modifier process takes a word as a semantically significant word that becomes the beginning of all other words in the expression, making this new multi-word expression dependent on the next noun phrase; the pronoun parsing process uses improved intrinsic pronoun parsers inspired by the first three rules of the Hobbs algorithm, so that the intrinsic pronoun parsers can operate in a dependency tree to restore the relationship between objects in a sentence; plural noun processes replicate each node of the graph according to the value of the numerical modifier.
Preferably, the background retrieval module is specifically configured to utilize the background retrieval module to perform retrieval in the candidate semantic segmentation map database based on instance information in a scene graph, calculate a distance value between the scene graph and each retrieved segmentation map, use the distance value as a layout similarity score, perform descending order on all the layout similarity scores, and use a plurality of scene graphs arranged in front as a group of candidate segmentation maps most relevant to the scene graph;
given containing k instances
Figure RE-GDA0002991057340000061
Scene graph S, c ofiIs example SiGiven a category comprising l instances
Figure RE-GDA0002991057340000071
The segmentation map M and the corresponding correct label map I, the distance IoU between the scene map S and the segmentation map MrThe calculation method comprises the following steps:
Figure RE-GDA0002991057340000072
where C is the total number of object classes,
Figure RE-GDA0002991057340000073
Mjrepresents l division graphs
Figure RE-GDA0002991057340000074
Set of (1), SjRepresenting k scene graphs
Figure RE-GDA0002991057340000075
And (2) the sets of (U) and (n) respectively represent a union operation.
Preferably, the background fusion module is specifically configured to encode the group of x candidate segmentation maps by using the background fusion module, input the encoded x segmentation maps into x convolutional layers, output feature representations of the x segmentation maps by the x convolutional layers respectively, link the feature representations of the x segmentation maps channel by channel into a total feature representation, perform pooling operation on the total feature representation, connect the total feature representation and the total feature representation after the pooling operation into a feature representation including a foreground scene and a background category by channel, perform feature learning and correction on the feature representation including the foreground scene and the background category by using 2 convolutional layers, and obtain a best-matching background feature;
let m be searchedA division map Mr,0,...,Mr,mAnd corresponding background label map
Figure RE-GDA0002991057340000076
Figure RE-GDA0002991057340000077
Each scene graph corresponds to a background label graph
Figure RE-GDA0002991057340000078
By passing
Figure RE-GDA0002991057340000079
Obtaining l* r,0,...,l* r,mThrough cascade l* r,i(i ═ 0, 1.. times, m) was obtained
Figure RE-GDA00029910573400000710
Using a convolutional network F1Encoding the background label graph into a background feature graph;
Figure RE-GDA00029910573400000711
wherein Pool represents the average pooling, using another convolutional neural network F2To obtain an updated profile:
Figure RE-GDA00029910573400000712
after T steps, the best matched background characteristic l is obtained*=lTContaining information from the scene map and background of the salient objects.
Preferably, the image generation module is specifically configured to utilize the scene graph generation module, the background retrieval module, the background fusion module and the image generation module to form a generator for generating a countermeasure network, input the foreground object and the background feature into the generation countermeasure network, perform spatial adaptive normalization coding on the foreground object and the background feature by the generation countermeasure network, input the image subjected to the spatial adaptive normalization coding into a discriminator for generating the countermeasure network, utilize a matching discriminator to judge whether the image input into the discriminator is a real image or a generated image, utilize a countervailing loss function and a perceptual loss function to train weights for generating the countermeasure network, the generator continuously generates more images, meanwhile the discriminator continuously identifies the real image or the generated image, the generator and the discriminator continuously play games until nash balance between the generator and the discriminator is achieved, and obtaining a well-trained generated countermeasure network.
According to the technical scheme provided by the embodiment of the invention, the saliency-based scene graph is introduced into image synthesis by the saliency-based scene graph analysis-based text synthesis image method, and the accuracy of image synthesis is effectively improved by exploring cross-modal text semantic space configuration.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a method and a system for synthesizing an image from a text based on a saliency scene graph analysis according to an embodiment of the present invention;
fig. 2 is a detailed structural diagram of a text synthesis image device based on saliency scene graph analysis according to an embodiment of the present invention, including: data preprocessing 21, a scene graph generation module 22, a background retrieval module 23, a background fusion module 24 and an image generation module 25.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
Example one
The processing flow of the text synthesis image method based on the saliency scene graph analysis provided by the embodiment of the invention is shown in fig. 1, and comprises the following processing steps:
step S10: the textual description is extracted from the existing dataset. The existing datasets include the CUB dataset, the Oxford-102 dataset, and the MS-COCO dataset.
Analyzing the dependency relationship of a large number of existing text descriptions, and constructing an object data set, an attribute data set and a relationship data set according to the analysis result, wherein the object data set is obtained by sampling noun objects in the text descriptions, the attribute data set is formed by the attributes of the description objects, and the relationship data set is formed by the relationship between any two objects in the object data set.
For example, given a textual description "A man is looking at a his black watch," it is parsed into two objects using text dependency parsing
Figure RE-GDA0002991057340000101
And o2Relation e ═ watch, { black }, and1=(o1,look·at,o2) And e2=(o1,have,o2) And attribute a ═ ({ black }).
Step S20: extracting all objects in the object data set, extracting all attributes in the attribute data set, extracting all relations in the relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, performing tree conversion on the dependency tree for several times to obtain a semantic graph,
generic dependency parsing is in many ways close to a shallow semantic representation and is therefore the starting point for parsing an image description into a scene graph. The enhanced dependency representation output by Stanford Parser v3.5.2 is utilized to generate a dependency tree, and then, quantized modifier processing, pronoun parsing processing and plural noun processing steps are performed on the dependency tree to process complex quantized modifiers, parse pronouns and process multiple nouns, thereby generating a semantic graph.
The quantitative modifier processing method makes the first word the beginning of all other words in the expression and then makes this new multi-word expression dependent on the next noun phrase. This step ensures that the semantic graphs of "booth cars" and "booth of the cars" both have similar structures, where the semantically significant word is the car of the beginning of the sentence. The pronoun parsing processing method uses an improved intrinsic pronoun parser derived from the first three rules of the Hobbs algorithm, so that it can operate in a dependency tree to restore the relationships between objects in the sentence. Plural noun processing methods copy each node of the graph according to the value of its numerical modifier. While limiting the number of copies per node to 20, if a plural noun lacks such a modifier, one copy of the node will be exactly duplicated.
The scene graph analyzer maps the syntax of the dependency item in the dependency tree into a scene graph, and the scene graph can contain a plurality of instances.
Step S30: and searching the scene graph generated in the last step by using a background searching module, and selecting a group of candidate segmentation graphs which are most relevant to the scene graph from a large candidate semantic segmentation graph database by using the layout similarity score.
The background retrieval module utilizes the distance IoU between the scene graph and the segmentation graph as described belowrAnd (4) obtaining the product. Scene graph has no segmentation graph corresponding to the pair in the segmentation graph data set, and we will IoUrAnd (5) sorting the scores in a descending order, and calculating the Top-m (Top 3) scene graphs which are the most similar as candidates.
The layout similarity score is a variation of the intersection ratio evaluation index, and the distance between the scene graph and the candidate segmentation graph can be calculated through the layout similarity score. Given containing k instances
Figure RE-GDA0002991057340000111
Scene graph S, c ofiIs example SiGiven a category comprising l instances
Figure RE-GDA0002991057340000112
The segmentation map M and the corresponding correct label map (ground route) I. Using the layout similarity score metric to retrieve the pair I and M containing the most similar to the S scene graph to measure the distance IoU between the scene graph of the salient object layout and the segmentation graph of the fine objectsr
Figure RE-GDA0002991057340000113
Where C is the total number of object classes,
Figure RE-GDA0002991057340000121
Mjrepresents l division graphs
Figure RE-GDA0002991057340000122
Set of (1), SjRepresenting k scene graphs
Figure RE-GDA0002991057340000123
And (2) the sets of (U) and (n) respectively represent a union operation.
Step S40: the set of candidate segmentation maps is encoded by a background fusion module to produce the best matching background features.
The background fusion module will encode the set of (x) candidate segmentations, input the x segmentations into x convolutional layers (Conv layers) to obtain feature representations (feature representation) of the different segmentations, and then link the channel-wise (channel) into one feature representation. The above features are indicated for pooling (Pooling) operation. And connecting the feature representation of the convolved scene graph and the feature representation of the segmented graph after pooling into a feature representation of a category containing a scene of the foreground and a background by a channel (channel). And finally, performing feature learning and correction by using 2 convolutional layers to obtain the scene graph features of the fusion background.
The background fusion module encodes the candidate segmentation maps to visualize a smoother background, and downsamples data by using aver-pooling according to the retrieved candidate segmentation maps and the corresponding background label maps thereof to reduce dimensionality. And encoding the background label image into a background feature image by using a convolution network.
Let M division maps M be searchedr,0,...,Mr,mAnd corresponding background label map
Figure RE-GDA0002991057340000124
Figure RE-GDA0002991057340000125
Each scene graph corresponds to a background label graph
Figure RE-GDA0002991057340000126
By passing
Figure RE-GDA0002991057340000127
Obtaining l* r,0,...,l* r,mThrough cascade l* r,i(i ═ 0, 1.. times, m) was obtained
Figure RE-GDA0002991057340000128
Then using the convolutional network F1And encoding the background label image into a background feature image.
Figure RE-GDA0002991057340000129
Wherein Pool represents average pooling while using another convolutional neural network F2To obtain an updated profile:
Figure RE-GDA0002991057340000131
after T steps, a final background characteristic picture l is obtained*=lTContaining information from the scene map and background of the salient objects.
Step S50: and obtaining the foreground object from the generated scene graph.
The scene graph generation module, the background retrieval module, the background fusion module and the image generation module together form a generator (generator) for generating the countermeasure network. Meanwhile, a matching-aware discriminator is used to determine whether the image inputted into the discriminator is a real image (real image) or a generated image (generated image). The generator continuously generates more vivid images, and meanwhile, the discriminator continuously identifies whether the generated images are real images or not, and the two networks continuously game until Nash equilibrium is reached.
The foreground object and the background feature are input into a generation countermeasure network, the generation countermeasure network carries out space self-adaptive normalized coding on the foreground object and the background feature, the generation countermeasure network converts a batch normalization layer into a condition normalization layer, the batch normalization layer is modulated and activated by using the input through space self-adaptive learning conversion, and semantic information can be effectively spread in the whole network.
Let hiIndicating the activation of the i-th layer of the deep convolutional network given a batch of N samples, CiIs the number of channels in the layer. Let HiAnd WiThe height and width of the active features in the layer. Unlike batch normalization with channel-by-channel activation, where the spatial adaptive normalization is modulated using learnable scaling and translation parameters, the activation value is:
Figure RE-GDA0002991057340000132
wherein N belongs to N, C belongs to Ci,y∈Hi,x∈Wi
Figure RE-GDA0002991057340000133
Is the activation at that location prior to normalization,
Figure RE-GDA0002991057340000134
and
Figure RE-GDA0002991057340000135
is the mean and standard deviation of activation in channel C.
Figure RE-GDA0002991057340000141
Figure RE-GDA0002991057340000142
And training the weight of the generated countermeasure network by using the countermeasure loss function and the perception loss function to obtain the trained generated countermeasure network, and taking the trained generated countermeasure network as a text synthetic image model.
Using oppositional loss functions
Figure RE-GDA0002991057340000143
Training generates a confrontation network to fine-tune (fine-tune) the entire network. The generator continuously generates more vivid images, and meanwhile, the discriminator continuously identifies the generated images instead of real images, and the two networks continuously game until Nash equilibrium is reached.
The global difference between a generated image feature (generated image feature) and a corresponding real image feature (corrected real image feature) is measured using a perceptual loss function. This function can extract features from the 7 th fully-connected layer (FC7) of the pre-trained VGG16 network to measure global differences in images.
Figure RE-GDA0002991057340000144
Wherein x isjRepresenting the jth generated image; x is the number ofrRepresenting the corresponding real image of the jth generated image; n represents the total number of images in the dataset. Cj、HjAnd WjThe number of channels (channel), height (height), and width (width) of the image feature. L ispRepresenting the perceptual loss function.
Step S60: and inputting the text description to be converted into the trained text composite image model, and outputting a high-resolution image with consistent foreground and background vision by the text composite image model.
Example two
Fig. 2 is a schematic structural diagram of a text synthesis image apparatus based on saliency scene graph analysis, including:
the data preprocessing module 21 is configured to extract text descriptions from an existing data set, and construct an object data set, an attribute data set, and a relationship data set from the text descriptions according to dependency analysis;
the scene graph generation module 22 is configured to extract all objects in the object data set, all attributes in the attribute data set, and all relationships in the relationship data set, analyze the obtained objects, attributes, and relationships into a dependency tree, perform tree transformation on the dependency tree to obtain a semantic graph, construct a rule-based scene graph analyzer according to the semantic graph, map a dependency syntax in the dependency tree into a scene graph through the scene graph analyzer, and obtain a foreground object from the scene graph;
a background retrieval module 23, configured to retrieve the scene graph by using the background retrieval module, and select a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;
a background fusion module 24, configured to encode the candidate segmentation map through the background fusion module to generate a best matching background feature;
the image generation module 25 is configured to input the foreground object and the background feature representation into the generated countermeasure network, and train the weight of the generated countermeasure network by using the countermeasure loss function and the perceptual loss function to obtain a trained generated countermeasure network; and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.
Specifically, the scene graph generating module 22 is specifically configured to extract all objects in an object data set, extract all attributes in an attribute data set, extract all relationships in a relationship data set, analyze an image description as a starting point of a scene graph, output a dependency relationship among the objects, the attributes, and the relationships by using Stanford Parser v3.5.2, analyze the objects, the attributes, and the relationships as a dependency tree according to the dependency relationship, perform quantization modifier processing, pronoun analysis processing, and complex noun processing on the dependency tree, and convert the dependency tree into a semantic graph;
the quantitative modifier process takes a word as a semantically significant word that becomes the beginning of all other words in the expression, making this new multi-word expression dependent on the next noun phrase; the pronoun parsing process uses improved intrinsic pronoun parsers inspired by the first three rules of the Hobbs algorithm, so that the intrinsic pronoun parsers can operate in a dependency tree to restore the relationship between objects in a sentence; plural noun processes replicate each node of the graph according to the value of the numerical modifier.
Specifically, the background retrieval module 23 is specifically configured to utilize the background retrieval module to perform retrieval in the candidate semantic segmentation map database based on the instance information in the scene graph, calculate a distance value between the scene graph and each retrieved segmentation map, use the distance value as a layout similarity score, perform descending order on all the layout similarity scores, and use a plurality of scene graphs arranged in front as a group of candidate segmentation maps most relevant to the scene graph;
given containing k instances
Figure RE-GDA0002991057340000161
Scene graph S, c ofiIs example SiGiven a category comprising l instances
Figure RE-GDA0002991057340000162
The segmentation map M and the corresponding correct label map I, the distance IoU between the scene map S and the segmentation map MrThe calculation method comprises the following steps:
Figure RE-GDA0002991057340000163
where C is the total number of object classes,
Figure RE-GDA0002991057340000164
Mjrepresents l division graphs
Figure RE-GDA0002991057340000165
Set of (1), SjRepresenting k scene graphs
Figure RE-GDA0002991057340000166
And (2) the sets of (U) and (n) respectively represent a union operation.
Specifically, the background fusion module 24 is specifically configured to encode the group of x candidate segmentation maps by using a background fusion module, input the encoded x segmentation maps into x convolution layers, output feature representations of the x segmentation maps by the x convolution layers respectively, link the feature representations of the x segmentation maps into a total feature representation channel by channel, perform pooling operation on the total feature representation, connect the total feature representation and the total feature representation after the pooling operation into a feature representation including a foreground scene and a background category through a channel, and perform feature learning and correction on the feature representation including the foreground scene and the background category by using 2 convolution layers to obtain a best-matched background feature;
let M division maps M be searchedr,0,...,Mr,mAnd corresponding background label map
Figure RE-GDA0002991057340000171
Figure RE-GDA0002991057340000172
Each scene graph corresponds to a background label graph
Figure RE-GDA0002991057340000173
By passing
Figure RE-GDA0002991057340000174
Obtaining l* r,0,...,l* r,mThrough cascade l* r,i(i ═ 0, 1.. times, m) was obtained
Figure RE-GDA0002991057340000175
Using a convolutional network F1Encoding the background label graph into a background feature graph;
Figure RE-GDA0002991057340000176
wherein Pool represents the average pooling, using another convolutional neural network F2To obtain an updated profile:
Figure RE-GDA0002991057340000177
after T steps, the best matched background characteristic l is obtained*=lTContaining information from the scene map and background of the salient objects.
Specifically, the image generation module 25 is specifically configured to utilize the scene graph generation module, the background retrieval module, the background fusion module, and the image generation module to form a generator for generating a countermeasure network, input the foreground object and the background feature into the generation countermeasure network, perform spatial adaptive normalization coding on the foreground object and the background feature by the generation countermeasure network, input the image subjected to the spatial adaptive normalization coding into a discriminator for generating the countermeasure network, utilize a matching discriminator to judge whether the image input into the discriminator is a real image or a generated image, utilize a countervailing loss function and a perceptual loss function to train weights for generating the countermeasure network, continuously generate more images by the generator, continuously identify whether the discriminator is a real image or a generated image, continuously play the generator and the discriminator until nash equilibrium between the generator and the discriminator is reached, and obtaining a well-trained generated countermeasure network.
The specific process of using the apparatus of the embodiment of the present invention to perform the text synthesis image based on the significant scene graph analysis is similar to that of the foregoing method embodiment, and is not repeated here.
In summary, the method for synthesizing the text based on the saliency scene graph analysis according to the embodiment of the invention introduces the saliency scene graph into the image synthesis, and effectively improves the accuracy of the image synthesis by exploring the cross-modal text semantic space configuration.
The method can automatically screen the segmentation maps with better quality by screening the segmentation maps in the candidate set, integrates background characteristics, generates the consistency of the foreground and background images, improves the image quality and reduces the memory usage amount.
All modules of the system of the embodiment of the invention are automatic and do not need manual intervention, so that the system can be easily integrated into other text-based image synthesis systems. The system can also be divided into two subsystems: a text synthesis salient scene graph subsystem and a salient object layout generation image subsystem. The text synthesis salient scene graph subsystem can be embedded into an image retrieval general analysis system, and similarly, the salient object layout generation image subsystem can change the layout position and the type according to the user requirements to generate a general scene meeting the user requirements, so that the text synthesis salient scene graph subsystem has a wide application prospect.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for synthesizing an image based on a text analyzed by a saliency scene graph is characterized by comprising the following steps:
step 1: extracting text descriptions from the existing data sets, and constructing an object data set, an attribute data set and a relation data set based on the text descriptions according to dependency analysis;
step 2: extracting all objects in the object data set, all attributes in the attribute data set and all relations in the relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, and performing tree conversion on the dependency tree to obtain a semantic graph;
and step 3: constructing a rule-based scene graph analyzer according to the semantic graph, mapping the syntax of the dependence items in the dependence tree into a scene graph through the scene graph analyzer, and obtaining a foreground object from the scene graph;
and 4, step 4: retrieving the scene graph by using a background retrieval module, and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;
and 4, step 4: encoding the candidate segmentation map through a background fusion module to generate a best matched background feature;
and 5: inputting the foreground object and the background feature representation into a generated countermeasure network, and training the weight of the generated countermeasure network by using a countermeasure loss function and a perception loss function to obtain a trained generated countermeasure network;
step 6: and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.
2. The method of claim 1, wherein the extracting all objects in the object dataset, all attributes in the attribute dataset, and all relationships in the relationship dataset, parsing the obtained objects, attributes, and relationships into a dependency tree, and performing tree transformation on the dependency tree to obtain the semantic graph comprises:
extracting all objects in an object data set, extracting all attributes in an attribute data set, extracting all relations in a relation data set, analyzing an image description into a starting point of a scene graph, outputting the dependency relations among the objects, the attributes and the relations by using Stanford Parser v3.5.2, analyzing the objects, the attributes and the relations into a dependency tree according to the dependency relations, performing quantization modifier processing, pronoun analysis processing and plural noun processing on the dependency tree, and converting the dependency tree into a semantic graph;
the quantitative modifier process takes a word as a semantically significant word that becomes the beginning of all other words in the expression, making this new multi-word expression dependent on the next noun phrase; the pronoun parsing process uses improved intrinsic pronoun parsers inspired by the first three rules of the Hobbs algorithm, so that the intrinsic pronoun parsers can operate in a dependency tree to restore the relationship between objects in a sentence; plural noun processes replicate each node of the graph according to the value of the numerical modifier.
3. The method of claim 1, wherein the retrieving the scene graph using a background retrieval module, and selecting a set of candidate segmentation graphs from the database of candidate semantic segmentation graphs that are most relevant to the scene graph according to the layout similarity score, comprises:
searching in the candidate semantic segmentation map database by using a background search module based on example information in the scene map, calculating distance values between the scene map and each searched segmentation map, taking the distance values as layout similarity scores, performing descending ordering on all the layout similarity scores, and taking a plurality of scene maps which are arranged in front as a group of candidate segmentation maps which are most relevant to the scene map;
given containing k instances
Figure RE-FDA0002991057330000021
Scene graph S, c ofiIs example SiGiven a category comprising l instances
Figure RE-FDA0002991057330000022
The segmentation map M and the corresponding correct label map I, the distance IoU between the scene map S and the segmentation map MrThe calculation method comprises the following steps:
Figure RE-FDA0002991057330000031
where C is the total number of object classes,
Figure RE-FDA0002991057330000032
Mjrepresents l division graphs
Figure RE-FDA0002991057330000033
Set of (1), SjRepresenting k scene graphs
Figure RE-FDA0002991057330000034
And (2) the sets of (U) and (n) respectively represent a union operation.
4. The method according to claim 1, wherein the encoding the candidate segmentation map by the context fusion module to obtain the best matching context feature comprises:
the method comprises the steps of encoding a group of x candidate segmentation maps by using a background fusion module, inputting the encoded x segmentation maps into x convolution layers, respectively outputting feature representations of the x segmentation maps by the x convolution layers, linking the feature representations of the x segmentation maps into a total feature representation channel by channel, performing pooling operation on the total feature representation, connecting the total feature representation and the pooled total feature representation into a feature representation containing a foreground scene and a background category by channels, and performing feature learning and correction on the feature representation containing the foreground scene and the background category by using 2 convolution layers to obtain a best matched background feature.
5. The method of claim 4, wherein the encoding the candidate segmentation map by a context fusion module to obtain a best matching context feature further comprises:
let M division maps M be searchedr,0,...,Mr,mAnd corresponding background label map
Figure RE-FDA0002991057330000035
Figure RE-FDA0002991057330000036
Each scene graph corresponds to a background label graph
Figure RE-FDA0002991057330000037
By passing
Figure RE-FDA0002991057330000038
Obtaining l* r,0,...,l* r,mThrough cascade l* r,i(i ═ 0, 1.. times, m) was obtained
Figure RE-FDA0002991057330000039
Using a convolutional network F1Encoding the background label graph into a background feature graph;
Figure RE-FDA00029910573300000310
wherein Pool represents the average pooling, using another convolutional neural network F2To obtain an updated profile:
Figure RE-FDA0002991057330000041
after T steps, the best matched background characteristic l is obtained*=lTContaining information from the scene map and background of the salient objects.
6. The method of claim 4, wherein the inputting foreground object and background feature representation into the generative confrontation network, training the weights of the generative confrontation network by using the confrontation loss function and the perceptual loss function to obtain a trained generative confrontation network, comprises:
a generator for generating a confrontation network is formed by a scene graph generating module, a background retrieving module, a background fusing module and an image generating module, foreground objects and background characteristics are input into the confrontation network, generating an antagonistic network to carry out space self-adaptive normalized coding on the foreground object and the background characteristic, inputting the image subjected to the space self-adaptive normalized coding into a discriminator for generating the antagonistic network, judging whether the image input into the discriminator is a real image or a generated image by using a matching discriminator, training the weight of the generated countermeasure network by using a countermeasure loss function and a perception loss function, continuously generating more images by using a generator, meanwhile, the discriminator continuously identifies real images or generates images, and the generator and the discriminator continuously play games until Nash balance between the generator and the discriminator is achieved, so that a trained generation countermeasure network is obtained.
7. A text-synthesized image device based on saliency scene map analysis, comprising:
the data preprocessing module is used for extracting text descriptions from the existing data set and constructing an object data set, an attribute data set and a relation data set from the text descriptions according to dependency analysis;
the system comprises a scene graph generation module, a scene graph analysis module and a foreground object generation module, wherein the scene graph generation module is used for extracting all objects in an object data set, all attributes in an attribute data set and all relations in a relation data set, analyzing the obtained objects, attributes and relations into a dependency tree, carrying out tree conversion on the dependency tree to obtain a semantic graph, constructing a rule-based scene graph analyzer according to the semantic graph, mapping a dependency syntax in the dependency tree into a scene graph through the scene graph analyzer, and obtaining a foreground object from the scene graph;
the background retrieval module is used for retrieving the scene graph by utilizing the background retrieval module and selecting a group of candidate segmentation graphs most relevant to the scene graph from the candidate semantic segmentation graph database according to the layout similarity score;
the background fusion module is used for coding the candidate segmentation map through the background fusion module to generate the best matched background feature;
the image generation module is used for inputting the foreground object and the background feature representation into the generation countermeasure network, and training the weight of the generation countermeasure network by using the countermeasure loss function and the perception loss function to obtain the trained generation countermeasure network; and taking the trained generated confrontation network as a text composite image model, inputting the text description to be converted into the trained text composite image model, and outputting an image corresponding to the text description to be converted by the text composite image model.
8. The apparatus of claim 7, wherein:
the scene graph generation module is specifically used for extracting all objects in an object data set, extracting all attributes in an attribute data set, extracting all relations in a relation data set, analyzing the image description into a starting point of a scene graph, outputting the dependency relations among the objects, the attributes and the relations by using Stanford Parser v3.5.2, analyzing the objects, the attributes and the relations into a dependency tree according to the dependency relations, performing quantization modifier processing, pronoun analysis processing and plural noun processing on the dependency tree, and converting the dependency tree into a semantic graph;
the quantitative modifier process takes a word as a semantically significant word that becomes the beginning of all other words in the expression, making this new multi-word expression dependent on the next noun phrase; the pronoun parsing process uses improved intrinsic pronoun parsers inspired by the first three rules of the Hobbs algorithm, so that the intrinsic pronoun parsers can operate in a dependency tree to restore the relationship between objects in a sentence; plural noun processes replicate each node of the graph according to the value of the numerical modifier.
9. The apparatus of claim 8, wherein:
the background retrieval module is specifically used for retrieving in the candidate semantic segmentation map database by using the background retrieval module based on example information in the scene graph, calculating distance values between the scene graph and each retrieved segmentation map, taking the distance values as layout similarity scores, performing descending ordering on all the layout similarity scores, and taking a plurality of scene graphs arranged in front as a group of candidate segmentation maps most relevant to the scene graph;
given containing k instances
Figure RE-FDA0002991057330000061
Scene graph S, c ofiIs example SiGiven a category comprising l instances
Figure RE-FDA0002991057330000062
The segmentation map M and the corresponding correct label map I, the distance IoU between the scene map S and the segmentation map MrThe calculation method comprises the following steps:
Figure RE-FDA0002991057330000063
where C is the total number of object classes,
Figure RE-FDA0002991057330000064
Mjrepresents l division graphs
Figure RE-FDA0002991057330000065
Set of (1)And then, SjRepresenting k scene graphs
Figure RE-FDA0002991057330000066
And (2) the sets of (U) and (n) respectively represent a union operation.
10. The apparatus of claim 9, wherein:
the background fusion module is specifically configured to encode the group of x candidate segmentation maps by using the background fusion module, input the encoded x segmentation maps into x convolution layers, output feature representations of the x segmentation maps by the x convolution layers respectively, link the feature representations of the x segmentation maps channel by channel into an overall feature representation, perform pooling operation on the overall feature representation, connect the overall feature representation and the overall feature representation after the pooling operation into a feature representation including a foreground scene and a background category through a channel, and perform feature learning and correction on the feature representation including the foreground scene and the background category by using 2 convolution layers to obtain a best-matched background feature;
let M division maps M be searchedr,0,...,Mr,mAnd corresponding background label map
Figure RE-FDA0002991057330000067
Figure RE-FDA0002991057330000068
Each scene graph corresponds to a background label graph
Figure RE-FDA0002991057330000069
By passing
Figure RE-FDA0002991057330000071
Obtaining l* r,0,...,l* r,mThrough cascade l* r,i(i ═ 0, 1.. times, m) was obtained
Figure RE-FDA0002991057330000072
Using a convolutional network F1Encoding the background label graph into a background feature graph;
Figure RE-FDA0002991057330000073
wherein Pool represents the average pooling, using another convolutional neural network F2To obtain an updated profile:
Figure RE-FDA0002991057330000074
after T steps, the best matched background characteristic l is obtained*=lTIncluding information from the scene map and background of the salient objects;
the image generation module is specifically used for forming a generator for generating a countermeasure network by using a scene image generation module, a background retrieval module, a background fusion module and an image generation module, inputting a foreground object and background features into the generation countermeasure network, performing space adaptive normalized coding on the foreground object and the background features by the generation countermeasure network, inputting an image subjected to the space adaptive normalized coding into a discriminator for generating the countermeasure network, judging whether the image input into the discriminator is a real image or a generated image by using a matching discriminator, training the weight of the generation countermeasure network by using a countermeasure loss function and a perception loss function, continuously generating more images by the generator, continuously identifying the real image or the generated image by the discriminator, continuously playing games by the generator and the discriminator until the Nash balance between the generator and the discriminator is achieved, and obtaining a well-trained generated countermeasure network.
CN202011381287.9A 2020-12-01 2020-12-01 Text synthesized image method and system based on saliency scene graph analysis Active CN112734881B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011381287.9A CN112734881B (en) 2020-12-01 2020-12-01 Text synthesized image method and system based on saliency scene graph analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011381287.9A CN112734881B (en) 2020-12-01 2020-12-01 Text synthesized image method and system based on saliency scene graph analysis

Publications (2)

Publication Number Publication Date
CN112734881A true CN112734881A (en) 2021-04-30
CN112734881B CN112734881B (en) 2023-09-22

Family

ID=75598027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011381287.9A Active CN112734881B (en) 2020-12-01 2020-12-01 Text synthesized image method and system based on saliency scene graph analysis

Country Status (1)

Country Link
CN (1) CN112734881B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description
CN113505772A (en) * 2021-06-23 2021-10-15 北京华创智芯科技有限公司 License plate image generation method and system based on generation countermeasure network
CN113793403A (en) * 2021-08-19 2021-12-14 西南科技大学 Text image synthesis method for simulating drawing process
CN114048340A (en) * 2021-11-15 2022-02-15 电子科技大学 Hierarchical fusion combined query image retrieval method
CN114708472A (en) * 2022-06-06 2022-07-05 浙江大学 AI (Artificial intelligence) training-oriented multi-modal data set labeling method and device and electronic equipment
WO2023207531A1 (en) * 2022-04-29 2023-11-02 华为技术有限公司 Image processing method and related device
CN117593527A (en) * 2024-01-18 2024-02-23 厦门大学 Directional 3D instance segmentation method based on chain perception

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10297070B1 (en) * 2018-10-16 2019-05-21 Inception Institute of Artificial Intelligence, Ltd 3D scene synthesis techniques using neural network architectures
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph
US20200242774A1 (en) * 2019-01-25 2020-07-30 Nvidia Corporation Semantic image synthesis for generating substantially photorealistic images using neural networks
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10297070B1 (en) * 2018-10-16 2019-05-21 Inception Institute of Artificial Intelligence, Ltd 3D scene synthesis techniques using neural network architectures
US20200242774A1 (en) * 2019-01-25 2020-07-30 Nvidia Corporation Semantic image synthesis for generating substantially photorealistic images using neural networks
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08, pages 83 - 95 *
张素素;倪建成;周子力;侯杰;: "融合语义标签和噪声先验的图像生成", 计算机应用, no. 05, pages 195 - 203 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505772A (en) * 2021-06-23 2021-10-15 北京华创智芯科技有限公司 License plate image generation method and system based on generation countermeasure network
CN113505772B (en) * 2021-06-23 2024-05-10 北京华创智芯科技有限公司 License plate image generation method and system based on generation countermeasure network
CN113487629B (en) * 2021-07-07 2023-04-07 电子科技大学 Image attribute editing method based on structured scene and text description
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description
CN113793403A (en) * 2021-08-19 2021-12-14 西南科技大学 Text image synthesis method for simulating drawing process
CN113793403B (en) * 2021-08-19 2023-09-22 西南科技大学 Text image synthesizing method for simulating painting process
CN114048340A (en) * 2021-11-15 2022-02-15 电子科技大学 Hierarchical fusion combined query image retrieval method
CN114048340B (en) * 2021-11-15 2023-04-21 电子科技大学 Hierarchical fusion combined query image retrieval method
WO2023207531A1 (en) * 2022-04-29 2023-11-02 华为技术有限公司 Image processing method and related device
CN114708472B (en) * 2022-06-06 2022-09-09 浙江大学 AI (Artificial intelligence) training-oriented multi-modal data set labeling method and device and electronic equipment
CN114708472A (en) * 2022-06-06 2022-07-05 浙江大学 AI (Artificial intelligence) training-oriented multi-modal data set labeling method and device and electronic equipment
CN117593527A (en) * 2024-01-18 2024-02-23 厦门大学 Directional 3D instance segmentation method based on chain perception
CN117593527B (en) * 2024-01-18 2024-05-24 厦门大学 Directional 3D instance segmentation method based on chain perception

Also Published As

Publication number Publication date
CN112734881B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN112734881B (en) Text synthesized image method and system based on saliency scene graph analysis
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
US11657230B2 (en) Referring image segmentation
CN113435203B (en) Multi-modal named entity recognition method and device and electronic equipment
CN115033670A (en) Cross-modal image-text retrieval method with multi-granularity feature fusion
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN111144410B (en) Cross-modal image semantic extraction method, system, equipment and medium
CN111598183A (en) Multi-feature fusion image description method
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN117390497B (en) Category prediction method, device and equipment based on large language model
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN115269882A (en) Intellectual property retrieval system and method based on semantic understanding
CN113392265A (en) Multimedia processing method, device and equipment
CN111079374A (en) Font generation method, device and storage medium
CN114332288B (en) Method for generating text generation image of confrontation network based on phrase drive and network
CN115221369A (en) Visual question-answer implementation method and visual question-answer inspection model-based method
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
Du et al. From plane to hierarchy: Deformable transformer for remote sensing image captioning
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
Zhou et al. Joint scence network and attention-guided for image captioning
CN114972907A (en) Image semantic understanding and text generation based on reinforcement learning and contrast learning
CN114580385A (en) Text semantic similarity calculation method combined with grammar
CN113486180A (en) Remote supervision relation extraction method and system based on relation hierarchy interaction
CN110633363A (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant