CN113793404A

CN113793404A - Artificially controllable image synthesis method based on text and outline

Info

Publication number: CN113793404A
Application number: CN202110953936.6A
Authority: CN
Inventors: 俞文心; 张志强; 甘泽军; 龚梦石; 文茄汁; 龚俊
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-12-14
Anticipated expiration: 2041-08-19
Also published as: CN113793404B

Abstract

The invention discloses a text and outline-based artificial controllable image synthesis method, which comprises the steps of customizing and synthesizing: drawing a basic outline and inputting basic text information, coding the basic outline and the basic text information to obtain respective feature vectors, and combining the feature vectors of the basic outline and the basic text information together to synthesize a corresponding image; optimizing and correcting: and inputting optimized text information, coding the synthesized image and the optimized text information to obtain a corresponding feature vector, and combining the feature vectors of the synthesized image and the optimized text information together to synthesize the synthesized image and the optimized text information to obtain an optimized image. The invention can improve the controllability of image synthesis and synthesize high-quality image results under the condition of completely meeting the subjective will of people.

Description

Artificially controllable image synthesis method based on text and outline

Technical Field

The invention belongs to the technical field of image synthesis, and particularly relates to a text and contour-based artificial controllable image synthesis method.

Background

Controllable image synthesis is one of the most important technical challenges of the current artificial intelligence in the field of computer vision. The image contains much more abundant content than text and voice information, making it difficult for a machine to perform high quality image synthesis. The rapid development of artificial intelligence in recent years has made the image synthesis technology a major breakthrough, and machines have been able to synthesize subjectively more realistic complex images (such as faces, album covers, room layouts, etc.). With the advent of the artificial intelligence era, machines are endowed with more missions, the most important of which is to enable machines to more accurately understand the subjective ideas of humans and thus to better serve humans. Under such circumstances, image synthesis is currently moving towards human control, i.e. a machine can synthesize corresponding images according to human intention. The artificial controllable image synthesis technology has good promotion effects on improving the practicability of the image synthesis technology and popularizing image synthesis software. In addition, the technology of artificial control can also make the machine more intelligent, thereby further promoting the development of artificial intelligence.

The prior art image synthesis techniques are less than satisfactory in terms of human controllability. Most image synthesis techniques cannot introduce human control factors, that is, the whole image synthesis process cannot be controlled manually. Some image synthesis techniques introduce some human control factors, such as allowing a human to enter a class label of an image to artificially determine the type of image synthesis; it also allows for the manual input of natural language descriptions to determine the underlying content of the composite image. The manner in which the category labels are entered can only be somewhat controlled by human operators, as the category labels contain too little information. For example, if a category label "bird" is manually input, image composition technology based on the category label can compose an image of a bird, but the specific information of the bird (such as the color, size, etc. of the bird) is manually uncontrollable. The mode of inputting the natural language description contains more information and can play a better role of artificial control, but the mode lacks the constraint of the overall layout of the image, so that the synthesized result cannot better reach the artificial expectation.

Disclosure of Invention

In order to solve the problems, the invention provides a text and outline-based artificially controllable image synthesis method, which can improve the controllability of image synthesis and synthesize a high-quality image result under the condition of completely meeting the subjective intention of people.

In order to achieve the purpose, the invention adopts the technical scheme that: a method for synthesizing artificially controllable images based on texts and outlines comprises the following steps:

custom synthesis: drawing a basic outline and inputting basic text information, coding the basic outline and the basic text information to obtain respective feature vectors, and combining the feature vectors of the basic outline and the basic text information together to synthesize a corresponding image;

optimizing and correcting: and inputting optimized text information, coding the synthesized image and the optimized text information to obtain a corresponding feature vector, and combining the feature vectors of the synthesized image and the optimized text information together to synthesize the synthesized image and the optimized text information to obtain an optimized image.

Further, when encoding the basic outline and the basic text information to obtain respective feature vectors: and acquiring the feature vector of the basic outline through a convolutional neural network, and acquiring the feature vector of the basic text information through a bidirectional long-short term memory network.

Further, when combining the basic outline and the feature vector of the basic text information together to synthesize a corresponding image, the outline feature vector and the basic text feature vector are connected, and the connected feature vector is converted into a corresponding image by deconvolution.

Further, when the synthesized image and the optimized text information are encoded to obtain the corresponding feature vector: and acquiring the feature vector of the synthesized image through a convolutional neural network, and acquiring the feature vector of the optimized text information through a bidirectional long-short term memory network.

Furthermore, when the feature vectors of the synthesized image and the optimized text information are combined together to obtain the optimized image, the feature vectors of the synthesized image and the optimized text are connected, and then the connected feature vectors are converted into the corresponding optimized image by utilizing deconvolution.

Further, optimized text information is input for multiple times, the images synthesized in sequence and the newly added optimized text information are coded to obtain corresponding feature vectors, and then the images synthesized in sequence and the feature vectors of the newly added optimized text information are combined together to obtain an optimized image.

Further, when the images and the newly added optimized text information which are sequentially synthesized are encoded to obtain the corresponding feature vectors: obtaining the characteristic vector of the synthesized image through a convolution neural network, and obtaining the characteristic vector of newly added optimized text information through a bidirectional long-short term memory network

Furthermore, when the feature vectors of the sequentially synthesized image and the newly added optimized text information are combined together to obtain the optimized image, the feature vectors of the synthesized image and the newly added optimized text are connected, and then the connected feature vectors are converted into the corresponding optimized image by utilizing deconvolution.

The beneficial effects of the technical scheme are as follows:

the invention uses text information and simple outline information to synthesize corresponding images, the text is used for controlling the basic content of the synthesized image, and the outline is used for controlling the basic shape of the synthesized image. Both text and simple outline information are manually input, and the two kinds of information are simple and accord with the input habits of people. Therefore, the invention realizes a highly artificial controllable image synthesis technology. The technology can synthesize high-quality image results under the condition of completely meeting the subjective will of people, and has great effects on promoting the development of controllable image synthesis technology and promoting the intelligence of a machine. The invention can improve the artificial controllable degree of image synthesis, because the human can participate in the whole synthesis process and play a key control role, the synthesized image result can reach the basic expectation of people. Therefore, the practicability of the image synthesis technology can be improved, and the image synthesis software can be popularized better.

Drawings

FIG. 1 is a schematic flow chart of a method for synthesizing artificially controllable images based on texts and outlines according to the present invention;

fig. 2 is a schematic diagram illustrating a principle of a text and outline-based artificially controlled image synthesis method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.

In this embodiment, referring to fig. 1 and fig. 2, the present invention provides a method for synthesizing an artificially controllable image based on text and outline, comprising the steps of:

As an optimization scheme of the above embodiment, when encoding the basic outline and the basic text information to obtain respective feature vectors: and acquiring the feature vector of the basic outline through a convolutional neural network, and acquiring the feature vector of the basic text information through a bidirectional long-short term memory network.

The concrete implementation formula is as follows:

feature vector of base profile: enc _ contout ═ CNN (I)_c)；

Feature vector of the basic text information: enc _ text ═ Bi _ lstm (t);

wherein, I_cA simple outline drawing drawn by man is shown, and T is the text content input by man. CNN represents a convolutional neural network for encoding the contour map into corresponding feature vectors; Bi-LSTM stands for bidirectional long-short term memory networkFor encoding text into text vectors.

When the basic outline and the feature vector of the basic text information are combined together to synthesize a corresponding image, the outline feature vector and the basic text feature vector are connected, and the connected feature vector is converted into the corresponding image by deconvolution.

The concrete implementation formula is as follows:

f_c＝concat(enc_contor，enc_text)；

I_g＝deconvolution(f_c)；

wherein f is_cRepresenting the feature vector after the concatenation of the outline feature and the base text feature vector, I_gRepresenting the generated image result; performing connection combination on the feature vectors through a concat function;

deconvoltation represents a deconvolution used to convert the concatenated feature vectors into corresponding images.

As an optimization scheme of the above embodiment, when encoding the synthesized image and the optimized text information to obtain the corresponding feature vector: and acquiring the feature vector of the synthesized image through a convolutional neural network, and acquiring the feature vector of the optimized text information through a bidirectional long-short term memory network.

The concrete implementation formula is as follows:

feature vector of the synthesized image: enc _ gen ═ CNN (I)_g)；

Optimizing feature vectors of text information: enc _ text _ new ═ Bi _ LSTM (T _ new);

wherein, I_gRepresenting a composite image, and T _ new representing optimized text information; CNN represents a convolutional neural network for encoding the composite image into corresponding feature vectors; Bi-LSTM represents a Bi-directional long-short term memory network for encoding text into text vectors.

And when the feature vectors of the synthesized image and the optimized text information are combined together to synthesize the optimized image, firstly connecting the feature vectors of the synthesized image and the optimized text, and then converting the connected feature vectors into the corresponding optimized image by utilizing deconvolution.

The concrete implementation formula is as follows:

f_{c_new}＝concat(enc_gen，enc_text_new)；

I_{g_new}＝deconvolution(f_{c_new})；

wherein f is_{c_new}Representing the feature vector after concatenation of the feature vector of the synthetic image and the feature vector of the optimized text, I_gRepresenting a composite image; performing connection combination on the feature vectors through a concat function;

The human continues to enter text information to modify the previously synthesized image results. This modification process may be continued until the resultant image meets the human requirements. Specifically, the image synthesis result at the custom synthesis stage may not be satisfactory, and thus text information conforming to subjective will may be continuously input to modify the synthesized image. If the newly synthesized image content is still unsatisfactory, text information may continue to be entered to further modify the image content. The entire content modification stage thus provides a highly human controllable factor.

The custom synthesis phase allows for the artificial drawing of an outline to determine the basic shape of the image result and for the artificial input of textual information to determine the basic content of the image result. The subsequent content modification phase allows the human to continually enter new textual descriptions to modify the content of the composite image until the image results are satisfactory. The whole synthesis process is artificially participated in and plays a core control role, so the invention realizes the image synthesis effect with the highest artificial controllable degree at present.

As an optimization scheme of the above embodiment, as shown in fig. 2, optimized text information is input for multiple times, images sequentially synthesized and newly added optimized text information are encoded to obtain corresponding feature vectors, and then the feature vectors of the sequentially synthesized images and the newly added optimized text information are combined together to obtain an optimized image.

When the images which are sequentially synthesized and newly added optimized text information are coded to obtain corresponding feature vectors: and acquiring the feature vector of the synthesized image through a convolutional neural network, and acquiring the feature vector of newly added optimized text information through a bidirectional long-short term memory network.

When combining the sequentially synthesized image and the newly added feature vector of the optimized text information together to obtain the optimized image, firstly connecting the feature vector of the synthesized image and the newly added feature vector of the optimized text, and then converting the connected feature vector into the corresponding optimized image by utilizing deconvolution.

Specific examples may employ:

first, webpage end image synthesis system

A webpage interface similar to hundredth translation is provided, text description is manually input and simple outline information is drawn in the interface, and then a synthesis button is clicked to generate a corresponding image result. The user may then continue to enter textual descriptions in the interface to modify the previously synthesized image content.

Second, customized image synthesis software

The software comprises two parts: customized composition of images, content modification of images.

The customized image synthesis software formed by the invention allows a user to manually draw simple outlines and input text information in the software, and then the software can automatically synthesize corresponding images. The user may continue to enter text description content that meets personal expectations in the content modification function, after which the software may modify the previously synthesized image content based on the newly entered text. The software can be used for infant education and computer structural aided design.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for synthesizing an artificially controllable image based on text and outline is characterized by comprising the following steps:

2. The method of claim 1, wherein when encoding the base outline and the base text information to obtain the respective feature vectors: and acquiring the feature vector of the basic outline through a convolutional neural network, and acquiring the feature vector of the basic text information through a bidirectional long-short term memory network.

3. The method as claimed in claim 2, wherein the basic outline and the feature vector of the basic text information are combined together to compose the corresponding image, the feature vector of the outline and the feature vector of the basic text information are connected, and the connected feature vectors are converted into the corresponding image by deconvolution.

4. The method as claimed in claim 1 or 3, wherein when the synthesized image and the optimized text information are encoded to obtain the corresponding feature vector: and acquiring the feature vector of the synthesized image through a convolutional neural network, and acquiring the feature vector of the optimized text information through a bidirectional long-short term memory network.

5. The method as claimed in claim 4, wherein when the feature vectors of the synthesized image and the optimized text information are combined together to obtain the optimized image, the feature vectors of the synthesized image and the optimized text are connected, and then the connected feature vectors are converted into the corresponding optimized image by deconvolution.

6. The method as claimed in claim 1, wherein the optimized text information is input multiple times, the sequentially synthesized image and the newly added optimized text information are encoded to obtain corresponding feature vectors, and then the sequentially synthesized image and the newly added feature vectors of the optimized text information are combined together to obtain the optimized image.

7. The method as claimed in claim 6, wherein when the image synthesized in sequence and the newly added optimized text information are encoded to obtain the corresponding feature vector: and acquiring the feature vector of the synthesized image through a convolutional neural network, and acquiring the feature vector of newly added optimized text information through a bidirectional long-short term memory network.

8. The method as claimed in claim 7, wherein when the image to be synthesized and the feature vector of the newly added optimized text information are combined together to obtain an optimized image, the feature vector of the synthesized image and the feature vector of the newly added optimized text are connected, and then the connected feature vectors are converted into the corresponding optimized image by deconvolution.