CN117911548A

CN117911548A - Target scene synthesis using generative AI

Info

Publication number: CN117911548A
Application number: CN202310958821.5A
Authority: CN
Inventors: O·布拉迪克兹卡; I·罗斯卡; A·达拉比; A·V·科斯汀; A·奇库丽塔
Original assignee: Adobe Systems Inc
Current assignee: Adobe Inc
Priority date: 2022-10-17
Filing date: 2023-08-01
Publication date: 2024-04-19

Abstract

Embodiments of the present disclosure relate to target scene synthesis using a generative AI. Embodiments of the present disclosure provide techniques for generating synthetic target scenes using natural language cues. The method includes receiving a natural language description of an image to be generated using a machine learning model. The method further includes extracting control elements and sub-hints from the natural language description of the image to be generated. The method further includes identifying a relationship between the control element and the sub-hint based on the natural language description of the image to be generated. The method also includes generating, by the machine learning model, an image based on the control elements, the sub-hints, and the relationships. The image includes visual elements corresponding to the control elements and the sub-cues.

Description

Target scene synthesis using generative AI

Cross Reference to Related Applications

The present application claims the rights of U.S. provisional application No. 63/416,882, filed on day 10, month 17, 2022, and the rights of U.S. application No. 18/322,523, filed on day 23, 05, 2023, which are incorporated herein by reference.

Technical Field

Embodiments of the present disclosure relate to the field of computers, and more particularly, to methods, systems, and media for target scene synthesis.

Background

Digital tools allow artists to represent creative efforts in digital workspaces. For example, an artist (or other creator) creates a scene in a digital workspace. A scene is a collection of concepts or objects and relationships between objects created in a digital workspace created by an artist's creative effort/ideas. In particular, the scene includes a composition (or structural arrangement) of visual elements. Sometimes, an artist creates each of the objects (or other visual elements) of a scene. Alternatively, the artist may reuse portions of previously created objects and adapt such objects to new scenes. However, different levels of artist skill lead to inconsistent scene quality and varying degrees of effort, time, and resources (both computational and human resources) required to create a scene. Furthermore, adapting previously created objects to new scenes can be very time consuming.

Disclosure of Invention

Presented herein are techniques for generating a synthetic target scene using natural language cues. In other embodiments, synthesizing the target scene is also based on the source image. The composite target scene includes structures that operate according to any desired style, visual element, and/or image included in the natural language cues. The generation of the target scene enables the user to create digital art using the natural language description of the user's creative ideas. In fact, the target scenario generation system presents the creative ideas of the user, regardless of the user's skill. Using hints or natural language instructions, the target scene generation system creates a composition of images and/or generates images to facilitate the creative exploration of the user.

More particularly, in one or more embodiments, the target scene generation system decomposes the received text description of the target scene into separate sub-cues for image generation. Such decomposition is performed using natural language processing techniques. Further, the decomposition performed by the target scene generation system parses a control language from sub-cues (such as objects) of the target scene, wherein the control language defines image operations on the composition of visual elements.

The target scene generation system also derives groupings of sub-hints from the arrangement of control language fragments and the syntax structure of the hints. Such groupings are converted into visual elements of the scene, as well as additional image operations. Finally, the user may edit the generated scene, where the generated scene is a recommendation for the arranged visual elements determined by the target scene generation system from the image operations and sub-cues.

Additional features and advantages of exemplary embodiments of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of such exemplary embodiments.

Drawings

The detailed description will be described with reference to the accompanying drawings, in which:

FIG. 1 illustrates a diagram of a process of generating a target scene based on input in accordance with one or more embodiments;

FIG. 2 illustrates an example of an input extractor in accordance with one or more embodiments;

FIG. 3 illustrates natural language cues processed by an input extractor in accordance with one or more embodiments;

FIG. 4 illustrates an example bezel-generated image in accordance with one or more embodiments;

FIGS. 5-6 illustrate examples of described scenes and corresponding recommended target scenes in accordance with one or more embodiments;

FIG. 7 illustrates an example implementation of a diffusion model in accordance with one or more embodiments;

FIG. 8 illustrates a diffusion process for training a diffusion model in accordance with one or more embodiments;

FIG. 9 illustrates a schematic diagram of a target scene generation system in accordance with one or more embodiments;

FIG. 10 illustrates a flow diagram of a series of actions in a method of composing a target scene using natural language descriptions in accordance with one or more embodiments; and

FIG. 11 illustrates a block diagram of an exemplary computing device in accordance with one or more embodiments.

Detailed Description

One or more embodiments of the present disclosure include a target scene generation system that uses hints to create a composition of an image and/or to generate an image as a structured scene. One conventional approach involves manually creating visual elements of a scene and adapting the visual elements to match the creative effort of the user. However, this approach is time consuming and the manually created image varies based on the skill level of the user. Other conventional methods automatically generate images from the provided descriptions of images and styles. However, these methods cannot generate structured scenes. For example, the composition of visual elements does not preserve the original image details (or the user's target scene description). In other words, the structure of the visual elements is not hierarchical or otherwise properly arranged in the scene.

To address these and other deficiencies in conventional systems, the subject scene generation system of the present disclosure combines natural language processing for analyzing and decomposing text descriptions, text image operations for defining a composition, and a generated AI for automatically creating a composite image having a desired style, visual elements, and image operations.

Providing a scene recommendation to a user that includes structured visual elements (e.g., composition) reduces computing resources (such as power, memory, and bandwidth) spent adjusting, creating, or otherwise adapting visual elements of a target scene. For example, the target scene generation system of the present disclosure maintains the composition of the target scene by decomposing the natural language description of the target scene. Decomposing the natural language description results in the identification of the control language and the identification of the descriptive scene language, which allows groups of sub-cues to be derived. Due to this grouping, the structural composition of the target scene is achieved. In this way, the likelihood that the generated scene cannot maintain the composition of the target scene is reduced. Thus, the user does not have to execute the scene generation algorithm (or refine the structure of the generated scene) multiple times because of the failed structural synthesis of the visual element.

FIG. 1 illustrates a diagram of a process of generating a target scene based on input in accordance with one or more embodiments. As shown in fig. 1, an embodiment includes a target scene generation system 100. The target scene generation system 100 includes an input extractor 102, an image composer 110, one or more modules (such as a generation AI module 106), and an image compiler 118.

At numeral 1, the target scene generation system 100 receives an input 120. The input 120 is a hint, or a textual description that includes 1) an image operation, 2) a sentence with a syntactic structure reflecting the composition, and/or 3) an object (object) and a topic (subject). The desired composition (otherwise referred to herein as a description of the target scene or image to be generated) is described using hints in a natural language format.

In some embodiments, the target scene generation system 100 receives a source image as part of the input 120. The source image may be a computer generated image, a user uploaded image (such as a video frame, a picture captured by a camera (or other sensor)), or the like. In some embodiments, the source image is used as a basis (or baseline) for the target scene determined by the target scene generation system 100. For example, the target scene generation system 100 may revise (revise) visual elements of the source image, add visual elements (or remove visual elements), etc., to change the composition of the source image.

In some embodiments, the source image is a previously generated image (e.g., output 122). In these embodiments, the input 120 may include a revision (e.g., a prompt for a revision) to the previous description of the scene. These inputs relate to user revisions/modifications to the scene based on the displayed scene (e.g., output 122). Revisions to the previous description of the scene include revised natural language descriptions, or detected user interactions with a portion of the scene. The detected user interactions include pressing the mouse, releasing the mouse, tactile feedback, keyboard input, voice commands, and the like. In a particular example, the user may resize the visual element of the scene by clicking/dragging a portion of the visual element of the output 122. Thus, the input 120 to the target scene generation system 100 includes an adjustment to the portion of the visual element and generates a revised visual element.

At numeral 2, the input extractor 102 extracts information from the input using any one or more modules, as described with reference to fig. 2. In operation, the input extractor 102 identifies and parses out information defining visual elements of a target scene (e.g., arrangement/composition of objects in an image). For example, such parsed information from the natural language description of the image (e.g., input 120) may include sub-cues from received cues and identification control language. Although this disclosure primarily describes control languages and sub-hints, it should be understood that one or more modules of the input extractor 102 can be used to extract other information from the input. The parsed information (e.g., control language and sub-cues) is used to generate visual elements (e.g., objects) in the target scene.

At numeral 3, the image composer 110 maps the parsed information from the input to one or more modules. For example, the image composer may receive the sub-prompt(s) and the control language. The image composer 110 provides the information determined from the input 120 to subsequent modules (such as the generated AI module 106 and the image compiler 118). For example, the image composer 110 may receive the sub-cues determined by the structure analyzer 204 of the input extractor 102 described in FIG. 2. The image composer 110 then provides such sub-cues to the generated AI module 106. By providing sub-hints to the generation-type AI module 106, the image composer 110 determines how many times the generation-type AI module 106 has performed (e.g., how many visual elements should be generated by the generation-type AI module 106). Similarly, the image composer 110 may receive the control language determined by the control language identifier 202 of the input extractor 102 described in FIG. 2. Subsequently, the image composer 110 supplies such a control language to the image compiler 118. By providing the control language to the image compiler 118, the image composer 110 may arrange the composition of the target scene.

In some embodiments, image composer 110 determines one or more semantically related terms for the sub-hints and/or control language parsed from input 120. Image composer 110 may also group the semantically related terms into sub-hints and/or control languages to maintain the relationship between the semantically related terms and the sub-hints and/or control languages. Image composer 110 may perform any one or more semantic similarity analyses to determine features/characteristics corresponding to the sub-cues. Such semantically-related features/characteristics corresponding to the sub-hints may be obtained from one or more external or internal (not shown) databases, data stores, memories, servers, applications, and the like. For example, the image composer 110 may retrieve the mapped features/characteristics of the sub-hints. In response to determining/obtaining one or more semantically-related features/characteristics corresponding to the sub-hints, the image composer 110 may provide the generated AI module 106 with features in the set of one or more semantically-related features corresponding to the sub-hints. In this manner, the generated formula AI module 106 generates visual elements of one or more features of the sub-cues.

In a non-limiting example, the sub-cues extracted from the input 120 may include "piracy". In response to receiving the sub-cues, image composer 110 may determine semantic-related features of the pirate. For example, the semantically related feature corresponding to pirate may be rape smile (smirk). Subsequently, the image composer 110 provides the generated AI module 106 with a "rape smile" feature corresponding to the pirate tip. Thus, the generated AI module 106 generates a facial expression of "rape smile". To ensure that the generated "rape smile" feature is applied to the piracy cues, the image composer 110 groups the "rape smile" feature and the piracy cues into a group. In this way, when the image compiler 118 arranges visual elements to generate an image, the relationship between "rape" and "pirate" is maintained, as described herein.

The image composer 110 may determine the number of features in the set of one or more semantically-related features provided to the generated AI module 106 based on the user-configurable parameters or other information extracted from the input. For example, in response to a user indicating a highly stylized target scene, the image composer 110 will determine to send more features in the feature set to the generated AI module 106. By sending more features of the sub-hints to the generated AI module 106, the image composer 110 receives more visual elements related to the sub-hint semantics. Thus, the target scene includes more visual elements. Conversely, in response to the user indicating a low stylized target scene, the image composer 110 will determine to send fewer features in the feature set to the generated AI module 106. By sending fewer features of the sub-hints to the generated AI module 106, the image composer 110 receives fewer visual elements related to the sub-hints semantics. Thus, the target scene includes fewer visual elements.

At numeral 4, the image composer 110 provides the image information to the generated AI module 106 so that the generated AI module 106 can generate visual elements. For example, the image composer 110 provides each sub-hint in the set of sub-hints (e.g., one or more sub-hints extracted from the input 120) to the generated AI module 106. Additionally or alternatively, the image composer 110 provides each feature in the feature set corresponding to the sub-cues to the generated AI module 106. The image composer 110 may also provide the control language to the generated AI module 106 depending on the control language determined from the input 120. For example, some control languages include effects such as borders, shapes, shadows, exposure, and the like. These effects require the generative AI module 106 to generate visual elements. This control language is referred to herein as a control element. Other control languages may indicate image operations and/or composition operations. Such control language is not a control element and, thus, does not correspond to the visual element generated by the generated AI module 106.

At numeral 5, the generated AI module 106 generates visual elements (e.g., objects or topics) of the target scene using information received from the image composer (e.g., each of the sub-cues, features, and/or control elements). The generated AI module 106 generates an image using control elements, sub-hints, and relationships identified between the control elements and the sub-hints (determined using the input extractor 102 as described herein) obtained from a natural language description of the image (e.g., the input 120).

In some embodiments, the generated AI module 106 can receive a batch of prompts, wherein each prompt in the batch of prompts includes a sub-prompt, a feature of the sub-prompt, and/or a control element. The sub-cues, features of the sub-cues, and/or control elements each correspond to a theme/object (e.g., visual element) of the target scene. In other embodiments, the generated AI module 106 can receive a single hint including a plurality of sub hints, features, and/or control elements.

The generated AI module 106 may be any generated AI configured to generate an image using natural language cues. In some embodiments, the generative AI module 106 generates a neural image and a neural layer based on the sub-cues. In other embodiments, the visual elements generated by the generative AI module 106 may be generated from scratch (e.g., using the generative AI module 106) and/or generated using a source image (e.g., an image including one or more objects received as part of the input 120).

The generated AI module 106 may be any artificial intelligence including one or more neural networks. The neural network may include a machine learning model that may be adjusted (e.g., trained) to approximate the unknown function based on the training input. In particular, the neural network may include a model of interconnected digital neurons that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. For example, the neural network includes one or more machine learning algorithms. In other words, neural networks are algorithms that implement deep learning techniques, i.e., machine learning that utilizes a set of algorithms to attempt to model high-level abstractions in data.

At numeral 6, the generated AI module 106 communicates each generated visual element to the image orchestrator 110. Thus, the image composer 110 obtains all foreground and background objects (e.g., objects, visual elements, subjects, etc.) associated with the target scene. The image composer 110 may store each received generated visual element in a buffer or other memory until all visual elements associated with the target scene (or portion of the target scene) are generated.

As shown at numerals 4-6, the image composer 110 passes the information extracted from the input 120 (using one or more components of the input extractor 102, as described in fig. 2) to a module (illustrated here as a generated AI module 106). The module then passes the visual element back to the image composer 110. In response to receiving all visual elements corresponding to the target scene (or a portion of the visual elements corresponding to a portion of the target scene), the image composer 110 passes the visual elements to the image compiler 118 for use in target scene synthesis.

Although only one module (e.g., the generated AI module 106) is shown, it should be appreciated that different modules may be invoked by the image composer 110 to perform one or more operations based on the extracted input information. For example, a stylist module (not shown) may be invoked by image composer 110 to generate visual elements and/or revise visual elements in response to the extracted one or more style descriptions determined from input 120. The image composer 110 invokes the stylizer module to generate visual elements and/or revise visual elements using the extracted style description(s) received from the input extractor 102.

At numeral 7, the image composer 110 passes all received visual elements (determined via the generated AI module 106) and any control language (determined via the input extractor 102) to the image compiler 118. In some embodiments, the image composer 110 reformats the visual elements and/or control language before passing the information to the image compiler 118.

At numeral 8, the image compiler 118 arranges one or more received visual elements in a representation (e.g., a target scene or image) that the user can edit and further refine. As described above, the refined content may include natural language descriptions of adjustments to one or more portions of the target scene, user interactions with one or more portions of the target scene, and the like. Refinement content may include revisions to visual elements or source images. Further, the refinement may include a resizing revision, a color revision, a position revision, and the like.

The image compiler 118 is configured to perform one or more layering operations, image operations, etc. to compile an image (e.g., a target scene). For example, the image compiler 118 may use several generated visual elements (e.g., rape smiles, ruddy cheeks, etc.) corresponding to the pirates and source images to arrange the pirates in the target scene. In addition, additional visual elements including parrot and hats are disposed in the target scene.

The image compiler 118 also utilizes the identified control language. Using the control language (and/or any other grammatical/structural relationship identified by the input extractor 102), the image compiler 118 performs one or more operations. Operations performed by the image compiler 118 include "select a theme" (where one or more functions are used to identify a theme of a source image), layering (where visual elements are applied to each other in an orderly fashion), removing background, bi-tone (where contrasting colors are applied to a target scene), exposure (where the brightness of the target scene is adjusted), cropping (where one or more portions of the visual elements and/or source image are removed), and the like. Fig. 4 illustrates the frame generation image and describes the operation of the image compiler 118 in more detail. In some embodiments, the control element may be converted to a particular neural layer (corresponding to the generated image determined using the generated AI module 106). In some embodiments, the image compiler 118 combines the neural image and the neural layer. In other embodiments, the generation AI module 106 combines the neural image and the neural layer.

At numeral 9, the image compiler 118 provides the target scene as an output 122. A target scene is constructed (or synthesized/compiled) from the inputs. The target scene is a customizable image based on the received hint input (and in some embodiments, the received source image). Output 122 is displayed on one or more user devices and/or transmitted to one or more downstream devices (e.g., servers, applications, systems, processors, or some combination). Downstream devices may perform subsequent processing on output 122.

FIG. 2 illustrates an example of an input extractor in accordance with one or more embodiments. As described herein, the input extractor 102 uses one or more modules to extract information from cues (e.g., textual descriptions as part of the input 120) and/or source images. As shown, the modules include a control language identifier 202 and a structural analyzer 204. However, additional modules may be executed by the input extractor 102 to extract different types of information (including cues and/or any images) from the input 120.

For example, style analyzer 206 may be executed by input extractor 102 to identify style information present in the prompt. The style information may include nouns (or other portions of speech) that describe the style or topic. In some embodiments, style analyzer 206 uses any semantic similarity analysis and/or dictionary of terms to style/topic to determine terms that are semantically related to the identified style/topic. The semantically related terms may be fed to the generated AI module 106 as sub-hints and/or features of sub-hints. In this manner, the generative AI module 106 is encouraged to generate different objects in the scene.

Additionally or alternatively, the fidelity analyzer 208 may be executed by the input extractor 102 to extract a degree of stylization associated with the target scene. For example, a hint including adjectives and/or adverbs may describe a degree or artistic expression (or degree of fidelity). In a non-limiting example, a prompt describing "butterfly that is angry" may result in a more artistic expression of the butterfly output from the generative AI module, as opposed to a prompt describing the butterfly output from the generative AI module. The degree of stylization extracted by the fidelity analyzer 208 may determine the number of semantically related terms associated with the input 120. For example, to generate the "anger butterfly" described above, the image composer determines a higher number of semantically related features/characteristics corresponding to the "anger" sub-hints. In contrast, to generate a "butterfly," the image composer feeds the hint "butterfly" to the generated AI module 106 (rather than additional terms related to the semantics of "anger"). In some embodiments, the fidelity analyzer 208 maps adjectives/adverbs to a plurality of semantically related terms to be generated.

In some embodiments, an image comparator (not shown) may be performed by the input extractor 102 to identify adjustments to one or more visual elements and/or adjustments to the source image. For example, the image comparator may compare the generated visual element (generated as part of output 122) with the adjusted visual element (received as part of input 120). By comparing the generated visual element with the adjusted visual element, the image comparator may determine changes to the generated visual element, including color changes, size changes, position changes, and the like.

In a particular example, the input extractor 102 identifies and extracts a control language from the prompt. Control language identifier 202 is used to identify the control language. The control language may refer to a language that describes control elements, such as visual elements of boundaries (e.g., borders), shapes (e.g., circles), illustrations, one or more effects (e.g., double exposure), and the like. The control language may also refer to a language describing the composition of the scene. The control language identifier 202 identifies the control language using any suitable mechanism. For example, the control language identifier 202 identifies the control language in the prompt using string matches by comparing the string in the prompt to a set of control languages. In other embodiments, the control language identifier 202 uses semantic similarity techniques (e.g., determining semantically similar terms of the control language in the prompt) to identify the control language in the prompt. For example, the character strings of the characters in the hint may be semantically related to the set of control elements.

In some embodiments, the control elements identified by the control language identifier 202 are defined by a particular set of parameters. In some embodiments, the control element determines the number and type of parameters. For example, a border element may require parameters such as width (integer), height (integer), and description (string). In a particular example, the cues may include "blue and red smoke of bezel 60". Other control elements may have different parameters. For example, the blur control element may use radial parameters to blur a defined region. In this example, the parameters of the blur control element may include radial x-values and radial y-values that indicate blur radii in the x-direction and the y-direction, respectively. The region to be blurred may be described using the coordinates of a character string or an image.

In operation, hints including control language (such as a border) describe the dimensions of the border, the configuration of the border, and the text description of the style or any style element of the border. In some embodiments, if parameters of a particular control element are not defined in the prompt (e.g., the user does not specify the width of the bezel), the control language identifier 202 may determine one or more default parameters. FIG. 4 described herein illustrates an example bezel-generated image.

In some embodiments, the control language identifier 202 is configured to derive the groupings from the arrangement of control languages and the syntactical structure of the prompts. In some embodiments, the packets may be converted into one or more image operations.

The control language identifier 202 may derive the groupings using any suitable grouping technique (e.g., any natural language processing technique, any clustering technique, etc.). For example, the control language identifier 202 executes a Natural Language Toolkit (NLTK) to identify the grammatical relations of the hint. The identified grammatical relation groups related information of the hint. Control language identifier 202 may indicate such groupings to image composer 110. Thus, the image composer 110 passes the packets to the image compiler 118 so that a target scene is generated that maintains the structure of the cues (e.g., input 120). The groupings maintain the relationships of the objects in the group such that the image compiler 118 arranges visual elements (corresponding to the objects of the group) according to the group. For example, a prompt describing "a pirate bezel rectangle with a hat 60 color with parrot over the shoulder" may cause the control language identifier 202 to derive a bezel grouping. The bezel grouping groups the parameters of the bezel (e.g., '60', '20', and 'color') with the bezel. Additionally or alternatively, the control language identifier 202 groups the borders with the subject matter of the image (e.g., pirates). The structural analyzer 204 may also derive groups, as described herein. For example, pirates group with hats and parrots. Thus, one topic (e.g., pirate) may be in two groups (e.g., a group of rims and pirate objects). Alternatively, a group may include both control elements and subject/object relationships (determined by the structure analyzer 204). As described herein, the image compiler 118 may perform image operations on such groupings.

In some embodiments, the input extractor 102 executes the structure analyzer 204 to identify sub-hints (including topics, related objects, and attributes (e.g., adjectives)). In some embodiments, the structural analyzer 204 executes on the remaining cues (e.g., input 120 with parsed control language). It should be appreciated that while the present disclosure describes the structural analyzer 204 executing on the remaining prompts (e.g., after executing the control language identifier 202), the structural analyzer 204 may execute prior to the control language identifier 202 (and/or in parallel with the control language identifier 202).

In operation, yet another NLP algorithm is utilized by the structure analyzer 204 to identify different portions of speech and their relationships in the remaining prompts. In a particular example, the structure analyzer 204 may employ a perceptron marker (perceptron tagger) that marks portions of speech using an average perceptron algorithm (averaged perceptron algorithm).

The remaining cues separate the ion cues for image generation. In a particular example, the remaining cue describes "pirate with parrot hat. The sub-cues identified by the structural analyzer 204 include "pirates", "hats", and "parrots", and such sub-cues are generated as visual elements using the generation AI module 106. Each sub-hint identified by the structure analyzer 204 can include nouns of the entered subject/object and any associated attributes (e.g., adjectives).

By using a portion of the phonetic symbols to identify sub-hints comprising topics, related objects, and related attributes, the structure analyzer 204 derives a packet from the syntax structure of the remaining hints. Additionally or alternatively, the structure analyzer 204 uses any grouping technique (such as any one or more natural language processing techniques, clustering techniques, etc.) to determine groupings from the remaining hints. For example, the structure analyzer executes NLTK to identify grammatical relations between verbs, nouns, and adjectives of the remaining hints. The identified grammatical relations become groupings of related topics, objects, and attributes of the remaining hints. The structure analyzer 204 may indicate such groupings to the image composer 110. Thus, the image composer 110 passes the packets to the image compiler 118 so that a target scene is generated that maintains the structure of the cues (e.g., input 120). As described herein, the image compiler 118 may perform image operations on such groupings. In some embodiments, the groupings are converted into neural images and/or neural layers (e.g., objects in the target scene).

The relationship of visual elements in a scene (and/or any determined grouping) affects the composition/structural arrangement of the scene. For example, a prompt describing "a pirate wearing a hat with parrot on the shoulder" should produce a target scene/image with the pirate as the subject of the scene and the hat and parrot associated with the pirate. In this way, a group is created in which caps and parrots are associated with pirates. If the grouping is not considered, the target scene generation system 100 (and in particular, the image compiler 118) may synthesize a target scene with poor scene synthesis. For example, the system may generate images with pirates, hats, and parrots clustered together.

In some embodiments, the neural synthesis (neural composition) procedure is determined by the structural analyzer 204 from the relationships parsed from the remaining cues. Neural synthesis is a model agnostic, dynamic, context sensitive, and personalized approach using generative models. Neural synthesis is used to generate a neural image for the subject of synthesis, and a neural layer for each relevant object. The relationship between nouns (e.g., topics and related objects) in the remaining hints specifies the neural image (topic) and the neural layer (object) applied to the neural image. In an example, the structural analyzer 204 analyzes "pirates" (neural images, because pirates are subjects), "caps" (neural layers, because caps are objects related to pirates), and "parrots" (another neural layer, because parrots are objects related to pirates).

FIG. 3 illustrates natural language cues processed by the input extractor 102 in accordance with one or more embodiments. As illustrated, the cue describes "pirate rims rectangle with caps with parrot 60 color". As described herein, the control language identifier 202 of the input extractor 102 operates on hints to identify control languages (and in particular control elements) that include "bezel rectangle 60 color. The structural analyzer 204 operates on the remaining cue "pirate with hat parrot".

As shown, 302 illustrates an example of portions of phonetic markers extracted using a marker implemented by the structural analyzer 204 and their relationships. As shown, the structure analyzer 204 determines groupings based on word relationships in the remaining hints. Each grouping groups the properties of nouns (or topics/objects). As described herein, a packet may also include (or be otherwise associated with) a control language (such as a bezel control element).

As shown at 304, the structural analyzer 204 of the input extractor 102 parses the remaining cues to separate sub-cues used to generate visual elements. The sub-cues identified by the structural analyzer 204 include "pirates", "hats", and "parrots", and such sub-cues are generated as images or visual elements using the generation AI module 106. As illustrated in 302, a group is formed using "pirate," "wear" and "hat. As described herein, the image compiler 118 receives such groupings to group "pirate" and "cap" visual elements together in a manner such that the generated pirate visual elements have a relationship with the generated cap visual elements (e.g., are being worn). In some embodiments, beginning with a primary noun or topic, the structural analyzer 204 identifies (and in some embodiments, generates) a neural image for the topic, and adds a neural layer for each related object.

FIG. 4 illustrates an example bezel-generated image in accordance with one or more embodiments. The input 120 received by the target scene generation system 100 may include a source image (identified by sub-cues 404) and a free-text description "red and blue smoke of the bezel 60" (identified by cues 406). In response to such inputs (e.g., source images and cues), the target scene generation system 100 performs the processes described herein to decompose cues into control language and sub-cues, and thus, image 402 is created as a target scene.

In particular, the input extractor 102 (and in particular the control language identifier 202) parses out "red and blue smoke of the bezel 60". The control elements and corresponding parameter sets are passed to the image composer 110. The image composer 110 passes the control elements to the generated AI module 106 so that the generated AI module 106 can generate a smoke image based on the cues "red and blue smoke". The generated AI module 106 then passes the smoke image to the image composer 110, which image composer 110 provides the smoke image and the source image to the image compiler 118. The image compiler 118 uses smoke images and a source image to perform a composition operation.

In some embodiments, the composition operation associated with the "bezel" control element includes a "select theme" operation on the image and a layering of "6010" cut-out views of the thematically generated red and blue smoke images. That is, the image compiler 118 applies the first visual element (e.g., red and blue aerosol generating images) to the second visual element (e.g., source image) in a layering operation. In particular, the image compiler 118 identifies the subject matter of the source image. Subsequently, the image compiler 118 cuts out the "6010" layer from the generated red and blue smoke images (e.g., smoke images). In operation, the image compiler 118 performs image operations on the generated smoke image to create a bezel by cropping the generated image to 60 pixels high and 10 pixels wide. Subsequently, the image compiler 118 stacks the border and the source image by applying the border to the source image. Thus, in image 402, the borders highlight the subject matter in the source image. The frame is red and blue smog.

Fig. 5-6 illustrate examples of described scenes and corresponding synthetic target scenes in accordance with one or more embodiments. As illustrated in fig. 5, the inputs provided to the target scene generation system 100 include a source image 502 and a scene description at 504 to the target scene generation system 100. As illustrated in fig. 6, in response to an input (e.g., a prompt or scene description at 504 and a source image 502), the target generation system 100 generates a target scene 606. As illustrated, the natural language description (e.g., hint 504) is decomposed into sub-hints and control elements. The sub-cues and control elements are used to generate visual elements such as human-understandable cues. Each sub-hint/control element represents a visual element to be displayed in the target scene 606. The displayed visual elements may be generated using a generative AI model or retrieved from a catalog/marketplace of visual elements.

As shown, the source image (e.g., image 502) is modified. In particular, the generative AI module generates visual elements corresponding to the cues. Subsequently, one or more image operations are used to superimpose or otherwise incorporate visual elements into the source image. For example, hints are used to modify the subject (e.g., human) of the source image. That is, the subject of the source image is represented as pirate in image 606 using piracy features (e.g., rape smiles, ruddy cheeks, etc.). In particular, such features are superimposed or otherwise incorporated into the source image 502. In addition, the colored border highlights the subject matter in the image 606 (e.g., pirate humans). As described herein, the bounding box is generated and then merged with the image 502 (e.g., using a clipping operation and a layering operation). Image 606 also includes caps on the pirate's head and parrots on the pirate's shoulders. Thus, the target scene (e.g., image 606) faithfully retains the structure of the input (e.g., source image 502 and cues 504).

As illustrated in fig. 6, once the hint has been broken down and the target scene created, the user may further edit the target scene or elements of the target scene by editing the resulting sub-hints. Additionally or alternatively, the user may revise the prompt itself. Further, the user may interact with the target scene to edit the target scene (e.g., click on visual elements to move them, rearrange visual elements, delete visual elements, resize visual elements, change properties of visual elements, etc.). As a result of one or more edits, a new target scene is created. That is, the editing content, source images, cues, etc. are fed back as input 120 to the target scene generation system 100.

FIG. 7 illustrates an example implementation of a diffusion model in accordance with one or more embodiments. As described herein, generating the AI may be performed using any suitable mechanism. In some embodiments, such a generated AI is performed using a diffusion model.

The diffusion model is one example architecture for performing the generated AI. Generating the formula AI involves predicting the characteristics of a given tag. For example, given a tag (or natural hint describing a "cat"), the generated AI module determines the features most likely to be associated with the "cat". The features associated with the tag are determined during training using a back-diffusion process in which the noisy image is iteratively denoised to obtain an image. In operation, a function that predicts noise of potential spatial features associated with the tag is determined.

During training, images (e.g., images of cats) and corresponding tags (e.g., "cats") are used to teach diffusion model features of cues (e.g., tags "cats"). As shown in fig. 7, the input image 702 and the text input 712 are converted into a potential space 720 using an image encoder 704 and a text encoder 714, respectively. Thus, the potential image features 706 and text features 708 are determined from the image input 702 and text input 712, respectively. The potential space 720 is a space in which unobserved features are determined such that relationships and other dependencies of such features can be learned. In some embodiments, the image encoder 704 and/or the text encoder 714 are pre-trained. In other embodiments, the image encoder 704 and/or the text encoder are co-trained.

Once the image feature 706 has been determined by the image encoder 704, a forward diffusion process 716 is performed according to a fixed markov chain to inject gaussian noise into the image feature 706. The forward diffusion process 716 is described in more detail with reference to fig. 8. As a result of the forward diffusion process 716, a set of noise image features 710 is obtained.

The text feature 708 and the noise image feature 710 are algorithmically combined in one or more steps (e.g., iterations) of the back-diffusion process 726. The back diffusion process 726 is described in more detail with reference to fig. 8. As a result of performing the back diffusion, image features 718 are determined, wherein such image features 718 should be similar to image features 706. The image features 718 are decoded using an image decoder 722 to predict an image output 724. The similarity between the image features 706 and 718 may be determined in any manner. In some embodiments, the similarity between the image input 702 and the predicted image output 724 is determined in any manner. The similarity between the image features 706 and 718 and/or the images 702 and 724 is used to adjust one or more parameters of the back diffusion process 726.

FIG. 8 illustrates a diffusion process for training a diffusion model in accordance with one or more embodiments. The diffusion model may be implemented using any artificial intelligence/machine learning architecture where the input and output dimensions are the same. For example, the diffusion model may be implemented according to a u-net neural network architecture.

As described herein, the forward diffusion process adds noise over a series of steps (iterations t) according to a fixed markov diffusion chain. Subsequently, the back-diffusion process removes noise to learn the back-diffusion process, thereby constructing a desired image (based on text input) from the noise. During deployment of the diffusion model, a back diffusion process is used in the generated AI module to generate an image from the input text. In some embodiments, the input image is not provided to the diffusion model.

The forward diffusion process 716 begins with an input (e.g., feature X ₀ indicated by 802). At most T iterations per time step T (or iteration), noise is added to feature x such that feature x _T indicated by 810 is determined. As described herein, the characteristics of injected noise are potential spatial characteristics. The denoising performed during the back diffusion process 726 may be more accurate if the noise injected at each step is small. The noise added to feature X may be described as a markov chain, where the distribution of noise injected at each time step depends on the previous time step. That is, the forward diffusion process 716 may be mathematically represented as

The back diffusion process 726 begins with a noise input (e.g., noise signature X _T indicated by 810). Each time step t removes noise from the feature. The noise removed from the features may be described as a markov chain, where the noise removed at each time step is the product of the noise removed between the features of the two iterations and the normal gaussian noise distribution. That is, the back-diffusion process 726 may be mathematically represented as a joint probability of a sequence of samples in a Markov chain, where the edge probability is multiplied by the conditional probability of noise added at each iteration in the Markov chain. In other words, the back diffusion process 726 is

FIG. 9 illustrates a schematic diagram of a target scene generation system (e.g., the "target scene generation system" described above) in accordance with one or more embodiments. As shown, the target scene generation system 900 may include, but is not limited to, a user interface manager 904, an input extractor 902, an image composer 910, a generation AI module 906, an image compiler 908, a neural network manager 912, a training manager 914, and a storage manager 922. As described herein, the input extractor 902 includes various modules such as a control language identifier and a structural analyzer, however these modules are not shown in fig. 9. The storage manager 922 includes training data 918 and control language data 916.

As illustrated in fig. 9, the target scene generation system 900 includes a user interface manager 904. The user interface manager 904 allows a user to provide input (e.g., input 120 in fig. 1) to the target scene generation system 900. The user interface manager 904 also enables a user to view the resulting target scene output image and/or request further editing of the scene using a user interface.

The input received by the user interface manager 904 includes a natural language description of the target scene. In particular, the input cues may include 1) image operations, 2) sentences having a syntactic structure reflecting the composition, and/or 3) objects and topics.

The input may also include a source image or other baseline/base image. The source image may be a computer generated image, a video frame, a picture captured by a camera (or other sensor), or the like. In some embodiments, user interface manager 904 may enable a user to download images from a local storage location or a remote storage location. For example, the user may provide an address such as a URL or other endpoint associated with the source image. In some embodiments, the user interface manager 904 may enable a user to link an image capture device, such as a camera or other hardware, to capture image data and provide it to the target scene generation system 900.

In some embodiments, the source image is a previously generated image (e.g., a target scene). Additionally or alternatively, the input may include a revision to the previous description of the scene. These inputs relate to user revisions/modifications to the scene based on the displayed scene (e.g., output).

As illustrated in fig. 9, the target scene generation system 900 includes an input extractor 902. The input extractor 902 executes various modules to extract information from the input. In some embodiments, the input extractor 902 may include modules such as a style analyzer, a fidelity analyzer, an image comparator, and the like, as described herein. In some embodiments, input extractor 902 is configured to identify and parse control languages and sub-hints. The control language and sub-cues are used to generate visual elements in the target scene (e.g., objects in the target scene). The control language may include visual elements (referred to herein as control elements), such as boundaries (e.g., borders), shapes (e.g., circles), illustrations, effects (e.g., double exposure), and the like. Each control element (identified by the control language in the hint) may be defined by a particular set of parameters. The control language may also include target scene composition relationships. The sub-hints include noun(s) corresponding to the entered topic/object, as well as any associated attributes (e.g., adjectives).

In some embodiments, any of the modules of input extractor 902 (e.g., control language identifier and/or structure analyzer) is configured to derive a packet from the arrangement of extracted information (e.g., control language and/or sub-cues) and the syntax structure of the cues. The packets may be converted into one or more image operations. The modules of the input extractor 902 may derive the groupings using any suitable grouping technique (e.g., any natural language processing technique, any clustering technique, etc.). Using these packets, a target scene is generated that maintains the structure of the hint.

As illustrated in fig. 9, the target scene generation system 900 includes an image composer 910. The image orchestrator 910 maps the extracted information from the input extractor 902 to one or more modules of the target scene generation system 900 (such as the generated AI module 906 hosted by the neural network manager 912, described below). The generated AI module 906 may be any generated AI module configured to generate an image using natural language cues. In particular, the generated AI module 906 may receive a batch of sub-cues, each of which includes information for generating visual elements corresponding to a theme/object of the target scene.

In response to providing sub-hint information, control element information, and/or other extracted information to modules of the target scene generation system 900, the image composer 910 receives a visual element. Subsequently, the image composer 910 buffers the generated visual elements and provides the generated set of visual elements to the image compiler 908.

The image compiler 908 arranges one or more received generated visual elements in a representation (e.g., a target scene) that the user can edit and further refine. The image compiler 908 is configured to perform one or more layering operations, image operations, etc. to synthesize a target scene. The image compiler 908 utilizes any identified control language by performing a particular composition operation associated with the control language.

As illustrated in fig. 9, the target scene generation system 900 also includes a neural network manager 912. The neural network manager 912 can host a plurality of neural networks or other machine learning models, such as the generated AI module 906. The neural network manager 912 may include an execution environment, libraries, and/or any other data needed to execute the machine learning model. In some embodiments, the neural network manager 912 may be associated with dedicated software and/or hardware resources to execute a machine learning model. As discussed, the generated AI module 906 can be implemented as any type of generated AI. In various embodiments, each neural network hosted by the neural network manager 912 may be the same type of neural network or may be a different type of neural network, depending on the implementation. Although depicted in fig. 9 as being hosted by a single neural network manager 912, in various embodiments the neural network may be hosted in multiple neural network managers and/or as part of different components. For example, the generated AI module 906 may be hosted by its own neural network manager, or by other hosting environments in which the respective neural networks execute, or the generated AI module 906 may be deployed across multiple neural network managers depending on, for example, resource requirements of the generated AI module 906.

As illustrated in fig. 9, the target scene generation system 900 also includes a training manager 914. Training manager 914 may teach, direct, adjust, and/or train one or more neural networks. In particular, training manager 914 may train the neural network based on a plurality of training data. For example, the generative AI module may be trained to perform a back diffusion process. More particularly, training manager 914 may access, identify, generate, create, and/or determine training inputs, and utilize the training inputs to train and fine tune the neural network.

As illustrated in fig. 9, the target scene generation system 900 also includes a storage manager 922. The storage manager 922 holds data for the target scenario generation system 900. The storage manager 922 may hold any type, size, or kind of data needed to perform the functions of the target scenario generation system 900. As shown in fig. 9, storage manager 922 includes training data 918. Training data 918 includes manual labeling data for supervised learning. Training using supervised learning is part of training performed during semi-supervised learning. The storage manager 922 also stores control language data 916. The control language data 916 includes a set of control elements and corresponding parameters. In identifying the control language in the prompt, control language data 916 may be accessed by control language identifier 910. Additionally or alternatively, the control language data 916 may be accessed by the control language identifier 910 to obtain the format of the identified control elements. For example, the bezel control elements may include parameters such as width (integer) and height (integer) and description (string). In some embodiments, the control language data 916 also includes default parameter values.

Each of the components 902-914 of the target scenario generation system 900 and their corresponding elements (as shown in fig. 9) may communicate with each other using any suitable communication technique. It will be appreciated that although components 902-914 and their corresponding elements are shown as separate in fig. 9, any of the components 902-914 and their corresponding elements may be combined into fewer components, such as being combined into a single facility or module, divided into more components, or configured into different components that may serve a particular embodiment.

The components 902-914 and their corresponding elements may comprise software, hardware, or both. For example, the components 902-914 and their corresponding elements may comprise one or more instructions stored on a computer-readable storage medium and executable by a processor of one or more computing devices. The computer-executable instructions of the target scene generation system 900, when executed by one or more processors, may cause a client device and/or a server device to perform the methods described herein. Alternatively, the components 902-914 and their corresponding elements may comprise hardware, such as dedicated processing devices for performing a particular function or group of functions. Additionally, the components 902-914 and their corresponding elements may comprise a combination of computer-executable instructions and hardware.

Further, the components 902-914 of the target scenario generation system 900 may be implemented, for example, as one or more stand-alone applications, one or more modules of an application, one or more plug-ins, one or more library functions, or functions that may be invoked by other applications, and/or a cloud computing model. Accordingly, the components 902-914 of the target scenario generation system 900 may be implemented as stand-alone applications, such as desktop or mobile applications. Further, the components 902-914 of the target scenario generation system 900 may be implemented as one or more web-based applications hosted on a remote server. Alternatively or additionally, the components 902-914 of the target scenario generation system 900 may be implemented in a set of mobile device applications or "apps.

As shown, the target scene generation system 900 may be implemented as a single system. In other embodiments, the target scene generation system 900 may be implemented across multiple systems. For example, one or more functions of the target scene generation system 900 may be performed by one or more servers and one or more functions of the target scene generation system 900 may be performed by one or more client devices.

For example, in one or more embodiments, when a client device accesses a web page or other web application hosted at one or more servers, the one or more servers may provide access to a user interface displayed at the client device that prompts a user for input of a scene to be generated (e.g., description of the scene, baseline image for the scene, etc.). The client device may provide input to one or more servers. Upon receiving input of a scene to be generated, one or more servers may automatically perform the above-described methods and processes to extract the structure of the input and generate a composite target scene. One or more servers may provide access to a user interface with a target scene displayed at a client device.

Figures 1-9, corresponding text, and examples provide a variety of different systems and devices that allow a user to synthesize a target scene using natural language descriptions. In addition to the foregoing, embodiments may be described in terms of flow charts including acts and steps in methods for achieving a particular result. For example, FIG. 10 illustrates a flow diagram of an exemplary method in accordance with one or more embodiments. The method described with respect to fig. 10 may be performed with fewer or more steps/acts, or the steps/acts may be performed in a different order. Additionally, the steps/acts described herein may be repeated or performed in parallel with each other or in parallel with different instances of the same or similar steps/acts.

FIG. 10 illustrates a flow diagram 1000 of a series of actions in a method of composing a target scene using natural language descriptions in accordance with one or more embodiments. In one or more embodiments, the method 1000 is performed in a digital media environment that includes the target scene generation system 900. Method 1000 is intended to illustrate one or more methods in accordance with the present disclosure and is not intended to limit the possible embodiments. Alternative embodiments may include more, fewer, or different steps than those expressed in fig. 10.

As illustrated in fig. 10, method 1000 includes an act 1002: a natural language description of an image to be generated using a machine learning model is received. The natural language description of the image may include 1) image operations, 2) sentences with syntactic structures reflecting the composition, and/or 3) objects and topics of the scene. In some embodiments, a source image is also received. The desired composition (alternatively referred to as a description of the target scene or image to be generated) is described using hints in a natural language format.

As illustrated in fig. 10, method 1000 includes an act 1004 of: control elements and sub-hints are extracted from natural language descriptions of the image to be generated. As described herein, one or more modules of the input extractor may extract information from a natural language description of an image to be generated. For example, the control language identifier of the input extractor parses out any control language in the natural language description of the image to be generated. The control language may refer to a language that describes control elements, which are visual elements such as boundaries (e.g., borders), shapes (e.g., circles), illustrations, one or more effects (e.g., double exposure), and the like. The control language may also refer to a language describing the composition of the scene. The structural analyzer of the input extractor parses out any sub-cues in the natural language description of the image to be generated. In some embodiments, one or more features/characteristics of the sub-hints are determined by identifying semantically-related terms corresponding to the sub-hints. The semantically-related features/characteristics of the sub-hints are grouped with the sub-hints to preserve the relationship between the features/characteristics of the sub-hints and the sub-hints.

As illustrated in fig. 10, method 1000 includes an act 1006 of: based on the natural language description of the image to be generated, a relationship between the control element and the sub-hint is identified. For example, the control language identifier executes a Natural Language Toolkit (NLTK) to identify the grammatical relations of the natural language description of the image to be generated. The identified grammatical relation groups the related information. Similarly, the structure analyzer may derive groupings of sub-hints. For example, the structural analyzer utilizes yet another NLP algorithm to identify different parts of speech and their relationships in the natural language description of the image to be generated. As described herein, each sub-hint identified by the structural analyzer may include nouns of the entered subject/object, as well as any associated attributes (e.g., adjectives). By using portions of the phonetic notation to identify sub-hints comprising topics, related objects, and related attributes, the structure analyzer derives groupings from the syntactical structure of the natural language description of the image to be generated. The module of the input extractor derives groupings between the control language and sub-cues, between sub-cues and other sub-cues, semantically related features of sub-cues and between sub-cues, etc. The groupings maintain the relationships of the objects in the group such that visual elements (objects corresponding to the group) are arranged according to the group.

As illustrated in fig. 10, method 1000 includes an act 1008 of: an image is generated by the machine learning model based on the control elements, the sub-cues, and the relationships. The image includes visual elements corresponding to the control elements and the sub-cues. As described herein, a machine learning model, such as a generative AI model, receives sub-hint information and control element information determined from a natural language description of an image to be generated. The generated AI module generates visual elements corresponding to information extracted from the input (e.g., sub-cues, control language, features of the sub-cues, etc.). An image described by the natural language description is generated by arranging the generated visual elements according to the relationship identified between the control elements and the sub-cues.

Embodiments of the present disclosure may include or utilize a special purpose or general-purpose computer including computer hardware, such as one or more processors and system memory, as discussed in more detail below. Embodiments within the scope of the present disclosure also include physical media and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be at least partially implemented as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions from a non-transitory computer-readable medium (e.g., memory, etc.) and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.

Computer readable media can be any available media that can be accessed by a general purpose or special purpose computer system. The computer-readable medium storing computer-executable instructions is a non-transitory computer-readable storage medium (device). The computer-readable medium carrying computer-executable instructions is a transmission medium. Thus, by way of example, and not limitation, embodiments of the present disclosure may include at least two distinct categories of computer-readable media: a non-transitory computer readable storage medium (device) and a transmission medium.

Non-transitory computer readable storage media (devices) include RAM, ROM, EEPROM, CD-ROM, solid state drives ("ssds") (e.g., based on RAM), flash memory, phase change memory ("PCM"), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code means in the form of computer-executable instructions or data structures and that can be accessed by a general purpose or special purpose computer.

A "network" is defined as one or more data links capable of transmitting electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. The transmission media can include networks and/or data links, which can be used to carry desired program code means in the form of computer-executable instructions or data structures, and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.

Furthermore, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link may be cached in RAM within a network interface module (e.g., a "NIC") and then ultimately transferred to computer system RAM and/or less volatile computer storage media (devices) at the computer system. Thus, it should be understood that a non-transitory computer readable storage medium (device) can be included in a computer system component that also (or even primarily) utilizes transmission media.

Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions execute on a general purpose computer to transform the general purpose computer into a special purpose computer that implements the elements of the present disclosure. The computer-executable instructions may be, for example, binary code, intermediate format instructions (such as assembly language), or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablet computers, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Embodiments of the present disclosure may also be implemented in a cloud computing environment. In this specification, "cloud computing" is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing may be employed in the marketplace to provide ubiquitous and convenient on-demand access to a shared pool of configurable computing resources. The shared pool of configurable computing resources may be quickly provided via virtualization and released with low management workload or low service provider interactions and then scaled accordingly.

Cloud computing models may be composed of various features such as on-demand self-service, wide network access, resource pooling, rapid elasticity, measurement services, and the like. The cloud computing model may also expose various service models, such as software as a service ("SaaS"), platform as a service ("PaaS"), and infrastructure as a service ("IaaS"). The cloud computing model may also be deployed using different deployment models, such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this specification and in the claims, a "cloud computing environment" is an environment in which cloud computing is employed.

FIG. 11 illustrates, in block diagram form, an exemplary computing device 1100 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices, such as computing device 1100, may implement a target scene generation system. As shown by fig. 11, a computing device may include a processor 1102, a memory 1104, one or more communication interfaces 1106, a storage device 1108, and one or more I/O devices/interfaces 1110. In some embodiments, computing device 1100 may include fewer or more components than those shown in fig. 11. The components of computing device 1100 shown in fig. 11 will now be described in more detail.

In a particular embodiment, the processor(s) 1102 include hardware for executing instructions (such as those of a composite computer program). As an example and not by way of limitation, processor(s) 1102 may retrieve (or fetch) instructions from internal registers, internal cache, memory 1104, or storage 1108, decode them, and execute them. In various embodiments, the processor(s) 1102 may include one or more Central Processing Units (CPUs), graphics Processing Units (GPUs), field Programmable Gate Arrays (FPGAs), system-on-a-chip (socs), or other processor(s) or combination of processors.

Computing device 1100 includes memory 1104 coupled to processor(s) 1102. Memory 1104 may be used for storing data, metadata, and programs for execution by the processor(s). Memory 1104 may include one or more of volatile memory and non-volatile memory, such as random access memory ("RAM"), read-only memory ("ROM"), solid state disk ("SSD"), flash memory, phase change memory ("PCM"), or other types of data storage. The memory 1104 may be internal memory or distributed memory.

Computing device 1100 may also include one or more communication interfaces 1106. Communication interface 1106 may include hardware, software, or both. The communication interface 1106 may provide one or more interfaces for communication (e.g., packet-based communication) between the computing device and one or more other computing devices 1100 or one or more networks. By way of example, and not limitation, communication interface 1106 may include a Network Interface Controller (NIC) or network adapter for communicating with an ethernet or other line-based network, or a Wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as WI-FI. Computing device 1100 may also include a bus 1112. Bus 1112 may include hardware, software, or both that couple components of computing device 1100 to one another.

Computing device 1100 includes a storage device 1108 that includes storage for storing data or instructions. By way of example, and not limitation, storage device 1108 may include the non-transitory storage media described above. Storage 1108 may include a Hard Disk Drive (HDD), flash memory, a Universal Serial Bus (USB) drive, or a combination of these or other storage devices. Computing device 1100 also includes one or more input or output ("I/O") devices/interfaces 1110 that are provided to allow a user to provide input to computing device 1100, such as user strokes, to receive output from computing device 1100, and to otherwise transfer data to and from computing device 1100. These I/O devices/interfaces 1110 may include a mouse, a keypad or keyboard, a touch screen, a camera, an optical scanner, a network interface, a modem, other known I/O devices, or a combination of such I/O devices/interfaces 1110. The touch screen may be activated with a stylus or finger.

The I/O devices/interfaces 1110 may include one or more devices for presenting output to a user, including but not limited to a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In some embodiments, the I/O device/interface 1110 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve a particular implementation.

In the foregoing specification, embodiments have been described with reference to specific exemplary embodiments thereof. Various embodiments are described with reference to details discussed herein and the accompanying figures illustrate the various embodiments. The above description and drawings are illustrative of one or more embodiments and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of the various embodiments.

Embodiments may include other specific forms without departing from the spirit or essential characteristics thereof. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with fewer or more steps/acts, or the steps/acts may be performed in a different order. Additionally, the steps/acts described herein may be repeated or performed in parallel with each other or in parallel with different instances of the same or similar steps/acts. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

In the various embodiments described above, unless specifically indicated otherwise, a separating language such as the phrase "at least one of A, B or C" is intended to be understood to mean A, B or C or any combination thereof (e.g., A, B and/or C). Thus, the separating language is not intended, nor should it be construed, to imply that at least one of A, at least one of B, or at least one of C is required for a given embodiment.

Claims

1. A method, comprising:

receiving a natural language description of an image to be generated using a machine learning model;

extracting control elements and sub-hints from the natural language description of the image to be generated;

Identifying a relationship between the control element and the sub-hint based on the natural language description of the image to be generated; and

An image is generated by the machine learning model based on the control element, the sub-cues, and the relationship, where the image includes visual elements corresponding to the control element and the sub-cues.

2. The method of claim 1, wherein the control element comprises a set of parameters defining a first visual element of the image and the sub-hint defines a second visual element of the image.

3. The method of claim 2, wherein generating, by the machine learning model, the image based on the control element, the sub-hint, and the relationship further comprises:

Generating the first visual element and the second visual element using the machine learning model;

Performing an operation on the first visual element based on the set of parameters; and

The image is arranged using the first visual element and the second visual element.

4. The method of claim 3, wherein performing the operation comprises: the first visual element is clipped according to the parameter set.

5. A method according to claim 3, wherein arranging the image comprises:

the first visual element is applied to the second visual element.

6. The method of claim 2, wherein the parameter set is based on the control element and includes one or more of: height parameters, width parameters, radial x values, radial y values, or descriptions.

7. The method of claim 2, further comprising:

Determining a semantically-related term using the sub-hint, wherein the semantically-related term defines a third visual element;

generating the third visual element using the machine learning model; and

The image is arranged using the first visual element, the second visual element, and the third visual element.

8. The method of claim 7, further comprising:

Grouping the second visual element and the third visual element to maintain a relationship between the second visual element and the third visual element.

9. The method of claim 1, wherein the relationship between the control element and the sub-hint is a grammatical relationship that groups the control element and the sub-hint.

10. The method of claim 1, further comprising:

receiving a source image; and

The image is generated by the machine learning model based on the source image, the control element, the sub-cues, and the relationship, where the image includes visual elements corresponding to the control element and the sub-cues.

11. A system, comprising:

A memory component; and

A processing device coupled to the memory component, the processing device to perform operations comprising:

12. The system of claim 11, wherein the control element comprises a set of parameters defining a first visual element of the image and the sub-hint defines a second visual element of the image.

13. The system of claim 12, wherein generating, by the machine learning model, the image based on the control element, the sub-hint, and the relationship further comprises:

14. The system of claim 11, wherein the relationship between the control element and the sub-hint is a grammatical relationship that groups the control element and the sub-hint.

15. A computer-readable medium storing executable instructions that, when executed by a processing device, cause the processing device to perform operations comprising:

receiving a natural language description of an image and a source image;

Parsing the natural language description of the image to identify sub-hints;

generating a visual element using a generative model, wherein the visual element is based on the sub-cues; and

An image is arranged using the visual element and the source image.

16. The computer-readable medium of claim 15, wherein arranging the image comprises:

a synthesis operation is performed based on the source image and the visual element.

17. The computer-readable medium of claim 15, storing executable instructions that further cause the processing device to perform operations comprising:

a revision of at least one of the visual element or the source image is received.

18. The computer-readable medium of claim 17, wherein the revision is at least one of a size revision, a color revision, or a location revision.

19. The computer-readable medium of claim 15, wherein parsing the natural language description of the image further comprises: a control element defining a parameter set is identified, the parameter set defining a visual element, and the control element is used to generate another visual element using the generative model.

20. The computer-readable medium of claim 15, storing executable instructions that further cause the processing device to perform operations comprising:

The image is output for display to a user.