CN117541668A - Virtual character generation method, device, equipment and storage medium - Google Patents

Virtual character generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN117541668A
CN117541668A CN202311373791.8A CN202311373791A CN117541668A CN 117541668 A CN117541668 A CN 117541668A CN 202311373791 A CN202311373791 A CN 202311373791A CN 117541668 A CN117541668 A CN 117541668A
Authority
CN
China
Prior art keywords
prop
character
sample
image
virtual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311373791.8A
Other languages
Chinese (zh)
Inventor
李玺
吴欣填
张亚庆
万乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Tencent Technology Shenzhen Co Ltd
Original Assignee
Zhejiang University ZJU
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU, Tencent Technology Shenzhen Co Ltd filed Critical Zhejiang University ZJU
Priority to CN202311373791.8A priority Critical patent/CN117541668A/en
Publication of CN117541668A publication Critical patent/CN117541668A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application discloses a virtual character generation method, device, equipment and storage medium, and relates to the field of artificial intelligence. Comprising the following steps: acquiring character posture information, prop masks and description texts; based on the character posture information and the description text, character feature extraction is carried out through a character control network, so that character conditional features are obtained; based on the prop mask and the description text, prop feature extraction is carried out through a prop control network, so that prop conditional features are obtained; based on the descriptive text, the character conditioning features and the prop conditioning features, generating a virtual character image through a stable diffusion model, wherein a virtual character in the virtual character image accords with character posture information, the appearance and the position of a virtual prop in the virtual character image accord with prop masks, and the interaction relationship between the virtual character and the virtual prop accords with the descriptive text.

Description

Virtual character generation method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the field of artificial intelligence, in particular to a virtual character generation method, device, equipment and storage medium.
Background
Text-to-image generation is now becoming accepted by the general public, and a corresponding character image can be obtained by simply inputting descriptive text for the character.
In the related art, a stable diffusion model can generate a high-quality character by using an input text, and the stable diffusion model uses the input text as a condition for generating an image, thereby generating a virtual character related to the text. And the virtual props carried by the virtual roles in the images can be described in a text description mode, so that the stable diffusion model generates corresponding props for the virtual roles.
However, in the scheme provided by the related art, since the characteristics of different dimensions such as the type, the size, the angle and the like of the prop held by the character can only be coarsely regulated and controlled through the descriptive text, the interaction relationship between the virtual character and the virtual prop in the generated image often does not accord with the expected of the user, and the generation effect of the virtual character and the virtual prop is poor.
Disclosure of Invention
The embodiment of the application provides a virtual character generation method, device and equipment and a storage medium. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a method for generating a virtual character, where the method includes:
Acquiring character posture information, a prop mask and a description text, wherein the character posture information is used for indicating the posture of a virtual character, the prop mask is used for indicating the appearance of the virtual prop and the position of the virtual prop in an image, and the description text is used for describing image content;
based on the character posture information and the description text, character feature extraction is carried out through a character control network to obtain character conditioning features, wherein the character conditioning features are used for representing features of the virtual character required to meet conditions;
based on the prop mask and the description text, prop feature extraction is carried out through a prop control network, so that prop conditional features are obtained, and the prop conditional features are used for representing features of the virtual prop, wherein the features are required to meet conditions;
generating a virtual character image through a stable diffusion model based on the descriptive text, the character conditioning feature and the prop conditioning feature, wherein a virtual character in the virtual character image accords with the character gesture information, the appearance and the position of a virtual prop in the virtual character image accord with the prop mask, and the interaction relationship between the virtual character and the virtual prop accords with the descriptive text.
On the other hand, the embodiment of the application provides a virtual character generating device, which comprises:
the system comprises an acquisition module, a description text and a display module, wherein the acquisition module is used for acquiring character gesture information, a prop mask and a description text, the character gesture information is used for indicating the gesture of a virtual character, the prop mask is used for indicating the appearance of the virtual prop and the position of the virtual prop in an image, and the description text is used for describing the content of the image;
the character feature extraction module is used for extracting character features through a character control network based on the character posture information and the description text to obtain character conditioning features, wherein the character conditioning features are used for characterizing features of the virtual character required to meet conditions;
the prop feature extraction module is used for extracting prop features through a prop control network based on the prop mask and the description text to obtain prop conditional features, wherein the prop conditional features are used for representing features of the virtual prop, which are required to meet conditions;
the generating module is used for generating a virtual character image through a stable diffusion model based on the descriptive text, the character conditioning characteristics and the prop conditioning characteristics, wherein a virtual character in the virtual character image accords with the character gesture information, the appearance and the position of a virtual prop in the virtual character image accord with the prop mask, and the interaction relation between the virtual character and the virtual prop accords with the descriptive text.
In another aspect, embodiments of the present application provide a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement a method for generating a virtual character as described in the foregoing aspect.
In another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the method of generating a virtual character as described in the above aspect.
In another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the virtual character generating method provided in the above aspect.
The beneficial effects that technical scheme that this application embodiment provided include at least:
in the embodiment of the application, in the process of generating the virtual character image, character feature extraction is performed on character pose information and description text for describing the to-be-generated virtual character pose information by using a character control network, so that character conditional features are obtained. And extracting features of a prop mask and a description text for describing the form and the position of the virtual prop to be generated by using a prop control network, so as to obtain the conditional feature of the prop. The virtual character and the virtual prop of the generated virtual character image are in accordance with the user's expectations by using the character conditioning feature and the prop conditioning feature as the control for generating the virtual character image, so that the generating effect of the virtual character image is improved. And because two control networks are added on the basis of the stable diffusion model to control the generated virtual character image, after the prop control network and the character control network respectively obtain prop conditional features and character conditional features, feature aggregation is carried out on the prop conditional features and the prop conditional features to obtain aggregation conditional features, so that the stable diffusion model generates the virtual character image through the aggregation conditional features and the descriptive text. The virtual character and the virtual prop have better interaction effect by respectively controlling the attitude information of the virtual character, the appearance and the position of the virtual prop through introducing a character control network and a prop control network.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a schematic diagram of a control network;
FIG. 2 illustrates a schematic view of a virtual character image;
FIG. 3 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;
FIG. 4 illustrates a flowchart of a method for generating virtual roles provided by an exemplary embodiment of the present application;
FIG. 5 illustrates a schematic diagram of a process for generating a virtual character image provided by an exemplary embodiment of the present application;
FIG. 6 shows a schematic block diagram of a residual block and a space Transformer according to an exemplary embodiment of the present application;
FIG. 7 illustrates a schematic diagram of an image generation model provided by an exemplary embodiment of the present application;
FIG. 8 illustrates a schematic diagram of a training process provided by an exemplary embodiment of the present application;
FIG. 9 illustrates a flowchart of a process for acquiring a sample image, sample character pose information, sample prop mask, and sample descriptive text provided by an exemplary embodiment of the present application;
FIG. 10 is a schematic diagram of a training sample acquisition process provided in an exemplary embodiment of the present application;
FIG. 11 illustrates a schematic diagram of a avatar image provided in an exemplary embodiment of the present application;
fig. 12 is a block diagram showing a configuration of a virtual character generating apparatus according to an exemplary embodiment of the present application;
fig. 13 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The Diffusion Model (DM) is a deep AI generation Model, which is an advanced generation Model framework in which random noise is gradually added to data in the form of a markov chain, and then data samples required for constructing from the noise by a reverse Diffusion process are learned.
The stable diffusion model (Stable Diffusion mode, SD) is a diffusion model capable of generating high-quality pictures by text input, which introduces text prompts as conditions, performs diffusion learning in a latent space to reduce the model calculation amount, and is an advanced text-to-text graph diffusion model framework.
The Control Network (CN) is a conditional plug-in model applicable to the SD model, which performs feature fusion with the potential space of the SD model to implement conditional control generation by encoding given condition information, such as information of a gesture map, a segmentation map, a depth map, and the like, and is a text-to-image plug-in model capable of implementing conditional control.
Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. With the development and progress of artificial intelligence, the artificial intelligence is researched and applied in various fields, such as common smart home, smart customer service, virtual assistant, smart sound box, smart marketing, unmanned driving, automatic driving, robot, smart medical treatment and the like, and with the further development of future technology, the artificial intelligence is applied in more fields, and plays an increasingly important value.
Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance.
Embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) and Machine Learning techniques, designed based on Machine Learning (ML) in artificial intelligence.
In the related art, a stable diffusion model is combined with a control network, and control conditions are added to the stable diffusion model through the control network, so that a virtual character which meets the requirements of users is generated. And encoding the condition information through a control network, and fusing the control characteristics obtained by encoding with the latent space of the stable diffusion model, so that a corresponding image is generated through the stable diffusion model according to the control conditions. The condition information may include a gesture image of the character, and the segmentation image, the depth image, or the like is generally limited to the characteristics of the character itself.
Referring to fig. 1, a schematic diagram of a control network is shown. In text editing of the hint text, the hint text is input into a text encoder such that the text encoder encodes the hint text into text features. The text features are respectively input into a diffusion model and a control network, so that the control network controls the character posture information output by the stable diffusion model according to the character posture information, the generated virtual characters accord with the input character posture information, and the input of the stable diffusion model is a random noise image and a description text. The stable diffusion model comprises 4 coding blocks, a middle layer block and four decoding blocks, wherein a control network comprises 4 coding blocks and a middle layer block, the coding blocks and the middle layer block are used for carrying out feature extraction, the decoding blocks are used for carrying out up-sampling, 1/1 of the coding blocks represent 64×64 'potential images', the potential images are subjected to down-sampling by one coding block and then are converted into 32×32 feature spaces, 1/4 of the corresponding coding blocks correspond to 16×16 feature spaces, 1/8 of the coding blocks correspond to 4×4 feature spaces, the corresponding 4 decoding blocks are sequentially subjected to up-sampling, and the feature spaces are sequentially increased.
However, in the scheme provided by the related art, the gesture of the generated virtual character can be controlled by the control network based on the character gesture information, if the user has a need of generating the virtual prop associated with the virtual character in the virtual character image, the virtual character carrying the virtual prop can only be controlled to be generated by adding a corresponding description mode in the description text, coarse-grained regulation and control are performed on the virtual prop to be generated by the description text, the number and the position of virtual prop generation cannot be controlled, part of virtual props may be omitted in the generated virtual character image, and the virtual character image which accords with the user expectation is difficult to generate.
Referring to fig. 2 for illustration, a schematic diagram of a virtual character image is shown, wherein two swords (corresponding to the elements selected in the drawing) are generated in the first virtual character image 202 according to the description text "the person drawing holding the sword on the beach" and the first character gesture information 201, and the description text fails to control the position of the virtual prop in the first virtual character image 202. Based on the descriptive text "an avatar of a person holding a sword, shield, and armor" and the second character pose information 203, the resulting missing shield in the second virtual character image 204 may be seen to miss virtual props in generating the virtual character image based on the descriptive text and the character pose information.
Referring to fig. 3, a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application is shown. The implementation environment includes a terminal 310 and a server 320. The data communication between the terminal 310 and the server 320 is performed through a communication network, alternatively, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.
The terminal 310 is an electronic device having a function of generating a virtual character image based on the descriptive text, character pose information, and prop mask, and the electronic device may be a mobile terminal such as a smart phone, a tablet computer, a laptop, or a terminal such as a desktop computer, a projection computer, or the like, which is not limited in the embodiment of the present application. And the virtual character image generating function may be provided in the terminal through, for example, a picture processing type application, a communication type application, a game type application, etc., which is not limited in this embodiment.
The server 320 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. In the embodiment of the present application, the server 320 is a background server that provides a virtual character image generating function in the terminal 310, and may generate a virtual character image based on the description text, character pose information, and prop mask input by the user, and return to the terminal 310.
As shown in fig. 3, when the terminal 310 receives the description text, character pose information, and prop mask input by the user, the terminal 310 transmits the description text, character pose information, and prop mask to the server 320, and after the server 320 determines character-conditioning features and prop-conditioning features, respectively, virtual character images are generated based on the character-conditioning features, prop-conditioning features, and description text, and then the virtual character images are transmitted to the terminal 310, and the generated virtual character images are displayed to the user by the terminal 310.
In another possible implementation, server 320 is configured to train the image generation model, server 320 obtains the sample image, sample character pose information, sample prop mask, and sample description text, and trains the image generation model based on the sample image, sample character pose information, sample prop mask, and sample description text. The trained image generation model is used to generate virtual character images.
For convenience of description, the following embodiments are described as examples of the creation method of the virtual character executed by the computer device.
Referring to fig. 4, a flowchart of a method for generating a virtual character according to an exemplary embodiment of the present application is shown. This embodiment will be described by taking the method for a computer device as an example, and the method includes the following steps.
Step 401, acquiring character posture information, prop masks and description texts;
the character gesture information is used for indicating the gesture of the virtual character in the virtual character image to be generated, the prop mask is used for indicating the appearance of the virtual prop and the position of the virtual prop in the virtual character image to be generated, and the description text is used for describing the image content of the virtual character image to be generated.
Alternatively, the character pose information may be generally represented in the form of a skeleton diagram (character pose image), in which different key points exist corresponding to different character parts, that is, the character pose information may indicate the pose of the virtual character and the position of the virtual character in the virtual character image to be generated.
The descriptive text is used for describing the content of the image, wherein the descriptive text comprises image elements, image colors, interaction relation of virtual characters and virtual props in the image and the like, and the descriptive text can describe details aiming at specific image elements, for example, the descriptive text is "a girl stands on a beach, the girl has black shorthair", the beach is the description of the background of the image, and the black shorthair is the detailed description of the virtual characters.
The mask is a string of binary digits, the prop mask of the virtual character image is a binary matrix consistent with the size of the virtual character image, each matrix element corresponds to one pixel of the virtual character image, the value is 0 or 1, the matrix element corresponding to the position where the virtual prop exists in the virtual character image is 1, and the matrix element corresponding to the position where the virtual prop does not exist in the virtual character image is 0.
Step 402, based on the character gesture information and the description text, character feature extraction is performed through a character control network, and character conditioning features are obtained.
Since the character pose information can characterize the position and the pose of the character in the image, the character pose information is used as a control condition for generating the virtual object in the virtual object image, and the pose of the virtual character in the generated virtual character image can be matched with the character pose information.
Optionally, the text description is used as a text control condition for the virtual character image generating process, the character posture information is used as a character posture information control condition for the virtual character image generating process, and the text control condition and the character posture information control condition are interacted, so that the generated virtual character image can simultaneously accord with the text description and the character posture information.
The character control network obtains character conditioning characteristics after interaction of two control conditions by extracting characteristics of character gesture information and description text, wherein the character conditioning characteristics are used for representing characteristics of the virtual character required to meet the conditions.
Optionally, image coding is performed on the angular pose information to obtain a character pose information vector, text coding is performed on the descriptive text to obtain a text descriptive vector, and feature stitching is performed on the character pose information vector and the text descriptive vector to obtain input features of the character control network. And the character control network performs feature extraction on the input features so as to obtain character conditioning features.
Step 403, extracting prop features through a prop control network based on the prop mask and the description text, and obtaining prop conditional features.
The position and the form of the prop in the image can be represented by the prop mask, the control conditions of the virtual prop in the virtual object image are generated by the prop mask, and the position and the form of the virtual prop in the generated virtual character image can be matched with the prop form and the position represented by the prop mask.
Optionally, the text description is used as a text control condition for the virtual character generating process, the prop mask is used as a prop control condition for the virtual character image generating process, and the text control condition and the prop control condition are interacted, so that the generated virtual character image can simultaneously accord with the text description and the prop mask.
The prop control network performs feature extraction through prop masks and descriptive texts, so that prop conditional features after interaction of two control conditions are obtained, and the prop conditional features are used for representing features of virtual props required to meet the conditions.
Optionally, image encoding is performed on the prop mask to obtain a prop mask vector, text encoding is performed on the descriptive text to obtain a text vector, and feature stitching is performed on the prop mask vector and the text vector to obtain input features of the prop control network. And the prop control network performs feature extraction on the input features so as to obtain prop conditional features.
Step 404, generating a virtual character image by the stable diffusion model based on the descriptive text, the character conditioning features, and the prop conditioning features.
Wherein the virtual character in the virtual character image accords with character gesture information, the appearance and position of the virtual prop in the virtual character image accords with prop masks, and the interaction relation between the virtual character and the virtual prop accords with descriptive text.
Optionally, the stable diffusion model generates a random noise image during the process of generating the virtual character image, and the process of generating the virtual character image is a process of denoising (back diffusion) the random noise image based on the descriptive text and the aggregation conditional feature.
Optionally, the prop control network and the character control network are connected with the stable diffusion model, so that the prop control network inputs the prop conditional feature output by the prop control network and the character conditional feature output by the character control network into the stable diffusion model, and thereby a virtual character image is generated based on the description text, the character conditional feature and the prop conditional feature through the stable diffusion model.
In summary, in the embodiment of the present application, in the process of generating the virtual character image, the character control network is used to perform character feature extraction on the character pose information and the description text for describing the virtual character pose information to be generated, so as to obtain the character conditioning feature. And extracting features of a prop mask and a description text for describing the form and the position of the virtual prop to be generated by using a prop control network, so as to obtain the conditional feature of the prop. The virtual character and the virtual prop of the generated virtual character image are in accordance with the user's expectations by using the character conditioning feature and the prop conditioning feature as the control for generating the virtual character image, so that the generating effect of the virtual character image is improved. And because two control networks are added on the basis of the stable diffusion model to control the generated virtual character image, after the prop control network and the character control network respectively obtain prop conditional features and character conditional features, feature aggregation is carried out on the prop conditional features and the prop conditional features to obtain aggregation conditional features, so that the stable diffusion model generates the virtual character image through the aggregation conditional features and the descriptive text. The virtual character and the virtual prop have better interaction effect by respectively controlling the attitude information of the virtual character, the appearance and the position of the virtual prop through introducing a character control network and a prop control network.
In one possible embodiment, the prop control network, the character control network, and the stable diffusion model form an image generation model. In the event that the user has a need to make a virtual character image, a textual description, character pose information, and prop masks may be provided. Under the condition that the text description, the character posture information and the prop mask are acquired, the computer equipment inputs the text description, the character posture information and the prop mask into an image generation model, prop conditional features are determined through a prop control network based on the prop mask and the descriptive text, and character conditional features are determined through the character control network based on the character posture information and the descriptive text. And generating a virtual character image based on the descriptive text and the aggregated conditioning features by the stable diffusion model.
Optionally, while the user provides the text description, character pose information, and prop masks, the user may set the resolution of the virtual character image, i.e., the length and width of the virtual character image, so that the image model generates a virtual character image of a corresponding resolution.
Optionally, while the user provides the text description, the character pose information and the prop mask, the user can set the execution steps of the image generation model (namely, the steps of denoising the stable diffusion model), and the more the execution steps of the model, the more the denoising steps of the random noise image, the stronger the denoising effect, the higher the quality of the obtained virtual image, and the more accords with the input text description, the prop mask and the character pose information. However, the greater the number of execution steps of the model set by the user, the longer the delay in generating the virtual character image, and the higher the performance requirements for the computer device. For example, the number of execution steps of the model may be set to 28 steps, 30 steps, 32 steps, and so on.
Optionally, the image generating model processes the input text description, the character pose information and the prop mask, so that a plurality of different virtual character images can be generated, the poses of the virtual characters in the plurality of virtual character images all accord with the character pose information, the appearance and the positions of the virtual props in the plurality of virtual character images all accord with the prop mask, and the interaction relations of the virtual props and the virtual characters in the plurality of virtual character images all accord with the text description.
Referring to fig. 5, a schematic diagram of a process for generating a virtual character image according to an exemplary embodiment of the present application is shown. The user inputs character pose information as character control conditions, prop masks as prop control conditions, and text descriptions at an application interface providing a virtual character image generation function. And, the user needs to perform parameter configuration. The computer equipment inputs character attitude information, prop masks and text descriptions into the image generation model, and also inputs parameters configured by a user into the image generation model, and the parameters are processed through the image generation model to obtain a plurality of virtual character pictures output by the model for selection by the user.
In another possible implementation, the user may provide text replacement information and an example image with the virtual character image generation requirement, the example image including the example virtual character and the example prop therein, the example virtual character and the example prop in the example image being examples of the virtual character and the virtual prop in the generated virtual character image. The method comprises the steps that the gesture of a virtual character in a virtual character image output by an image generation model is matched with the gesture of an example virtual character in an example image, the shape and the position of a virtual prop in the virtual character image output by the image generation model are matched with the shape and the position of an example prop in the example image, the interaction relation between the virtual character in the virtual character image output by the image generation model and the virtual prop is matched with the interaction relation between the example virtual character in the example image and the example prop, and the interaction relation between the virtual character in the virtual character image and the virtual prop accords with text replacement information.
In the case where the user provides text as well as example images, the computer device obtains the example images. Then, the gesture extractor is used for extracting the gesture of the example virtual character in the example image, so that character gesture information is obtained.
The gesture extractor may be a CPM (Convolutional Pose Machines, convolution gesture machine), CPN (Cascaded Pyramid Network, cascading pyramid network), or openpost model, among others.
The gesture extractor detects the node positions of the example virtual characters from the example images, so that character gesture information identification of the example virtual characters is realized, and character gesture information is obtained.
Optionally, in the process of extracting the gesture of the example virtual character in the example image, the gesture extractor detects the bone connection point position information of the example virtual character from the example image, then splices the key point position into human bones, maps the key point position information into a solid-color image with the same resolution as the original image, and finally outputs the human gesture image.
The computer equipment performs mask extraction on the example props in the example images through the image divider to obtain prop masks.
Optionally, the computer device identifies the image elements in the example image through RAM (Recognize Anything Model, identifying the everything model) to obtain an image element tag, and screens the image element tag through a GPT (generating Pre-Training Transformer, generating Pre-training transducer) model to screen out prop element tags related to prop elements, such as "spades", "bows", "organs", "flowers", and the like. Finally, selecting prop elements in the example image based on prop element labels through a grouping-DINO (DETR with Improved deNoising anchOr boxs, DETR algorithm based on improved anchor frame denoising) model, and then carrying out mask extraction on the example props in the example image through an image divider based on the frame selection condition of the prop elements to obtain prop masks.
And the computer equipment performs text content replacement on the example description text corresponding to the example image based on the input text replacement information to obtain the description text.
Optionally, the computer device identifies the elements in the example image through a RAM model to obtain an image element tag, and converts the image element tag into the example description text through a tag2text model in the RAM.
And under the condition that the user inputs text replacement information, replacing corresponding content in the example description text to obtain the description text. For example, the example descriptive text is "a person standing on a beach with black hair" and the text input by the user is "a child with blue hair" on a beach, then the computer device may determine that the text replacement information is "a child with blue hair" and the descriptive text obtained by text content replacement of the example descriptive text corresponding to the example image is "a child with blue hair standing on a beach".
In the embodiment of the application, under the condition that the user inputs descriptive text, character posture information and prop masks, the virtual character image which meets the user expectations is generated through an image generation model constructed based on the stable diffusion model, the character control network and the prop control network. Under the condition that the user inputs text replacement information and an example image, character posture information and prop masks are obtained through processing the example image, description text is obtained after the example description text is replaced, and a virtual character image is regenerated, so that different input forms can be provided for the user.
In this embodiment, the stable diffusion model includes an encoding network and a decoding network, where the decoding network includes n decoding blocks. The prop control network and the role control network both comprise n coding blocks, the n coding blocks in the prop control network are connected in series, the n coding blocks in the role control network are connected in series, and the feature space of the n coding blocks is sequentially reduced.
In addition, in the embodiment of the application, the coding network of the stable diffusion model has the same structure as the character control network and the prop control network.
Optionally, the decoding network and the coding network in the stable diffusion model are connected through a middle layer block, the feature space of the middle layer block is identical to the feature space of the nth coding block and the feature space of the first decoding block, for example, the feature space of the nth coding block is 4×4, the feature space of the first decoding block is identical to the feature space of the nth coding block, and the feature space of the middle layer block is identical to the feature space of the nth coding block, that is, the middle layer block does not include a downsampling layer or an upsampling layer. Similarly, a middle layer block is also connected after n coding blocks in the prop control network and the role control network.
Optionally, the coding blocks included in the coding network of the stable diffusion model, the n coding blocks in the prop control network, and the n coding blocks in the role control network have the same structure, and each coding block is formed by a residual block (ResBlock) -space Transformer (equivalent to a self-attention layer) -residual block-space Transformer-downsampling hierarchy. Optionally, the middle layer blocks of the stable diffusion model have the same structure as the middle layer blocks contained in the prop control network and the role control network, and each middle layer block is formed by two residual error network blocks.
Referring to fig. 6, a schematic diagram of a residual block and a spatial Transformer according to an exemplary embodiment of the present application is shown. The residual error network comprises two convolution layers and a full connection layer, wherein the full connection layer is used for mapping input time embedded information to a sample marking space, the time embedded information can be understood as the denoising times of a model, the input characteristics are output characteristics of a last coding block (corresponding to character conditioning characteristics, prop conditioning characteristics or output characteristics of a coding block in a stable diffusion model in different networks respectively), the residual error block directly combines the input characteristics with intermediate characteristics passing through one convolution layer and then inputs the next convolution layer, namely, the mapping relation to be learned by the network is converted from the input characteristics to the intermediate characteristics into learning differences between the intermediate characteristics and the input characteristics, so that network optimization is simpler, and the deeper the network is, the more accurate the residual error learned by the network is.
In fig. 6, in the space Transformer, a Q matrix represents a Query matrix, a K matrix represents a Key (a Key Value), and a V matrix represents a Value, a Q, K, V matrix is obtained by linearly changing a vector matrix of input features, a similarity matrix is obtained by dot product (MatMul) of the Q matrix and the K matrix, the similarity matrix is scaled and normalized by an activation layer Softmax function, and values of matrix elements in the obtained matrix are all between 0 and 1, so that the obtained matrix can be understood as a weight matrix. And calculating dot products of the obtained weight matrix and the V matrix to obtain a weighted sum, and carrying out convolution operation on the obtained weighted sum through a convolution layer to obtain the output characteristics. The weighted features can enable the variance of the model to be smaller, and the gradient is more stable during training.
In generating a virtual character image by a stable diffusion model based on descriptive text, character conditioning features and prop conditioning features, feature aggregation should be performed on the character conditioning features and prop conditioning features to obtain aggregation conditioning features.
In order to enable the character conditional feature and the prop conditional feature to be used as control conditions for generating the virtual character image, firstly, feature aggregation is carried out on the conditional feature output by the character control network and the prop conditional feature output by the prop control network, and therefore the obtained aggregation conditional feature is input into a stable diffusion model.
Optionally, the character conditioning feature is output by an encoding block in the character control network, the prop conditioning feature is output by an encoding block in the prop control network, and the aggregate conditioning feature is input to a decoding block of the stable diffusion model.
Optionally, feature aggregation is feature fusion, which is used for fusing character conditioning features and prop conditioning features into aggregation conditioning features. The aggregation conditional characteristic is the character gesture information and the control conditional characteristic of the virtual character image to be generated by the prop mask.
The n encoding blocks included in the prop control network are used for outputting n prop conditional features, and the n decoding blocks included in the character control network are used for outputting n character conditional features. In order to enable the character conditioning features and the prop conditioning features to jointly act on a decoding network of the stable diffusion model, computer equipment performs feature aggregation on the i-th character conditioning features output by the i-th coding block in the character control network and the i-th prop conditioning features output by the i-th coding block in the prop control network to obtain i-th aggregation conditioning features. The prop control network and the role control network each include n decoding blocks, and n aggregation conditional features are obtained. The n aggregation conditional characteristic distributions are input to different decoding blocks in the stable diffusion model, namely the i < th > aggregation conditional characteristic is the input characteristic of the n < th > -i+1 decoding block. After the ith aggregation conditional characteristics are transferred to the decoding block of the stable diffusion model for further fusion, the fused characteristics have the information of the character control network and the condition control provided by the condition control network and the basic image information provided by the stable diffusion model.
Optionally, feature aggregation is performed on the n+1th character conditioning feature output by the middle layer block of the character control network and the n+1th prop conditioning feature output by the middle layer block of the condition control model to obtain the n+1th aggregation conditioning feature, wherein the n+1th aggregation conditioning feature is used as an input feature of the middle layer block of the stable diffusion model.
In the embodiment of the application, when the virtual character image is generated, besides character posture information, prop masks and description texts which are input into the image generation model, a random noise image is required to be generated, and the stable diffusion model is used for denoising the random noise image to obtain the virtual character image. The feature of the random noise image is also used as an input of a character control network and a prop control network, and the generated character conditioning feature has feature data except for the region of the virtual character, and the generated prop conditioning feature also includes feature data of other regions except for the virtual prop feature. Therefore, in order to enable the character conditioning features output by the character control network to be only used for controlling the virtual characters, the prop conditioning features output by the prop control network are only used for controlling the virtual props, and the computer equipment performs feature aggregation on the ith character conditioning features and the ith prop conditioning features based on the prop masks to obtain the ith aggregation conditioning features.
Specifically, as the binary data corresponding to the prop region in the prop mask is 1 and the binary data corresponding to the other regions is 0, the prop conditional features only comprise prop conditional features of the prop region, and the i prop region features of the prop region can be extracted from the i prop conditional features based on the prop mask. And, for character conditioning features, the feature of the i-th character region outside the prop region can be extracted from the i-th character conditioning features based on the reverse prop mask corresponding to the prop mask, wherein the reverse prop mask is obtained by reversing the prop mask.
Optionally, when extracting the i-th prop region feature of the prop region from the i-th prop conditional feature based on the prop mask, the prop mask may be multiplied by the i-th prop conditional feature (hadamard product) by using the prop mask having the same h×w as the i-th prop region feature, to obtain the i-th prop region feature of the prop region. And when the character region features of the ith character outside the character region are extracted from the character conditional features of the ith character based on the reverse prop masks corresponding to the prop masks, inverting the prop masks which are the same as the character conditional features of the ith character by H multiplied by the prop masks to obtain reverse prop masks, and multiplying the reverse prop masks by the character conditional features of the ith character to obtain the character region features of the ith character outside the character region.
Referring to fig. 7, a schematic diagram of an image generation model according to an exemplary embodiment of the present application is shown, which includes a prop control network, a stable diffusion model, and a character control network. Wherein the two random noise images are identical random noise images, which are randomly generated by the computer device, in fact the size of the random noise images should fit into the feature space (64 x 64) of the coding block 1/1, the computer device will first code the random noise images as random noise image features. In addition, a text editor should be included in the image generation model to encode descriptive text as text features, and the computer device will also encode character pose information and prop masks as character pose information features and prop features. The input of the stable diffusion model is a first splicing characteristic after the random noise image characteristic and the text characteristic are spliced, the input of the prop control network is a second splicing characteristic obtained by characteristic splicing of the random noise image characteristic, the text characteristic and the prop characteristic, and the input of the character control network is a third splicing characteristic obtained by characteristic splicing of the random noise image characteristic, the text characteristic and the character posture information characteristic.
The stable diffusion model comprises 4 coding blocks, a middle layer block and four decoding blocks, wherein the character control network and the prop control network all comprise 4 coding blocks and a middle layer block, the coding blocks and the middle layer block are used for extracting characteristics, and the structures of the prop control network and the character control network are consistent. When feature aggregation is carried out on character conditioning features and prop conditioning features, the character conditioning features are multiplied by reverse prop masks, the prop conditioning features are multiplied by the prop masks, and the character conditioning features and the prop masks are respectively input into corresponding decoding blocks after feature fusion is carried out on the character conditioning features and the prop conditioning features. The image generation model further includes a decoder of a VAE (variable Auto-Encoders) for decoding the character image based on character image features of the character image obtained by the stable diffusion model.
Corresponding to fig. 7, in the application process, the computer device inputs the random noise image and the descriptive text into the coding network of the stable diffusion model to obtain intermediate noise image characteristics output by the coding network, then performs noise prediction through the decoding network of the stable diffusion model based on the intermediate noise image characteristics and the aggregation conditional characteristics to obtain prediction noise, and finally performs denoising processing on the random noise image based on the prediction noise to obtain the virtual character image.
Optionally, the computer device encodes the random noise image generated randomly to obtain random noise image features, inputs the random noise image features into an encoding network of the stable diffusion model to obtain intermediate noise image features output by the encoding network, and then performs noise prediction through a decoding network of the stable diffusion model based on the intermediate noise image features and the aggregation conditional features to obtain predicted noise image features. And finally, denoising the random noise image features based on the predicted noise image features to obtain virtual character image features, and decoding the virtual character image features through a VAE decoder to obtain the virtual character image.
In the training process, the computer equipment firstly needs to acquire sample images, sample character posture information, sample prop masks and sample description texts, and then trains the images to generate models based on the sample character posture information, the sample prop masks, the sample description texts and the sample images.
Wherein the sample prop mask is used for indicating the appearance of the sample prop in the sample image and the position of the sample prop in the sample image, and the sample description text is used for describing the image content of the sample image. The image generation model is composed of a role control network, a prop control network and a stable diffusion model.
In the process of training the image generation model, as the training results of the character control network and the stable diffusion model in the related technology are mature, the character control network and the stable diffusion model which are finished by pre-training can be directly adopted to construct the image generation model together with the untrained prop control network, namely, the parameters of the character control network and the stable diffusion model are frozen in the training process. For example, the pre-training parameters of the steady diffusion model use model parameters of the angry-v 3 model, and the pre-training parameters of the character control network use model parameters of the sd-control net-openpost model.
Referring to fig. 8, a schematic diagram of a training process provided in an exemplary embodiment of the present application is shown. Wherein the solid line network layer represents that the network layer is a parameter frozen layer, namely an untrainable layer, and the broken line box represents the trainable layer. The noisy image is used as part of the input information during training.
The process of generating a model based on sample character pose information, sample prop masks, sample descriptive text, and sample image training images is described below with reference to fig. 8.
Firstly, character feature extraction is carried out through a character control network based on sample character posture information and sample description text to obtain sample character conditioning features, and prop feature extraction is carried out through a prop control network based on a sample prop mask and sample description text to obtain sample prop conditioning features.
The specific implementation manners of generating the sample character conditioning feature and the sample prop conditioning feature are the same as those of generating the character conditioning feature and the prop conditioning feature in the above embodiment, and this embodiment will not be described in detail.
And secondly, carrying out feature aggregation on the sample character conditional features and the sample prop conditional features to obtain sample aggregation conditional features. Specifically, based on the sample prop mask, feature aggregation is performed on the ith sample character conditional feature and the ith sample prop conditional feature, and the ith sample aggregation conditional feature is obtained. The i-th sample aggregate conditioning feature is used as an input feature for the n-i+1-th decoding block in the stable diffusion model. In fig. 8, the light portion of the sample prop mask corresponds to a value of 1, and the dark portion corresponds to a value of 0.
Alternatively, the aggregate conditioning feature is a vector of n×c×h×w, where N represents the number of samples (i.e., the number of samples that complete one iteration), C represents the feature length (channel), H represents the feature height (height), and W represents the feature width (width).
And then, carrying out noise prediction on the noisy image through a stable diffusion model based on the sample description text and the sample aggregation conditional characteristics to obtain sample prediction noise. And determining a noise prediction loss based on the sample prediction noise and the sample noise, thereby training the prop control network based on the noise prediction loss, namely, the sample prop area takes a value of 1, the other part takes a value of 0, and the reverse sample prop mask is opposite to the sample prop area.
Wherein the noisy image is obtained by adding sample noise to the sample image. Optionally, the image generation model further comprises a VAE encoder, the sample image is encoded into sample image features, then, noise processing is carried out on the sample image features, and the sample noise image features are added to obtain noisy image features.
The computer equipment inputs the noisy image features into a coding network of the stable diffusion model to obtain middle noisy image features, and a decoding network of the stable diffusion model predicts sample noise image features based on the middle noisy image features and sample aggregation conditioning features to obtain sample prediction noise image features.
In the embodiment of the application, sample noise is added to sample images in the training process to serve as part of input information of an image generation model, and sample prediction noise is obtained through the image generation model, so that a prop control network is trained. In addition, in the process of training the prop control network, the sample aggregation conditional characteristics are input into the stable diffusion model, so that the sample prop mask is used as a prediction noise factor affecting the model output, and the aim of training the prop control network is achieved. And the trained prop control network, the character control network and the stable diffusion model form an image generation model together, so that a virtual character image conforming to prop masks and character posture information can be generated.
In the training process, training samples are a large number of candidate images, and the computer equipment needs to screen sample images from the candidate images and acquire sample character pose information, sample prop masks and sample description text. The process of acquiring a sample image, sample character pose information, sample prop mask, and sample descriptive text is described below with one exemplary embodiment.
Referring to fig. 9, a flowchart of acquiring a sample image, sample character pose information, sample prop mask, and sample description text according to an exemplary embodiment of the present application is shown, where the process includes:
step 901, performing label recognition on at least two candidate images to obtain sample element labels of image elements in the candidate images.
The candidate images may be a screenshot of a game screen, or a screenshot of other video playing interfaces, etc. In the training process, a large number of candidate images need to be subjected to label recognition so as to be capable of screening at least tens of thousands of sample images from the candidate images for training a prop control network.
Optionally, the tag identification may be implemented through a feature model, a Viola-Jones algorithm, or a RAM model, to obtain a sample element tag of an image element in the candidate image, which is not limited in this embodiment.
Sample elements in the candidate image may include sample roles, sample props, environmental elements, plant elements, and so forth.
Step 902, based on the sample element tags, a sample image with sample prop element tags and sample role element tags is screened from at least two candidate images.
The sample element labels contained in the candidate images are various, and there may be at least one of sample roles and sample props not contained in part of the candidate images, and for the image generation model constructed in the embodiment of the application, such images cannot be used as sample images, so that candidate images containing the sample props element labels and the sample roles element labels need to be screened from the candidate images as candidate images.
Optionally, the computer device screens sample images containing sample character element tags (e.g., "person", "human", etc.), and sample prop element tags (e.g., "swords", "bow", etc.).
In step 903, the sample element labels of the sample image are converted into sample description text by a text converter.
For example, the computer device may convert the image element tags into sample descriptive text based on a text generation model, such as a tag2text model or a seq2seq model in a RAM model, and the specific type of text converter is not limited by the embodiments of the present application.
The computer equipment converts the sample element label of the sample image into descriptive text, namely a process for describing the content of the sample image based on the sample element label.
Step 904, extracting the gesture of the character in the sample image by a gesture extractor to obtain sample character gesture information.
After the sample images are screened, the sample character elements are contained in the sample images, and the characters in the sample images can be extracted through a DensePose model, an OpenPose model, an alpha Pose model or a deep Pose model, and the like, so that sample character posture information is obtained. The present embodiment is not limited to the manner of gesture extraction.
In step 905, the sample prop in the sample image is subjected to image segmentation by the image segmenter, so as to obtain a sample prop mask.
The image segmentation is to segment an image into a plurality of segments, and the image segmentation is performed on the sample image based on the sample prop label to obtain the segments of the sample prop, so as to obtain the sample prop mask.
In one possible embodiment, a prop detector is used to frame the sample prop in the sample image, i.e., to determine the position of the sample prop in the sample image, prior to image segmentation. And then image segmentation is carried out based on the prop divider and the frame selection condition of prop elements.
Alternatively, the computer device may implement image segmentation using a deep lab model, mask R-CNN (Region-based Convolutional Neural Networks, area-based convolutional neural network), SAM model, or the like, which is not limited in this embodiment.
In another possible embodiment, training of the prop control network for the sample prop mask of only one prop type, for example, retaining the sample prop labels of the plant type, such as "flower", "grass", and the like, has better generation effect on the plant type road in the application stage of the image generation model.
Before image segmentation, the position of a sample prop of the target prop type in the sample image needs to be determined. Specifically, the computer device screens out a target sample prop tag from sample prop element tags contained in the sample image through the tag screener, wherein the target sample prop tag is used for representing props of a target prop type. For example, the tag filter may be a GPT model.
And then, framing the sample prop of the target prop type in the sample image based on the target sample prop label contained in the sample image through the prop detector, namely determining the position of the sample prop of the target prop type in the sample image.
Alternatively, the prop detector may employ an R-CNN model, a YOLO (You Only Look Once, you need only see once) model, a grouping-DINO model, etc., which is not limited in this embodiment.
After the sample prop of the target prop type in the sample image is selected in a frame mode, the computer equipment performs image segmentation on the sample prop belonging to the target prop type in the sample image through the image segmenter based on the frame selection condition of prop elements, and a sample prop mask is obtained.
Referring to fig. 10, a schematic diagram of a training sample acquiring procedure according to an exemplary embodiment of the present application is shown. At first, label identification is carried out on at least two candidate pictures to obtain sample element labels, and image screening is carried out according to the obtained sample element labels, so that sample images simultaneously containing sample roles and sample props are screened out. And then, carrying out text conversion on the sample element labels of the sample images to obtain sample descriptive text, wherein the sample descriptive text corresponds to an illustration of a woman who is dressed to hold the cane. Meanwhile, the computer equipment screens target sample prop labels corresponding to target prop types from the sample element labels through the label screener, so that the prop detector carries out frame selection on target props in sample images based on the target sample prop labels, corresponds to a virtual legend held by a virtual character in fig. 10, and carries out image segmentation on the sample images through the image segmenter based on frame selection conditions of the sample props in the sample images, so as to obtain sample prop masks. In addition, after screening out the sample image, the computer equipment extracts the gesture of the character in the sample image through the gesture extractor, so as to obtain sample character gesture information.
In the embodiment of the application, the candidate images are subjected to label recognition, so that sample images simultaneously containing sample prop elements and sample character elements are screened, an image generation model is trained based on the sample images, and prop control networks can learn the mapping relation from the input descriptive text and prop masks to prop conditional features, so that in the process of generating virtual character images, interaction relation between virtual characters and virtual props in the generated images is controlled through the prop masks.
Referring to fig. 11, a schematic diagram of a virtual character image provided in an exemplary embodiment of the present application is shown.
Under the condition that the input character pose information and the prop mask are the first character pose information and the first prop mask respectively, the virtual character in the generated first virtual character image accords with the first character pose information, and the appearance and the position of the virtual prop in the first virtual character image accord with the first prop mask. Under the condition that the input character posture information and the prop mask are the first character posture information and the second prop mask respectively, the virtual character in the generated second virtual character image accords with the first character posture information, and the appearance and the position of the virtual prop in the second virtual character image accord with the second prop mask. Because the character pose information input twice is the first character pose information, the obtained virtual character pose information of the first virtual character image and the second virtual character image is basically the same. The virtual prop positions indicated by the first prop mask and the second prop mask input twice are different, so that the generated virtual prop positions in the first virtual character image and the second virtual character image are different, and the forms of the virtual props are basically the same.
Referring to fig. 12, there is shown a block diagram of a virtual character generating apparatus according to an exemplary embodiment of the present application, as shown in fig. 12, the apparatus includes:
an obtaining module 1201, configured to obtain character pose information, a prop mask, and a description text, where the character pose information is used to indicate a pose of a virtual character, the prop mask is used to indicate an outline of a virtual prop and a position of the virtual prop in an image, and the description text is used to describe image content;
a character feature extraction module 1202, configured to perform character feature extraction through a character control network based on the character pose information and the description text, to obtain character conditioning features, where the character conditioning features are used to characterize features that are required to satisfy a condition of the virtual character;
the prop feature extraction module 1203 is configured to extract prop features through a prop control network based on the prop mask and the description text, to obtain prop conditional features, where the prop conditional features are used to characterize features of the virtual prop that are required to satisfy a condition;
and the generating module 1204 is configured to generate, based on the descriptive text, the character conditioning feature and the prop conditioning feature, a virtual character image through a stable diffusion model, where a virtual character in the virtual character image conforms to the character gesture information, an appearance and a position of a virtual prop in the virtual character image conform to the prop mask, and an interaction relationship between the virtual character and the virtual prop conforms to the descriptive text.
Optionally, the generating module 1204 is configured to:
performing feature aggregation on the character conditional feature and the prop conditional feature to obtain an aggregation conditional feature;
the virtual character image is generated by the stable diffusion model based on the descriptive text and the aggregated condition features.
Optionally, the stable diffusion model includes an encoding network and a decoding network, the decoding network includes n decoding blocks, and the prop control network and the character control network each include n encoding blocks;
the generating module 1204 is configured to:
performing feature aggregation on the ith character conditioning feature output by the ith coding block in the character control network and the ith prop conditioning feature output by the ith coding block in the prop control network to obtain an ith aggregation conditioning feature, wherein i is a positive integer smaller than or equal to n;
wherein the ith aggregation conditional characteristic is an input characteristic of an nth-i+1 decoding block.
Optionally, the generating module 1204 is configured to perform feature aggregation on the i-th character conditioning feature and the i-th prop conditioning feature based on the prop mask, to obtain the i-th aggregation conditioning feature.
Optionally, the generating module 1204 is configured to extract, based on the prop mask, an i-th prop region feature of the prop region from the i-th prop conditional features;
and extracting character region features of the ith character outside the character region from the character conditional features based on an inverse character mask corresponding to the character mask, wherein the inverse character mask is obtained by inverting the character mask.
Optionally, the generating module 1204 is configured to:
inputting a random noise image and the description text into the coding network of the stable diffusion model to obtain an intermediate noise image characteristic output by the coding network;
based on the intermediate noise image features and the aggregation conditional features, performing noise prediction through the decoding network of the stable diffusion model to obtain prediction noise;
and denoising the random noise image based on the predicted noise to obtain the virtual character image.
Optionally, the coding network of the stable diffusion model is the same as the character control network and the prop control network.
Optionally, the acquiring module 1201 is configured to:
Acquiring an example image, wherein the example image contains an example virtual character and an example prop;
extracting the gesture of the example virtual character in the example image through a gesture extractor to obtain character gesture information;
performing mask extraction on the example props in the example image through an image divider to obtain the prop mask;
and replacing text contents of the example description text corresponding to the example image based on the input text replacement information to obtain the description text.
Optionally, the apparatus further includes:
the training module is used for acquiring a sample image, sample character pose information, sample prop masks and sample description text, wherein the sample character pose information is used for indicating the pose of a character in the sample image, the sample prop masks are used for indicating the appearance of a sample prop in the sample image and the position of the sample prop in the sample image, and the sample description text is used for describing the image content of the sample image;
the training module is used for training an image generation model based on the sample character posture information, the sample prop mask, the sample descriptive text and the sample image, wherein the image generation model is composed of the character control network, the prop control network and the stable diffusion model.
Optionally, the training module is configured to:
performing label recognition on at least two candidate images to obtain sample element labels of image elements in the candidate images;
screening the sample images with the sample prop element labels and the sample role element labels from at least two candidate images based on the sample element labels;
converting, by a text converter, the sample element labels of the sample image into the sample descriptive text;
extracting the gesture of the character in the sample image through a gesture extractor to obtain gesture information of the sample character;
and carrying out image segmentation on the sample prop in the sample image through an image segmenter to obtain the sample prop mask.
Optionally, the training module is configured to:
screening out a target sample prop label from the sample prop element labels contained in the sample image by a label screening device, wherein the target sample prop label is used for representing props of a target prop type;
selecting the sample prop of the target prop type in the sample image by a prop detector based on the target sample prop label contained in the sample image;
The image segmentation is performed on the sample prop in the sample image by an image segmenter to obtain the sample prop mask, including:
and based on the frame selection condition of the prop element, carrying out image segmentation on the sample prop belonging to the target prop type in the sample image through the image segmenter to obtain the sample prop mask.
Optionally, the parameters of the character control network and the stable diffusion model are frozen in the training process;
the training module is used for: based on the sample character posture information and the sample description text, character feature extraction is carried out through the character control network, and sample character conditioning features are obtained;
based on the sample prop mask and the sample description text, prop feature extraction is carried out through the prop control network, and sample prop conditional features are obtained;
performing feature aggregation on the sample character conditional features and the sample prop conditional features to obtain sample aggregation conditional features;
based on the sample description text and the sample aggregation conditional feature, carrying out noise prediction on the noisy image through the stable diffusion model to obtain sample prediction noise, wherein the noisy image is obtained by adding sample noise to the sample image;
Determining a noise prediction loss based on the sample prediction noise and the sample noise;
training the prop control network based on the noise prediction loss.
In summary, in the embodiment of the present application, in the process of generating the virtual character image, the character control network is used to perform character feature extraction on the character pose information and the description text for describing the virtual character pose information to be generated, so as to obtain the character conditioning feature. And extracting features of a prop mask and a description text for describing the form and the position of the virtual prop to be generated by using a prop control network, so as to obtain the conditional feature of the prop. The virtual character and the virtual prop of the generated virtual character image are in accordance with the user's expectations by using the character conditioning feature and the prop conditioning feature as the control for generating the virtual character image, so that the generating effect of the virtual character image is improved. And because two control networks are added on the basis of the stable diffusion model to control the generated virtual character image, after the prop control network and the character control network respectively obtain prop conditional features and character conditional features, feature aggregation is carried out on the prop conditional features and the prop conditional features to obtain aggregation conditional features, so that the stable diffusion model generates the virtual character image through the aggregation conditional features and the descriptive text. The virtual character and the virtual prop have better interaction effect by respectively controlling the attitude information of the virtual character, the appearance and the position of the virtual prop through introducing a character control network and a prop control network.
Referring to fig. 13, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown, where the computer device may be implemented as a terminal or a server in the foregoing embodiments. Specifically, the present invention relates to a method for manufacturing a semiconductor device. The computer apparatus 1300 includes a central processing unit (Central Processing Unit, CPU) 1301, a system memory 1304 including a random access memory 1302 and a read only memory 1303, and a system bus 1305 connecting the system memory 1304 and the central processing unit 1301. The computer device 1300 also includes a basic Input/Output system (I/O) 1306 to facilitate the transfer of information between the various devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.
In some embodiments, the basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input output controller 1310 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown), such as a hard disk or drive.
The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (Random Access Memory, RAM), read Only Memory (ROM), flash Memory or other solid state Memory technology, compact disk (Compact Disc Read-Only Memory, CD-ROM), digital versatile disk (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.
The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the methods described above, the central processing unit 1301 executing the one or more programs to implement the methods provided by the various method embodiments described above.
According to various embodiments of the present application, the computer device 1300 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1300 may be connected to the network 1312 via a network interface unit 1311 coupled to the system bus 1305, or alternatively, the network interface unit 1311 may be used to connect to other types of networks or remote computer systems (not shown).
The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, as performed by the computer device.
The embodiment of the application further provides a computer readable storage medium, where at least one instruction, at least one section of program, a code set, or an instruction set is stored, where at least one instruction, at least one section of program, a code set, or an instruction set is loaded and executed by a processor to implement the method for generating a virtual character according to any one of the embodiments above.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the virtual character generating method provided in the above aspect.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, which may be a computer readable storage medium included in the memory of the above embodiments; or may be a computer-readable storage medium, alone, that is not incorporated into the terminal. The computer readable storage medium stores at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement a method for generating a virtual character according to any of the method embodiments described above.
Alternatively, the computer-readable storage medium may include: ROM, RAM, solid state disk (Solid State Drives, SSD), or optical disk, etc. The RAM may include resistive random access memory (Resistance Random Access Memory, reRAM) and dynamic random access memory (Dynamic Random Access Memory, DRAM), among others. The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. And references herein to "first," "second," etc. are used to distinguish similar objects and are not intended to limit a particular order or sequence. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims (16)

1. A method for generating a virtual character, the method comprising:
acquiring character posture information, a prop mask and a description text, wherein the character posture information is used for indicating the posture of a virtual character, the prop mask is used for indicating the appearance of the virtual prop and the position of the virtual prop in an image, and the description text is used for describing the image content of the image of the virtual character to be generated;
based on the character posture information and the description text, character feature extraction is carried out through a character control network to obtain character conditioning features, wherein the character conditioning features are used for representing features of the virtual character required to meet conditions;
based on the prop mask and the description text, prop feature extraction is carried out through a prop control network, so that prop conditional features are obtained, and the prop conditional features are used for representing features of the virtual prop, wherein the features are required to meet conditions;
Generating a virtual character image through a stable diffusion model based on the descriptive text, the character conditioning feature and the prop conditioning feature, wherein a virtual character in the virtual character image accords with the character gesture information, the appearance and the position of a virtual prop in the virtual character image accord with the prop mask, and the interaction relationship between the virtual character and the virtual prop accords with the descriptive text.
2. The method of claim 1, wherein the generating a virtual character image by a stable diffusion model based on the descriptive text, the character conditioning feature, and the prop conditioning feature comprises:
performing feature aggregation on the character conditional feature and the prop conditional feature to obtain an aggregation conditional feature;
the virtual character image is generated by the stable diffusion model based on the descriptive text and the aggregated condition features.
3. The method of claim 2, wherein the stable diffusion model comprises an encoding network and a decoding network, the decoding network comprising n decoding blocks, the prop control network and the character control network each comprising n encoding blocks;
Performing feature aggregation on the character conditioning feature and the prop conditioning feature to obtain an aggregation conditioning feature, wherein the feature aggregation comprises the following steps:
performing feature aggregation on the ith character conditioning feature output by the ith coding block in the character control network and the ith prop conditioning feature output by the ith coding block in the prop control network to obtain an ith aggregation conditioning feature, wherein i is a positive integer smaller than or equal to n;
wherein the ith aggregation conditional characteristic is an input characteristic of an nth-i+1 decoding block.
4. A method according to claim 3, wherein feature aggregation is performed on the i-th character conditioning feature output by the i-th coding block in the character control network and the i-th prop conditioning feature output by the i-th coding block in the prop control network to obtain an i-th aggregation conditioning feature, and the method comprises:
and based on the prop mask, performing feature aggregation on the ith character conditional feature and the ith prop conditional feature to obtain the ith aggregation conditional feature.
5. The method of claim 4, wherein feature aggregating the i-th character-conditioned feature and the i-th prop-conditioned feature based on the prop mask to obtain the i-th aggregate-conditioned feature, comprising:
Extracting an ith prop region feature of a prop region from the ith prop conditional features based on the prop mask;
extracting character region features of the ith character outside the character region from the character conditional features of the ith character based on reverse prop masks corresponding to the prop masks, wherein the reverse prop masks are obtained by reversing the prop masks;
and performing feature aggregation on the i prop region features and the i character region features to obtain the i aggregation conditional features.
6. A method according to claim 3, wherein said generating a virtual character image by said stable diffusion model based on said descriptive text and said aggregated conditioning features comprises:
inputting a random noise image and the description text into the coding network of the stable diffusion model to obtain an intermediate noise image characteristic output by the coding network;
based on the intermediate noise image features and the aggregation conditional features, performing noise prediction through the decoding network of the stable diffusion model to obtain prediction noise;
and denoising the random noise image based on the predicted noise to obtain the virtual character image.
7. A method according to claim 3, wherein the coding network of the stable diffusion model is structurally identical to the character control network and the prop control network.
8. The method of claim 1, wherein the obtaining character pose information, prop masks, and descriptive text comprises:
acquiring an example image, wherein the example image contains an example virtual character and an example prop;
extracting the gesture of the example virtual character in the example image through a gesture extractor to obtain character gesture information;
performing mask extraction on the example props in the example image through an image divider to obtain the prop mask;
and replacing text contents of the example description text corresponding to the example image based on the input text replacement information to obtain the description text.
9. The method according to claim 1, characterized in that the method comprises:
acquiring a sample image, sample character pose information, a sample prop mask and a sample description text, wherein the sample character pose information is used for indicating the pose of a character in the sample image, the sample prop mask is used for indicating the appearance of a sample prop in the sample image and the position of the sample prop in the sample image, and the sample description text is used for describing the image content of the sample image;
And training an image generation model based on the sample character pose information, the sample prop mask, the sample descriptive text and the sample image, wherein the image generation model is composed of the character control network, the prop control network and the stable diffusion model.
10. The method of claim 9, wherein the acquiring the sample image, the sample character pose information, the sample prop mask, and the sample descriptive text comprises:
performing label recognition on at least two candidate images to obtain sample element labels of image elements in the candidate images;
screening the sample images with the sample prop element labels and the sample role element labels from at least two candidate images based on the sample element labels;
converting, by a text converter, the sample element labels of the sample image into the sample descriptive text;
extracting the gesture of the character in the sample image through a gesture extractor to obtain gesture information of the sample character;
and carrying out image segmentation on the sample prop in the sample image through an image segmenter to obtain the sample prop mask.
11. The method of claim 10, wherein the image segmentation of the sample prop in the sample image by the image segmenter, prior to obtaining the sample prop mask, further comprises:
screening a target sample prop label from the sample prop element labels contained in the sample image by a label screening device, wherein the target sample prop label is used for representing the sample prop of a target prop type;
selecting the sample prop of the target prop type in the sample image by a prop detector based on the target sample prop label contained in the sample image;
the image segmentation is performed on the sample prop in the sample image by an image segmenter to obtain the sample prop mask, including:
based on the frame selection condition of the sample prop, image segmentation is carried out on the sample prop belonging to the target prop type in the sample image through the image segmenter, so as to obtain the sample prop mask.
12. The method of claim 9, wherein parameters of the character control network and the stable diffusion model freeze during training;
The training image generation model based on the sample character pose information, the sample prop mask, the sample descriptive text, and the sample image comprises:
based on the sample character posture information and the sample description text, character feature extraction is carried out through the character control network, and sample character conditioning features are obtained;
based on the sample prop mask and the sample description text, prop feature extraction is carried out through the prop control network, and sample prop conditional features are obtained;
performing feature aggregation on the sample character conditional features and the sample prop conditional features to obtain sample aggregation conditional features;
based on the sample description text and the sample aggregation conditional feature, carrying out noise prediction on a noisy image through the stable diffusion model to obtain sample prediction noise, wherein the noisy image is obtained by adding sample noise to the sample image;
determining a noise prediction loss based on the sample prediction noise and the sample noise;
training the prop control network based on the noise prediction loss.
13. An apparatus for generating a virtual character, the apparatus comprising:
The system comprises an acquisition module, a description text and a display module, wherein the acquisition module is used for acquiring character gesture information, a prop mask and a description text, the character gesture information is used for indicating the gesture of a virtual character, the prop mask is used for indicating the appearance of the virtual prop and the position of the virtual prop in an image, and the description text is used for describing the content of the image;
the character feature extraction module is used for extracting character features through a character control network based on the character posture information and the description text to obtain character conditioning features, wherein the character conditioning features are used for characterizing features of the virtual character required to meet conditions;
the prop feature extraction module is used for extracting prop features through a prop control network based on the prop mask and the description text to obtain prop conditional features, wherein the prop conditional features are used for representing features of the virtual prop, which are required to meet conditions;
the generating module is used for generating a virtual character image through a stable diffusion model based on the descriptive text, the character conditioning characteristics and the prop conditioning characteristics, wherein a virtual character in the virtual character image accords with the character gesture information, the appearance and the position of a virtual prop in the virtual character image accord with the prop mask, and the interaction relation between the virtual character and the virtual prop accords with the descriptive text.
14. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the method of generating a virtual character according to any one of claims 1 to 12.
15. A computer readable storage medium, wherein at least one program is stored in the readable storage medium, and the at least one program is loaded and executed by a processor to implement the method for generating a virtual character according to any one of claims 1 to 12.
16. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, which processor executes the computer instructions to implement the method of generating a virtual character according to any of claims 1 to 12.
CN202311373791.8A 2023-10-20 2023-10-20 Virtual character generation method, device, equipment and storage medium Pending CN117541668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311373791.8A CN117541668A (en) 2023-10-20 2023-10-20 Virtual character generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311373791.8A CN117541668A (en) 2023-10-20 2023-10-20 Virtual character generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117541668A true CN117541668A (en) 2024-02-09

Family

ID=89783113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311373791.8A Pending CN117541668A (en) 2023-10-20 2023-10-20 Virtual character generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117541668A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071887A (en) * 2024-04-17 2024-05-24 腾讯科技(深圳)有限公司 Image generation method and related device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071887A (en) * 2024-04-17 2024-05-24 腾讯科技(深圳)有限公司 Image generation method and related device

Similar Documents

Publication Publication Date Title
CN111898696A (en) Method, device, medium and equipment for generating pseudo label and label prediction model
CN111079601A (en) Video content description method, system and device based on multi-mode attention mechanism
CN112819686B (en) Image style processing method and device based on artificial intelligence and electronic equipment
CN112418292B (en) Image quality evaluation method, device, computer equipment and storage medium
CN115457531A (en) Method and device for recognizing text
CN112164002B (en) Training method and device of face correction model, electronic equipment and storage medium
CN113011202A (en) End-to-end image text translation method, system and device based on multi-task training
CN113870395A (en) Animation video generation method, device, equipment and storage medium
CN112242002B (en) Object identification and panoramic roaming method based on deep learning
CN113283336A (en) Text recognition method and system
CN116309992A (en) Intelligent meta-universe live person generation method, equipment and storage medium
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN115565238A (en) Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product
CN112668608A (en) Image identification method and device, electronic equipment and storage medium
CN114283352A (en) Video semantic segmentation device, training method and video semantic segmentation method
CN113011320A (en) Video processing method and device, electronic equipment and storage medium
CN112183544A (en) Double-channel fused three-layer architecture mathematical formula identification method, system and storage device
Bie et al. Renaissance: A survey into ai text-to-image generation in the era of large model
CN112115744A (en) Point cloud data processing method and device, computer storage medium and electronic equipment
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN113762261A (en) Method, device, equipment and medium for recognizing characters of image
CN115292439A (en) Data processing method and related equipment
CN113570509A (en) Data processing method and computer device
CN116821113A (en) Time sequence data missing value processing method and device, computer equipment and storage medium
CN116740078A (en) Image segmentation processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication