CN115155058B - Face pinching method, face pinching system and storage medium - Google Patents

Face pinching method, face pinching system and storage medium Download PDF

Info

Publication number
CN115155058B
CN115155058B CN202211081372.2A CN202211081372A CN115155058B CN 115155058 B CN115155058 B CN 115155058B CN 202211081372 A CN202211081372 A CN 202211081372A CN 115155058 B CN115155058 B CN 115155058B
Authority
CN
China
Prior art keywords
natural language
face
image
language description
face image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211081372.2A
Other languages
Chinese (zh)
Other versions
CN115155058A (en
Inventor
华菁云
王宇龙
马超
周明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lanzhou Technology Co ltd
Original Assignee
Beijing Lanzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lanzhou Technology Co ltd filed Critical Beijing Lanzhou Technology Co ltd
Priority to CN202211081372.2A priority Critical patent/CN115155058B/en
Publication of CN115155058A publication Critical patent/CN115155058A/en
Application granted granted Critical
Publication of CN115155058B publication Critical patent/CN115155058B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/50Controlling the output signals based on the game progress
    • A63F13/52Controlling the output signals based on the game progress involving aspects of the displayed game scene
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to the field of natural language, in particular to a face pinching method, a face pinching system and a storage medium, wherein the face pinching method comprises the following steps: acquiring natural language description of a target face image; randomly generating a group of face images; calculating the correlation between the natural language description and each image in a group of face images; and screening the face images with the correlation higher than a preset threshold value, and taking the face image with the highest correlation as a first target face image. The method can obtain the target face image in the user's imagination by inputting the natural language description by the user, and is simple to operate; in addition, the invention takes the correlation between the face image and the natural language description into account when acquiring the first target face image, so that the acquired first target face image has better effect. The invention also provides a face-pinching system and a storage medium, which have the same beneficial effects as the face-pinching method.

Description

Face pinching method, face pinching system and storage medium
Technical Field
The invention relates to the technical field of human face image generation, in particular to a face pinching method, a face pinching system and a storage medium.
Background
Currently, when a user enters a game or logs in a certain website or enters a metacosmic space, an imagination face image as an avatar needs to generate the avatar expected by the user through a control panel full of complex dragging bars, but for general users, the operation mode is complex, and the finally obtained avatar is often different from the imagination of the user greatly.
Disclosure of Invention
In order to reduce the face pinching difficulty, the invention provides a face pinching method, a face pinching system and a storage medium.
The invention provides a face pinching method for solving the technical problem, which comprises the following steps:
acquiring natural language description of a target face image, wherein the natural language description comprises natural language description of a voice modality;
randomly generating a group of false face images;
calculating the correlation between the natural language description and each image in a group of false face images based on a preset multi-mode double-tower architecture model; the multi-mode double-tower architecture model comprises a text encoder and an image encoder, and is obtained by pairing and pre-training massive images and natural language data;
screening the face images with the correlation higher than a preset threshold value, and taking the face image with the highest correlation as a first target face image;
judging whether a new natural language description of the portrait exists or not;
and if so, modifying the first target face image based on the new natural language description of the face image through a preset large-scale pre-training multi-modal model to obtain a second target face image.
Preferably, the set of false face images is 128 face images.
Preferably, the natural language description further comprises a natural language description of a textual modality.
Preferably, if the natural language description is a natural language description of a speech modality, the natural language description of the speech modality is converted into a natural language description of a text modality by a speech recognition model.
Preferably, the step of randomly generating a set of false face images comprises;
a set of false face images is randomly generated by a countering network generator.
Preferably, the step of calculating the correlation of the natural language description with each image in a set of false face images further comprises:
judging whether the correlation between the natural language description and each image in a group of false face images exceeds a preset threshold value or not;
and if the correlation between the natural language description and each image in the group of false face images does not exceed a preset threshold value, regenerating a group of new false face images.
The invention also provides a face-pinching system for solving the technical problem, which is used for realizing the face-pinching method and comprises an input module, a portrait generating module, a text and image matching module and an optimizing module; the text and image matching module is respectively in signal connection with the input module and the portrait generating module; the optimization module is respectively in signal connection with the input module and the text and image matching module; the text and image matching module comprises a comparison module and a multi-mode double-tower architecture model;
the input module is used for acquiring natural language description of a language mode;
the human image generation module is used for randomly generating a group of false human face images;
the text and image matching module is used for acquiring a first target face image; the multi-modal double-tower architecture model comprises a text encoder and an image encoder, is obtained by pairing and pre-training massive images and natural language data, and is used for calculating the correlation between natural language description and each image in a group of false face images; the comparison module is used for comparing whether the correlation between the natural language description and each image in a group of false face images is larger than a preset threshold value or not;
the optimization module is used for modifying the first target face image to obtain a second target face image.
Preferably, the input module comprises a speech to text module for converting a natural language description of a speech modality into a natural language description of a text modality.
The present invention also provides a storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the above-mentioned face-pinching method.
Compared with the prior art, the face pinching method, the face pinching system and the storage medium have the following advantages:
1. the face pinching method comprises the following steps: acquiring natural language description of a target face image; randomly generating a group of face images; calculating the correlation between the natural language description and each image in a group of face images; and screening the face images with the correlation higher than a preset threshold value, and taking the face image with the highest correlation as a first target face image. It can be understood that, by using the face-pinching method of the present invention, a user can generate and adjust a face image by directly inputting a natural language description to obtain a face image that the user wants, i.e., a first target face image or a second target face image, without pinching the face by a control panel full of complex dragging bars, which greatly reduces the face-pinching difficulty and further reduces the user operation threshold. In addition, the method and the device take the correlation between the natural language description of the first target face image and the natural language description of the target face image into account when the first target face image is obtained, so that the generation efficiency of the first target face image is improved, the natural language description of the first target face image and the natural language description of the target face image are very close to each other, the generation effect of the first target face image is greatly improved, and the situation that the first target face image is the face image wanted by a user is possible to occur.
2. The invention also comprises the following steps after screening the face image with the correlation higher than the preset threshold value and taking the face image with the highest correlation as the first target face image: judging whether a new natural language description of the portrait exists or not; and if so, acquiring new natural language description of the portrait and modifying the first target face image based on the new natural language description of the portrait to obtain a second target face image. It can be understood that, under the condition that the first target face image does not meet the user expectation, the user inputs a new natural language description of the human image to modify the first target face image to obtain the user expectation face image, namely the second target face image, so that the possibility that the user obtains the expectation face image is greatly improved.
3. The natural language description of the invention comprises the natural language description of the voice modality and the natural language description of the text modality, so that a user can generate and adjust the face image by inputting the character description of the portrait and can directly speak the description of the portrait by voice, and therefore, the generation and adjustment of the portrait by the voice interaction form provided by the invention is a more convenient and flexible way; when a user wants to construct an avatar by imagination, the desired avatar can be generated by the voice interaction form of the invention without threshold, such as a customer service image, a customer service head portrait and the like used in software, and the interaction is convenient, threshold-free, flexible and easy to use.
4. The method comprises the steps of randomly generating a group of face images;
a set of face images is randomly generated by a confrontation network generator. The invention avoids the problem of infringing the portrait right by using a group of face images generated by a generator in the confrontation network as false face images.
5. The step of calculating the relevance of the natural language description to each image in a group of face images comprises: the correlation between the natural language description and each image in a group of face images is calculated one by one through a multi-mode double-tower architecture model, and the generation efficiency of the initial target face image and the generation effect of the initial target face image are improved.
6. The steps after the invention calculates the correlation of the natural language description with each image in a group of face images further comprise: judging whether the correlation between the natural language description and each image in a group of face images exceeds a preset threshold value or not; and if the correlation between the natural language description and each image in the group of face images does not exceed the preset threshold value, regenerating a group of new face images. According to the invention, a group of face images are randomly generated for a plurality of times until an image with the correlation and the natural language description of the target face image larger than the preset threshold value is obtained and is taken as the first target face image, if a plurality of face images with the correlation with the natural language description of the face image larger than the preset threshold value exist in the group of face images, the face image with the highest correlation is taken as the first target face image, and the generation effect of the first target face image can be greatly improved.
7. The invention also provides a face pinching system, which has the same beneficial effects as the face pinching method, and is not repeated herein.
8. The present invention further provides a storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the above-mentioned face-pinching method, and has the same beneficial effects as the above-mentioned face-pinching method, and the description thereof is omitted here.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the embodiments or the prior art description will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.
Fig. 1 is a flowchart illustrating steps of a face-pinching method according to a first embodiment of the present invention.
Fig. 2 is a block diagram of a face-pinching system provided by a second embodiment of the present invention.
The attached drawings indicate the following:
1. a face-pinching system;
10. an input module; 20. a portrait generating module; 30. a text and image matching module; 40. an optimization module;
100. a voice-to-text module; 300. and a comparison module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and implementation examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
The terms "vertical," "horizontal," "left," "right," "up," "down," "left up," "right up," "left down," "right down," and the like as used herein are for illustrative purposes only.
Referring to fig. 1, a first embodiment of the present invention provides a face-pinching method, including the following steps:
s1, acquiring natural language description of a target face image;
s2, randomly generating a group of face images;
s3, calculating the correlation between the natural language description and each image in a group of face images;
and S4, screening the face image with the correlation higher than a preset threshold value, and taking the face image with the highest correlation as a first target face image.
It can be understood that, in the present invention, step S1 and step S2 are not in sequence, and may also be performed simultaneously.
It can be understood that the actual semantics of the input natural language description are considered when the first target face image is acquired, so that the generation efficiency of the first target face image is improved, meanwhile, the natural language description of the first target face image is very close to that of the target face image, the generation effect of the first target face image is greatly improved, the situation that the first target face image is the face image wanted by the user may occur, and the process can be ended without performing the subsequent steps when the first target face image is the face image wanted by the user.
Further, if the first target face image is not the face image desired by the user, the first target face image may be modified to obtain the face image desired by the user. Therefore, the method for screening the face images with the correlation higher than the preset threshold value further comprises the following steps after the face image with the highest correlation is taken as the first target face image:
s5, judging whether a new natural language description of the portrait exists or not;
and S6, if so, acquiring new natural language description of the portrait and modifying the first target face image based on the new natural language description of the portrait to obtain a second target face image.
It can be understood that when the user is not satisfied with the first target face image, the user inputs a new natural language description of the person image to modify the first target face image, so that the new natural language description of the person image is recognized and the subsequent operation is performed.
Further, modifying the first target face image based on the new natural language description of the person image includes modifying the first target face image one or more times; if the first target face image is modified for multiple times, a new natural language description of the face image needs to be obtained again before each modification. Therefore, steps S5 and S6 can be repeated an unlimited number of times until the second target face image desired by the user is finally obtained. It will be appreciated that the new natural language description of the figure is a different natural language description of the figure each time the user enters.
Further, the steps after S3 further include:
s31, judging whether the correlation between the natural language description and each image in a group of face images exceeds a preset threshold value or not;
and S32, if the correlation between the natural language description and each image in the group of face images does not exceed a preset threshold value, regenerating a group of new face images.
It can be understood that when the correlation between each face image in a group of face images and the natural language description is lower than the preset threshold, a new group of face images is regenerated, and then S3 is executed until S4 is entered.
Further, the natural language description includes a natural language description of a speech modality and a natural language description of a text modality. The natural language description of the voice modality is description content of the portrait directly spoken by the user, and the natural language description of the text modality is description content of the portrait input by the user through text.
It can be understood that when a user wants an imaginary face image as a head portrait when entering a game or logging in a certain website or entering a meta-space, the face-pinching method provided by the invention can generate and adjust the face image in a voice interaction mode, and compared with the method for generating and adjusting the face image by operating a control panel filled with a complicated dragging bar, the face-pinching method provided by the invention is more convenient and flexible and is easier to operate; when a user wants to construct an avatar by imagination, the user can generate the desired avatar such as a game avatar, a customer service avatar used in software, a customer service avatar and the like by the voice interaction form of the invention without threshold. In addition, the invention supports the user to input characters besides using voice, so that when the user is inconvenient to use the voice, the face image can be generated and adjusted in a typing mode to obtain the face image expected by the user.
Specifically, when the natural language description is a natural language description of a speech modality, the natural language description of the speech modality is converted into a natural language description of a text modality through an ASR (speech recognition) model, and then subsequent steps are performed for processing, which is convenient for a computer.
Further, step S2 is specifically to randomly generate a group of face images by the confrontation network generator. Wherein, the group of face images is 128 face images.
It will be appreciated that the confrontational network generator of the present invention needs to be pre-trained in advance. Generating the antagonistic network model comprises two parts: a generator and a discriminator. The generator generates a fake picture based on random noise, and the discriminator distinguishes the fake picture from the real picture. The method uses a real picture data set to train a generated confrontation network model, and only uses a generator part for generating the confrontation network model to generate the false face in the implementation and use process of the system, so that the face images in the method are all false face images. Therefore, the face pinching method does not need to employ a real model to take pictures, does not have the problem of infringing the portrait right, is convenient to interact, has no threshold, and is flexible and easy to use.
Further, step S3 is specifically to calculate the correlation between the natural language description and each image in the group of face images one by one through a multi-modal two-tower architecture model.
Further, the multi-modal double-tower architecture model is pre-trained by pairing massive images with natural language data. The training process of the multi-modal two-tower architecture model is roughly as follows: the data set is a massive image-text alignment data set, namely the natural language in each sample in the data set is a proper description of the corresponding image content, and the input of the model is the natural language of a text mode and a false face image. The natural language of a text mode of a batch size (128) is converted into text embedding through a text encoder, a face image of the batch size is converted into image embedding through an image encoder, and a loss function of a training model is cross entropy (cross entropy) of the text embedding and the image embedding. The model is trained by back-propagation until the model loss converges.
Further, in order to make the original target face image closer to the natural language description of the face image, the preset threshold value in step S4 of the present invention is 0.8.
Further, the specific steps of step S6 include:
s61, acquiring new natural language description of the portrait based on the difference between the first target face image and the second target face image;
and S62, modifying the first target face image according to the new natural language description of the image based on the large-scale pre-training multi-mode model to obtain a second target face image.
It can be understood that the new natural language description of the portrait is the content acquired by the computer and input by the user for modifying the first target face image, the new natural language description of the portrait is acquired only when the user is to modify, and the new natural language description of the face is the content input by the user based on the difference between the first target face image and the second target face image.
The large-scale pre-training multi-modal model is a pre-trained visual-language model, and the image-text loss function is designed based on the large-scale visual-language pre-training model, and the image is continuously optimized through a gradient back propagation algorithm, so that the image can better meet the face description input by a user.
It can be understood that the face-pinching method of the present invention can modify the initial target face image an unlimited number of times until a second target face image meeting the user's expectations is finally obtained.
It can be understood that the face-pinching method provided by the invention does not directly generate the face image desired by the user only by means of a segment of natural language description, but supports modifying the first target face image infinitely by inputting the natural language description for multiple times until the face image meeting the user's expectation is obtained, and is closer to the use scene when the user pinches the face in the scenes such as games or metas. In addition, by using the face pinching method provided by the invention, the user can modify the initial target face image only by directly speaking out the modification idea, so that the technical threshold of face pinching is reduced, the operation is simple and easy, and the face image meeting the user expectation can be obtained more easily.
Illustratively, the image of a boy's face that the user wants to obtain can be obtained by the following steps:
the user inputs a natural language description of the target face image, such as: the method comprises the steps that a person wants a face image of a curling sunlight boy, the face image of the curling sunlight boy is output, and the face image of the curling sunlight boy is a first target face image. At the moment, if the user feels that the face image of the curling sunlight boy is the face image wanted by the user, the flow can be ended, and the next step is not carried out; if the user feels that the facial image of the curling sunlight boy is not the image wanted by the user, the user can modify the facial image of the curling sunlight boy, and the modifying steps are as follows:
the user enters a new natural language description of the portrait, such as: i want to make him appear angry and output a facial image of a curly angry boy. At this time, if the user feels that the face image of the curly hair-growing boy is the face image that the user wants, the face image of the curly hair-growing boy is the second target face image, and the process is ended; if the user feels that the face image of the curly hair-waving boy is not the desired face image, the user can continuously modify the face image of the curly hair-waving boy.
The step of continuing the modification is to re-enter a new natural language description of the face for the user, such as: i want to let him smile, output the face image of a curly smiling boy. Similarly, if the user feels that the face image of the curly hair smiling boy is the face image desired by the user, the face image of the curly hair smiling boy is the second target face image, and the process is ended; if the user feels that the face image of the curly hair smiling boy is not the face image wanted by the user, the user can continuously modify the face image of the curly hair smiling boy, and the modifying steps are the same until the face image wanted by the user, namely the second target face image, is finally obtained.
It will be appreciated that the modification of the first target face image may be performed an unlimited number of times until a target face image meeting the user's expectations is ultimately obtained.
It is understood that modifications to the face image include modifications to the five sense organs, modifications to the hairstyle, modifications to the skin tone, modifications to the mood, modifications to the hair color, modifications to the makeup, modifications to the headwear, and the like.
It can be understood that the content input by the user can be voice content input by the user in a speaking mode or text content input by the user in a typing mode, so that the user operation threshold is greatly reduced.
In conclusion, the face pinching method provided by the invention has the advantages of low face pinching threshold, simple and easy user operation, high efficiency of obtaining the face image desired by the user and good face image effect which is more close to the appearance imagined by the user. In addition, the generated face image is a false image, and the portrait infringement can not occur.
Referring to fig. 2, a second embodiment of the present invention provides a face-pinching system 1, which includes an input module 10, a portrait generating module 20, a text and image matching module 30, and an optimizing module 40; the input module 10 is in signal connection with the text and image matching module 30 and the optimization module 40, and the text and image matching module 30 is in signal connection with the portrait generation module 20 and the optimization module 40.
The input module 10 is used for acquiring a natural language description; the portrait generating module 20 is used for randomly generating a group of face images; the text and image matching module 30 is used for acquiring a first target face image; the optimization module 40 is configured to modify the first target face image to output a second target face image.
Further, the input module 10 includes a speech-to-text module 100, and the speech-to-text module 100 is configured to convert the natural language description of the speech modality into a natural language description of the text modality, so as to facilitate subsequent processing by the computer.
Further, the text and image matching module 30 includes a comparison module 300, and the comparison module 300 is configured to compare whether the correlation between the natural language description of the target face image and each image in the set of face images is greater than a preset threshold.
It is understood that the user can input the natural language description of the portrait by voice or text into the input module 10, and when the user inputs the natural language description of the portrait by voice, the inputted voice content is converted into the natural language description of the text modality by the voice-to-text module 100.
Further, an ASR model is provided in the speech-to-text module 100; a generated confrontation network generator after pre-training is arranged in the portrait generating module 20; a pre-trained multi-mode double-tower architecture model is arranged in the text and image matching module 30; and a large-scale pre-training multi-mode model after pre-training is arranged in the optimization module.
Further, the operation of the face-pinching system 1 of the present invention is substantially as follows:
first, the input module 10 outputs a natural language description of the text modality of the target face image, and the generation countermeasure network generator of the face generation module 20 generates a set of face images, wherein the set of face images is 128 sheets.
It is understood that when the user inputs a natural language description of a speech modality, the input module 10 converts the natural language description of the speech modality into a natural language description of a text modality through the ASR model in the speech-to-text module 100 and outputs the converted natural language description.
Then, the text and image matching module 30 calculates the correlation between each face image in the group of face images generated by the face image generation module 20 and the natural language description of the text modality of the target face image output by the input module 10 one by one through a multi-modal two-tower architecture model, and when the correlation between at least one face image and the natural language description of the text modality of the face image exceeds a preset threshold, selects the face image with the highest correlation as the first target face image to be output, wherein the preset threshold is 0.8. Otherwise, the face generation module 20 regenerates a group of face images, again 128, and then the text and image matching module 30 calculates the correlation again in the same way until the first target face image is output.
Finally, the optimization module 40 may modify the first target face image one or more times through the large-scale pre-training multi-modal model, including modifying details such as five sense organs, hair style, expression, and the like, until outputting a face image meeting the user expectation, that is, the second target face image.
It can be understood that the face-pinching system 1 provided in the second embodiment of the present invention may be matched with the face-pinching method provided in the first embodiment of the present invention to implement a face-pinching process, and the face-pinching system 1 provided in the second embodiment of the present invention has the same beneficial effects as the face-pinching method provided in the first embodiment of the present invention, and is not described herein again.
Further, a third embodiment of the present invention provides a storage medium having a computer program stored thereon, the computer program, when being executed by a processor, implementing the face-pinching method provided by the first embodiment of the present invention. It can be understood that the storage medium provided in the third embodiment of the present invention has the same beneficial effects as the face-pinching method provided in the first embodiment of the present invention, and details are not described herein.
It will be appreciated that the processes described above with reference to the flowcharts may be implemented as computer software programs, in accordance with the disclosed embodiments of the invention. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program performs the above-mentioned functions defined in the method of the present application when executed by a Central Processing Unit (CPU). It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention.
In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Compared with the prior art, the face pinching method, the face pinching system and the storage medium have the following advantages:
1. the face pinching method comprises the following steps: acquiring natural language description of a target face image; randomly generating a group of face images; calculating the correlation between the natural language description and each image in a group of face images; and screening the face images with the correlation higher than a preset threshold value, and taking the face image with the highest correlation as a first target face image. It can be understood that, by using the face-pinching method of the present invention, a user can generate and adjust a face image by directly inputting a natural language description to obtain a face image that the user wants, i.e., a first target face image or a second target face image, without pinching the face by a control panel full of complex dragging bars, which greatly reduces the face-pinching difficulty and further reduces the user operation threshold. In addition, the invention takes the correlation between the first target face image and the natural language description of the target face image into account when the first target face image is obtained, thereby not only improving the generation efficiency of the first target face image, but also greatly improving the generation effect of the first target face image because the natural language description of the first target face image and the natural language description of the target face image are very close, and possibly causing the situation that the first target face image is the face image which is wanted by the user.
2. The invention also comprises the following steps after screening the face image with the correlation higher than the preset threshold value and taking the face image with the highest correlation as the first target face image: judging whether a new natural language description of the portrait exists or not; and if so, acquiring new natural language description of the portrait and modifying the first target face image based on the new natural language description of the portrait to obtain a second target face image. It can be understood that, under the condition that the first target face image does not meet the user expectation, the user inputs a new natural language description of the human image to modify the first target face image to obtain the user expectation face image, namely the second target face image, so that the possibility that the user obtains the expectation face image is greatly improved.
3. The natural language description of the invention comprises natural language description of voice modality and natural language description of text modality, therefore, a user can generate and adjust a face image by inputting character description of the portrait and also can directly speak description of the portrait by voice, therefore, generating and adjusting the portrait by the voice interaction form provided by the invention is a more convenient and flexible way; when a user wants to construct an avatar by imagination, the desired avatar can be generated by the voice interaction form of the invention without threshold, such as a customer service image, a customer service head portrait and the like used in software, and the interaction is convenient, threshold-free, flexible and easy to use.
4. The step of randomly generating a group of face images comprises;
a set of face images is randomly generated by a confrontation network generator. The invention avoids the problem of infringing the portrait right by using a group of face images generated by the generators in the countermeasure network as false face images.
5. The steps of the invention for calculating the correlation of natural language description and each image in a group of face images comprise: the correlation between the natural language description and each image in a group of face images is calculated one by one through a multi-mode double-tower architecture model, and the generation efficiency of the initial target face image and the generation effect of the initial target face image are improved.
6. The steps after the invention calculates the correlation of the natural language description with each image in a group of face images further comprise: judging whether the correlation between the natural language description and each image in a group of face images exceeds a preset threshold value or not; and if the correlation between the natural language description and each image in the group of face images does not exceed the preset threshold value, regenerating a group of new face images. According to the invention, a group of face images are randomly generated for a plurality of times until an image with the correlation and the natural language description of the target face image larger than the preset threshold value is obtained and is taken as the first target face image, if a plurality of face images with the correlation with the natural language description of the face image larger than the preset threshold value exist in the group of face images, the face image with the highest correlation is taken as the first target face image, and the generation effect of the first target face image can be greatly improved.
7. The invention also provides a face pinching system, which has the same beneficial effects as the face pinching method, and is not repeated herein.
8. The present invention further provides a storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the above-mentioned face-pinching method, and has the same beneficial effects as the above-mentioned face-pinching method, and the description thereof is omitted here.
The above detailed description is provided for a face-pinching method, a face-pinching system and a storage medium disclosed in the embodiments of the present invention, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for the persons skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention, and any modification, equivalent replacement, and improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A face pinching method is characterized in that: the method comprises the following steps:
acquiring natural language description of a target face image, wherein the natural language description comprises natural language description of a voice modality;
randomly generating a group of false face images;
calculating the correlation between the natural language description and each image in a group of false face images based on a preset multi-mode double-tower architecture model; the multi-mode double-tower architecture model comprises a text encoder and an image encoder, and is obtained by pairing and pre-training massive images and natural language data;
screening the face images with the correlation higher than a preset threshold value, and taking the face image with the highest correlation as a first target face image;
judging whether a new natural language description of the portrait exists or not;
and if so, modifying the first target face image based on the new natural language description of the face image through a preset large-scale pre-training multi-modal model to obtain a second target face image.
2. The face-pinching method as claimed in claim 1, wherein: one group of false face images is 128 face images.
3. The face-pinching method as claimed in claim 1, wherein: the natural language description also includes a natural language description of a textual modality.
4. A face-pinching method as claimed in claim 3, characterized in that: and if the natural language description is the natural language description of the voice mode, converting the natural language description of the voice mode into the natural language description of the text mode through a voice recognition model.
5. The face-pinching method as claimed in claim 1, wherein: the step of randomly generating a group of false face images comprises;
a set of false face images is randomly generated by a countering network generator.
6. The face-pinching method as claimed in claim 1, wherein: the step after calculating the correlation of the natural language description to each image in a set of false face images further comprises:
judging whether the correlation between the natural language description and each image in a group of false face images exceeds a preset threshold value or not;
and if the correlation between the natural language description and each image in the group of false face images does not exceed a preset threshold value, regenerating a group of new false face images.
7. A face-pinching system for implementing the face-pinching method as claimed in any one of claims 1 to 6, characterized in that: the system comprises an input module, a portrait generation module, a text and image matching module and an optimization module; the text and image matching module is respectively in signal connection with the input module and the portrait generating module; the optimization module is respectively in signal connection with the input module and the text and image matching module; the text and image matching module comprises a comparison module and a multi-mode double-tower architecture model;
the input module is used for acquiring natural language description of a language mode;
the human image generation module is used for randomly generating a group of false human face images;
the text and image matching module is used for acquiring a first target face image; the multi-modal double-tower architecture model comprises a text encoder and an image encoder, is obtained by pairing and pre-training massive images and natural language data and is used for calculating the correlation between natural language description and each image in a group of false face images; the comparison module is used for comparing whether the correlation between the natural language description and each image in a group of false face images is larger than a preset threshold value or not;
the optimization module is used for modifying the first target face image to obtain a second target face image.
8. The face pinching system of claim 7, wherein: the input module comprises a voice-to-text module for converting a natural language description of a voice modality to a natural language description of a text modality.
9. A storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a processor, implements the face-pinching method of any one of claims 1-6.
CN202211081372.2A 2022-09-06 2022-09-06 Face pinching method, face pinching system and storage medium Active CN115155058B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211081372.2A CN115155058B (en) 2022-09-06 2022-09-06 Face pinching method, face pinching system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211081372.2A CN115155058B (en) 2022-09-06 2022-09-06 Face pinching method, face pinching system and storage medium

Publications (2)

Publication Number Publication Date
CN115155058A CN115155058A (en) 2022-10-11
CN115155058B true CN115155058B (en) 2023-02-03

Family

ID=83482132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211081372.2A Active CN115155058B (en) 2022-09-06 2022-09-06 Face pinching method, face pinching system and storage medium

Country Status (1)

Country Link
CN (1) CN115155058B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116741197B (en) * 2023-08-11 2023-12-12 上海蜜度信息技术有限公司 Multi-mode image generation method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837229A (en) * 2021-08-30 2021-12-24 厦门大学 Knowledge-driven text-to-image generation method
CN114187165A (en) * 2021-11-09 2022-03-15 阿里巴巴云计算(北京)有限公司 Image processing method and device
CN114625897A (en) * 2022-03-21 2022-06-14 腾讯科技(深圳)有限公司 Multimedia resource processing method and device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259698B (en) * 2018-11-30 2023-10-13 百度在线网络技术(北京)有限公司 Method and device for acquiring image
CN112132912B (en) * 2019-06-25 2024-02-13 北京百度网讯科技有限公司 Method and device for establishing face generation model and generating face image
CN113642359B (en) * 2020-04-27 2023-11-14 北京达佳互联信息技术有限公司 Face image generation method and device, electronic equipment and storage medium
JP6843409B1 (en) * 2020-06-23 2021-03-17 クリスタルメソッド株式会社 Learning method, content playback device, and content playback system
CN114359423B (en) * 2020-10-13 2023-09-12 四川大学 Text generation face method based on deep countermeasure generation network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837229A (en) * 2021-08-30 2021-12-24 厦门大学 Knowledge-driven text-to-image generation method
CN114187165A (en) * 2021-11-09 2022-03-15 阿里巴巴云计算(北京)有限公司 Image processing method and device
CN114625897A (en) * 2022-03-21 2022-06-14 腾讯科技(深圳)有限公司 Multimedia resource processing method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
""T2F:所述即所见,使用深度学习,文本一键生成人脸";Animesh Kamewar;《https://www.cloud.tencent.com/developer/news/272688》;20180712;第1-2、4-6页 *

Also Published As

Publication number Publication date
CN115155058A (en) 2022-10-11

Similar Documents

Publication Publication Date Title
KR102503413B1 (en) Animation interaction method, device, equipment and storage medium
CN111415677B (en) Method, apparatus, device and medium for generating video
JP7374274B2 (en) Training method for virtual image generation model and virtual image generation method
CN111145322B (en) Method, apparatus, and computer-readable storage medium for driving avatar
WO2022166709A1 (en) Virtual video live broadcast processing method and apparatus, and storage medium and electronic device
CN111383307A (en) Video generation method and device based on portrait and storage medium
CN113378697A (en) Method and device for generating speaking face video based on convolutional neural network
JP7479750B2 (en) Virtual video live broadcast processing method and device, electronic device
JP2020034895A (en) Responding method and device
JP7238204B2 (en) Speech synthesis method and device, storage medium
CN111401101A (en) Video generation system based on portrait
CN111785246A (en) Virtual character voice processing method and device and computer equipment
CN115155058B (en) Face pinching method, face pinching system and storage medium
CN115631267A (en) Method and device for generating animation
CN115356953B (en) Virtual robot decision method, system and electronic equipment
CN111696520A (en) Intelligent dubbing method, device, medium and electronic equipment
CN112819933A (en) Data processing method and device, electronic equipment and storage medium
An et al. Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features
KR20210078863A (en) Server, method and computer program for providing avatar service
JP2021086415A (en) Virtual person interaction system, video generation method, and video generation program
CN117152308B (en) Virtual person action expression optimization method and system
CN109961152A (en) Personalized interactive method, system, terminal device and the storage medium of virtual idol
KR102318150B1 (en) Hand sign language image generation system based on Generative Adversarial Networks
CN113241054B (en) Speech smoothing model generation method, speech smoothing method and device
Viswanathan et al. Text to image translation using generative adversarial networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant