WO2024078293A1 - 图像处理方法、装置、电子设备及存储介质 - Google Patents

图像处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2024078293A1
WO2024078293A1 PCT/CN2023/120412 CN2023120412W WO2024078293A1 WO 2024078293 A1 WO2024078293 A1 WO 2024078293A1 CN 2023120412 W CN2023120412 W CN 2023120412W WO 2024078293 A1 WO2024078293 A1 WO 2024078293A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
audio
audio frame
data
facial
Prior art date
Application number
PCT/CN2023/120412
Other languages
English (en)
French (fr)
Inventor
王胜男
彭威
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2024078293A1 publication Critical patent/WO2024078293A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/206Drawing of charts or graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback

Definitions

  • the embodiments of the present disclosure relate to the field of Internet technology, and in particular, to an image processing method, device, electronic device, computer-readable storage medium, computer program product, and computer program.
  • content creation platforms such as short video applications
  • content creation platforms are popular among users for their rich and diverse content. For example, after a content creator records audio, generates an audio work and uploads it to the application platform, other users can listen to the audio work through the corresponding application client.
  • the display process of audio works is usually to display only static pictures in the playback interface of the client while playing the audio works, which has the problems of single display method and poor display effect.
  • the embodiments of the present disclosure provide an image processing method, an apparatus, an electronic device, a computer-readable storage medium, a computer program product, and a computer program to overcome the problem of a single display method and poor display effect of displayed content when playing audio works.
  • an embodiment of the present disclosure provides an image processing method, including:
  • a target facial map corresponding to a target audio frame is generated, wherein the target facial map is used to represent a target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame; and the target facial map is displayed in a first facial area of a target image, wherein the first facial area is used to show changes in the mouth shape as the audio data is played.
  • an image processing device including:
  • a processing module used for generating a target facial map corresponding to a target audio frame during the playback of audio data, wherein the target facial map is used for representing a target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame;
  • a display module is used to display the target facial map in a first facial area of a target image, where the first facial area is used to show changes in a mouth shape as the audio data is played.
  • an electronic device including:
  • a processor and a memory communicatively connected to the processor
  • the memory stores computer-executable instructions
  • the processor executes the computer-executable instructions stored in the memory to implement the image processing method as described in the first aspect and various possible designs of the first aspect.
  • an embodiment of the present disclosure provides a computer-readable storage medium, in which computer execution instructions are stored.
  • a processor executes the computer execution instructions, the image processing method described in the first aspect and various possible designs of the first aspect is implemented.
  • an embodiment of the present disclosure provides a computer program product, including a computer program, which, when executed by a processor, implements the image processing method as described in the first aspect and various possible designs of the first aspect.
  • an embodiment of the present disclosure provides a computer program, which, when executed by a processor, implements the image processing method as described in the first aspect and various possible designs of the first aspect.
  • FIG. 1 is a diagram showing an application scenario of the image processing method provided by an embodiment of the present disclosure.
  • FIG. 2 is a flowchart diagram 1 of the image processing method provided in an embodiment of the present disclosure.
  • FIG3 is a schematic diagram of a correspondence between an audio frame and audio content provided by an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a target facial map provided by an embodiment of the present disclosure.
  • FIG. 5 is a flowchart of a possible specific implementation step of step S101 in the embodiment shown in FIG. 2 .
  • FIG. 6 is a schematic diagram of a process of displaying a target facial map provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a process of alternately enlarging and displaying a target image provided by an embodiment of the present disclosure.
  • FIG8 is a second flow chart of the image processing method provided in an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of a correspondence between a pronunciation stage and first lip shape data provided by an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of a contour key point provided by an embodiment of the present disclosure.
  • FIG. 11 is a flowchart of a possible specific implementation step of step S205 in the embodiment shown in FIG. 8 .
  • FIG. 12 is a structural block diagram of an image processing device provided in an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure.
  • FIG. 14 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present disclosure.
  • FIG1 is a diagram of an application scenario of the image processing method provided by the embodiment of the present disclosure.
  • the image processing method provided by the embodiment of the present disclosure can be applied to the scenarios of audio/video production and audio/video playback. More specifically, for example, during the audio/video playback on a short video platform, the "lip sync" can be achieved by triggering the platform props (i.e., function controls).
  • the method provided by the embodiment of the present disclosure can be applied to a terminal device, wherein an application client for playing video or audio runs in the terminal device, and the terminal device obtains corresponding media data and plays it by communicating with a server, wherein, illustratively, the media data is, for example, audio data, or video data including audio data.
  • the client running in the terminal device displays a static target image in the playback interface of the client, wherein the audio data is, for example, data corresponding to a song, and while playing the audio data, the terminal device displays a photo of the singer of the song in the playback interface.
  • the disclosed embodiment provides an image processing method, which generates a dynamic facial map to simulate the singer's singing/speaking process to the audio during the playback of audio data, thereby achieving the effect of video display to solve the above problem.
  • FIG. 2 is a flow chart of an image processing method provided by an embodiment of the present disclosure.
  • the method of this embodiment can be applied in a terminal device, and the image processing method includes:
  • Step S101 during the process of playing audio data, a target facial map corresponding to a target audio frame is generated, where the target facial map is used to represent a target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame.
  • the execution subject of the method provided in this embodiment can be a terminal device, wherein an application client for playing video and audio runs in the terminal device, specifically, such as a short video client or a music client, and the client in the terminal device obtains the media data to be played by communicating with the server, wherein, in one possible implementation, the media data includes video data and audio data, and the terminal device parses the media data based on the video channel and the audio channel to obtain the data corresponding to the video channel and the data corresponding to the audio channel, that is, the video data and the corresponding audio data.
  • the media data is audio data
  • the terminal device directly obtains the audio data by accessing the server.
  • the audio data is data corresponding to content such as song music, voice, etc.
  • the terminal device plays the audio data, wherein the audio data is composed of at least one audio segment, each audio segment includes multiple audio frames, and when the playback duration of the audio data is fixed, the number (and duration) of the audio frames is determined by the frame rate of the audio data.
  • the audio data corresponds to a song or a speech, for example, then each audio frame in the audio data corresponds to a pronunciation segment in the song or speech, and an audio segment composed of multiple audio frames can realize the complete pronunciation of a text, number, letter, word or beat, and the text, number, letter, word and beat are the audio content.
  • FIG3 is a schematic diagram of a correspondence between an audio frame and audio content provided by an embodiment of the present disclosure.
  • the audio data corresponds to a segment of speech
  • the audio content i.e., speech content
  • each character corresponds to an audio segment.
  • each audio segment includes a different number of audio frames, for example, the audio segment D1 includes n1 audio frames, the audio segment D2 includes n2 audio frames, the audio segment D3 includes n3 audio frames, and the audio segment D4 includes n4 audio frames.
  • each audio frame in the audio segment D1 corresponds to the audio content “one”
  • each audio frame in the audio segment D2 corresponds to the audio content “two”
  • each audio frame in the audio segment D3 corresponds to the audio content “start”
  • each audio frame in the audio segment D4 corresponds to the audio content “start”.
  • the audio content may also be words, letters constituting words, etc.
  • each word, or letters constituting a word corresponds to an audio segment, and each audio segment is composed of multiple audio frames.
  • the implementation of the embodiment is similar to the above embodiment, and the specific implementation method can be set as needed and will not be repeated here.
  • each audio segment at least one target audio frame is included. More specifically, for example, the first frame of each audio segment is the target audio frame.
  • a map corresponding to the target audio frame is generated, that is, a target facial map.
  • the target facial map can show the mouth shape of a real person when emitting the audio content corresponding to the target audio frame.
  • FIG4 is a schematic diagram of a target facial map provided by an embodiment of the present disclosure.
  • the audio data corresponds to a voice
  • the audio content (voice content) is "one, two, start!
  • the audio content corresponding to the target audio frame Frame_1 is the text "one”
  • the audio content corresponding to the target audio frame Frame_2 is the text "two”
  • the audio content corresponding to the target audio frame Frame_3 is the text "start”
  • the audio content corresponding to the target audio frame Frame_4 is the text "start”.
  • a corresponding target facial map is generated, that is, the target audio frame Frame_1 corresponds to the target facial map P1
  • the target facial map P1 is the mouth shape of a real person when he/she pronounces the word "one”.
  • the target audio frame Frame_2 corresponds to the target facial map P2, and the target facial map P2 is the mouth shape of a real person when he/she pronounces the word "two".
  • the target audio frame Frame_3 corresponds to the target facial map P3, and the target facial map P3 is the mouth shape of a real person when he/she pronounces the word "open”.
  • the target audio frame Frame_4 corresponds to the target facial map P4, and the target facial map P4 is the mouth shape of a real person when he/she pronounces the word "start”.
  • the target facial maps corresponding to different target audio frames may be the same, for example, the target facial map P1 is the same as the target facial map P2.
  • step S101 include:
  • Step S1011 obtaining first lip shape data corresponding to the target audio frame, where the first lip shape data is used to represent the shape of the mouth.
  • Step S1012 Detect the target image and obtain second lip shape data, where the second lip shape data represents the size parameters of the target mouth shape.
  • Step S1013 Generate a target facial map based on the first lip shape parameter and the second lip shape parameter.
  • first lip shape data or a network model for generating the first lip shape data is preset, and the first lip shape data is used to characterize the shape of the mouth.
  • the first lip shape data may be an image, a logo, or other descriptive information that can describe the shape of the mouth.
  • the first lip shape data may be an image representing the mouth shape for the pronunciation of the word "one”
  • the first lip shape data may be a logo representing the mouth shape for the pronunciation of the letter "A”
  • the first lip shape data may be descriptive information representing the mouth shape for the pronunciation of the word "Apple”.
  • the corresponding first lip shape data can be determined based on the preset lip shape timing mapping information.
  • the lip shape timing mapping information represents the first lip shape data corresponding to each audio frame in the audio data.
  • Each audio data corresponds to unique lip shape timing mapping information, which can be pre-generated based on the specific audio content of the audio data, and will not be described in detail here.
  • the purpose of generating the target facial map is to simulate the mouth movements of a real user when pronouncing a word, it is necessary to set the size of the target facial map so that the size of the target mouth shape matches the size of the portrait in the target image.
  • the target mouth shape includes a length parameter and a width parameter.
  • the facial dimensions f1 and f2 of the portrait in the target image where f1 represents the facial height and f2 represents the facial width, then, according to the preset (face-mouth) ratio coefficient, based on the above f1 and f2, the length parameter c1 and the width parameter c2 that match the facial dimensions are obtained.
  • a target facial map matching the target image can be obtained.
  • Step S102 Displaying a target facial map in a first facial region of the target image, where the first facial region is used to show changes in the mouth shape as the audio data is played.
  • the target facial map is displayed in the first facial area of the target image, wherein, exemplarily, the target image includes a portrait portion, and the first facial area can be the mouth area of the portrait face, or a facial area including the mouth area.
  • FIG6 is a schematic diagram of a process of displaying a target facial map provided by an embodiment of the present disclosure. Referring to FIG6 , when playing to the target audio frame, after the corresponding target facial map is displayed in the first facial area, the original mouth shape in the target image is covered, and the target mouth shape represented by the target facial map is presented, thereby achieving the purpose of simulating the mouth movement of a real user when pronouncing.
  • the target audio frame includes a first audio frame and a second audio frame, and the first audio frame and the second audio frame are played alternately.
  • the specific implementation steps of step S102 include:
  • Step S1021 If the target audio frame is the first audio frame, the target image and the target facial map located in the first facial area are displayed based on the first magnification factor.
  • Step S1022 If the target audio frame is the second audio frame, the target image and the target facial map located in the first facial area are displayed based on a second magnification factor, wherein the first magnification factor is different from the second magnification factor.
  • each target audio frame corresponds to a mouth movement, and when different target audio frames are played in sequence, the mouth movement is switched and displayed.
  • the target audio frame is divided into a first audio frame and a second audio frame.
  • different magnification factors are used to alternately magnify and display the target image and the target facial map located in the first facial area, so that the target image displayed in the playback interface can present the effect of zooming in and zooming out according to the rhythm, thereby improving the visual expression of the target image.
  • the first audio frame is an odd audio frame among all target audio frames of the audio data
  • the second audio frame is an even audio frame among all target audio frames of the audio data.
  • the first magnification factor and the second magnification factor can be a scale factor based on the diagonal length of the target image, or a scale factor based on the area of the target image.
  • the first magnification factor and the second magnification factor are real numbers greater than 0. When they are greater than 1, they indicate that the size is enlarged; when they are less than 1, they indicate that the size is reduced.
  • FIG7 is a schematic diagram of a process of alternately magnifying and displaying a target image provided by an embodiment of the present disclosure.
  • the target audio frame is Frame_1, it is the first audio frame.
  • the target image and the target facial map located in the first facial area are displayed at 1 times with a 1-fold magnification factor (first magnification factor), that is, the target image and the target facial map are displayed at the original size.
  • first magnification factor 1-fold magnification factor
  • the target audio frame is Frame_2
  • the target image and the target facial map located in the first facial area are displayed at 1.2 times with a 1.2-fold magnification factor (second magnification factor).
  • the target audio frame is Frame_3, it is the first audio frame, which is implemented in the same way as Frame_1 and will not be described in detail.
  • the target image and the target facial map are rhythmically presented with a lens push-out and pull-in effect, thereby improving the visual expressiveness of the target image.
  • a target facial map corresponding to the target audio frame is generated, the target facial map is used to represent the target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame; the target facial map is displayed in the first facial area of the target image, and the first facial area is used to show the changes in the mouth shape as the audio data is played.
  • the target facial map is used to simulate and display the target mouth shape corresponding to the currently playing target audio content, so that the mouth shape displayed in the facial area of the target image can change with the changes in the audio content, thereby achieving the imitation of the audio process corresponding to the audio data of real people singing, so that the audio work can present the video work.
  • FIG8 is a second flow chart of the image processing method provided by the embodiment of the present disclosure. Based on the embodiment shown in FIG2 , this embodiment further refines step S101 and adds a step of determining the target audio frame.
  • the image processing method provided by this embodiment can be applied to the scenario where the content corresponding to the audio data is a song.
  • the image processing method includes:
  • Step S201 obtaining beat information of audio data, where the beat information represents the time interval between at least two audio segments during the playback of the audio data, wherein the audio segment includes multiple audio frames, and the audio segment is used to achieve complete pronunciation of at least one audio content.
  • Step S202 Determine the target audio frame according to the beat information.
  • the beat information is information about the rhythm characteristics of the melody of the song corresponding to the audio data. More specifically, the beat information can be used to characterize the speed of the melody of the song. Exemplarily, when the melody rhythm is fast, the audio segment interval is short, that is, the time interval between the audio content is short; conversely, when the melody rhythm is fast, the audio segment interval is long, that is, the time interval between the audio content is long; wherein the audio content is, for example, the lyrics in the song.
  • the beat information can be a specific identifier or number, indicating the length of a fixed time interval, for example, the beat information is 300ms (milliseconds), indicating that the interval between two audio segments is 300ms, then according to the beat information, a target audio frame is determined every 300ms in the audio data, and the process of determining the target audio frame can be completed before playing the audio data, or it can be completed while playing the audio data, and can be set as needed. Afterwards, based on the evenly distributed target audio frames, the corresponding target facial map is generated and displayed, so that the mouth shape displayed in the first facial area in the target image can be dynamically changed according to a fixed period (300ms). Among them, the beat information can be preset information corresponding to the audio data, and the acquisition method will not be repeated.
  • step S202 the specific implementation steps of step S202 include:
  • Step S2021 Obtain the beat number of the audio data according to the beat information, where the beat number represents the number of beats per minute of the melody corresponding to the audio data.
  • Step S2022 Determine the target audio frame based on the timestamp of the audio frame in the audio data and the number of beats of the audio data.
  • the beat number (Beat Per Minute, referred to as BPM) refers to the number of sound beats emitted within one minute.
  • the melody corresponding to the audio data has a fixed number of beats.
  • the timestamp of all target audio frames in the audio data can be determined by accumulating with the beat length as a fixed period.
  • all target audio frames can be obtained by recording based on the timestamp of the target audio frame.
  • the timestamp of the current audio frame is obtained. If the timestamp of the current audio frame is an integer multiple of the timestamp of the audio frame of the first beat plus the beat length, the current audio frame is determined as the target audio frame, and the subsequent steps are performed synchronously to generate the target facial map.
  • the first beat can be set based on user needs.
  • the beat information includes a beat sequence, which includes multiple beat nodes, each of which indicates the timestamp of a target audio frame, that is, the beat sequence is a set of timestamps of each target audio frame.
  • the beat sequence can be preset by the user based on the content of the audio data. Based on the beat sequence, the target audio frames can be unevenly distributed.
  • the melody rhythm of the first half of the audio data The target audio frames are slow, including 50 target audio frames
  • the melody rhythm block in the second half of the audio data includes 100 target audio frames, that is, when the melody rhythm becomes blocky, the appearance density of the target audio frames also becomes higher, so that the update speed of the mouth shape in the target image also speeds up; and when the melody rhythm becomes slower, the appearance density of the target audio frames also becomes lower, so that the update speed of the mouth shape in the target image also slows down, so that the density of the target audio frames changes with the change of the melody rhythm, and then the change speed of the target facial map displayed in the target image changes with the change of the melody rhythm, which is closer to the pronunciation process of real users and improves the visual expressiveness of the target image.
  • the target audio frame is matched with the melody rhythm of the audio data, and then the changes of the subsequently generated target facial map are matched with the melody rhythm, thereby improving the visual expression effect of the target facial map.
  • Step S203 acquiring target semantic information corresponding to the target audio frame, where the target semantic information represents the audio content of the target audio frame.
  • Step S204 Based on the pre-trained Generative Adversative Nets (GAN), the target semantic information is processed to obtain the first lip shape data.
  • GAN Generative Adversative Nets
  • the target semantic information corresponding to the target audio frame is obtained.
  • the target voice information may be preset information stored in the audio data and used to characterize the type of audio content corresponding to the audio frame.
  • the target semantic information is #001, which characterizes the Chinese character " ⁇ " (open)
  • the target semantic information is #002, which characterizes the Chinese character " ⁇ " (begin).
  • the target semantic information is input into the pre-trained adversarial neural network, and the generation capability of the pre-trained adversarial neural network is used to generate corresponding pronunciation mouth shape pictures, labels, description information, etc., such as the pronunciation mouth shape picture of the Chinese character "kai".
  • the adversarial neural network can be obtained by training with the pronunciation mouth shape pictures annotated with semantic information as training samples, and the specific training process will not be repeated here.
  • the target semantic information includes text information and a corresponding pronunciation stage identifier, wherein the text information represents the target text corresponding to the audio content of the target audio frame, and the pronunciation stage identifier represents the target pronunciation stage of the target text corresponding to the target audio frame.
  • the specific implementation steps of step S204 include: inputting the text information and the corresponding pronunciation stage identifier into the adversarial neural network to obtain the first lip shape data.
  • the target semantic information includes text information and corresponding pronunciation stage identifiers.
  • the target semantic information is an array, which includes a first field representing text information and a second field representing the pronunciation stage identifier, wherein the content of the first field is "GB2312", representing the Chinese character " ⁇ "; the content of the second field is "stage_1", representing the mouth shape of the first pronunciation stage of the Chinese character " ⁇ ".
  • the text information and the corresponding pronunciation stage identifier are input into the adversarial neural network, and the first lip shape data that matches the Chinese character " ⁇ " (text information) and the "mouth shape of the first pronunciation stage” (pronunciation stage identifier) can be obtained.
  • Figure 9 is a schematic diagram of the correspondence between a pronunciation stage and the first lip shape data provided by an embodiment of the present disclosure.
  • the first lip shape data is a picture representing the shape of the mouth.
  • the corresponding first lip shape information is generated based on the text information and the corresponding pronunciation stage identifier, thereby further refining the differences in mouth shape at different pronunciation stages during the pronunciation process, improving the precision of the mouth shape changes in the first facial area of the target image, and improving visual expression.
  • Step S205 Detect the target image and obtain second mouth shape data, where the second mouth shape data represents the size parameters of the target mouth shape.
  • step S205 Exemplarily, the specific implementation steps of step S205 include:
  • Step S2051 Based on the target image, perform mouth feature recognition to obtain contour key points, where the contour key points are used to characterize the length and width of the mouth contour in the target image.
  • Step S2052 Obtain second lip shape data based on the coordinates of the outline key points.
  • FIG10 is a schematic diagram of a contour key point provided by an embodiment of the present disclosure.
  • the mouth feature recognition is performed on the portrait portion thereof to obtain the contour key points thereof, wherein, as shown in the figure, by way of example, the contour key points may include the leftmost endpoint D1 of the mouth contour, the rightmost endpoint D2 of the mouth contour, and the uppermost endpoint D3 and the lowermost endpoint D4 of the mouth contour.
  • the second mouth shape data is obtained, wherein, by way of example, the second mouth shape data represents the length value and width value of the mouth shape, or the width-to-length ratio of the mouth shape.
  • FIG. 11 is a possible implementation of step S205. As shown in FIG. 11 , optionally, after step S2051, the method further includes:
  • Step S2051A Obtain the head turning angle of the portrait in the target image.
  • step S2052 the specific implementation method of step S2052 is: obtaining the second lip shape data based on the coordinates of the contour key points and the head turning angle.
  • the head turning angle is the angle between the plane where the face of the person in the target image is located and the screen plane.
  • the head turning angle can also be obtained by performing view detection on the target image.
  • the specific implementation method is the existing technology and will not be repeated here. After obtaining the head turning angle, the second lip shape data is calculated based on the coordinates of the contour key points and the head turning angle.
  • the specific implementation method is shown in formula (1):
  • mouthDis is the second mouth shape data, which represents the width-to-length ratio of the mouth shape
  • D3.y is the ordinate of the contour key point D3, D4.y is the ordinate of the contour key point D4, D1.x is the abscissa of the contour key point D1, D2.x is the abscissa of the contour key point D2, and yaw is the head turning angle.
  • Step S206 Generate a target facial map based on the first lip shape parameter and the second lip shape parameter.
  • Step S207 If the target audio frame is the first audio frame, the target image and the target facial map located in the first facial area are displayed based on the first magnification factor; if the target audio frame is the second audio frame, the target image and the target facial map located in the first facial area are displayed based on the second magnification factor, wherein the first magnification factor is different from the second magnification factor.
  • the first lip shape parameter and the second lip shape parameter are used as inputs and processed or rendered to generate a corresponding target facial map. Then, based on the result of image detection on the target image, the position of the first facial region (e.g., the mouth region) in the target image is determined, the target mouth shape in the target facial map is aligned with the mouth region in the target image, and rendering is performed to make the target facial map It can be overlaid on the target image to imitate the mouth shape of a real person's pronunciation.
  • the first facial region e.g., the mouth region
  • the first audio frame and the second audio frame are amplified again using different amplification factors, so that the target image and the target facial map present a zoom-in and zoom-out effect according to the rhythm, thereby improving the visual expressiveness of the target image.
  • the target image and the target facial map present a zoom-in and zoom-out effect according to the rhythm, thereby improving the visual expressiveness of the target image.
  • the target audio frame (including the first audio frame and the second audio frame) is determined based on the beat information. Therefore, the density of the target audio frame can change with the change of the melody rhythm.
  • the scheme can also make the current lens zoom-in and zoom-out effect (i.e., the frequency of the lens zoom-in and zoom-in) change with the melody rhythm in the process of alternately amplifying and displaying the first audio frame and the second audio frame to present the lens zoom-in and zoom-in effect, thereby improving the visual expressiveness of the target image.
  • FIG12 is a structural block diagram of an image processing device provided by an embodiment of the present disclosure.
  • the image processing device 3 includes:
  • the processing module 31 is used to generate a target facial map corresponding to a target audio frame during the playback of audio data, wherein the target facial map is used to represent a target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame.
  • the display module 32 is used to display the target facial map in a first facial area of the target image, where the first facial area is used to show the changes in the mouth shape as the audio data is played.
  • the target audio frame includes a first audio frame and a second audio frame, and the first audio frame and the second audio frame are played alternately;
  • the display module 32 is specifically used to: if the target audio frame is the first audio frame, then based on a first magnification factor, display the target image and the target facial map located in the first facial area; if the target audio frame is the second audio frame, then based on a second magnification factor, display the target image and the target facial map located in the first facial area; wherein the first magnification factor is different from the second magnification factor.
  • the processing module 31 when generating a target facial map corresponding to a target audio frame, is specifically used to: obtain first lip shape data corresponding to the target audio frame, the first lip shape data being used to characterize a mouth shape; detect the target image to obtain second lip shape data, the second lip shape data characterizing size parameters of the target mouth shape; and generate the target facial map based on the first lip shape parameters and the second lip shape parameters.
  • the processing module 31 when the processing module 31 obtains the first lip shape data corresponding to the target audio frame, it is specifically used to: obtain target semantic information corresponding to the target audio frame, the target semantic information representing the audio content of the target audio frame; based on a pre-trained adversarial neural network, process the target semantic information to obtain the first lip shape data.
  • the target semantic information includes text information and a corresponding pronunciation stage identifier, wherein the text information represents a target text corresponding to the audio content of the target audio frame, and the pronunciation stage identifier represents a target pronunciation stage of the target audio frame corresponding to the target text; when the processing module 31 processes the target semantic information based on a pre-trained adversarial neural network to obtain the first lip shape data, it is specifically used to: input the text information and the corresponding pronunciation stage identifier into the adversarial neural network to obtain the first lip shape data.
  • the processing module 31 when the processing module 31 detects the target image and obtains the second lip shape data, it is specifically used to: perform mouth feature recognition based on the target image to obtain contour key points, the contour key points The key points are used to represent the length and width of the mouth contour in the target image; based on the coordinates of the key points of the contour, the second mouth shape data is obtained.
  • the processing module 31 is further used to: obtain the head turning angle of the portrait in the target image; when the processing module 31 obtains the second lip shape data based on the coordinates of the contour key points, it is specifically used to: obtain the second lip shape data based on the coordinates of the contour key points and the head turning angle.
  • the processing module 31 before generating a target facial map corresponding to a target audio frame, is further used to: obtain beat information of the audio data, wherein the beat information represents a time interval between at least two audio segments during playback of the audio data, wherein the audio segment includes a plurality of audio frames, and the audio segment is used to realize a complete pronunciation of at least one audio content; and determine the target audio frame according to the beat information.
  • the processing module 31 determines the target audio frame according to the beat information, it is specifically used to: obtain the beat number of the audio data according to the beat information, wherein the beat number represents the number of beats per minute of the melody corresponding to the audio data; determine the target audio frame based on the timestamp of the audio frame in the audio data and the beat number of the audio data.
  • the processing module 31 is connected to the display module 32.
  • the image processing device 3 provided in this embodiment can execute the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, which will not be described in detail in this embodiment.
  • FIG. 13 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 13 , the electronic device 4 includes:
  • the memory 42 stores computer-executable instructions
  • the processor 41 executes the computer-executable instructions stored in the memory 42 to implement the image processing method in the embodiments shown in FIG. 2 to FIG. 11 .
  • processor 41 and the memory 42 are connected via a bus 43 .
  • FIG. 14 it shows a schematic diagram of the structure of an electronic device 900 suitable for implementing the embodiment of the present disclosure
  • the electronic device 900 may be a terminal device or a server.
  • the terminal device may include but is not limited to mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (Portable Android Devices, PADs), portable multimedia players (PMPs), vehicle terminals (such as vehicle navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • PDAs personal digital assistants
  • PADs Portable Android Devices
  • PMPs portable multimedia players
  • vehicle terminals such as vehicle navigation terminals
  • fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG. 14 is only an example and should not impose any limitations on the functions and scope of use of the embodiment of the present disclosure.
  • the electronic device 900 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 901, which may perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage device 908 to a random access memory (RAM) 903.
  • a processing device e.g., a central processing unit, a graphics processing unit, etc.
  • RAM random access memory
  • Various programs and data required for the operation of the electronic device 900 are also stored in the RAM 903.
  • the processing device 901, the ROM 902, and the RAM 903 are connected to each other via a bus 904.
  • An input/output (I/O) interface 905 is also connected to the bus 904.
  • the following devices can be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; input devices 906 including, for example, a liquid crystal display (LCD);
  • the electronic device 900 includes an output device 907 such as a display, LCD, speaker, vibrator, etc.; a storage device 908 such as a magnetic tape, a hard disk, etc.; and a communication device 909.
  • the communication device 909 can allow the electronic device 900 to communicate with other devices wirelessly or by wire to exchange data.
  • FIG. 14 shows an electronic device 900 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or provided instead.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902.
  • the processing device 901 the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, RAM, ROM, an erasable programmable read-only memory (EPROM) or flash memory, optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. This propagated data signal can take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate or transmit programs for use by or in conjunction with instruction execution systems, devices or devices.
  • the program code contained on the computer-readable medium can be transmitted using any appropriate medium, including but not limited to: wires, optical cables, radio frequencies (RF), etc., or any suitable combination of the above.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device executes the method shown in the above embodiment.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages.
  • the program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer may be connected to the user's computer via any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet Service Provider).
  • LAN Local Area Network
  • WAN Wide Area Network
  • Internet Service Provider e.g., via the Internet using an Internet Service Provider
  • each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two square boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or hardware.
  • the name of a unit does not limit the unit itself in some cases.
  • the first acquisition unit may also be described as a "unit for acquiring at least two Internet Protocol addresses".
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • machine-readable storage media may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a RAM, a ROM, an EPROM, or a flash memory, an optical fiber, a CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • an image processing method comprising:
  • a target facial map corresponding to a target audio frame is generated, wherein the target facial map is used to represent a target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame; and the target facial map is displayed in a first facial area of a target image, wherein the first facial area is used to show changes in the mouth shape as the audio data is played.
  • the target audio frame includes a first audio frame and a second audio frame, and the first audio frame and the second audio frame are played alternately;
  • the displaying the target facial map in the first facial area of the target image includes: if the target audio frame is the first audio frame, based on a first magnification factor, displaying the target image and the target facial map located in the first facial area; if the target audio frame is the second audio frame, based on a second magnification factor, displaying the target image and the target facial map located in the first facial area; wherein the first magnification factor is different from the second magnification factor.
  • generating a target facial map corresponding to a target audio frame includes: acquiring first lip shape data corresponding to the target audio frame, the first lip shape data being used to characterize a mouth shape; detecting the target image to obtain second lip shape data, the second lip shape data characterizing a size parameter of the target mouth shape; The target facial map is generated based on the first lip shape parameter and the second lip shape parameter.
  • obtaining the first lip shape data corresponding to the target audio frame includes: obtaining target semantic information corresponding to the target audio frame, the target semantic information representing the audio content of the target audio frame; and processing the target semantic information based on a pre-trained adversarial neural network to obtain the first lip shape data.
  • the target semantic information includes text information and a corresponding pronunciation stage identifier, wherein the text information represents a target text corresponding to the audio content of the target audio frame, and the pronunciation stage identifier represents a target pronunciation stage of the target audio frame corresponding to the target text;
  • the pre-trained adversarial neural network is used to process the target semantic information to obtain the first lip shape data, including: inputting the text information and the corresponding pronunciation stage identifier into the adversarial neural network to obtain the first lip shape data.
  • detecting the target image and obtaining the second lip shape data includes: based on the target image, performing mouth feature recognition to obtain contour key points, wherein the contour key points are used to characterize the length and width of the mouth contour in the target image; and obtaining the second lip shape data based on the coordinates of the contour key points.
  • the method further includes: obtaining the head turning angle of the portrait in the target image; and obtaining the second lip shape data based on the coordinates of the contour key points, including: obtaining the second lip shape data based on the coordinates of the contour key points and the head turning angle.
  • the method before generating a target facial map corresponding to a target audio frame, the method further includes: acquiring beat information of the audio data, wherein the beat information represents a time interval between at least two audio segments during playback of the audio data, wherein the audio segment includes a plurality of audio frames, and the audio segment is used to realize a complete pronunciation of at least one audio content; and determining the target audio frame according to the beat information.
  • determining the target audio frame based on the beat information includes: acquiring the beat number of the audio data based on the beat information, the beat number representing the number of beats per minute of the melody corresponding to the audio data; determining the target audio frame based on the timestamp of the audio frame in the audio data and the beat number of the audio data.
  • an image processing apparatus comprising:
  • the processing module is used to generate a target facial map corresponding to a target audio frame during the playback of audio data, wherein the target facial map is used to represent a target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame.
  • a display module is used to display the target facial map in a first facial area of a target image, where the first facial area is used to show changes in a mouth shape as the audio data is played.
  • the target audio frame includes a first audio frame and a second audio frame, and the first audio frame and the second audio frame are played alternately;
  • the display module is specifically used to: if the target audio frame is the first audio frame, then based on a first magnification factor, display the target image and the target facial map located in the first facial area; if the target audio frame is the second audio frame, then based on a second magnification factor, display the target image and the target facial map located in the first facial area; wherein the first magnification factor is different from the second magnification factor.
  • the processing module when generating a target facial map corresponding to a target audio frame, is specifically used to: obtain first lip shape data corresponding to the target audio frame, the first lip shape data is used to represent the mouth shape; detect the target image, obtain second lip shape data, the second lip shape data represents the target mouth shape; and generating the target facial map based on the first lip shape parameters and the second lip shape parameters.
  • the processing module when the processing module obtains the first lip shape data corresponding to the target audio frame, it is specifically used to: obtain target semantic information corresponding to the target audio frame, the target semantic information representing the audio content of the target audio frame; based on a pre-trained adversarial neural network, process the target semantic information to obtain the first lip shape data.
  • the target semantic information includes text information and a corresponding pronunciation stage identifier, wherein the text information represents a target text corresponding to the audio content of the target audio frame, and the pronunciation stage identifier represents a target pronunciation stage of the target audio frame corresponding to the target text; when the processing module processes the target semantic information based on a pre-trained adversarial neural network to obtain the first lip shape data, the processing module is specifically used to: input the text information and the corresponding pronunciation stage identifier into the adversarial neural network to obtain the first lip shape data.
  • the processing module when the processing module detects the target image and obtains the second lip shape data, it is specifically used to: perform mouth feature recognition based on the target image to obtain contour key points, and the contour key points are used to characterize the length and width of the mouth contour in the target image; based on the coordinates of the contour key points, obtain the second lip shape data.
  • the processing module is further used to: obtain the head turning angle of the portrait in the target image; when the processing module obtains the second lip shape data based on the coordinates of the contour key points, it is specifically used to: obtain the second lip shape data based on the coordinates of the contour key points and the head turning angle.
  • the processing module before generating a target facial map corresponding to a target audio frame, is further used to: obtain beat information of the audio data, wherein the beat information represents a time interval between at least two audio segments during playback of the audio data, wherein the audio segment includes a plurality of audio frames, and the audio segment is used to realize a complete pronunciation of at least one audio content; and determine the target audio frame according to the beat information.
  • the processing module determines the target audio frame according to the beat information, it is specifically used to: obtain the beat number of the audio data according to the beat information, wherein the beat number represents the number of beats per minute of the melody corresponding to the audio data; determine the target audio frame based on the timestamp of the audio frame in the audio data and the beat number of the audio data.
  • an electronic device comprising: a processor, and a memory communicatively connected to the processor;
  • the memory stores computer-executable instructions
  • the processor executes the computer-executable instructions stored in the memory to implement the image processing method as described in the first aspect and various possible designs of the first aspect.
  • a computer-readable storage medium stores computer execution instructions.
  • the image processing method described in the first aspect and various possible designs of the first aspect is implemented.
  • an embodiment of the present disclosure provides a computer program product, including a computer program, which, when executed by a processor, implements the image processing method as described in the first aspect and various possible designs of the first aspect.
  • an embodiment of the present disclosure provides a computer program, which, when executed by a processor, implements the image processing method as described in the first aspect and various possible designs of the first aspect.
  • the image processing method, device, electronic device, computer-readable storage medium, computer program product and computer program provided in this embodiment generate a target facial map corresponding to a target audio frame during the process of playing audio data, wherein the target facial map is used to characterize a target mouth shape, and the target mouth shape corresponds to the audio content of the target audio frame; the target facial map is displayed in the first facial area of the target image, and the first facial area is used to show the changes in the mouth shape as the audio data is played.
  • the target facial map is used to simulate and display the target mouth shape corresponding to the target audio content currently being played, so that the mouth shape displayed by the facial area of the target image can change as the audio content changes, thereby achieving the imitation of the audio process corresponding to the audio data of a real person singing, so that the audio work can present the display effect of a video work, and improve the richness and diversity of the display content of the audio work.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Processing Or Creating Images (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本公开实施例提供一种图像处理方法、装置、电子设备、计算机可读存储介质、计算机程序产品及计算机程序,在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,目标面部贴图用于表征目标嘴部形状,目标嘴部形状与目标音频帧的音频内容对应;在目标图像的第一面部区域显示目标面部贴图,第一面部区域用于展示嘴部形状随音频数据的播放而发生的变化。利用目标面部贴图来模拟展示当前播放的目标音频内容对应的目标嘴部形状,使目标图像的面部区域所展示的嘴部形状,能够随音频内容的变化而发生变化,实现对真人演唱音频数据对应的音频过程的模仿,使音频作品能够呈现出视频作品的展示效果,提高音频作品的展示内容丰富性和多样性。

Description

图像处理方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求于2022年10月14日提交中国专利局、申请号为202211262215.1、申请名称为“图像处理方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开实施例涉及互联网技术领域,尤其涉及一种图像处理方法、装置、电子设备、计算机可读存储介质、计算机程序产品及计算机程序。
背景技术
当前,以短视频应用(Application)为例的内容创作平台,凭借其丰富和多样化的内容,深受用户的喜爱。例如,内容创造用户通过录制音频,生成音频作品并上传至应用平台后,其他用户即通过对应的应用客户端,收听到该音频作品。
然而,现有技术中,对于音频作品的展示过程,通常是在播放音频作品的同时,仅在客户端的播放界面内展示静态的图片,存在展示方式单一,展示效果差的问题。
发明内容
本公开实施例提供一种图像处理方法、装置、电子设备、计算机可读存储介质、计算机程序产品及计算机程序,以克服在播放音频作品时,展示内容的展示方式单一、展示效果差的问题。
第一方面,本公开实施例提供一种图像处理方法,包括:
在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应;在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。
第二方面,本公开实施例提供一种图像处理装置,包括:
处理模块,用于在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应;
显示模块,用于在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。
第三方面,本公开实施例提供一种电子设备,包括:
处理器,以及与所述处理器通信连接的存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,以实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
第四方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
第五方面,本公开实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
第六方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的图像处理方法的一种应用场景图。
图2为本公开实施例提供的图像处理方法的流程示意图一。
图3为本公开实施例提供的一种音频帧与音频内容的对应关系示意图。
图4为本公开实施例提供的一种目标面部贴图的示意图。
图5为图2所示实施例中步骤S101的一种可能的具体实现步骤流程图。
图6为本公开实施例提供的一种显示目标面部贴图的过程示意图。
图7为本公开实施例提供的一种对目标图像进行交替放大显示的过程示意图。
图8为本公开实施例提供的图像处理方法的流程示意图二。
图9为本公开实施例提供的一种发音阶段与第一口型数据的对应关系示意图。
图10为本公开实施例提供的一种轮廓关键点的示意图。
图11为图8所示实施例中步骤S205的一种可能的具体实现步骤流程图。
图12为本公开实施例提供的图像处理装置的结构框图。
图13为本公开实施例提供的一种电子设备的结构示意图。
图14为本公开实施例提供的电子设备的硬件结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
下面对本公开实施例的应用场景进行解释:
图1为本公开实施例提供的图像处理方法的一种应用场景图,本公开实施例提供的图像处理方法,可以应用于音频/视频制作和音频/视频播放的场景下。更具体地,例如在短视频平台内播放音频/视频过程中,通过触发平台道具(即功能控件)来实现“对口型 演唱”的视觉特效的应用场景。如图1所示,本公开实施例提供的方法,可以应用于终端设备,终端设备内运行有用于播放视频或音频的应用客户端,终端设备通过与服务器通信,获得相应的媒体数据并进行播放,其中,示例性地,媒体数据例如为音频数据,或者包括音频数据的视频数据。终端设备内运行的客户端在播放音频数据的同时,在客户端的播放界面内,展示静态的目标图像,其中,音频数据例如为歌曲对应的数据,终端设备在播放音频数据的同时,在播放界面内展示歌曲的演唱者的照片。
现有技术中,客户端在播放音频作品的过程中,通常是仅在客户端的播放界面内展示静态的封面图片(目标图像),例如歌手照片等。由于缺乏视频数据,因此,相比如视频作品的展示,存在展示方式单一,展示效果差的问题。本公开实施例提供一种图像处理方法,在播放音频数据的过程中,通过生成动态的面部贴图,来模拟对歌手对音频的演唱/讲说过程,从而实现视频展示的效果,以解决上述问题。
参考图2,图2为本公开实施例提供的图像处理方法的流程示意图一。本实施例的方法可以应用在终端设备中,该图像处理方法包括:
步骤S101:在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,目标面部贴图用于表征目标嘴部形状,目标嘴部形状与目标音频帧的音频内容对应。
示例性地,本实施例提供的方法的执行主体可以为终端设备,其中,终端设备内运行有用于播放视频、音频的应用客户端,具体地,例如短视频客户端或音乐客户端,终端设备内的客户端通过与服务端通信,而获得待播放的媒体数据,其中,一种可能的实现方式中,媒体数据中包括视频数据和音频数据,终端设备基于视频通道和音频通道对媒体数据进行解析后,得到视频通道对应的数据和音频通道对应的数据,也即视频数据和对应的音频数据。另一种可能的实现方式中,媒体数据为音频数据,终端设备通过访问服务端,直接获取到该音频数据。进一步地,音频数据例如为歌曲音乐、语音等内容对应的数据。
示例性地,终端设备在获得音频数据后,对该音频数据进行播放,其中,音频数据由至少一个音频段构成,每一音频段中包括多个音频帧,在音频数据的播放时长固定的情况下,音频帧的数量(和时长)由音频数据的帧率决定,具体对应关系为现有技术,此处不再赘述。进一步地,音频数据例如对应一段歌曲、一段演讲,则音频数据中的每一音频帧,均对应上述歌曲、演讲中的一个发音片段,多个音频帧构成的音频段,可以实现一个文字、数字、字母、单词或一个节拍的完整发音,上述文字、数字、字母、单词和节拍即音频内容。图3为本公开实施例提供的一种音频帧与音频内容的对应关系示意图,如图3所示,音频数据对应一段语音,音频内容(即语音内容)为“一、二、开始”其中,每一个文字对应一个音频段,示例性地,文字“一”对应音频段D1;文字“二”对应音频段D2;文字“开”对应音频段D3;文字“始”对应音频段D4;进一步地,每一音频段又包括数量不同的音频帧,例如,音频段D1包括n1个音频帧、音频段D2包括n2个音频帧、音频段D3包括n3个音频帧、音频段D4包括n4个音频帧。也即,音频段D1中的各音频帧,对应的音频内容为“一”、音频段D2中的各音频帧,对应的音频内容为“二”、音频段D3中的各音频帧,对应的音频内容为“开”、音频段D4中的各音频帧,对应的音频内容为“始”。
另一种可能的情况下,音频内容还可以为单词,以及构成单词的字母等,此种情况下,每一单词,或者构成单词的字母分别对应一个音频段,每一音频段由多个音频帧构成,具 体实现情况与上述实施例类似,具体实现方式可根据需要设置,不再赘述。
进一步地,对于每一音频段,包含至少一个目标音频帧,更具体地,例如各音频段的首帧为目标音频帧,在播放音频数据的过程中,当播放至目标音频帧时,生成与该目标音频帧对应的贴图,即目标面部贴图,该目标面部贴图能够表现真人在发出该目标音频帧对应的音频内容时的嘴部形状。图4为本公开实施例提供的一种目标面部贴图的示意图,如图4所示,音频数据对应一段语音,音频内容(语音内容)为“一、二、开始!”,其中,目标音频帧Frame_1对应的音频内容为文字“一”、目标音频帧Frame_2对应的音频内容为文字“二”、目标音频帧Frame_3对应的音频内容为文字“开”、目标音频帧Frame_4对应的音频内容为文字“始”。针对每一目标音频帧,生成对应的目标面部贴图,即目标音频帧Frame_1对应目标面部贴图P1,该目标面部贴图P1是真人发出文字“一”时的嘴部形状,同样的,目标音频帧Frame_2对应目标面部贴图P2,目标面部贴图P2是真人发出文字“二”时的嘴部形状,目标音频帧Frame_3对应目标面部贴图P3,目标面部贴图P3是真人发出文字“开”时的嘴部形状,目标音频帧Frame_4对应目标面部贴图P4,目标面部贴图P4是真人发出文字“始”时的嘴部形状。其中,在一种可能的实现方式中,不同的目标音频帧对应的目标面部贴图可以相同,例如,目标面部贴图P1和目标面部贴图P2相同。
进一步地,在一种可能的实现方式中,如图5所示,步骤S101的具体实现步骤包括:
步骤S1011:获取目标音频帧对应的第一口型数据,第一口型数据用于表征嘴部形状。
步骤S1012:检测目标图像,获得第二口型数据,第二口型数据表征目标嘴部形状的尺寸参数。
步骤S1013:基于第一口型参数和第二口型参数,生成目标面部贴图。
示例性地,在终端设备或支持服务端运行的云服务器内,预设有第一口型数据或用于生成第一口型数据的网络模型,第一口型数据用于表征嘴部形状,更具体地,第一口型数据可以为图像、标识或其他能够描述嘴部形状的描述信息,例如,第一口型数据可以为表征文字“一”发音的嘴部形状的图像、第一口型数据可以表征字母“A”发音的嘴部形状的标识、第一口型数据可以表征单词“Apple”发音的嘴部形状的描述信息。
在确定目标音频帧后,一种可能的实现方式中,可以基于预设的口型时序映射信息,来确定对应的第一口型数据。其中,示例性地,口型时序映射信息表征音频数据中各音频帧对应的第一口型数据。每一音频数据,对应唯一的口型时序映射信息,该口型时序映射信息可以是基于音频数据的具体音频内容预先生成的,此处不再具体介绍。
进一步地,由于生成目标面部贴图的目的,是为了模拟真实用户发音时的嘴部动作,因此,需要对目标面部贴图的尺寸进行设置,从而使目标嘴部形状的尺寸与目标图像中的人像尺寸相匹配。具体地,通过检测目标图像,确定目标图像中的人像尺寸,进而基于该人像尺寸得到对应的表征目标嘴部形状的尺寸参数的第二口型数据。例如,目标嘴部形状包括长度参数和宽度参数,通过检测目标图像中人像的面部尺寸f1和f2,其中f1表示面部高度、f2表示面部宽度,之后,按照预设的(面部-嘴部)比例系数,基于上述f1和f2,得到与面部尺寸匹配的长度参数为c1和宽度参数为c2。
进一步地,基于尺寸参数(第二口型数据)和表征目标嘴部形状的数据(第一口型数据)进行渲染、处理后,即可得到与目标图像相匹配的目标面部贴图。
步骤S102:在目标图像的第一面部区域显示目标面部贴图,第一面部区域用于展示嘴部形状随音频数据的播放而发生的变化。
示例性地,在得到目标面部贴图后,将目标面部贴图显示在目标图像的第一面部区域,其中,示例性地,目标图像中包括人像部分,第一面部区域可以是人像面部的嘴部区域,或者包括嘴部区域的面部区域。图6为本公开实施例提供的一种显示目标面部贴图的过程示意图,参考图6所示,在播放至目标音频帧时,将对应的目标面部贴图显示在第一面部区域后,覆盖目标图像中原有的原始嘴部形状,而呈现目标面部贴图所表征目标嘴部形状,从而实现模拟真实用户发音时的嘴部动作的目的。
在一种可能的实现方式中,目标音频帧包括第一音频帧和第二音频帧,第一音频帧和第二音频帧交错播放,步骤S102的具体实现步骤包括:
步骤S1021:若目标音频帧为第一音频帧,则基于第一放大系数,对目标图像和位于第一面部区域的目标面部贴图进行显示。
步骤S1022:若目标音频帧为第二音频帧,则基于第二放大系数,对目标图像和位于第一面部区域的目标面部贴图进行显示,其中,第一放大系数与第二放大系数不同。
示例性地,每一目标音频帧对应一个嘴部动作,当依次播放至不同的目标音频帧时,嘴部动作切换显示。在此基础上,本实施例中,将目标音频帧分为第一音频帧和第二音频帧,当播放至第一音频帧和第二音频帧时,分别采用不同的放大系数对目标图像和位于第一面部区域的目标面部贴图进行交替放大显示,使播放界面内所展示的目标图像,可以按节奏呈现出镜头推远与拉进效果,从而提高目标图像的视觉表现力。其中,一种实现方式中,第一音频帧为音频数据的所有目标音频帧中的奇数音频帧,而第二音频帧为音频数据的所有目标音频帧中的偶数音频帧。第一放大系数、第二放大系数,可以是基于目标图像的对角线长度进行放大的比例系数,或者,基于目标图像的面积进行放大的比例系数,第一放大系数、第二放大系数为大于0的实数,当其大于1时,表示尺寸放大;当其小于1时,表示尺寸缩小。
图7为本公开实施例提供的一种对目标图像进行交替放大显示的过程示意图,如图7所示,当目标音频帧为Frame_1时,为第一音频帧,此时,以1倍放大系数(第一放大系数)对目标图像和位于第一面部区域的目标面部贴图,进行1倍显示,也即,以原始大小显示目标图像和目标面部贴图。当目标音频帧为Frame_2时,为第二音频帧,此时,以1.2倍放大系数(第二放大系数)对目标图像和位于第一面部区域的目标面部贴图,进行1.2倍显示,当目标音频帧为Frame_3时,为第一音频帧,与Frame_1的实现方式相同,不再赘述。本实施例中,通过对第一音频帧和第二音频帧以不同的放大系数进行显示,使目标图像以及目标面部贴图,按节奏呈现出镜头推远与拉进效果,提高目标图像的视觉表现力。
在本实施例中,通过在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,目标面部贴图用于表征目标嘴部形状,目标嘴部形状与目标音频帧的音频内容对应;在目标图像的第一面部区域显示目标面部贴图,第一面部区域用于展示嘴部形状随音频数据的播放而发生的变化。利用目标面部贴图来模拟展示当前播放的目标音频内容对应的目标嘴部形状,使目标图像的面部区域所展示的嘴部形状,能够随音频内容的变化而发生变化,实现对真人演唱音频数据对应的音频过程的模仿,使音频作品能够呈现出视频作 品的展示效果,提高音频作品的展示内容丰富性和多样性。
参考图8,图8为本公开实施例提供的图像处理方法的流程示意图二。本实施例在图2所示实施例的基础上,进一步对步骤S101进行细化,并增加了确定目标音频帧的步骤,本实施例提供的图像处理方法,可以应用于音频数据对应的内容为歌曲的场景下,该图像处理方法包括:
步骤S201:获取音频数据的节拍信息,节拍信息表征音频数据播放过程中,至少两个音频段之间的时间间隔,其中,音频段包括多个音频帧,音频段用于实现至少一个音频内容的完整发音。
步骤S202:根据节拍信息,确定目标音频帧。
示例性地,节拍信息是音频数据所对应的歌曲的旋律的节奏特征的信息,更具体地,节拍信息可以用于表征歌曲的旋律的节奏快慢,示例性地,当旋律节奏快时,音频段间隔短,即音频内容之间的时间间隔短;反之,当旋律节奏快时,音频段间隔长,即音频内容之间的时间间隔长;其中,音频内容例如为歌曲中的歌词。一种可能的实现方式中,节拍信息可以是一个具体地标识或数字,表示固定的时间间隔的长度,例如节拍信息为300ms(毫秒),表征两个音频段之间间隔300ms,则根据该节拍信息,在音频数据中每间隔300ms确定一个目标音频帧,该确定目标音频帧的过程,可以是在播放音频数据之前完成的,也可以是在播放音频数据的同时完成的,可以按需设置。之后,基于该均匀分布的目标音频帧,生成并展示对应的目标面部贴图,从而实现目标图像中第一面部区域展示的嘴部形状,可以按照固定周期(300ms)动态变化。其中,节拍信息可以是与音频数据对应的预设信息,获取方式不再赘述。
一种可能的实现步骤中,步骤S202的具体实现步骤包括:
步骤S2021:根据节拍信息,获取音频数据的节拍数,节拍数表征音频数据对应的旋律在每分钟内的拍子数量。
步骤S2022:基于音频数据中的音频帧的时间戳和音频数据的节拍数,确定目标音频帧。
示例性地,节拍数(Beat Per Minute,简称BPM)是指在一分钟的时间内,所发出的声音节拍的数量,音频数据对应的旋律具有固定的节拍数,例如,节拍数为166,表示音频数据对应的旋律在一分钟内的拍子数量为166,则对应的拍长为60/166=361ms,在一种可能的实现方式中,在播放音频数据之前,基于音频帧的时间戳,以首个拍子对应的音频帧的时间戳为起点,以该拍长为固定周期进行累加,即可确定音频数据中的所有目标音频帧的时间戳。之后基于目标音频帧的时间戳进行记录,即可获得所有目标音频帧。在另一种可能的实现方式中,在播放音频数据的过程中,获取当前音频帧的时间戳,若当前音频帧的时间戳为首个拍子的音频帧的时间戳加拍长的整数倍,则将当前音频帧确定为目标音频帧,并同步进行后续步骤,生成目标面部贴图。其中,首个拍子可以基于用户需要设置。
在另一种可能的实现方式中,针对节奏变化的音频数据,节拍信息包括节拍序列,节拍序列中包括多个节拍节点,每一节拍节点指示一个目标音频帧的时间戳,也即,节拍序列是各目标音频帧的时间戳集合。其中,节拍序列可以是用户基于音频数据的内容而预设的。基于节拍序列,目标音频帧可以非均匀分布,例如,音频数据的前半部分的旋律节奏 慢,包括50个目标音频帧,而音频数据的后半部分的旋律节奏块,包括100个目标音频帧,即当旋律节奏变块时,目标音频帧的出现密度也变高,从而目标图像内嘴部形状的更新速度也加快;而当旋律节奏变慢时,目标音频帧的出现密度也变低,从而目标图像内嘴部形状的更新速度也变慢,使目标音频帧的密度随旋律节奏的变化而变化,进而使目标图像展示的目标面部贴图的变化速度随旋律节奏的变化而变化,更加贴近真人用户的发音过程,提高目标图像的视觉表现力。
本实施例中,通过获取与音频数据对应的节拍信息,并基于节拍信息来确定目标音频帧,使目标音频帧与音频数据的旋律节奏相匹配,进而使后续生成的目标面部贴图的变化与旋律节奏想匹配,提高目标面部贴图的视觉表现效果。
步骤S203:获取目标音频帧对应的目标语义信息,目标语义信息表征目标音频帧的音频内容。
步骤S204:基于预训练的对抗神经网络(Generative Adversative Nets,简称GAN),处理目标语义信息,得到第一口型数据。
示例性地,在确定目标音频帧后,获取该目标音频帧对应的目标语义信息,目标语音信息可以是预设信息,存储在音频数据内,用于表征音频帧所对应的音频内容的类型,例如,目标语义信息为#001,表征汉字“开”、目标语义信息为#002,表征汉字“始”。
之后,将目标语义信息输入预训练的对抗神经网络,利用预训练的对抗神经网络的生成能力,生成对应的发音口型图片、标识、描述信息等,例如汉字“开”的发音口型图片。其中,对抗神经网络可以以利用语义信息进行标注的发音口型图片作为训练样本进行训练后得到,具体训练过程不再赘述。
一种可能的实现方式中,目标语义信息包括文本信息和对应的发音阶段标识,其中,文本信息表征目标音频帧的音频内容对应的目标文字,发音阶段标识表征目标音频帧对应目标文字的目标发音阶段。进一步地,步骤S204的具体实现步骤包括:将文本信息和对应的发音阶段标识输入对抗神经网络,得到第一口型数据。
示例性地,真人用户的实际发音过程是一个持续过程,其嘴部形状会在该持续过程中连续变化,因此,若需要更加准确的表现该过程,需要多帧面部贴图进行表现。具体地,目标语义信息包括文本信息和对应的发音阶段标识,例如,目标语义信息为数组,其中包括表征文本信息的第一字段和表征发音阶段标识的第二字段,其中,第一字段的内容为“GB2312”,表征汉字“开”;第二字段的内容为“stage_1”,表征汉字“开”的第一发音阶段的嘴部形状。之后,将文本信息和对应的发音阶段标识输入对抗神经网络,即可得到同时与汉字“开”(文本信息)和“第一发音阶段的嘴部形状”(发音阶段标识)匹配的第一口型数据。
其中,目标文字对应多个发音阶段。图9为本公开实施例提供的一种发音阶段与第一口型数据的对应关系示意图,本实施例中,第一口型数据为表征嘴部形状的图片,参考图9所示,对于同一个文本信息(text_001),例如表示汉字“开”,当目标音频帧对应的发音阶段标识为stage_1时,对应汉字“开”的第一发音阶段,通过对抗神经网络生成的对应的第一口型数据为P1;当目标音频帧对应的发音阶段标识为stage_2时,对应汉字“开”的第二发音阶段,通过对抗神经网络生成的对应的第一口型数据为P2;类似的,当目标音频帧对应的发音阶段标识为stage_3、stage_4时,通过对抗神经网络生成的对应 的第一口型数据为P3、P4。
本实施例中,通过获取目标音频帧对应的文本信息和对应的发音阶段标识,基于文本信息和对应的发音阶段标识生成对应的第一口型信息,进一步地细化了发音过程中嘴部形状在不同发音阶段的差别,提高了目标图像的第一面部区域展示嘴部形状变化的精细度,提高视觉表现力。
步骤S205:检测目标图像,获得第二口型数据,第二口型数据表征目标嘴部形状的尺寸参数。
示例性地,步骤S205的具体实现步骤包括:
步骤S2051:基于目标图像,进行嘴部特征识别,获得轮廓关键点,轮廓关键点用于表征目标图像内的嘴部轮廓的长度和宽度。
步骤S2052:基于轮廓关键点的坐标,获得第二口型数据。
图10为本公开实施例提供的一种轮廓关键点的示意图,下面结合图10对上述步骤进行说明,示例性地,在获得目标图像后,对其中的人像部分进行嘴部特征识别,即可得到其中的轮廓关键点,其中,如图所示,示例性地,轮廓关键点可以包括嘴部轮廓最左侧的端点D1、嘴部轮廓最右侧的端点D2,以及嘴部轮廓最上侧的端点D3、嘴部轮廓最下侧的端点D4。之后,基于上述轮廓关键点的坐标,得到第二口型数据,其中,示例性地,第二口型数据表征嘴部形状长度数值和宽度数值,或者,嘴部形状的宽长比。
图11为步骤S205的一种可能的实现方式,如图11所示,可选地,在步骤S2051之后,还包括:
步骤S2051A:获取目标图像中人像的头部转向角度。
相应的,在执行步骤S2051A后,步骤S2052的具体实现方式为:基于轮廓关键点的坐标和头部转向角度,获得第二口型数据。
示例性地,头部转向角度即目标图像中人物的面部所在平面相对屏幕平面的夹角,头部转向角度也可以通过对目标图像进行视图检测而获得,具体实现方式为现有技术,不再赘述。在获得头部转向角度后,基于轮廓关键点的坐标和头部转向角度,计算第二口型数据,具体实现方式如式(1)所示:
其中,mouthDis为第二口型数据,表征嘴部形状的宽长比,D3.y为轮廓关键点D3的纵坐标,D4.y为轮廓关键点D4的纵坐标,D1.x为轮廓关键点D1的横坐标,D2.x为轮廓关键点D2的横坐标,yaw为头部转向角度。
步骤S206:基于第一口型参数和第二口型参数,生成目标面部贴图。
步骤S207:若目标音频帧为第一音频帧,则基于第一放大系数,对目标图像和位于第一面部区域的目标面部贴图进行显示;若目标音频帧为第二音频帧,则基于第二放大系数,对目标图像和位于第一面部区域的目标面部贴图进行显示,其中,第一放大系数与第二放大系数不同。
示例性地,获得第一口型参数和第二口型参数后,将第一口型参数和第二口型参数作为输入量,进行处理或渲染,即可生成对应的目标面部贴图。之后,确定基于对目标图像进行图像检测的结果,确定目标图像中的第一面部区域(例如嘴部区域)的位置,将目标面部贴图中的目标嘴部形状与目标图像中的嘴部区域对齐,并进行渲染,使目标面部贴图 能够覆盖显示在该目标图像中,实现真人发音的嘴部形状模仿。
进一步地,基于目标音频帧的类型,对于第一音频帧和第二音频帧,使用不同的放大系数再进行二次放大,使目标图像以及目标面部贴图,按节奏呈现出镜头推远与拉进效果,提高目标图像的视觉表现力,具体实现过程可参见图7对应实施例中相关说明,此处不再赘述。
需要说明的是,本实施例中,目标音频帧(包括第一音频帧和第二音频帧)是基于节拍信息确定的,因此,使目标音频帧的密度可以随旋律节奏的变化而变化,该方案在实现使目标面部贴图(嘴部形状)随旋律节奏变化的同时,还可以在基于第一音频帧和第二音频帧进行交替放大显示,以呈现镜头推远与拉进效果的过程中,使现镜头推远与拉进效果(即镜头推远与拉进的频率)也随旋律节奏变化,提高目标图像的视觉表现力。
对应于上文实施例的图像处理方法,图12为本公开实施例提供的图像处理装置的结构框图。为了便于说明,仅示出了与本公开实施例相关的部分。参照图8,所述图像处理装置3包括:
处理模块31,用于在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应。
显示模块32,用于在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。
在本公开的一个实施例中,所述目标音频帧包括第一音频帧和第二音频帧,所述第一音频帧和所述第二音频帧交错播放;所述显示模块32,具体用于:若所述目标音频帧为所述第一音频帧,则基于第一放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;若所述目标音频帧为所述第二音频帧,则基于第二放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;其中,所述第一放大系数与所述第二放大系数不同。
在本公开的一个实施例中,所述处理模块31在生成目标音频帧对应的目标面部贴图时,具体用于:获取所述目标音频帧对应的第一口型数据,所述第一口型数据用于表征嘴部形状;检测所述目标图像,获得第二口型数据,所述第二口型数据表征所述目标嘴部形状的尺寸参数;基于所述第一口型参数和所述第二口型参数,生成所述目标面部贴图。
在本公开的一个实施例中,所述处理模块31在获取所述目标音频帧对应的第一口型数据时,具体用于:获取所述目标音频帧对应的目标语义信息,所述目标语义信息表征所述目标音频帧的音频内容;基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据。
在本公开的一个实施例中,所述目标语义信息包括文本信息和对应的发音阶段标识,其中,所述文本信息表征所述目标音频帧的音频内容对应的目标文字,所述发音阶段标识表征所述目标音频帧对应所述目标文字的目标发音阶段;处理模块31在基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据时,具体用于:将所述文本信息和对应的发音阶段标识输入所述对抗神经网络,得到所述第一口型数据。
在本公开的一个实施例中,所述处理模块31在检测所述目标图像,获得第二口型数据时,具体用于:基于所述目标图像,进行嘴部特征识别,获得轮廓关键点,所述轮廓关 键点用于表征所述目标图像内的嘴部轮廓的长度和宽度;基于所述轮廓关键点的坐标,获得所述第二口型数据。
在本公开的一个实施例中,所述处理模块31,还用于:获取所述目标图像中人像的头部转向角度;所述处理模块31在基于所述轮廓关键点的坐标,获得所述第二口型数据时,具体用于:基于所述轮廓关键点的坐标和所述头部转向角度,获得所述第二口型数据。
在本公开的一个实施例中,在所述生成目标音频帧对应的目标面部贴图之前,所述处理模块31,还用于:获取所述音频数据的节拍信息,所述节拍信息表征所述音频数据播放过程中,至少两个音频段之间的时间间隔,其中,所述音频段包括多个音频帧,所述音频段用于实现至少一个音频内容的完整发音;根据所述节拍信息,确定所述目标音频帧。
在本公开的一个实施例中,所述处理模块31在根据所述节拍信息,确定所述目标音频帧时,具体用于:根据所述节拍信息,获取所述音频数据的节拍数,所述节拍数表征所述音频数据对应的旋律在每分钟内的拍子数量;基于所述音频数据中的音频帧的时间戳和所述音频数据的节拍数,确定所述目标音频帧。
其中,处理模块31和显示模块32连接。本实施例提供的图像处理装置3可以执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
图13为本公开实施例提供的一种电子设备的结构示意图,如图13所示,该电子设备4包括:
处理器41,以及与所述处理器41通信连接的存储器42;
所述存储器42存储计算机执行指令;
所述处理器41执行所述存储器42存储的计算机执行指令,以实现如图2-图11所示实施例中的图像处理方法。
其中,可选地,处理器41和存储器42通过总线43连接。
相关说明可以对应参见图2-图11所对应的实施例中的步骤所对应的相关描述和效果进行理解,此处不做过多赘述。
参考图14,其示出了适于用来实现本公开实施例的电子设备900的结构示意图,该电子设备900可以为终端设备或服务器。其中,终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,简称PDA)、平板电脑(Portable Android Device,简称PAD)、便携式多媒体播放器(Portable Media Player,简称PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图14示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图14所示,电子设备900可以包括处理装置(例如中央处理器、图形处理器等)901,其可以根据存储在只读存储器(Read Only Memory,简称ROM)902中的程序或者从存储装置908加载到随机访问存储器(Random Access Memory,简称RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中,还存储有电子设备900操作所需的各种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(Input/Output,简称I/O)接口905也连接至总线904。
通常,以下装置可以连接至I/O接口905:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置906;包括例如液晶显示器(Liquid Crystal  Display,简称LCD)、扬声器、振动器等的输出装置907;包括例如磁带、硬盘等的存储装置908;以及通信装置909。通信装置909可以允许电子设备900与其他设备进行无线或有线通信以交换数据。虽然图14示出了具有各种装置的电子设备900,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置909从网络上被下载和安装,或者从存储装置908被安装,或者从ROM 902被安装。在该计算机程序被处理装置901执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,简称EPROM)或闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disk Read Only Memory,简称CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,简称RF)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述实施例所示的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,简称LAN)或广域网(Wide Area Network,简称WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、专用标准产品(Application Specific Standard Parts,简称ASSP)、片上***(System on Chip,简称SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,简称CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM或快闪存储器、光纤、CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。
第一方面,根据本公开的一个或多个实施例,提供了一种图像处理方法,包括:
在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应;在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。
根据本公开的一个或多个实施例,所述目标音频帧包括第一音频帧和第二音频帧,所述第一音频帧和所述第二音频帧交错播放;所述在目标图像的第一面部区域显示所述目标面部贴图,包括:若所述目标音频帧为所述第一音频帧,则基于第一放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;若所述目标音频帧为所述第二音频帧,则基于第二放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;其中,所述第一放大系数与所述第二放大系数不同。
根据本公开的一个或多个实施例,所述生成目标音频帧对应的目标面部贴图,包括:获取所述目标音频帧对应的第一口型数据,所述第一口型数据用于表征嘴部形状;检测所述目标图像,获得第二口型数据,所述第二口型数据表征所述目标嘴部形状的尺寸参数; 基于所述第一口型参数和所述第二口型参数,生成所述目标面部贴图。
根据本公开的一个或多个实施例,所述获取所述目标音频帧对应的第一口型数据,包括:获取所述目标音频帧对应的目标语义信息,所述目标语义信息表征所述目标音频帧的音频内容;基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据。
根据本公开的一个或多个实施例,所述目标语义信息包括文本信息和对应的发音阶段标识,其中,所述文本信息表征所述目标音频帧的音频内容对应的目标文字,所述发音阶段标识表征所述目标音频帧对应所述目标文字的目标发音阶段;所述基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据,包括:将所述文本信息和对应的发音阶段标识输入所述对抗神经网络,得到所述第一口型数据。
根据本公开的一个或多个实施例,所述检测所述目标图像,获得第二口型数据,包括:基于所述目标图像,进行嘴部特征识别,获得轮廓关键点,所述轮廓关键点用于表征所述目标图像内的嘴部轮廓的长度和宽度;基于所述轮廓关键点的坐标,获得所述第二口型数据。
根据本公开的一个或多个实施例,所述方法还包括:获取所述目标图像中人像的头部转向角度;所述基于所述轮廓关键点的坐标,获得所述第二口型数据,包括:基于所述轮廓关键点的坐标和所述头部转向角度,获得所述第二口型数据。
根据本公开的一个或多个实施例,在所述生成目标音频帧对应的目标面部贴图之前,所述方法还包括:获取所述音频数据的节拍信息,所述节拍信息表征所述音频数据播放过程中,至少两个音频段之间的时间间隔,其中,所述音频段包括多个音频帧,所述音频段用于实现至少一个音频内容的完整发音;根据所述节拍信息,确定所述目标音频帧。
根据本公开的一个或多个实施例,所述根据所述节拍信息,确定所述目标音频帧,包括:根据所述节拍信息,获取所述音频数据的节拍数,所述节拍数表征所述音频数据对应的旋律在每分钟内的拍子数量;基于所述音频数据中的音频帧的时间戳和所述音频数据的节拍数,确定所述目标音频帧。
第二方面,根据本公开的一个或多个实施例,提供了一种图像处理装置,包括:
处理模块,用于在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应。
显示模块,用于在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。
根据本公开的一个或多个实施例,所述目标音频帧包括第一音频帧和第二音频帧,所述第一音频帧和所述第二音频帧交错播放;所述显示模块,具体用于:若所述目标音频帧为所述第一音频帧,则基于第一放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;若所述目标音频帧为所述第二音频帧,则基于第二放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;其中,所述第一放大系数与所述第二放大系数不同。
根据本公开的一个或多个实施例,所述处理模块在生成目标音频帧对应的目标面部贴图时,具体用于:获取所述目标音频帧对应的第一口型数据,所述第一口型数据用于表征嘴部形状;检测所述目标图像,获得第二口型数据,所述第二口型数据表征所述目标嘴 部形状的尺寸参数;基于所述第一口型参数和所述第二口型参数,生成所述目标面部贴图。
根据本公开的一个或多个实施例,所述处理模块在获取所述目标音频帧对应的第一口型数据时,具体用于:获取所述目标音频帧对应的目标语义信息,所述目标语义信息表征所述目标音频帧的音频内容;基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据。
根据本公开的一个或多个实施例,所述目标语义信息包括文本信息和对应的发音阶段标识,其中,所述文本信息表征所述目标音频帧的音频内容对应的目标文字,所述发音阶段标识表征所述目标音频帧对应所述目标文字的目标发音阶段;处理模块在基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据时,具体用于:将所述文本信息和对应的发音阶段标识输入所述对抗神经网络,得到所述第一口型数据。
根据本公开的一个或多个实施例,所述处理模块在检测所述目标图像,获得第二口型数据时,具体用于:基于所述目标图像,进行嘴部特征识别,获得轮廓关键点,所述轮廓关键点用于表征所述目标图像内的嘴部轮廓的长度和宽度;基于所述轮廓关键点的坐标,获得所述第二口型数据。
根据本公开的一个或多个实施例,所述处理模块,还用于:获取所述目标图像中人像的头部转向角度;所述处理模块在基于所述轮廓关键点的坐标,获得所述第二口型数据时,具体用于:基于所述轮廓关键点的坐标和所述头部转向角度,获得所述第二口型数据。
根据本公开的一个或多个实施例,在所述生成目标音频帧对应的目标面部贴图之前,所述处理模块,还用于:获取所述音频数据的节拍信息,所述节拍信息表征所述音频数据播放过程中,至少两个音频段之间的时间间隔,其中,所述音频段包括多个音频帧,所述音频段用于实现至少一个音频内容的完整发音;根据所述节拍信息,确定所述目标音频帧。
根据本公开的一个或多个实施例,所述处理模块在根据所述节拍信息,确定所述目标音频帧时,具体用于:根据所述节拍信息,获取所述音频数据的节拍数,所述节拍数表征所述音频数据对应的旋律在每分钟内的拍子数量;基于所述音频数据中的音频帧的时间戳和所述音频数据的节拍数,确定所述目标音频帧。
第三方面,根据本公开的一个或多个实施例,提供了一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,以实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
第四方面,根据本公开的一个或多个实施例,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
第五方面,本公开实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
第六方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的图像处理方法。
本实施例提供的图像处理方法、装置、电子设备、计算机可读存储介质、计算机程序产品及计算机程序,在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应;在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。利用目标面部贴图来模拟展示当前播放的目标音频内容对应的目标嘴部形状,使目标图像的面部区域所展示的嘴部形状,能够随音频内容的变化而发生变化,实现对真人演唱音频数据对应的音频过程的模仿,使音频作品能够呈现出视频作品的展示效果,提高音频作品的展示内容丰富性和多样性。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (14)

  1. 一种图像处理方法,包括:
    在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应;
    在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。
  2. 根据权利要求1所述的方法,其中,所述目标音频帧包括第一音频帧和第二音频帧,所述第一音频帧和所述第二音频帧交错播放;
    所述在目标图像的第一面部区域显示所述目标面部贴图,包括:
    若所述目标音频帧为所述第一音频帧,则基于第一放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;
    若所述目标音频帧为所述第二音频帧,则基于第二放大系数,对所述目标图像和位于所述第一面部区域的目标面部贴图进行显示;
    其中,所述第一放大系数与所述第二放大系数不同。
  3. 根据权利要求1或2所述的方法,其中,所述生成目标音频帧对应的目标面部贴图,包括:
    获取所述目标音频帧对应的第一口型数据,所述第一口型数据用于表征嘴部形状;
    检测所述目标图像,获得第二口型数据,所述第二口型数据表征所述目标嘴部形状的尺寸参数;
    基于所述第一口型参数和所述第二口型参数,生成所述目标面部贴图。
  4. 根据权利要求3所述的方法,其中,所述获取所述目标音频帧对应的第一口型数据,包括:
    获取所述目标音频帧对应的目标语义信息,所述目标语义信息表征所述目标音频帧的音频内容;
    基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据。
  5. 根据权利要求4所述的方法,其中,所述目标语义信息包括文本信息和对应的发音阶段标识,其中,所述文本信息表征所述目标音频帧的音频内容对应的目标文字,所述发音阶段标识表征所述目标音频帧对应所述目标文字的目标发音阶段;
    所述基于预训练的对抗神经网络,处理所述目标语义信息,得到所述第一口型数据,包括:
    将所述文本信息和对应的发音阶段标识输入所述对抗神经网络,得到所述第一口型数据。
  6. 根据权利要求3至5中任一项所述的方法,其中,所述检测所述目标图像,获得第二口型数据,包括:
    基于所述目标图像,进行嘴部特征识别,获得轮廓关键点,所述轮廓关键点用于表征所述目标图像内的嘴部轮廓的长度和宽度;
    基于所述轮廓关键点的坐标,获得所述第二口型数据。
  7. 根据权利要求6所述的方法,其中,所述方法还包括:
    获取所述目标图像中人像的头部转向角度;
    所述基于所述轮廓关键点的坐标,获得所述第二口型数据,包括:
    基于所述轮廓关键点的坐标和所述头部转向角度,获得所述第二口型数据。
  8. 根据权利要求1至7中任一项所述的方法,其中,在所述生成目标音频帧对应的目标面部贴图之前,所述方法还包括:
    获取所述音频数据的节拍信息,所述节拍信息表征所述音频数据播放过程中,至少两个音频段之间的时间间隔,其中,所述音频段包括多个音频帧,所述音频段用于实现至少一个音频内容的完整发音;
    根据所述节拍信息,确定所述目标音频帧。
  9. 根据权利要求8所述的方法,其中,所述根据所述节拍信息,确定所述目标音频帧,包括:
    根据所述节拍信息,获取所述音频数据的节拍数,所述节拍数表征所述音频数据对应的旋律在每分钟内的拍子数量;
    基于所述音频数据中的音频帧的时间戳和所述音频数据的节拍数,确定所述目标音频帧。
  10. 一种图像处理装置,包括:
    处理模块,用于在播放音频数据的过程中,生成目标音频帧对应的目标面部贴图,所述目标面部贴图用于表征目标嘴部形状,所述目标嘴部形状与所述目标音频帧的音频内容对应;
    显示模块,用于在目标图像的第一面部区域显示所述目标面部贴图,所述第一面部区域用于展示嘴部形状随所述音频数据的播放而发生的变化。
  11. 一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,以实现如权利要求1至9中任一项所述的图像处理方法。
  12. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至9中任一项所述的图像处理方法。
  13. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现权利要求1至9中任一项所述的图像处理方法。
  14. 一种计算机程序,所述计算机程序被处理器执行时实现权利要求1至9中任一项所述的图像处理方法。
PCT/CN2023/120412 2022-10-14 2023-09-21 图像处理方法、装置、电子设备及存储介质 WO2024078293A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211262215.1A CN115619897A (zh) 2022-10-14 2022-10-14 图像处理方法、装置、电子设备及存储介质
CN202211262215.1 2022-10-14

Publications (1)

Publication Number Publication Date
WO2024078293A1 true WO2024078293A1 (zh) 2024-04-18

Family

ID=84862488

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/120412 WO2024078293A1 (zh) 2022-10-14 2023-09-21 图像处理方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN115619897A (zh)
WO (1) WO2024078293A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115619897A (zh) * 2022-10-14 2023-01-17 北京字跳网络技术有限公司 图像处理方法、装置、电子设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886643A (zh) * 2021-09-30 2022-01-04 深圳追一科技有限公司 数字人视频生成方法、装置、电子设备和存储介质
CN113987269A (zh) * 2021-09-30 2022-01-28 深圳追一科技有限公司 数字人视频生成方法、装置、电子设备和存储介质
CN115619897A (zh) * 2022-10-14 2023-01-17 北京字跳网络技术有限公司 图像处理方法、装置、电子设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886643A (zh) * 2021-09-30 2022-01-04 深圳追一科技有限公司 数字人视频生成方法、装置、电子设备和存储介质
CN113987269A (zh) * 2021-09-30 2022-01-28 深圳追一科技有限公司 数字人视频生成方法、装置、电子设备和存储介质
CN115619897A (zh) * 2022-10-14 2023-01-17 北京字跳网络技术有限公司 图像处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115619897A (zh) 2023-01-17

Similar Documents

Publication Publication Date Title
US11158102B2 (en) Method and apparatus for processing information
WO2020253806A1 (zh) 展示视频的生成方法、装置、设备及存储介质
WO2020113733A1 (zh) 动画生成方法、装置、电子设备及计算机可读存储介质
US11514923B2 (en) Method and device for processing music file, terminal and storage medium
CN113365134B (zh) 音频分享方法、装置、设备及介质
CN109474850B (zh) 运动像素视频特效添加方法、装置、终端设备及存储介质
JP7199527B2 (ja) 画像処理方法、装置、ハードウェア装置
CN111899706A (zh) 音频制作方法、装置、设备及存储介质
WO2024078293A1 (zh) 图像处理方法、装置、电子设备及存储介质
WO2022007565A1 (zh) 增强现实的图像处理方法、装置、电子设备及存储介质
CN109600559B (zh) 一种视频特效添加方法、装置、终端设备及存储介质
JP7427792B2 (ja) ビデオエフェクト処理方法及び装置
WO2021057740A1 (zh) 视频生成方法、装置、电子设备和计算机可读介质
US11886484B2 (en) Music playing method and apparatus based on user interaction, and device and storage medium
CN108986841A (zh) 音频信息处理方法、装置及存储介质
WO2023051246A1 (zh) 视频录制方法、装置、设备及存储介质
JP2013161205A (ja) 情報処理装置、情報処理方法、及びプログラム
CN113821189A (zh) 音频播放方法、装置、终端设备及存储介质
WO2024046360A1 (zh) 媒体内容处理方法、装置、设备、可读存储介质及产品
WO2023061229A1 (zh) 视频生成方法及设备
CN112883223A (zh) 音频展示方法、装置、电子设备及计算机存储介质
CN111028823A (zh) 音频生成方法、装置、计算机可读存储介质及计算设备
WO2024131585A1 (zh) 视频特效显示方法、装置、电子设备及存储介质
WO2024082948A1 (zh) 多媒体数据处理方法、装置、设备及介质
WO2024066790A1 (zh) 音频处理方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23876498

Country of ref document: EP

Kind code of ref document: A1