WO2021213008A1 - 一种视频的音画匹配方法、相关装置以及存储介质 - Google Patents

一种视频的音画匹配方法、相关装置以及存储介质 Download PDF

Info

Publication number
WO2021213008A1
WO2021213008A1 PCT/CN2021/078367 CN2021078367W WO2021213008A1 WO 2021213008 A1 WO2021213008 A1 WO 2021213008A1 CN 2021078367 W CN2021078367 W CN 2021078367W WO 2021213008 A1 WO2021213008 A1 WO 2021213008A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
duration
matched
voice
initial position
Prior art date
Application number
PCT/CN2021/078367
Other languages
English (en)
French (fr)
Inventor
凌永根
黄浩智
沈力
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21792639.3A priority Critical patent/EP4033769A4/en
Publication of WO2021213008A1 publication Critical patent/WO2021213008A1/zh
Priority to US17/712,060 priority patent/US11972778B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/34Indicating arrangements 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs

Definitions

  • This application relates to the field of artificial intelligence, and in particular to video audio-visual matching technology.
  • GAN Generative Adversarial Networks
  • the video content generated by GAN only includes image sequences, not voice content, and is limited by the insufficiency of training data and the instability of training methods.
  • the generated image sequences often have obvious flaws, which leads to the generation of The authenticity of the video content is poor.
  • the embodiments of the present application provide a video audio-visual matching method, related devices, and storage media.
  • the start and end markers can be used to locate the position of the active segment in the image sequence, so as to combine the active segment with the action with The voice fragments are matched.
  • the synthesized video fragments can be guaranteed to have a more realistic visual effect, that is, the scenes of the characters in the video fragments are more realistic, and the effect of the characters in the real scene is close, and it is difficult for people to recognize
  • the voice and image in the video segment are synthesized.
  • the movement direction of the start and end identifier can be used to match the voice segment and the active segment in an orderly manner, which can improve the consistency and continuity of the action and the voice in the synthesized video segment.
  • the first aspect of the present application provides a video audio-visual matching method, including:
  • the voice segment to be matched belongs to any voice segment in the voice sequence
  • the image sequence includes N active segments, and each active segment includes the action picture of the object, and the initial position of the start and end markers is the start frame of the active segment Or the end frame of the active segment, N is an integer greater than or equal to 1;
  • the voice segment to be matched and the active segment to be matched are synthesized to obtain a video segment, where the video segment includes the action picture of the object and the voice of the object.
  • a second aspect of the present application provides an audio-visual matching device, including:
  • the receiving module is used to obtain a voice sequence, where the voice sequence includes M voice segments, and M is an integer greater than or equal to 1;
  • the acquiring module is used to acquire a voice segment to be matched from a voice sequence, where the voice segment to be matched belongs to any voice segment in the voice sequence;
  • the acquiring module is also used to acquire the initial position of the start and end markers and the movement direction of the start and stop markers from the image sequence, where the image sequence includes N active segments, each active segment includes the action screen of the object, and the initial position of the start and stop markers is The start frame of the active segment or the end frame of the active segment, N is an integer greater than or equal to 1;
  • the acquiring module is also used to determine the active segment to be matched according to the initial position of the start and end identifier, the movement direction of the start and end identifier, and the voice segment to be matched;
  • the processing module is used for synthesizing the to-be-matched voice segment and the to-be-matched activity segment to obtain a video segment, where the video segment includes the action picture of the object and the voice of the object.
  • a third aspect of the present application provides a computer device, including: a memory, a transceiver, a processor, and a bus system;
  • the memory is used to store programs
  • the processor is configured to execute programs in the memory to implement the methods described in the foregoing aspects;
  • the bus system is used to connect the memory and the processor so that the memory and the processor communicate.
  • the fourth aspect of the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which when run on a computer, cause the computer to execute the methods described in the foregoing aspects.
  • the fifth aspect of the present application provides a computer program product, including instructions, which when run on a computer, cause the computer to execute the methods described in the above aspects.
  • the voice sequence sent by the client is first received, and then the voice segment to be matched is obtained from the voice sequence; the initial position of the start and end identifier and the movement of the start and end identifier are obtained from the image sequence Direction, according to the initial position of the start and end identifier, the movement direction of the start and end identifier, and the voice segment to be matched, determine the active segment to be matched; finally, the voice segment to be matched and the active segment to be matched are synthesized to obtain a video segment.
  • the position of the active segment in the image sequence is located using the position of the start and end markers, so that the active segment with action is matched with the voice segment.
  • the synthesized video segment has a more realistic
  • the visual effect that is, the scene of the characters speaking in the video clip is more realistic, and the effect of the characters speaking in the real scene is close, and it is difficult for people to recognize that the voice and image in the video clip are synthesized; in addition, use the start and stop
  • the movement direction of the identifier can match the voice segment and the active segment in an orderly manner, which can improve the consistency and continuity of the action and the voice in the synthesized video segment.
  • FIG. 1 is a schematic diagram of a scene in which a video is generated based on the audio-visual matching method in an embodiment of the application;
  • FIG. 2 is a schematic diagram of an architecture of the audio-visual matching system in an embodiment of the application
  • FIG. 3 is a schematic flowchart of a method for audio-visual matching of video in an embodiment of the application
  • FIG. 4 is a schematic diagram of an embodiment of a method for audio-visual matching of video in an embodiment of the application
  • FIG. 5 is a schematic diagram of an embodiment of a voice sequence in an embodiment of this application.
  • FIG. 6A is a schematic diagram of an embodiment of an image sequence in an embodiment of this application.
  • FIG. 6B is a schematic diagram of an embodiment of the initial position of the start and end identifiers in the embodiment of the application.
  • FIG. 6C is a schematic diagram of another embodiment of the initial position of the start and end identifiers in the embodiment of this application.
  • FIG. 7 is a schematic diagram of an embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 8 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 9 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 10 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 11 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 12 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 13 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 14 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 15 is a schematic diagram of another embodiment of determining an active segment to be matched in an embodiment of the application.
  • FIG. 16 is a schematic diagram of an embodiment of an audio-visual matching device in an embodiment of this application.
  • FIG. 17 is a schematic diagram of an embodiment of a terminal device in an embodiment of the application.
  • the audio-visual matching method provided by this application is applied to the scene of synthesizing video, that is, a video including voice information and image information can be synthesized, and applications such as virtual idols, virtual commentaries, or virtual teachers can be realized based on the video.
  • a video can be obtained, which includes an image sequence with speaking actions; and then a voice sequence can be obtained.
  • the voice sequence can be pre-recorded, or collected in real time, or It is obtained after text conversion; then, using the audio-visual matching method provided in this application, the voice sequence is corresponding to the image sequence in the video, and the corresponding video is synthesized, thereby realizing virtual explanation.
  • FIG. 1 is a schematic diagram of a scene of generating a video based on the audio-visual matching method in an embodiment of the application.
  • (A) in FIG. the terminal device can obtain the voice segment to be matched from the voice sequence input by the user, and determine the active segment to be matched online, and then synthesize the voice segment to be matched with the active segment to be matched to obtain the video segment, which is shown in Figure 1.
  • the video clip shown in the figure the video clip includes the generated action picture of the object and the voice of the object.
  • the video clip thus synthesized is more in line with the natural law of the character when speaking, so that it can be displayed on the client of the terminal device
  • the video clips have better authenticity. It is understandable that the application scenarios are not exhaustively listed here.
  • this application proposes a video audio-visual matching method, which is applied to the video audio-visual matching system shown in FIG. 2, please refer to FIG. 2.
  • 2 is a schematic diagram of the architecture of the audio-visual matching system in the embodiment of the application.
  • the video audio-visual matching system includes a server and a terminal device.
  • the audio-visual matching device can be deployed on the server or on the terminal device.
  • An exemplary method is that the terminal device obtains the voice sequence, and then obtains the voice segment to be matched from the voice sequence, and then obtains the active segment to be matched from the image sequence according to the audio-visual matching method provided in this application, and then obtains the segment to be matched on the terminal device side
  • the voice segment to be matched and the active segment to be matched are synthesized to obtain a video segment, which can be played directly by the terminal device.
  • Another exemplary method is that the terminal device obtains the voice sequence, and then sends the voice sequence to the server, and the server obtains the voice segment to be matched from the voice sequence, and then obtains it from the image sequence according to the audio-visual matching method provided in this application
  • the active segment to be matched is synthesized, and the voice segment to be matched and the active segment to be matched are synthesized on the server side to obtain a video segment.
  • the server feeds back the video segment to the terminal device, and the terminal device plays it.
  • the server in FIG. 2 may be one server, or a server cluster or cloud computing center composed of multiple servers, and the details are not limited here.
  • the terminal device can also be other voice interaction devices. Voice interaction devices include, but are not limited to, smart speakers and Smart Appliances.
  • terminal devices and one server Although only five terminal devices and one server are shown in FIG. 2, it should be understood that the example in FIG. 2 is only used to understand this solution, and the specific numbers of terminal devices and servers should be flexibly determined in combination with actual conditions.
  • the embodiments of the application may implement audio-visual matching based on artificial intelligence (AI) technology.
  • AI artificial intelligence
  • Fig. 3 is a schematic flowchart of the audio-visual matching method of the video in an embodiment of the application. As shown in Fig. 3, the method includes the following steps:
  • step S1 the initial position of the start and end markers and the movement direction of the start and stop markers are obtained from the image sequence.
  • step S2 it is first determined whether the active segment to be matched has the same duration as that of the voice segment to be matched after zooming. If it exists, step S3 is executed. If it does not exist, and the reason is that the voice segment to be matched is too short, then Step S4 is executed. If it does not exist, and the reason is that the activity segment to be matched is too long, step S5 is executed.
  • step S3 the zoomed active segment to be matched and the voice segment to be matched are directly matched, and a video segment is obtained.
  • step S4 the active segment to be matched is generated with the start and end identifier as the center origin, and is matched with the voice segment to be matched, and a video segment is obtained.
  • step S5 an active segment to be matched is generated to match the speech segment to be matched, and then the initial position and moving direction of the start and end identifiers are acquired again.
  • an embodiment of the video audio-visual matching method in the embodiment of the present application includes:
  • the audio-visual matching device may receive a voice sequence sent by the client, and the voice sequence includes at least one voice segment.
  • the voice sequence sent by the client is input online by the client user.
  • the user inputs a voice through a microphone to generate a corresponding voice sequence, or the user inputs text content, and the text content is converted to a voice sequence.
  • the audio-visual matching device may also obtain a voice sequence from a database, and the voice sequence includes at least one voice segment. This application does not set any limitation on the acquisition method of the speech sequence here.
  • the audio-visual matching device can be deployed in any kind of computer equipment, such as a server or terminal device. limited.
  • the audio-visual matching device can obtain a voice segment to be matched from the voice sequence.
  • the duration of the speech segment to be matched is l i , and i is an integer greater than or equal to 1 and less than or equal to M.
  • the present application may extract segments from the speech sequence and the image sequence at a rate of 30 frames per second.
  • Figure 5 is a schematic diagram of an example of a voice sequence in an embodiment of this application.
  • A0 is used to indicate a voice sequence, where A1, A2, A3, A4, and A5 Indicate different voice segments in the voice sequence respectively, and the voice segment to be matched can be any one of the five voice segments.
  • each active segment includes the action picture of the object, and the initial position of the start and end markers is the start of the active segment.
  • the start frame or the end frame of the active segment, N is an integer greater than or equal to 1;
  • the audio-visual matching device needs to obtain an image sequence, where the image sequence is a sequence composed of multiple frames of images, and the image sequence includes active segments and silent segments, and each active segment includes the action picture of the object, and Each silent segment usually does not include the motion picture of the object. For example, the silent segment may only include the background image.
  • the audio-visual matching device obtains the initial position of the start-stop identifier and the movement direction of the start-stop identifier from the image sequence.
  • the initial position of the start-stop identifier can be the start frame of the active segment or the end frame of the active segment.
  • the cursor has the ability to move forward or backward, therefore, the cursor can be regarded as a pointer, and the cursor can specify any position in the image sequence or voice sequence.
  • the slider also has the ability to move forward or backward, and can specify any position in the image sequence or voice sequence. Therefore, the start and end identification can be expressed as a frame number in the image sequence, and the total length of time is expressed by the number of frames.
  • the object in the activity segment may refer to a virtual object, such as a virtual announcer, a virtual character or a cartoon character, etc., and the object may also refer to a real object, such as user A.
  • FIG. 6A is a schematic diagram of an embodiment of the image sequence in the embodiment of the application. As shown in FIG. Indicates different active segments in the image sequence.
  • FIG. 6B is a schematic diagram of an embodiment of the initial position of the start and stop marks in the embodiment of the application. As shown in FIG. The initial position, and the initial position of the start-stop identifier is the start frame of the active segment B3, please refer to FIG. 6C.
  • FIG. 6C is a schematic diagram of another embodiment of the initial position of the start-stop identifier in the embodiment of the application, as shown in FIG. 6C, when When the movement direction of the start-stop indicator is reverse, B7 is used to indicate the initial position corresponding to the start-stop indicator, and the initial position of the start-stop indicator is the end frame of the active segment B3.
  • the audio-visual matching device can determine the active segment to be matched based on the initial position of the start-stop identifier, the movement direction of the start-stop identifier, and the voice segment to be matched, and the active segment to be matched includes the action picture of the object. Specifically, assuming that A3 in FIG. 5 is the voice segment to be matched, the movement direction of the start and end markers is positive, and the initial position of the start and stop markers is the position shown in B6 in FIG. 6B. Based on this, the active segment to be matched can be determined It is the activity segment B3 in FIG. 6A, and the activity segment B3 to be matched includes the action picture of the object.
  • the audio-visual matching device synthesizes the voice segment to be matched and the active segment to be matched to obtain a video segment. Specifically, assuming that A3 in FIG. 5 is the voice segment to be matched, and B3 in FIG.
  • the voice segment A3 to be matched includes the voice of the object, and the active segment B3 to be matched includes the action image of the object. Therefore, the video segment includes both the action image of the object and the corresponding voice.
  • a neural network can also be used to synthesize the corresponding lip shape according to the content of the speech, and then stitch the lip shape to the synthesized video segment.
  • video clips include, but are not limited to, virtual video clips, synthesized video clips, and clipped video clips.
  • the video segment is a virtual video segment
  • the virtual video segment includes the motion picture of the virtual object and the voice of the virtual object.
  • the synthesized video segment includes the motion picture of the object and the voice of the object.
  • the clipped video clip includes a partial clip obtained from a complete video, and the clip includes the action picture of the object and the voice of the object.
  • a method for matching audio and video of a video is provided.
  • the position of the active segment in the image sequence is located by using the start and end marker positions, and the active segment with action is compared with The voice fragments are matched.
  • the synthesized video fragments can be guaranteed to have a more realistic visual effect, that is, the scenes of the characters in the video fragments are more realistic, and the effect of the characters in the real scene is close, and it is difficult for people to recognize
  • the voice and image in the video segment are synthesized.
  • the movement direction of the start and end identifier can be used to match the voice segment and the active segment in an orderly manner, which can improve the consistency and continuity of the voice and image in the synthesized video segment.
  • the to-be-matched speech segment is within the target forward duration interval, determine the to-be-matched active segment according to at least one of the j-th active segment and the (j+1)-th active segment;
  • Video audio and picture matching methods also include:
  • start and end identification position update conditions are met, the initial position of the start and end identification is updated
  • the movement direction of the start and end indicator is adjusted to reverse.
  • the audio-visual matching device can determine the minimum value of the first forward duration and the first A maximum value of the forward duration, and then determine the target forward duration interval.
  • the voice segment to be matched is within the target forward duration interval, the active segment to be matched can be determined.
  • the minimum zoom ratio is 0.8 and the maximum zoom ratio is 1.25 as an example for description.
  • the minimum value of the first forward duration can be calculated by the following formula:
  • the maximum value of the first forward duration can be calculated by the following formula:
  • the target positive time interval can be calculated by the following formula:
  • Index represents the initial position of the start and end markers
  • scale short represents the minimum zoom ratio
  • e j represents the end frame of the jth active segment
  • scale long represents the maximum zoom ratio
  • s j+1 represents the (j+1)th active segment The starting frame.
  • FIG. 7 is an embodiment of determining the active segment to be matched in the embodiment of the application.
  • a schematic diagram, as shown in FIG. 7, C0 shown in (A) in FIG. 7 represents the initial position Index of the start and end markers, which is the 10th frame of the image sequence.
  • C1 represents the starting frame s j of the j-th active segment, which is the 10th frame of the image sequence.
  • C2 represents the end frame e j of the j-th active segment, which is the 16th frame of the image sequence.
  • C3 represents the start frame s j+1 of the (j+1)th active segment, which is the 18th frame of the image sequence.
  • C4 represents the length of the jth active segment, and C5 represents the length of the (j+1)th active segment.
  • the minimum value of the first forward duration is 5.6, and the maximum value of the first forward duration is 11.25. From this, the target forward duration interval can be obtained as [5.6, 11.25]. If the duration of the voice segment to be matched is at [5.6, 11.25], that is, the voice segment to be matched C6 as shown in Figure 7(B), the to-be-matched segment can be determined according to at least one of the active segment C4 and the active segment C5 Activity fragment.
  • the audio-visual matching device can also update the initial position of the start-stop identifier.
  • the updated initial position of the start-stop identifier is greater than or equal to the position corresponding to the end frame of the Nth active segment, the The movement direction of the start and end marks is adjusted to reverse. That is to say, if the movement direction of the start and end marks is positive, and the initial position of the updated start and end marks has exceeded the end frame of the last active segment in the image sequence, then the movement direction of the start and end marks needs to be changed to reverse.
  • Performing a similar operation to the forward direction by updating and adjusting the movement direction of the start and end markers from forward to reverse, it is possible to match the voice sequence input in real time, thereby generating real-time videos with higher authenticity.
  • a method for determining the active segment to be matched is provided.
  • the initial position of the start-stop identifier and the start frame and end of the active segment are specifically used.
  • Frame, combined with the voice segment to be matched determines the active segment to be matched, so that the synthesized video is more in line with the scene when the actual voice of the object is described, so that the video has more authenticity.
  • the active segments to be matched corresponding to different speech segments to be matched are connected end to end, thereby improving the consistency and continuity of speech and image in the synthesized video segment.
  • At least one activity segment among the activity segments, and determining the activity segment to be matched may include:
  • the voice segment to be matched is within the first forward duration interval, then according to the duration of the voice segment to be matched, the duration between the initial position of the start and end identifier and the end frame of the j-th active segment is scaled to obtain the activity to be matched Fragment
  • the initial position of the start and end identification is updated, which may include:
  • the start and end identification position update condition is met
  • the initial position of the start and end identifier is updated to the position corresponding to the end frame of the j-th active segment.
  • the audio-visual matching device can determine the second minimum forward duration, and then determine the first forward duration interval according to the first minimum forward duration and the second minimum forward duration.
  • the voice segment to be matched is in When in the first forward duration interval, according to the duration of the voice segment to be matched, the duration between the initial position of the start and end identifier and the end frame of the j-th active segment is scaled to obtain the active segment to be matched.
  • the voice segment to be matched is within the first forward duration interval, it indicates that the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier can be updated to the position corresponding to the end frame of the j-th active segment.
  • the second minimum value of the forward duration can be calculated by the following formula:
  • the first positive duration interval can be calculated by the following formula:
  • Index represents the initial position of the start and end identifiers
  • scale short represents the minimum zoom ratio
  • e j represents the end frame of the j-th active segment.
  • FIG. 8 is a schematic diagram of another embodiment of determining the active segment to be matched in an embodiment of the application. As shown in Fig. 8, D0 shown in Fig. 8(A) represents the initial position Index of the start and end indicator, and the initial position of the start and end indicator is the 10th frame.
  • D1 represents the starting frame s j of the j-th active segment, and is the 10th frame of the image sequence.
  • D2 represents the end frame e j of the jth active segment, and is the 16th frame of the image sequence.
  • D3 represents the start frame s j+1 of the (j+1)th active segment, and is the 18th frame of the image sequence.
  • D4 represents the length of the jth active segment, and D5 represents the length of the (j+1)th active segment.
  • the minimum value of the first forward duration is 5.6 frames
  • the minimum value of the second forward duration is 7 frames
  • the first forward duration interval can be obtained as [5.6, 7].
  • the voice segment D6 to be matched shown in Figure 8 (B) is 6 frames
  • the duration of the voice segment to be matched is within the first forward duration interval. Therefore, the duration of the voice segment to be matched D6 can be adjusted
  • the duration between the initial position of the start and end identifier and the end frame of the j-th active segment is scaled, for example, the duration of the j-th active segment is scaled to 6 frames. Therefore, it is matched with the voice segment D6 to be matched.
  • the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier needs to be updated to the position corresponding to the end frame of the j-th active segment, that is Change the initial position of the start and end markers from frame 10 to frame 16.
  • At least one activity segment among the activity segments, and determining the activity segment to be matched may include:
  • the duration between the initial position of the start and end identifier and the start frame of the (j+1)th active segment is scaled according to the duration of the voice segment to be matched. , Get the active fragment to be matched;
  • the initial position of the start and end identification is updated, which may include:
  • the start and end identification position update condition is met
  • the initial position of the start and end identifier is updated to the position corresponding to the start frame of the (j+1)th active segment.
  • the audio-visual matching device can determine the second maximum forward duration, and then determine the second forward duration interval according to the first maximum forward duration and the second maximum forward duration.
  • the voice segment to be matched is in
  • the duration of the voice segment to be matched according to the duration of the voice segment to be matched, the duration between the initial position of the start and end identifier and the start frame of the (j+1)th active segment is scaled to obtain the active segment to be matched .
  • the initial position of the start and end identifier is updated to the position corresponding to the start frame of the (j+1)th active segment.
  • the second maximum value of the forward duration can be calculated by the following formula:
  • the second positive duration interval can be calculated by the following formula:
  • Index represents the initial position of the start and end identifiers
  • scale long represents the maximum zoom ratio
  • s j+1 represents the start frame of the (j+1)th active segment.
  • the initial position of the start and end markers is the 10th frame
  • the start frame of the j-th active segment is the 10th frame
  • the end frame of the j-th active segment is the 16th frame
  • the (j+1)th activity The starting frame of the segment is frame 18, and the duration of the speech segment to be matched is 10 frames as an example for description.
  • FIG. 9 is a schematic diagram of another embodiment of determining the active segment to be matched in an embodiment of the application.
  • E0 shown in (A) in FIG. 9 represents the initial position Index of the start and end indicator
  • the initial position of the start and end indicator is the 10th frame.
  • E1 represents the start frame s j of the jth active segment, and is the 10th frame of the image sequence
  • E2 represents the end frame e j of the jth active segment, and is the 16th frame of the image sequence
  • E3 represents the (( j+1) the starting frame s j+1 of the active segment, and is the 18th frame of the image sequence
  • E4 represents the length of the jth active segment
  • E5 represents the length of the (j+1)th active segment.
  • the maximum value of the first forward duration is 11.25 frames
  • the maximum value of the second forward duration is 9 frames
  • the second forward duration interval can be obtained as [9, 11.25].
  • the voice segment E6 to be matched shown in Figure 9 (B) is 10 frames
  • the duration of the voice segment to be matched is within the second forward duration interval. Therefore, the duration of the voice segment E6 to be matched can be adjusted according to the duration of the voice segment E6.
  • the duration between the initial position of the start and end identifier and the start frame of the (j+1)th active segment is scaled, for example, the duration between E0 and E3 is scaled to 10 frames. Therefore, the active segment to be matched with the same duration as the speech segment E6 to be matched can be obtained.
  • the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier needs to be updated to correspond to the start frame of the (j+1)th active segment , That is, change the initial position of the start and end markers from the 10th frame to the 18th frame.
  • At least one activity segment among the activity segments, and determining the activity segment to be matched may include:
  • the initial position of the start and end identification is updated, which may include:
  • the initial position of the start and end identifier is updated to the position corresponding to the end frame of the active segment to be matched.
  • the audio-visual matching device determines the second minimum forward duration and the second maximum forward duration, and then determines the third forward duration interval according to the second minimum forward duration and the second maximum forward duration
  • the active segment to be matched is determined according to the initial position of the start and end identifier and the duration of the voice segment to be matched. If the voice segment to be matched is within the third forward duration interval, the initial position of the start and end identifier is updated to the position corresponding to the end frame of the active segment to be matched.
  • the third positive duration interval can be calculated by the following formula:
  • Index represents the initial position of the start and end identifier
  • e j represents the end frame of the jth active segment
  • s j+1 represents the start frame of the (j+1)th active segment.
  • the initial position of the start and end markers is the 10th frame
  • the start frame of the j-th active segment is the 10th frame
  • the end frame of the j-th active segment is the 16th frame
  • the (j+1)th activity The starting frame of the segment is the 18th frame
  • the duration of the speech segment to be matched is 8 frames as an example for description.
  • FIG. 10 is a schematic diagram of another embodiment of acquiring the active segment to be matched in an embodiment of the application.
  • F0 shown in (A) in FIG. 10 represents the initial position Index of the start and end markers
  • the initial position of the start and stop markers is the 10th frame.
  • F1 represents the start frame s j of the jth active segment, and is the 10th frame of the image sequence
  • F2 represents the end frame e j of the jth active segment, and is the 16th frame of the image sequence
  • F3 represents the (( The starting frame s j+1 of the j+1) active segment is the 18th frame of the image sequence
  • F4 represents the length of the jth active segment
  • F5 represents the length of the (j+1)th active segment.
  • the minimum value of the second forward duration is 7 frames, and the maximum value of the second forward duration is 9 frames, so that the third forward duration interval can be obtained as [7, 9].
  • the to-be-matched speech segment F6 shown in Figure 10 (B) is 8 frames, that is, the duration of the to-be-matched speech segment is within the third forward duration interval. Therefore, the initial position F0 of the start and end identifiers and the to-be-matched segment can be The duration of the voice segment F6 is determined by the following method to determine the active segment to be matched:
  • Index represents the initial position of the start and end identifiers
  • l i represents the length of the voice segment to be matched. Assuming that the length of the speech segment to be matched is 8 frames, that is, the active segment to be matched is represented as the active segment between the 10th frame and the 17th frame.
  • the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier can be updated to the position corresponding to the end frame of the active segment to be matched, that is, The initial position of the start and end marks is changed from frame 10 to frame 17, so as to ensure that the initial position of the start and end marks is in the silent segment.
  • a method for determining the active segment to be matched is provided.
  • different methods can be used to determine the active segment to be matched when the length of the speech segment to be matched is different, thereby increasing the diversity of matching algorithms.
  • the updated start and end signs fall in the silent segment, which makes the active segment equipped with voice, which makes the synthesized video appear more natural.
  • the matching method provided by the present application is simple to calculate, can be used for real-time calculation, and can synthesize video clips online.
  • the audio-visual matching method of the video also includes:
  • the segment to be matched is determined according to the duration of the voice segment to be matched, the initial position of the start and end markers, and the moving radius. Match activity fragments;
  • the active segment to be matched is determined according to the duration of the voice segment to be matched and the initial position of the start and end markers .
  • the audio-visual matching device may determine the to-be-matched value by means of two-way swing. Match the active segment, or the audio-visual matching device may take several frames toward the silent segment to determine the active segment to be matched.
  • the initial position of the start and end identifier is the 10th frame
  • the start frame of the jth active segment is the 10th frame
  • the end frame of the jth active segment is the 16th frame
  • the (j+1)th active segment is 3 frames as an example for illustration.
  • the target forward duration interval can be obtained as [5.6, 11.25]. Therefore, the voice segment to be matched is not in the target Within the positive duration interval and less than the first positive duration minimum value 5.6, the initial position of the start and end markers can be taken as the center and move back and forth in the active segment with the movement radius r, so as to obtain the active segment to be matched.
  • the moving radius is usually an integer greater than or equal to 1 and less than or equal to 5, and the initial position of the start and end identifiers is not updated.
  • the radius is 3 and the initial position of the start and end marks is the 10th frame
  • the 10th, 11th, 12th, 11th, 10th, 9th, 8th and 9th frames can be taken , Frame 10, Frame 11, Frame 12, Frame 11, etc., and so on.
  • the time length of the speech segment to be matched the corresponding frames are obtained in turn.
  • the time length of the speech segment to be matched is 3 frames, the first 3 frames of images are taken from the above sequence, namely the 10th, 11th and 12th frames .
  • the first method is to use the first frame of action in the active segment as the starting frame and the last frame of the active segment as the starting frame.
  • the end frame that is, the active segment is consistent with the segment with motion seen by the naked eye.
  • Another way is to select a number of silent pictures before the first frame of the action picture, and use a certain frame corresponding to the silent picture as the start frame of the active segment, similarly, from the last frame of the active segment Select one of the several silent frames after the end of the screen as the end frame of the active segment.
  • the active segment actually includes a small silent segment at the beginning and the end, which is closer to actual industrial applications.
  • FIG. 11 is a schematic diagram of another embodiment for determining the activity segment to be matched in an embodiment of the application, as shown in FIG. It is an active segment G1 that includes a small silent segment, that is, the active segment G1 can include motion image frames and silent frames, centered on the initial position of the start and end markers, and moved forward and backward with a moving radius of r to obtain the active segment to be matched
  • the active segment to be matched usually includes several silent frames, and may also include a small number of active frames.
  • Figure 11 (B) shows the active segment G2 that does not include the silent frame, that is, the active segment G2 only includes the image frame with action, so you can directly move several frames from the initial position of the start and stop mark to the direction of the silent segment, and take it out
  • the active segment to be matched, and the number of frames of the active segment to be matched is the same as the number of frames of the voice segment to be matched. That is, when the start and end mark is at the start frame of the active segment, take a few frames forward with the duration of the voice segment to be matched, for example, take 3 frames from the 10th frame in the opposite direction (the direction of the silent segment) to obtain the active segment to be matched .
  • start and end markers When the start and end markers are in the end frame of the active segment, take a few frames later based on the duration of the voice segment to be matched, for example, take 3 frames from the 16th frame in the positive direction (the direction of the silent segment) to obtain the active segment to be matched.
  • a method for determining the active segment to be matched is provided.
  • a silent segment can be matched, so that the synthesized video will not appear too abrupt, thereby improving The authenticity of the video.
  • the audio-visual matching method of the video also includes:
  • the k-th active segment is obtained from the image sequence, where k is greater than or equal to 1 , And an integer less than or equal to N;
  • the voice segment to be matched is within the fourth forward duration interval, determine the duration of the active segment to be matched according to the initial position of the start and end identifier, the maximum zoom ratio, and the start frame of the k-th active segment;
  • the voice segment to be matched is divided into a first voice segment and a second voice segment.
  • the duration of the first voice segment is the same as the duration of the active segment to be matched, and the second
  • the initial position of the subsequent start and end identifiers matches the corresponding action segment;
  • the initial position of the start and end identification is updated, which may include:
  • the start and end identification position update condition is met
  • the initial position of the start and end markers is updated to the position corresponding to the start frame of the k-th active segment.
  • the audio-visual matching device may obtain the k-th active segment from the image sequence, and then determine the third minimum forward duration and the third maximum forward duration to determine The fourth forward duration interval, when the voice segment to be matched is within the fourth forward duration interval, determine the duration of the active segment to be matched according to the initial position of the start and end markers, the maximum zoom ratio, and the start frame of the k-th active segment In this way, the to-be-matched speech segment is divided into a first speech segment and a second speech segment.
  • the second speech segment is used to match the corresponding action segment according to the updated initial position of the start and end identifier, that is, the second speech segment is used as the next
  • the voice segment to be matched is matched again with sound and picture. If the voice segment to be matched is within the fourth forward duration interval, it indicates that the start-stop identifier position update condition is met, and then the initial position of the start-stop identifier is updated to the position corresponding to the start frame of the k-th active segment.
  • the initial position of the start and end identifier is the 10th frame
  • the start frame of the jth active segment is the 10th frame
  • the end frame of the jth active segment is the 16th frame
  • the (j+1)th active segment is the 18th frame
  • the duration of the speech segment to be matched is 25 frames as an example for illustration.
  • the target forward duration interval is [5.6, 11.25]
  • the first forward duration maximum value is At frame 11.25
  • the voice segment to be matched is not within the target forward duration interval and is greater than the maximum first forward duration, so the k-th active segment needs to be obtained.
  • the third minimum value of the forward duration can be calculated by the following formula:
  • the third maximum forward duration can be calculated by the following formula:
  • the fourth positive duration interval can be calculated by the following formula:
  • the duration of the active segment to be matched can be calculated by the following formula:
  • Index represents the initial position of the start and end identifier
  • sk represents the start frame of the k-th active segment
  • e k+1 represents the end frame of the (k+1)th active segment
  • scale short represents the minimum zoom ratio
  • scale long represents Maximum zoom ratio
  • the third minimum forward duration is 21.25 frames
  • the third maximum forward duration is 28.8 frames.
  • the fourth forward duration interval is determined to be [21.25, 28.8], if the duration of the voice segment to be matched is 25 frames, the voice segment to be matched is within the fourth forward duration interval. Further, according to the aforementioned formula, the duration of the active segment to be matched is 20.25 frames.
  • the duration of the second speech segment is calculated by the following formula:
  • Index represents the initial position of the start and end identifiers
  • sk represents the start frame of the k-th active segment
  • scale long represents the maximum zoom ratio
  • l i represents the duration of the voice segment to be matched.
  • the first frame to the 20.25 frame of the first speech segment can be obtained, and the duration of the second speech segment is 3.75 frames.
  • the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier can be updated to the position corresponding to the start frame of the k-th active segment, that is, The initial position of the start and end indicator is changed from the 10th frame to the 26th frame, and the second speech segment obtained above can match the corresponding action segment according to the updated initial position of the start and end indicator.
  • the specific matching method is similar to the foregoing similar embodiment. I won't repeat them here.
  • a method for obtaining active segments to be matched is provided.
  • the accuracy of matching can be improved, thereby improving the matching degree between the voice segments and the active segments in the video, thereby enhancing the authenticity of the video.
  • Video audio and picture matching methods also include:
  • start and end identification position update conditions are met, the initial position of the start and end identification is updated
  • the movement direction of the start and end indicator is adjusted to the positive direction.
  • the audio-visual matching device can determine the minimum value of the first reverse duration and the first One is the maximum value of the reverse duration, and then the target reverse duration interval is determined.
  • the voice segment to be matched is within the target reverse duration interval, the active segment to be matched can be determined. If the start and end mark position update condition is met, the initial position of the start and end mark is updated. If the updated position of the start and end mark is less than or equal to the position corresponding to the start frame of the first active segment, the movement direction of the start and end mark is adjusted Is positive.
  • the minimum zoom ratio is 0.8 and the maximum zoom ratio is 1.25 as an example for description.
  • the minimum value of the first reverse duration can be calculated by the following formula:
  • the maximum value of the first reverse duration can be calculated by the following formula:
  • the target reverse duration interval can be calculated by the following formula:
  • Index represents the initial position of the start and end markers
  • scale short represents the minimum zoom ratio
  • sp represents the starting frame of the p-th active segment
  • scale long represents the maximum zoom ratio
  • e p-1 represents the (p-1)th activity The end frame of the clip.
  • FIG. 12 is a schematic diagram of another embodiment for determining the active segment to be matched in an embodiment of the application, as shown in FIG. 12, in (A )
  • the H0 shown in the figure represents the initial position Index of the start and end markers, which is the 18th frame of the image sequence.
  • H1 represents the start frame S p the p-th active segment, namely the 11th frame of the image sequence.
  • H2 represents the end frame e p-1 of the (p-1)th active segment, which is the 9th frame of the image sequence.
  • H3 represents the length of the p-th active segment
  • H4 represents the length of the (p-1)th active segment.
  • the minimum value of the first reverse duration is 6.4 frames
  • the maximum value of the first reverse duration is 12.5 frames
  • the target reverse duration interval can be obtained as [6.4, 12.5]. If the duration of the voice segment to be matched is [6.4, 12.5], that is, the voice segment to be matched H5 as shown in Figure 12 (B), the to-be-matched segment can be determined according to at least one of the active segment H3 and the active segment H4 Activity fragment.
  • the audio-visual matching device can also update the initial position of the start-stop identifier.
  • the updated initial position of the start-stop identifier is less than or equal to the position corresponding to the start frame of the first active segment, Adjust the movement direction of the start and end marks to the positive direction.
  • the movement direction of the start and end marks is reverse, and the initial position of the updated start and end marks has exceeded the first frame of the first active segment in the image sequence, then the movement direction of the start and end marks needs to be changed to positive
  • the forward and forward matching methods have been introduced in the foregoing embodiments, and will not be repeated here. By updating and adjusting the movement direction of the start and end markers from forward to reverse, it is possible to match the voice sequence input in real time, thereby generating real-time video with higher authenticity.
  • another method for obtaining the active segment to be matched is provided.
  • the synthesized video is more in line with the scene when the actual voice of the object is described, so that the video has more authenticity.
  • the active segments to be matched corresponding to different speech segments to be matched are connected end to end, thereby improving the consistency and continuity of the voice and the image in the synthesized video segment.
  • At least one activity segment among the activity segments, and determining the activity segment to be matched may include:
  • the duration between the start frame of the p-th active segment and the initial position of the start and end markers is scaled according to the duration of the voice segment to be matched to obtain the Activity fragment
  • the initial position of the start and end identification is updated, which may include:
  • the start and end identification position update condition is met
  • the initial position of the start and end markers is updated to the position corresponding to the start frame of the p-th active segment.
  • the audio-visual matching device may determine the first reverse duration interval according to the minimum value of the first reverse duration and the minimum value of the second reverse duration.
  • the duration of the matching voice segment is scaled and the duration between the start frame of the p-th active segment and the initial position of the start and end markers is scaled to obtain the active segment to be matched. If the voice segment to be matched is within the first reverse duration interval, it indicates that the start-stop identifier position update condition is met, and the initial position of the start-stop identifier can be updated to the position corresponding to the start frame of the p-th active segment.
  • the second minimum reverse duration can be calculated by the following formula:
  • the first reverse duration interval can be calculated by the following formula:
  • Index represents the initial position of the start and end identifiers
  • scale short represents the minimum zoom ratio
  • sp represents the start frame of the p-th active segment.
  • the initial position of the start and end markers is the 18th frame of the image sequence (the end frame of the p-th active segment), the start frame of the p-th active segment is the 11th frame, and the (p-1)th activity
  • the end frame of the segment is the 9th frame
  • the duration of the speech segment to be matched is 7 frames as an example for description, please refer to FIG.
  • I0 shown in (A) in FIG. 13 represents the initial position Index of the start and end indicator
  • the initial position of the start and end indicator is the 18th frame.
  • I1 indicates a start frame S p the p-th active segment, and a frame 11 of the image sequence.
  • I2 represents the end frame e p-1 of the (p-1)th active segment, and is the 9th frame of the image sequence.
  • I3 represents the length of the p-th active segment
  • I4 represents the length of the (p-1)th active segment.
  • the minimum value of the first reverse duration is 6.4 frames
  • the minimum value of the second reverse duration is 8 frames
  • the first reverse duration interval can be obtained as [6.4, 8].
  • 13(B) shows that the duration of the voice segment I5 to be matched is 7 frames, that is, the duration of the voice segment to be matched is within the first reverse duration interval. Therefore, the duration of the voice segment I5 to be matched can be , Performing scaling processing on the duration between the start frame of the p-th active segment and the initial position of the start-stop identifier, for example, scaling the duration of the p-th active segment to 7 frames. Thereby, it is matched with the voice segment I5 to be matched.
  • the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier needs to be updated to the position corresponding to the start frame of the p-th active segment. That is to change the initial position of the start and end markers from the 18th frame to the 11th frame.
  • At least one activity segment among the activity segments, and determining the activity segment to be matched may include:
  • the duration between the end frame of the (p-1)th active segment and the initial position of the start and end markers is scaled according to the duration of the voice segment to be matched. Obtain the active fragment to be matched;
  • the initial position of the start and end identification is updated, which may include:
  • the initial position of the start and end markers is updated to the position corresponding to the end frame of the (p-1)th active segment.
  • the audio-visual matching device may determine the second reverse duration interval according to the first reverse duration maximum value and the second reverse duration maximum value. If the voice segment to be matched is within the second reverse duration interval, then according to For the duration of the voice segment to be matched, the duration from the end frame of the (p-1)th active segment to the initial position of the start and end markers is scaled to obtain the active segment to be matched. If the voice segment to be matched is within the second reverse duration interval, the start-stop identifier position update condition is satisfied, and then the initial position of the start-stop identifier can be updated to the position corresponding to the end frame of the (p-1)th active segment.
  • the second maximum reverse duration can be calculated by the following formula:
  • the second reverse duration interval can be calculated by the following formula:
  • Index represents the initial position of the start and end markers
  • scale long represents the maximum zoom ratio
  • e p-1 represents the end frame of the (p-1)th active segment.
  • the initial position of the start and end markers is the 18th frame of the image sequence (the end frame of the p-th active segment), the start frame of the p-th active segment is the 11th frame, and the (p-1)th activity
  • the end frame of the segment is the 9th frame
  • the duration of the speech segment to be matched is 11 frames as an example for description.
  • FIG. 14 J0 shown in (A) in FIG. 14 represents the initial position Index of the start and end markers, and the initial position of the start and stop markers is the 18th frame.
  • J1 represents a start frame S p the p-th active segment, and a frame 11 of the image sequence.
  • J2 represents the end frame e p-1 of the (p-1)th active segment, and is the 9th frame of the image sequence.
  • J3 represents the length of the p-th active segment
  • J4 represents the length of the (p-1)th active segment.
  • the maximum value of the first reverse duration is 12.5 frames
  • the maximum value of the second reverse duration is 10 frames
  • the second reverse duration interval can be obtained as [10, 12.5].
  • 14(B) shows that the voice segment J5 to be matched is 11 frames, that is, the duration of the voice segment to be matched is within the second reverse duration interval. Therefore, the duration of the voice segment J5 to be matched can be The length of time between the end frame of the (p-1)th active segment and the initial position of the start and end markers is scaled, for example, that is, the length of time between J2 and J0 is scaled to 11 frames. Therefore, the active segment to be matched with the same duration as the speech segment J5 to be matched can be obtained.
  • the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier can be updated to the end frame of the (p-1)th active segment. Position, that is, change the initial position of the start and end markers from the 18th frame to the 9th frame.
  • At least one activity segment among the activity segments, and determining the activity segment to be matched may include:
  • the initial position of the start and end identification is updated, which may include:
  • the initial position of the start and end identifier is updated to the position corresponding to the start frame of the active segment to be matched.
  • the audio-visual matching device determines the third reverse duration interval according to the second minimum reverse duration and the second maximum reverse duration. If the voice segment to be matched is within the third reverse duration interval, it will start and end according to The initial position of the identifier and the duration of the voice segment to be matched determine the active segment to be matched. If the voice segment to be matched is within the third reverse duration interval, it means that the start and end identifier position update condition is met, and the initial position of the start and end identifier is updated to the position corresponding to the start frame of the active segment to be matched.
  • the third reverse duration interval can be calculated by the following formula:
  • Index represents the initial position of the start and end identification
  • s p represents the p-th frame is the start of the active segment
  • e p-1 represents the frame ends (p-1) active fragments.
  • the initial position of the start and end markers is the 18th frame of the image sequence (the end frame of the p-th active segment), the start frame of the p-th active segment is the 11th frame, and the (p-1)th activity
  • the end frame of the segment is the 9th frame
  • the duration of the voice segment to be matched is 9 frames as an example for description.
  • K0 shown in (A) in FIG. 15 represents the initial position Index of the start and end markers
  • the initial position of the start and stop markers is the 18th frame.
  • K1 indicates the starting frame S p represents the p-th active segment, and a frame 11 of the image sequence.
  • K2 represents the end frame e p-1 of the (p-1)th active segment, and is the 9th frame of the image sequence.
  • K3 represents the length of the p-th active segment
  • K4 represents the length of the (p-1)th active segment.
  • the minimum value of the second reverse duration is 8 frames, and the maximum value of the second reverse duration is 10 frames, so that the third reverse duration interval can be obtained as [8, 10].
  • 15(B) shows that the duration of the to-be-matched speech segment K5 is 9 frames, that is, the duration of the to-be-matched speech segment is within the third inverse duration interval. Therefore, the initial position K0 and The duration of the voice segment K5 to be matched is determined by the following method:
  • Index represents the initial position of the start and end identifiers
  • l i represents the length of the voice segment to be matched. Assuming that the length of the voice segment to be matched is 9 frames, that is, the active segment to be matched is represented as the active segment between the 10th frame and the 18th frame.
  • the start-stop identifier position update condition is satisfied, and the initial position of the start-stop identifier can be updated to the position corresponding to the start frame of the active segment to be matched, that is, Change the initial position of the start and end markers from frame 18 to frame 11 to ensure that the initial position of the start and end markers is in the silent segment.
  • the active segment to be matched can be obtained in different ways when the length of the voice segment to be matched is different, thereby improving the matching algorithm Secondly, the initial position of the updated start and end logo falls in the silent segment, which makes the active segment equipped with voice, thereby enhancing the synthesized video to appear more natural. Furthermore, the matching method provided by the present application is simple to calculate, can be used for real-time calculation, and can synthesize video clips online.
  • the audio-visual matching method of the video also includes:
  • the to-be-matched voice segment is determined according to the duration of the voice segment to be matched, the initial position of the start and end identifiers, and the moving radius. Match activity fragments;
  • the active segment to be matched is determined according to the duration of the voice segment to be matched and the initial position of the start and end markers .
  • the audio-visual matching device may determine the to-be-matched by using a two-way swing value.
  • the active segment, or the audio-visual matching device may take several frames in the direction of the silent segment to obtain the active segment to be matched.
  • the initial position of the start and end markers is frame 18 (the end frame of the p-th active segment), the start frame of the p-th active segment is the 11th frame, and the end of the (p-1)-th active segment
  • the frame is the 9th frame
  • the duration of the to-be-matched speech segment is 2 frames as an example
  • the target reverse duration interval can be obtained from the aforementioned formula as [6.4, 12.5], therefore, the to-be-matched speech segment is not in the target reverse duration Within the interval, and less than the minimum value of the first reverse duration of 6.4, the initial position of the start and end indicator can be taken as the center, and the active segment to be matched can be obtained by moving back and forth in the active segment with the moving radius r.
  • the moving radius is usually an integer greater than or equal to 1 and less than or equal to 5, and the initial position of the start and end identifiers is not updated.
  • the initial position of the start and end marks is the 18th frame
  • the 18th frame, the 17th frame, the 18th frame, the 19th frame, the 18th frame, the 17th frame, the 18th frame, the 19th frame can be taken , Frame 18, frame 17, frame 18, etc., and so on.
  • corresponding frames are sequentially obtained.
  • the duration of the voice segment to be matched is 2 frames, the first 2 frames of images, namely the 17th frame and the 18th frame, are taken from the above sequence.
  • a method for obtaining the active segment to be matched is provided.
  • a silent segment can be matched, so that the synthesized video does not appear to be too abrupt, thereby improving The authenticity of the video.
  • the audio-visual matching method of the video also includes:
  • the qth active segment is obtained from the image sequence, where q is greater than or equal to 1 , And an integer less than or equal to N;
  • the voice segment to be matched is in the fourth reverse duration interval, determine the duration of the active segment to be matched according to the initial position of the start and end identifier, the maximum zoom ratio, and the end frame of the qth active segment;
  • the voice segment to be matched is divided into a third voice segment and a fourth voice segment, where the duration of the third voice segment is consistent with the duration of the active segment to be matched, and the fourth voice segment is used to start and end according to Identify the updated position to match the corresponding action segment;
  • the initial position of the start and end identification is updated, which may include:
  • the start and end identification position update condition is met
  • the initial position of the start and end markers is updated to the position corresponding to the end frame of the qth active segment.
  • the audio-visual matching device may obtain the qth active segment from the image sequence, and then determine the third minimum reverse duration and the third maximum reverse duration, and then determine In the fourth reverse duration interval, if the voice segment to be matched is within the fourth reverse duration interval, the duration of the active segment to be matched is determined, so as to divide the voice segment to be matched into a third voice segment and a fourth voice segment.
  • the four voice segments are used to match the corresponding action segments according to the updated initial positions of the start and end identifiers, that is, the fourth voice segment is used as the voice segment to be matched in the next round to perform audio-visual matching again.
  • the voice segment to be matched is within the fourth forward duration interval, it indicates that the start-stop identifier position update condition is met, and the initial position of the start-stop identifier can be updated to the position corresponding to the end frame of the qth active segment.
  • the initial position of the start and end identifier is the 38th frame of the image sequence (the end frame of the p-th active segment), the start frame of the p-th active segment is the 31st frame, and the (p-1)th active segment
  • the ending frame of is the 29th frame
  • the duration of the voice segment to be matched is 15 frames as an example for illustration.
  • the target reverse duration interval is [6.4, 12.5]
  • the maximum first forward duration is 12.5 frames Therefore, the voice segment to be matched is not within the target reverse duration interval and is greater than the maximum value of the first reverse duration, so the qth active segment needs to be acquired.
  • the minimum value of the third reverse duration can be calculated by the following formula:
  • the third maximum reverse duration can be calculated by the following formula:
  • the fourth reverse duration interval can be calculated by the following formula:
  • the duration of the active segment to be matched can be calculated by the following formula:
  • Index represents the initial position of the start and end identifier
  • e q represents the end frame of the qth active segment
  • s q-1 represents the start frame of the (q-1)th active segment
  • scale short represents the minimum zoom ratio
  • scale long Indicates the maximum zoom ratio
  • the initial position of the start and end markers is the 38th frame of the image sequence (the end frame of the p-th active segment), the start frame of the (q-1)th active segment is the 9th frame, and the end frame of the q-th active segment
  • the minimum value of the third reverse duration is 12.5 frames and the maximum value of the third reverse duration is 24 frames from the foregoing formula.
  • the maximum value determines that the fourth reverse duration interval is [12.5, 24]. If the duration of the to-be-matched speech segment is 15 frames, then the to-be-matched speech segment is in the fourth reverse duration interval. Further, according to the aforementioned formula, it can be obtained
  • the duration of the active segment to be matched is 11.5 frames.
  • the third speech segment is calculated by the following formula:
  • the duration of the fourth speech segment is calculated by the following formula:
  • Index represents the initial position of the start and end identifiers
  • e q represents the end frame of the qth active segment
  • scale long represents the maximum zoom ratio
  • l i represents the duration of the voice segment to be matched.
  • the third speech segment is from frame 1 to 11.5, and the duration of the fourth speech segment is 2.5 frames.
  • the voice segment to be matched is in the fourth reverse duration interval, which indicates that the start-stop identifier position update condition is met, the initial position of the start-stop identifier can be updated to the position corresponding to the end frame of the qth active segment, that is, The initial position of the start and end indicator is changed from frame 38 to frame 29, and the fourth voice segment obtained above can match the corresponding action segment according to the updated position of the start and end indicator.
  • the specific matching method is similar to the foregoing similar embodiment. No longer.
  • a method for obtaining the activity segment to be matched is provided, in the above-mentioned manner.
  • the accuracy of matching can be improved, thereby ensuring that the synthesized video clips have a more realistic visual effect, that is, the scenes of the characters speaking in the video clips are more realistic, and the effect of the characters speaking in the real scene is close, and it is difficult to make The person recognizes that the voice and image in the video clip are synthesized.
  • FIG. 16 is a schematic diagram of an embodiment of the audio-visual matching device in an embodiment of this application.
  • the audio-visual matching device 20 includes:
  • the receiving module 201 is configured to obtain a voice sequence, where the voice sequence includes M voice segments, and M is an integer greater than or equal to 1;
  • the acquiring module 202 is configured to acquire a voice segment to be matched from a voice sequence, where the voice segment to be matched belongs to any voice segment in the voice sequence;
  • the acquiring module 202 is also used to acquire the initial position of the start and end markers and the movement direction of the start and end markers from an image sequence, where the image sequence includes N activity segments, each of which includes the action picture of the object, and the initial position of the start and end markers Is the start frame of the active segment or the end frame of the active segment, and N is an integer greater than or equal to 1;
  • the acquiring module 202 is further configured to determine the active segment to be matched according to the initial position of the start and end identifier, the movement direction of the start and end identifier, and the voice segment to be matched;
  • the processing module 203 is configured to perform synthesis processing on the voice segment to be matched and the activity segment to be matched to obtain a video segment, where the video segment includes the action picture of the object and the voice of the object.
  • the audio-visual matching device 20 further includes an update module 204 and an adjustment module 205 ;
  • the acquiring module 202 is specifically configured to, when the movement direction of the start and end markers is positive, and the initial position of the start and end markers is less than or equal to the start frame of the jth active segment, according to the minimum zoom ratio, the initial position of the start and stop markers, and the jth Determine the minimum value of the first forward duration for the ending frame of the active segment, where j is an integer greater than or equal to 1 and less than or equal to (N+1);
  • the to-be-matched speech segment is within the target forward duration interval, determine the to-be-matched active segment according to at least one of the j-th active segment and the (j+1)-th active segment;
  • the update module 204 is configured to update the initial position of the start and end markers if the start and end marker position update conditions are met;
  • the adjustment module 205 is configured to adjust the movement direction of the start and end markers to reverse if the initial position of the updated start and end markers is greater than or equal to the position corresponding to the end frame of the Nth active segment.
  • the acquiring module 202 is specifically configured to determine the second minimum value of the forward duration according to the initial position of the start and end identifiers and the end frame of the j-th active segment, where the second minimum value of the forward duration is greater than the first minimum value of the forward duration;
  • the voice segment to be matched is within the first forward duration interval, then according to the duration of the voice segment to be matched, the duration between the initial position of the start and end identifier and the end frame of the j-th active segment is scaled to obtain the activity to be matched Fragment
  • the update module 204 is specifically configured to satisfy the start and end identification position update condition if the voice segment to be matched is within the first forward duration interval;
  • the initial position of the start and end identifier is updated to the position corresponding to the end frame of the j-th active segment.
  • the acquiring module 202 is specifically configured to determine the second maximum value of the forward duration according to the initial position of the start and end identifiers and the end frame of the (j+1)th active segment, where the second maximum value of the forward duration is less than the first forward duration Maximum
  • the duration between the initial position of the start and end identifier and the start frame of the (j+1)th active segment is scaled according to the duration of the voice segment to be matched. , Get the active fragment to be matched;
  • the update module 204 is specifically configured to meet the start and end identification position update condition if the voice segment to be matched is within the second forward duration interval;
  • the initial position of the start and end identifier is updated to the position corresponding to the start frame of the (j+1)th active segment.
  • the acquiring module 202 is specifically configured to determine the second minimum value of the forward duration according to the initial position of the start and end identifiers and the end frame of the j-th active segment, where the second minimum value of the forward duration is greater than the first minimum value of the forward duration;
  • the update module 204 is specifically configured to meet the start and end identification position update condition if the voice segment to be matched is within the third forward duration interval;
  • the initial position of the start and end identifier is updated to the position corresponding to the end frame of the active segment to be matched.
  • the acquiring module 202 is further configured to: if the voice segment to be matched is not within the target forward duration interval, and the duration of the voice segment to be matched is less than the first forward duration minimum value, according to the duration of the voice segment to be matched, the initial start and end identifiers Position and moving radius, determine the active segment to be matched;
  • the acquiring module 202 is further configured to: if the voice segment to be matched is not within the target forward duration interval, and the duration of the voice segment to be matched is less than the first forward duration minimum value, according to the duration of the voice segment to be matched and the initial start and end identifiers Position, determine the active segment to be matched.
  • the audio-visual matching device 20 further includes a determining module 206 and a dividing module 207 ;
  • the acquiring module 202 is further configured to, if the voice segment to be matched is not within the target forward duration interval, and the duration of the voice segment to be matched is greater than the maximum value of the first forward duration, then acquire the k-th active segment from the image sequence, where , K is an integer greater than or equal to 1 and less than or equal to N;
  • the determining module 206 is configured to determine the third minimum value of the forward duration according to the maximum zoom ratio, the initial position of the start and end identifiers, and the start frame of the k-th active segment;
  • the determining module 206 is further configured to determine the third maximum forward duration according to the minimum zoom ratio, the initial position of the start and end markers, and the end frame of the (k+1)th active segment;
  • the determining module 206 is further configured to determine a fourth forward duration interval according to the third minimum forward duration and the third maximum forward duration;
  • the determining module 206 is further configured to determine the duration of the active segment to be matched according to the initial position of the start and end identifier, the maximum zoom ratio, and the start frame of the k-th active segment if the speech segment to be matched is within the fourth forward duration interval ;
  • the dividing module 207 is configured to divide the voice segment to be matched into a first voice segment and a second voice segment according to the duration of the active segment to be matched.
  • the duration of the first voice segment is consistent with the duration of the active segment to be matched, and the second The voice segment is used to match the corresponding action segment according to the updated initial position of the start and end identifier;
  • the update module 204 is specifically configured to satisfy the start and end identification position update condition if the voice segment to be matched is within the fourth forward duration interval;
  • the initial position of the start and end markers is updated to the position corresponding to the start frame of the k-th active segment.
  • the acquiring module 202 is specifically configured to, when the movement direction of the start and end markers is reverse, and the initial position of the start and end markers is greater than or equal to the start frame of the pth active segment, according to the minimum zoom ratio, the initial position of the start and end markers, and the pth Determine the minimum value of the first reverse duration, where p is an integer greater than or equal to 1 and less than or equal to N;
  • the update module 204 is further configured to update the initial position of the start and end markers if the start and end marker position update conditions are met;
  • the adjustment module is further configured to adjust the movement direction of the start and end markers to a positive direction if the initial position of the updated start and end markers is less than or equal to the position corresponding to the start frame of the first active segment.
  • the acquiring module 202 is specifically configured to determine the second minimum reverse duration according to the initial position of the start and end identifiers and the start frame of the p-th active segment, where the second minimum reverse duration is greater than the first minimum reverse duration;
  • the duration between the start frame of the p-th active segment and the initial position of the start and end markers is scaled according to the duration of the voice segment to be matched to obtain the Activity fragment
  • the update module 204 is specifically configured to satisfy the start and end identification position update condition if the voice segment to be matched is within the first reverse duration interval;
  • the initial position of the start and end markers is updated to the position corresponding to the start frame of the p-th active segment.
  • the acquiring module 202 is specifically configured to determine the second maximum reverse duration according to the initial position of the start and end identifiers and the end frame of the (p-1)th active segment, where the second maximum reverse duration is less than the first reverse duration Maximum
  • the duration between the end frame of the (p-1)th active segment and the initial position of the start and end markers is scaled according to the duration of the voice segment to be matched. Obtain the active fragment to be matched;
  • the update module 204 is specifically configured to satisfy the start and end identification position update condition if the voice segment to be matched is within the second reverse duration interval;
  • the initial position of the start and end markers is updated to the position corresponding to the end frame of the (p-1)th active segment.
  • the obtaining module 202 is specifically configured to determine the second minimum value of the reverse duration according to the initial position of the start and end identifiers and the start frame of the p-th active segment, where the second minimum value of the reverse duration is greater than the first minimum value of the reverse duration;
  • the update module 204 is specifically configured to satisfy the start and end identification position update condition if the voice segment to be matched is within the third reverse duration interval;
  • the initial position of the start and end identifier is updated to the position corresponding to the start frame of the active segment to be matched.
  • the acquiring module 202 is further configured to: if the voice segment to be matched is not within the target reverse duration interval, and the duration of the voice segment to be matched is less than the first reverse duration minimum value, according to the duration of the voice segment to be matched, the initial start and end identifiers Position and moving radius, determine the active segment to be matched;
  • the acquiring module 202 is further configured to, if the voice segment to be matched is not within the target reverse duration interval, and the duration of the voice segment to be matched is less than the first reverse duration minimum value, according to the duration of the voice segment to be matched and the initial start and end identifiers Position, determine the active segment to be matched.
  • the acquiring module 202 is further configured to obtain the qth active segment from the image sequence if the voice segment to be matched is not within the target reverse duration interval, and the duration of the voice segment to be matched is greater than the maximum value of the first reverse duration, where , Q is an integer greater than or equal to 1 and less than or equal to N;
  • the determining module 206 is further configured to determine the third minimum reverse duration according to the maximum zoom ratio, the initial position of the start and end markers, and the end frame of the qth active segment;
  • the determining module 206 is further configured to determine the third maximum reverse duration according to the minimum zoom ratio, the initial position of the start and end markers, and the start frame of the (q-1)th active segment;
  • the determining module 206 is further configured to determine a fourth reverse duration interval according to the third minimum reverse duration and the third maximum reverse duration;
  • the determining module 206 is further configured to determine the duration of the active segment to be matched according to the initial position of the start and end identifier, the maximum zoom ratio, and the end frame of the qth active segment if the voice segment to be matched is within the fourth reverse duration interval;
  • the dividing module 207 is further configured to divide the voice segment to be matched into a third voice segment and a fourth voice segment according to the duration of the active segment to be matched.
  • the duration of the third voice segment is consistent with the duration of the active segment to be matched.
  • Four voice segments are used to match the corresponding action segments according to the updated initial positions of the start and end identifiers;
  • the update module 204 is specifically configured to satisfy the start and end identification position update condition if the voice segment to be matched is within the fourth forward duration interval;
  • the initial position of the start and end markers is updated to the position corresponding to the end frame of the qth active segment.
  • the video segment is a virtual video segment, a composite video segment, and a clipped video segment. At least one of
  • the start and end marks are cursors or sliding rods.
  • the embodiment of the present application also provides a terminal device.
  • the audio-visual matching device may be deployed on the terminal device, as shown in FIG.
  • the terminal device can be any terminal device including a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales, sales terminal device), a vehicle-mounted computer, and other terminal devices. Take the terminal device as a mobile phone as an example:
  • FIG. 17 shows a block diagram of a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application.
  • the mobile phone includes: a radio frequency (RF) circuit 310, a memory 320, an input unit 330, a display unit 340, a sensor 350, an audio circuit 360, a wireless fidelity (WiFi) module 370, and a processor 380 , And power supply 390 and other components.
  • RF radio frequency
  • the structure of the mobile phone shown in FIG. 17 does not constitute a limitation on the mobile phone, and may include more or less components than those shown in the figure, or a combination of some components, or different component arrangements.
  • the processor 380 included in the terminal device can perform the functions in the foregoing embodiment, and details are not described herein again.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program. When it runs on a computer, the computer executes the audio-visual matching in the method described in each of the foregoing embodiments. The steps performed by the device.
  • the embodiments of the present application also provide a computer program product including a program, which when running on a computer, causes the computer to execute the steps performed by the audio-visual matching apparatus in the methods described in the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请公开了一种视频的音画匹配方法、相关装置以及存储介质,用于人工智能领域。本申请方法包括:获取语音序列;从语音序列中获取待匹配语音片段;从图像序列中获取起止标识的初始位置以及起止标识的移动方向;根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,确定待匹配活动片段;对待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段。本申请在合成视频的过程中,利用起止标识定位出图像序列中活动片段的位置,从而将具有动作的活动片段与语音片段进行匹配,使得合成的视频片段更符合人物说话时的自然规律,具有更好的真实性。

Description

一种视频的音画匹配方法、相关装置以及存储介质
本申请要求于2020年04月23日提交中国专利局、申请号为2020103263061、申请名称为“一种视频的音画匹配方法、相关装置以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及视频的音画匹配技术。
背景技术
随着科学技术的不断发展,计算机视觉技术已经被广泛地应用在数字娱乐、医疗健康以及安防监控等很多领域。基于计算机视觉技术合成逼真的视觉内容不仅具有很大的商业价值,而且也是业界一直所期望的。
目前,相关技术提出了一种通过生成式对抗网络(Generative Adversarial Networks,GAN)生成视频的方法,即利用神经网络将已知的图像纹理映射到一个没有见过的场景里,并对映射后的图像进行修复和补全,从而生成期望的视频内容。
然而,采用GAN生成的视频内容中仅包括图像序列,并不包括语音内容,并且受限于训练数据的不足以及训练方法的不稳定性,生成的图像序列往往具有比较明显的瑕疵,从而导致生成的视频内容真实性较差。
发明内容
本申请实施例提供了一种视频的音画匹配方法、相关装置以及存储介质,可以在合成视频的过程中,利用起止标识定位出图像序列中活动片段的位置,从而将具有动作的活动片段与语音片段进行匹配,如此,可以保证合成的视频片段具有更真实的视觉效果,即视频片段中表现出的人物说话的场景更加逼真,与现实场景中人物说话的效果相贴近,难以让人识别出视频片段中的语音和图像是经过合成处理的,此外,利用起止标识的移动方向能够有序地匹配语音片段和活动片段,可以提升合成的视频片段中动作与语音的一致性和连续性。
有鉴于此,本申请第一方面提供一种视频的音画匹配方法,包括:
获取语音序列,其中,语音序列包括M个语音片段,M为大于或等于1的整数;
从语音序列中获取待匹配语音片段,其中,待匹配语音片段属于语音序列中的任意一个语音片段;
从图像序列中获取起止标识的初始位置以及起止标识的移动方向,其中,图像序列包括N个活动片段,每个活动片段中包括对象的动作画面,起止标识的初始位置为活动片段的起始帧或者活动片段的结束帧,N为大于或等于1的整数;
根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,确定待匹配活动片段;
对待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段,其中,视频片段包括对象的动作画面以及对象的语音。
本申请第二方面提供一种音画匹配装置,包括:
接收模块,用于获取语音序列,其中,语音序列包括M个语音片段,M为大于或等于1的整数;
获取模块,用于从语音序列中获取待匹配语音片段,其中,待匹配语音片段属于语音序列中的任意一个语音片段;
获取模块,还用于从图像序列中获取起止标识的初始位置以及起止标识的移动方向,其中,图像序列包括N个活动片段,每个活动片段中包括对象的动作画面,起止标识的初始位置为活动片段的起始帧或者活动片段的结束帧,N为大于或等于1的整数;
获取模块,还用于根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,确定待匹配活动片段;
处理模块,用于对待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段,其中,视频片段包括对象的动作画面以及对象的语音。
本申请第三方面提供一种计算机设备,包括:存储器、收发器、处理器以及总线***;
其中,存储器用于存储程序;
处理器用于执行存储器中的程序,以实现上述各方面所述的方法;
总线***用于连接存储器以及处理器,以使存储器以及处理器进行通信。
本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
本申请第五方面提供了一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:
在本申请实施例提供的视频的音画匹配方法中,首先接收客户端发送的语音序列,然后从语音序列中获取待匹配语音片段;从图像序列中获取起止标识的初始位置以及起止标识的移动方向,再根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,确定待匹配活动片段;最后将待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段。通过上述方式,在合成视频的过程中,利用起止标识的位置定位图像序列中活动片段的位置,从而将具有动作的活动片段与语音片段进行匹配,如此,可以保证合成的视频片段具有更真实的视觉效果,即视频片段中表现出的人物说话的场景更加逼真,与现实场景中人物说话的效果相贴近,难以让人识别出视频片段中的 语音和图像是经过合成处理的;此外,利用起止标识的移动方向能够有序地匹配语音片段和活动片段,可以提升合成的视频片段中动作与语音的一致性和连续性。
附图说明
图1为本申请实施例中基于音画匹配方法生成视频的一个场景示意图;
图2为本申请实施例中音画匹配***的一个架构示意图;
图3为本申请实施例中视频的音画匹配方法的一个流程示意图;
图4为本申请实施例中视频的音画匹配方法的一个实施例示意图;
图5为本申请实施例中语音序列的一个实施例示意图;
图6A为本申请实施例中图像序列的一个实施例示意图;
图6B为本申请实施例中起止标识初始位置的一个实施例示意图;
图6C为本申请实施例中起止标识初始位置的另一个实施例示意图;
图7为本申请实施例中确定待匹配活动片段的一个实施例示意图;
图8为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图9为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图10为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图11为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图12为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图13为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图14为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图15为本申请实施例中确定待匹配活动片段的另一实施例示意图;
图16为本申请实施例中音画匹配装置一个实施例示意图;
图17为本申请实施例中终端设备一个实施例示意图。
具体实施方式
本申请提供的音画匹配方法应用于合成视频的场景中,即可以合成包括语音信息和图像信息的视频,基于该视频可以实现虚拟偶像、虚拟解说或者虚拟教师等应用。例如,在实现虚拟解说的场景中,可以获取一段视频,该视频包括具有说话动作的图像序列;再获取语音序列,该语音序列可以是预先录制的,也可以是实时采集的,还可以是由文本转换后得到的;然后,采用本申请提供的音画匹配方法,将该语音序列与视频中的图像序列对应起来,合成相应的视频,以此实现虚拟解说。
为了便于理解,下面以实现虚拟播报对象的场景作为示例,结合图1对本申请提供的音画匹配方法适用的应用场景进行介绍。请参阅图1,图1为本申请实施例中基于音画匹配方法生成视频的场景示意图,如图1所示,图1中(A)图所示为用户通过终端设备的麦克风输入语音序列的方式,终端设备可以从用户输入的语音序列中获取待匹配语音片段,并在线确定待匹配活 动片段,进而对待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段,从而得到如图1中(B)图所示的视频片段,该视频片段中包括所生成的对象的动作画面以及对象的语音,如此合成的视频片段更符合人物说话时的自然规律,使得在终端设备的客户端上展示的视频片段具有更好的真实性。可以理解的是,此处不对应用场景进行穷举。
为了在各种应用场景中,提升视频内容的真实性,本申请提出了一种视频的音画匹配方法,该方法应用于图2所示的视频的音画匹配***,请参阅图2,图2为本申请实施例中音画匹配***的架构示意图,如图2所示,视频的音画匹配***中包括服务器和终端设备,音画匹配装置可以部署于服务器,也可以部署于终端设备。一种示例性的方式为,终端设备获取语音序列,然后从语音序列中获取待匹配语音片段,再按照本申请提供的音画匹配方法从图像序列中获取待匹配活动片段,并在终端设备侧将待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段,终端设备直接播放即可。另一种示例性的方式为,终端设备获取语音序列,然后将语音序列发送至服务器,由服务器从语音序列中获取待匹配语音片段,再按照本申请提供的音画匹配方法从图像序列中获取待匹配活动片段,并在服务器侧将待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段,服务器将视频片段反馈至终端设备,由终端设备进行播放。
需要说明的是,图2中的服务器可以是一台服务器,也可以是多台服务器组成的服务器集群或云计算中心等,具体此处均不限定。终端设备除了可以为图2中示出的平板电脑、笔记本电脑、掌上电脑、手机、个人电脑(personal computer,PC)外,还可以为其它语音交互设备,语音交互设备包含但不仅限于智能音响以及智能家电。
虽然图2中仅示出了五个终端设备和一个服务器,但应当理解,图2中的示例仅用于理解本方案,终端设备和服务器的具体数量均应当结合实际情况灵活确定。
本申请实施例可基于人工智能(Artificial Intelligence,AI)技术实现音画匹配。
基于此,下面将介绍视频的音画匹配方法,请参阅图3,图3为本申请实施例中视频的音画匹配方法的一个流程示意图,如图3所示,该方法包括以下步骤:
在步骤S1中,从图像序列中获取起止标识的初始位置以及起止标识的移动方向。
在步骤S2中,首先判断对待匹配活动片段缩放后,是否存在与待匹配语音片段同样的时长的情况,若存在,则执行步骤S3,若不存在,且理由是待匹配语音片段过短,则执行步骤S4。若不存在,且理由是待匹配活动片段过于长的情况,则执行步骤S5。
在步骤S3中,将缩放后的待匹配活动片段与待匹配语音片段直接进行匹配,并且得到视频片段。
在步骤S4中,以起止标识为中心原点生成待匹配活动片段,且与待匹配语音片段进行匹配,并且得到视频片段。
在步骤S5中,生成一段待匹配活动片段与待匹配语音片段匹配,然后重新获取起止标识的初始位置以及移动方向。
本申请实施例提供的方案涉及计算机视觉技术,结合上述介绍,下面将以执行主体为计算机设备中的音画匹配装置,对本申请中视频的音画匹配方法进行介绍,请参阅图4,如图4所示,本申请实施例中视频的音画匹配方法一个实施例包括:
101、获取语音序列,其中,语音序列包括M个语音片段,M为大于或等于1的整数;
本实施例中,音画匹配装置可以接收客户端发送的语音序列,并且该语音序列中包括至少一个语音片段。具体地,客户端所发送的语音序列是通过客户端用户在线输入的,例如,用户通过麦克风输入一段语音后生成对应的语音序列,或者用户输入文本内容,对文本内容经过转换后得到的语音序列。此外,音画匹配装置也可以从数据库中获取语音序列,并且该语音序列中包括至少一个语音片段。本申请在此不对语音序列的获取方式做任何限定。
需要说明的是,音画匹配装置可以部署于任意一种计算机设备,如服务器或者终端设备,本申请中以音画匹配装置部署于终端设备为例进行说明,然而这不应理解为对本申请的限定。
102、从语音序列中获取待匹配语音片段,其中,待匹配语音片段属于语音序列中的任意一个语音片段;
本实施例中,音画匹配装置可以从语音序列中获取一个待匹配语音片段。具体地,待匹配语音片段的时长为l i,i为大于或等于1,且小于或等于M的整数。为了将待匹配语音片段和图像序列中的待匹配活动片段进行匹配以及对齐,本申请可以采用每秒30帧的速率从语音序列和图像序列中提取片段。
为了便于理解,请参阅图5,图5为本申请实施例中语音序列的一个实施例示意图,如图5所示,A0用于指示一段语音序列,其中,A1、A2、A3、A4和A5分别指示语音序列中不同的语音片段,而待匹配语音片段可以为这五个语音片段中的任意一个。
103、从图像序列中获取起止标识的初始位置以及起止标识的移动方向,其中,图像序列包括N个活动片段,每个活动片段中包括对象的动作画面,起止标识的初始位置为活动片段的起始帧或者活动片段的结束帧,N为大于或等于1的整数;
本实施例中,音画匹配装置需要获取图像序列,其中,图像序列是由多帧图像组成的序列,图像序列中包括活动片段以及静默片段,每个活动片段 中均包括对象的动作画面,而每个静默片段中通常不包括对象的动作画面,比如,静默片段可以仅包括背景图像。
音画匹配装置从图像序列中获取起止标识的初始位置以及起止标识的移动方向,起止标识的初始位置可以为活动片段的起始帧或者活动片段的结束帧,其中,起止标识可以为游标(Cursor)或滑动杆,游标具有向前移动或向后移动的能力,因此,可以把游标当作一个指针,游标可以指定图像序列或语音序列中的任何位置。滑动杆与游标类似,也具有向前移动或向后移动的能力,可以指定图像序列或语音序列中的任何位置。因此,起止标识可以表示为图像序列中一个的帧号,时间总长度用帧的数量表示。活动片段中的对象可以指虚拟对象,例如虚拟播报员、虚拟人物或者卡通人物等,该对象也可以指真实对象,例如,用户甲。
具体地,请参阅图6A,图6A为本申请实施例中图像序列的一个实施例示意图,如图6A所示,B0用于指示一段图像序列,其中,B1、B2、B3、B4和B5分别指示图像序列中不同的活动片段。请参阅图6B,图6B为本申请实施例中起止标识的初始位置的一个实施例示意图,如图6B所示,当起止标识的移动方向为正向时,B6用于指示起止标识所对应的初始位置,且该起止标识的初始位置为活动片段B3的起始帧,请参阅图6C,图6C为本申请实施例中起止标识初始位置的另一个实施例示意图,如图6C所示,当起止标识的移动方向为反向时,B7用于指示起止标识所对应的初始位置,且该起止标识的初始位置为活动片段B3的结束帧。
104、根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,确定待匹配活动片段;
本实施例中,音画匹配装置可以根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,确定待匹配活动片段,该待匹配活动片段包括对象的动作画面。具体地,假设图5中的A3为待匹配语音片段,起止标识的移动方向为正向,且起止标识的初始位置为图6B中的B6所示的位置,基于此,可以确定待匹配活动片段为图6A中的活动片段B3,且该待匹配活动片段B3包括对象的动作画面。
105、将待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段,其中,视频片段包括对象的动作画面以及对象的语音。
本实施例中,音画匹配装置将待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段。具体地,假设图5中的A3为待匹配语音片段,图6A中的B3为待匹配活动片段,那么可以将待匹配语音片段A3与待匹配活动片段B3进行合成处理,从而得到视频片段,由于待匹配语音片段A3包括对象的语音,而待匹配活动片段B3包括对象的动作画面,因此,视频片段中同时包括对象的动作画面以及所对应的语音。
可选地,为了进一步提高合成视频的质量,还可以采用神经网络根据说 话的内容合成对应的嘴唇形状,然后将嘴唇形状的缝补到合成到的视频片段。
可以理解的是,视频片段包括但不限于虚拟视频片段、合成视频片段以及剪辑视频片段。当视频片段为虚拟视频片段时,该虚拟视频片段包括虚拟对象的动作画面以及虚拟对象的语音。当视频片段为合成视频片段时,该合成视频片段包括对象的动作画面以及对象的语音。当视频片段为剪辑视频片段时,该剪辑视频片段包括从一段完整视频中剪辑得到的部分片段,该片段包括对象的动作画面以及对象的语音。
本申请实施例中,提供了一种视频的音画匹配方法,通过上述方式,在合成视频的过程中,利用起止标识位置定位出图像序列中活动片段的位置,从将具有动作的活动片段与语音片段进行匹配,如此,可以保证合成的视频片段具有更真实的视觉效果,即视频片段中表现出的人物说话的场景更加逼真,与现实场景中人物说话的效果相贴近,难以让人识别出视频片段中的语音和图像是经过合成处理的,此外,利用起止标识的移动方向能够有序地匹配语音片段和活动片段,可以提升合成的视频片段中语音与图像的一致性和连续性。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的一个可选实施例中,当起止标识的移动方向为正向,且起止标识的初始位置小于或等于第j(j为大于或等于1,且小于或等于(N+1)的整数)个活动片段的起始帧时,根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,确定待匹配活动片段,可以包括:
根据最小缩放比例、起止标识的初始位置以及第j个活动片段的结束帧,确定第一正向时长最小值;
根据最大缩放比例、起止标识的初始位置以及第(j+1)个活动片段的起始帧确定第一正向时长最大值;
根据第一正向时长最小值以及第一正向时长最大值确定目标正向时长区间;
若待匹配语音片段在目标正向时长区间内,则根据第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定待匹配活动片段;
视频的音画匹配方法还包括:
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新;
若更新后的起止标识的初始位置大于或等于第N个活动片段的结束帧所对应的位置,则将起止标识的移动方向调整为反向。
本实施例中,当起止标识的移动方向为正向,且起止标识的初始位置小于或等于第j个活动片段的起始帧时,音画匹配装置可以确定第一正向时长最小值以及第一正向时长最大值,再确定目标正向时长区间,当待匹配语音片段在目标正向时长区间内时,即可确定待匹配活动片段。
具体地,以最小缩放比例为0.8,最大缩放比例为1.25作为示例进行说 明。第一正向时长最小值可以通过下式进行计算:
scale short*(e j-Index+1);
第一正向时长最大值可以通过下式进行计算:
scale long*(s j+1-Index+1);
目标正向时长区间可以通过下式进行计算:
[scale short*(e j-Index+1),scale long*(s j+1-Index+1)];
其中,Index表示起止标识的初始位置,scale short表示最小缩放比例,e j表示第j个活动片段的结束帧,scale long表示最大缩放比例,s j+1表示第(j+1)个活动片段的起始帧。
为了便于理解,在起止标识的移动方向为正向时,以起止标识的初始位置为图像序列的第10帧,第j个活动片段的起始帧为第10帧,第j个活动片段的结束帧为第16帧,第(j+1)个活动片段的起始帧为第18帧作为示例进行说明,请参阅图7,图7为本申请实施例中确定待匹配活动片段的一个实施例示意图,如图7所示,图7中(A)图示出的C0表示起止标识的初始位置Index,即为图像序列的第10帧。C1表示第j个活动片段的起始帧s j,即为图像序列的第10帧。C2表示第j个活动片段的结束帧e j,即为图像序列的第16帧。C3表示第(j+1)个活动片段的起始帧s j+1,即为图像序列的第18帧。C4表示第j个活动片段的长度,C5表示第(j+1)个活动片段的长度。
由前述公式可以得到第一正向时长最小值为5.6,第一正向时长最大值为11.25,由此可以得到目标正向时长区间为[5.6,11.25]。若待匹配语音片段的时长处于[5.6,11.25],即如图7中(B)图示出的待匹配语音片段C6,则可以根据活动片段C4以及活动片段C5中至少一个活动片段确定待匹配活动片段。
当满足起止标识位置更新条件时,音画匹配装置还可以对起止标识的初始位置进行更新,当更新后的起止标识的初始位置大于或等于第N个活动片段的结束帧所对应的位置,将起止标识的移动方向调整为反向。也就是说,如果起止标识的移动方向为正向,并且更新后的起止标识的初始位置已经超过了图像序列中最后一个活动片段的结束帧,那么需要将起止标识的移动方向更改为反向,进行与正向的类似操作,通过对起止标识的移动方向进行正向至反向的更新与调整,能够对实时输入的语音序列进行匹配,从而实时生成真实性较高的视频。
本申请实施例中,提供了一种确定待匹配活动片段的方法,通过上述方式,在起止标识的移动方向为正向时,具体通过起止标识的初始位置、以及活动片段的起始帧与结束帧,结合待匹配语音片段,确定待匹配活动片段,由此所合成的视频更符合对象实际语音描述时的场景,从而视频更具有真实性。此外,通过活动片段与下一个活动片段的匹配使得不同待匹配语音片段对应的待匹配活动片段是首尾相连的,从而提升了合成的视频片段中语音与 图像的一致性和连续性。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定待匹配活动片段,可以包括:
根据起止标识的初始位置以及第j个活动片段的结束帧确定第二正向时长最小值,其中,第二正向时长最小值大于第一正向时长最小值;
根据第一正向时长最小值以及第二正向时长最小值确定第一正向时长区间;
若待匹配语音片段在第一正向时长区间内,则根据待匹配语音片段的时长,对起止标识的初始位置至第j个活动片段的结束帧之间的时长进行缩放处理,得到待匹配活动片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第一正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第j个活动片段的结束帧所对应的位置。
本实施例中,音画匹配装置可以确定第二正向时长最小值,再根据第一正向时长最小值以及第二正向时长最小值确定第一正向时长区间,当待匹配语音片段在第一正向时长区间内时,根据待匹配语音片段的时长,对起止标识的初始位置至第j个活动片段的结束帧之间的时长进行缩放处理,得到待匹配活动片段。当待匹配语音片段在第一正向时长区间内时,表示满足起止标识位置更新条件,即可将起止标识的初始位置更新为第j个活动片段的结束帧所对应的位置。
具体地,第二正向时长最小值可以通过下式进行计算:
e j-Index+1;
其次,第一正向时长区间可以通过下式进行计算:
[scale short*(e j-Index+1),e j-Index+1];
其中,Index表示起止标识的初始位置,scale short表示最小缩放比例,e j表示第j个活动片段的结束帧。
为了便于理解,以起止标识的初始位置为第10帧,第j个活动片段的起始帧为第10帧,第j个活动片段的结束帧为第16帧,第(j+1)个活动片段的起始帧为第18帧,且待匹配语音片段的时长为6帧作为示例进行说明,请参阅图8,图8为本申请实施例中确定待匹配活动片段的另一实施例示意图,如图8所示,图8中(A)图示出的D0表示起止标识的初始位置Index,且起止标识的初始位置为第10帧。D1表示第j个活动片段的起始帧s j,且为图像序列的第10帧。D2表示表示第j个活动片段的结束帧e j,且为图像序列的第16帧。D3表示第(j+1)个活动片段的起始帧s j+1,且为图像序列的第18帧。 D4表示第j个活动片段的长度,D5表示第(j+1)个活动片段的长度。
由前述公式可以得到第一正向时长最小值为5.6帧,而第二正向时长最小值为7帧,由此可以得到第一正向时长区间为[5.6,7]。假设图8中(B)图示出的待匹配语音片段D6为6帧,即待匹配语音片段的时长处于第一正向时长区间内,由此,可以根据待匹配语音片段D6的时长,对起止标识的初始位置至第j个活动片段的结束帧之间的时长进行缩放处理,例如,将第j个活动片段的时长缩放至6帧。从而与待匹配语音片段D6进行匹配。
若待匹配语音片段的时长处于第一正向时长区间内,则满足起止标识位置更新条件,由此需要将起止标识的初始位置更新为第j个活动片段的结束帧所对应的位置,也就是将起止标识的初始位置从第10帧更改为第16帧。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定待匹配活动片段,可以包括:
根据起止标识的初始位置以及第(j+1)个活动片段的结束帧确定第二正向时长最大值,其中,第二正向时长最大值小于第一正向时长最大值;
根据第一正向时长最大值以及第二正向时长最大值确定第二正向时长区间;
若待匹配语音片段在第二正向时长区间内,则根据待匹配语音片段的时长,对起止标识的初始位置至第(j+1)个活动片段的起始帧之间的时长进行缩放处理,得到待匹配活动片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第二正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第(j+1)个活动片段的起始帧所对应的位置。
本实施例中,音画匹配装置可以确定第二正向时长最大值,然后根据第一正向时长最大值以及第二正向时长最大值确定第二正向时长区间,当待匹配语音片段在第二正向时长区间内时,根据待匹配语音片段的时长,对起止标识的初始位置至第(j+1)个活动片段的起始帧之间的时长进行缩放处理,得到待匹配活动片段。当待匹配语音片段在第二正向时长区间内时,将起止标识的初始位置更新为第(j+1)个活动片段的起始帧所对应的位置。
具体地,第二正向时长最大值可以通过下式进行计算:
s j+1-Index+1;
其次,第二正向时长区间可以通过下式进行计算:
[s j+1-Index+1,scale long*(s j+1-Index+1)];
其中,Index表示起止标识的初始位置,scale long表示最大缩放比例,s j+1 表示第(j+1)个活动片段的起始帧。
为了便于理解,以起止标识的初始位置为第10帧,第j个活动片段的起始帧为第10帧,第j个活动片段的结束帧为第16帧,第(j+1)个活动片段的起始帧为第18帧,且待匹配语音片段的时长为10帧作为示例进行说明,请参阅图9,图9为本申请实施例中确定待匹配活动片段的另一实施例示意图,如图9所示,图9中(A)图示出的E0表示起止标识的初始位置Index,且起止标识的初始位置为第10帧。E1表示第j个活动片段的起始帧s j,且为图像序列的第10帧,E2表示表示第j个活动片段的结束帧e j,且为图像序列的第16帧,E3表示第(j+1)个活动片段的起始帧s j+1,且为图像序列的第18帧,E4表示第j个活动片段的长度,E5表示第(j+1)个活动片段的长度。
由前述公式可以得到第一正向时长最大值为11.25帧,而第二正向时长最大值为9帧,由此可以得到第二正向时长区间为[9,11.25]。假设图9中(B)图示出的待匹配语音片段E6为10帧,即待匹配语音片段的时长处于第二正向时长区间内,由此,可以根据待匹配语音片段E6的时长,对起止标识的初始位置至第(j+1)个活动片段的起始帧之间的时长进行缩放处理,例如,将E0至E3之间的时长缩放至10帧。由此可以得到时长与待匹配语音片段E6相同的待匹配活动片段。
若待匹配语音片段的时长处于第二正向时长区间内,则满足起止标识位置更新条件,由此需要将起止标识的初始位置更新为第(j+1)个活动片段的起始帧所对应的位置,也就是将起止标识的初始位置从第10帧更改为第18帧。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定待匹配活动片段,可以包括:
根据起止标识的初始位置以及第j个活动片段的结束帧确定第二正向时长最小值,其中,第二正向时长最小值大于第一正向时长最小值;
根据起止标识的初始位置以及第(j+1)个活动片段的结束帧确定第二正向时长最大值,其中,第二正向时长最大值小于第一正向时长最大值;
根据第二正向时长最小值与第二正向时长最大值确定第三正向时长区间;
若待匹配语音片段在第三正向时长区间内,则根据起止标识的初始位置以及待匹配语音片段的时长,确定待匹配活动片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第三正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为待匹配活动片段的结束帧所对应的位置。
本实施例中,音画匹配装置确定第二正向时长最小值和第二正向时长最 大值,再根据第二正向时长最小值与第二正向时长最大值确定第三正向时长区间,当待匹配语音片段在第三正向时长区间内时,根据起止标识的初始位置以及待匹配语音片段的时长,确定待匹配活动片段。若待匹配语音片段在第三正向时长区间内,则将起止标识的初始位置更新为待匹配活动片段的结束帧所对应的位置。
具体地,第三正向时长区间可以通过下式进行计算:
[e j-Index+1,s j+1-Index+1];
其中,Index表示起止标识的初始位置,e j表示第j个活动片段的结束帧,s j+1表示第(j+1)个活动片段的起始帧。
为了便于理解,以起止标识的初始位置为第10帧,第j个活动片段的起始帧为第10帧,第j个活动片段的结束帧为第16帧,第(j+1)个活动片段的起始帧为第18帧,且待匹配语音片段的时长为8帧作为示例进行说明,请参阅图10,图10为本申请实施例中获取待匹配活动片段的另一实施例示意图,如图10所示,图10中(A)图示出的F0表示起止标识的初始位置Index,且起止标识的初始位置为第10帧。F1表示第j个活动片段的起始帧s j,且为图像序列的第10帧,F2表示表示第j个活动片段的结束帧e j,且为图像序列的第16帧,F3表示第(j+1)个活动片段的起始帧s j+1,且为图像序列的第18帧,F4表示第j个活动片段的长度,F5表示第(j+1)个活动片段的长度。
由前述公式可以得到第二正向时长最小值为7帧,而第二正向时长最大值为9帧,由此可以得到第三正向时长区间为[7,9]。假设图10中(B)图示出的待匹配语音片段F6为8帧,即待匹配语音片段的时长处于第三正向时长区间内,由此,可以根据起止标识的初始位置F0以及待匹配语音片段F6的时长,采用如下方式确定待匹配活动片段:
[Index,Index+l i-1];
其中,Index表示起止标识的初始位置,l i表示待匹配语音片段的长度。假设待匹配语音片段的长度为8帧,即待匹配活动片段表示为第10帧至第17帧之间的活动片段。
若待匹配语音片段的时长处于第三正向时长区间内,则满足起止标识位置更新条件,由此可以将起止标识的初始位置更新为待匹配活动片段的结束帧所对应的位置,也就是将起止标识的初始位置从第10帧更改为第17帧,从而保证起止标识的初始位置处于静默片段里。
本申请实施例中,提供了确定待匹配活动片段的方法,通过上述方式,可以在待匹配语音片段长度不同的情况下,采用不同的方式确定待匹配活动片段,从而提升匹配算法的多样性,其次,更新后的起止标识落在静默片段里,使得活动片段配有语音,从而使得合成视频的显得更加自然。更进一步地,本申请提供的匹配方法计算简单,可用于实时计算,能够在线合成视频片段。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第一正向时长最小值以及第一正向时长最大值确定目标正向时长区间之后,视频的音画匹配方法还包括:
若待匹配语音片段未在目标正向时长区间内,且待匹配语音片段的时长小于第一正向时长最小值,则根据待匹配语音片段的时长、起止标识的初始位置以及移动半径,确定待匹配活动片段;
或者,
若待匹配语音片段未在目标正向时长区间内,且待匹配语音片段的时长小于第一正向时长最小值,则根据待匹配语音片段的时长以及起止标识的初始位置,确定待匹配活动片段。
本实施例中,当待匹配语音片段未在目标正向时长区间内,且待匹配语音片段的时长小于第一正向时长最小值时,音画匹配装置可以采用双向摆动取值的方式确定待匹配活动片段,或者,音画匹配装置可以朝静默片段的方向取若干帧,以确定待匹配活动片段。
具体地,以起止标识的初始位置为第10帧,第j个活动片段的起始帧为第10帧,第j个活动片段的结束帧为第16帧,第(j+1)个活动片段的起始帧为第18帧,且待匹配语音片段的时长为3帧作为示例进行说明,由前述公式可以得到目标正向时长区间为[5.6,11.25],因此,待匹配语音片段未在目标正向时长区间内,且小于第一正向时长最小值5.6,于是可以将起止标识的初始位置作为中心,以移动半径为r在活动片段中来回移动,从而得到待匹配活动片段。
可以理解的是,移动半径通常为大于或等于1,且小于或等于5的整数,且起止标识的初始位置不进行更新。假设半径为3,起止标识的初始位置为第10帧,那么可以取到第10帧、第11帧、第12帧、第11帧、第10帧、第9帧、第8帧、第9帧、第10帧、第11帧、第12帧、第11帧等,以此类推。根据再基于待匹配语音片段的时长,依次获取对应的帧,假设待匹配语音片段的时长为3帧,即从上述序列中取出前3帧图像,即第10帧、第11帧和第12帧。
可以理解的是,在实际应用中,有两种活动片段的设计方式,第一种方式为,将活动片段中第一帧动作画面作为起始帧,将该活动片段中最后一帧动作画面作为结束帧,即活动片段与肉眼看到的具有动作的片段一致。另一种方式为,在第一帧动作画面之前选择若干帧静默画面,并将该静默画面所对应的某一帧作为活动片段的起始帧,类似地,从该活动片段中最后一帧动作画面结束后的若干帧静默画面中选择其中一帧,作为活动片段的结束帧,这样的话,相当于活动片段实际上在头尾部分是包括一小段静默片段的,更贴近于实际的工业应用。基于上述两种方式,本申请提供了以下两种解决方式。
为了便于理解,请参阅图11,图11为本申请实施例中确定待匹配活动片段的另一实施例示意图,如图11所示,图11中(A)图所示出的为头尾部分是包括一小段静默片段的活动片段G1,即活动片段G1中可以包括有动作的图像帧以及静默帧,以起止标识的初始位置为中心,以移动半径为r的正反移动得到待匹配活动片段,在实际应用中,待匹配活动片段通常包括若干静默帧,也可能包括少量活动帧。图11中(B)图所示为不包括静默帧的活动片段G2,即活动片段G2仅包括有动作的图像帧,于是可以直接从起止标识的初始位置向静默片段的方向移动若干帧,取出待匹配活动片段,且待匹配活动片段的帧数与待匹配语音片段的帧数相同。即当起止标识在活动片段的起始帧时,以待匹配语音片段的时长往前取几帧,比如从第10帧向反方向(静默片段的方向)取3帧,从而获取待匹配活动片段。当起止标识在活动片段的结束帧时,以待匹配语音片段的时长往后取几帧,比如从第16帧向正方向(静默片段的方向)取3帧,从而获取待匹配活动片段。
本申请实施例中,提供了一种确定待匹配活动片段的方法,通过上述方式,对于语音片段过短的情况,可以配以静默片段,从而不会显得合成后的视频过于突兀,由此提升视频的真实性。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第一正向时长最小值以及第一正向时长最大值确定目标正向时长区间之后,视频的音画匹配方法还包括:
若待匹配语音片段未在目标正向时长区间内,且待匹配语音片段的时长大于第一正向时长最大值,则从图像序列中获取第k个活动片段,其中,k为大于或等于1,且小于或等于N的整数;
根据最大缩放比例、起止标识的初始位置以及第k个活动片段的起始帧确定第三正向时长最小值;
根据最小缩放比例、起止标识的初始位置以及第(k+1)个活动片段的结束帧确定第三正向时长最大值;
根据第三正向时长最小值以及第三正向时长最大值确定第四正向时长区间;
若待匹配语音片段在第四正向时长区间内,则根据起止标识的初始位置、最大缩放比例、第k个活动片段的起始帧,确定待匹配活动片段的时长;
根据待匹配活动片段的时长,将待匹配语音片段划分为第一语音片段以及第二语音片段,其中,第一语音片段的时长与待匹配活动片段的时长一致,第二语音片段用于根据更新后的起止标识的初始位置匹配对应的动作片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第四正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第k个活动片段的起始帧所对应的位置。
本实施例中,如果待匹配语音片段过长,则音画匹配装置可以从图像序列中获取第k个活动片段,再确定第三正向时长最小值和第三正向时长最大值,进而确定第四正向时长区间,当待匹配语音片段在第四正向时长区间内时,根据起止标识的初始位置、最大缩放比例、第k个活动片段的起始帧,确定待匹配活动片段的时长,以此将待匹配语音片段划分为第一语音片段以及第二语音片段,第二语音片段用于根据更新后的起止标识的初始位置匹配对应的动作片段,也就是将第二语音片段作为下一轮的待匹配语音片段再次进行音画匹配。如果待匹配语音片段在第四正向时长区间内时,表示满足起止标识位置更新条件,再将起止标识的初始位置更新为第k个活动片段的起始帧所对应的位置。
具体地,以起止标识的初始位置为第10帧,第j个活动片段的起始帧为第10帧,第j个活动片段的结束帧为第16帧,第(j+1)个活动片段的起始帧为第18帧,且待匹配语音片段的时长为25帧作为示例进行说明,由前述公式可以得到目标正向时长区间为[5.6,11.25],且第一正向时长最大值为11.25帧,待匹配语音片段未在目标正向时长区间内,且大于第一正向时长最大值,于是需要获取第k个活动片段。
第三正向时长最小值可以通过下式进行计算:
scale long*(s k-Index+1);
第三正向时长最大值可以通过下式进行计算:
scale short*(e k+1-Index+1);
第四正向时长区间可以通过下式进行计算:
[scale long*(s k-Index+1),scale short*(e k+1-Index+1)];
待匹配活动片段的时长可以通过下式进行计算:
scale long*(s k-Index+1)-1;
其中,Index表示起止标识的初始位置,sk表示第k个活动片段的起始帧,e k+1表示第(k+1)个活动片段的结束帧,scale short表示最小缩放比例,scale long表示最大缩放比例。
以起止标识的初始位置为第10帧,第k个活动片段的起始帧为第26帧,第(k+1)个活动片段的结束帧为第45帧作为示例进行说明,由前述公式可以得到第三正向时长最小值为21.25帧,第三正向时长最大值为28.8帧,根据第三正向时长最小值以及第三正向时长最大值确定第四正向时长区间为[21.25,28.8],若待匹配语音片段的时长为25帧,则该待匹配语音片段在第四正向时长区间内,进一步地,根据前述公式可以得到待匹配活动片段的时长为20.25帧。
通过下式进行对第一语音片段进行计算:
[1,scale long*(s k-Index+1)-1];
通过下式进行对第二语音片段的时长进行计算:
l i=l i-scale long*(s k-Index+1);
其中,Index表示起止标识的初始位置,s k表示第k个活动片段的起始帧,scale long表示最大缩放比例,l i表示待匹配语音片段的时长。
因此,由前述公式可以得到第一语音片段第1帧至第20.25帧,第二语音片段的时长为3.75帧。其次,由于待匹配语音片段在第四正向时长区间内,则满足起止标识位置更新条件,可以将起止标识的初始位置更新为第k个活动片段的起始帧所对应的位置,也就是将起止标识的初始位置从第10帧更改为第26帧,而前述所得到的第二语音片段可以根据更新后的起止标识的初始位置匹配对应的动作片段,具体匹配方法与前述类似实施例类似,在此不再赘述。
本申请实施例中,提供了一种获取待匹配活动片段的方法,通过上述方式,可以提升匹配的准确度,由此提升视频中语音片段与活动片段的匹配度,从而提升视频的真实性。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,当起止标识的移动方向为反向,且起止标识的初始位置大于或等于第p(p为大于或等于1,且小于或等于N的整数)个活动片段的起始帧时,根据起止标识的初始位置、起止标识的移动方向以及待匹配语音片段,获取待匹配活动片段,可以包括:
根据最小缩放比例、起止标识的初始位置以及第p个活动片段的起始帧,确定第一反向时长最小值;
根据最大缩放比例、起止标识的初始位置以及第(p-1)个活动片段的结束帧确定第一反向时长最大值;
根据第一反向时长最小值以及第一反向时长最大值确定目标反向时长区间;
若待匹配语音片段在目标反向时长区间内,则根据第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定待匹配活动片段;
视频的音画匹配方法还包括:
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新;
若更新后的起止标识的初始位置小于或等于第一个活动片段的起始帧所对应的位置,则将起止标识的移动方向调整为正向。
本实施例中,当起止标识的移动方向为反向,且起止标识的初始位置大于或等于第p个活动片段的起始帧时,音画匹配装置可以确定第一反向时长最小值以及第一反向时长最大值,再确定目标反向时长区间,当待匹配语音片段在目标反向时长区间内时,即可确定待匹配活动片段。若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,若起止标识更新后的位置小于或等于第一个活动片段的起始帧所对应的位置,则将起止标识的移动 方向调整为正向。
具体地,以最小缩放比例为0.8,最大缩放比例为1.25作为示例进行说明。第一反向时长最小值可以通过下式进行计算:
scale short*(Index-s p+1);
第一反向时长最大值可以通过下式进行计算:
scale long*(Index-e p-1+1);
目标反向时长区间可以通过下式进行计算:
[scale short*(Index-s p+1),scale long*(Index-e p-1+1)];
其中,Index表示起止标识的初始位置,scale short表示最小缩放比例,s p表示第p个活动片段的起始帧,scale long表示最大缩放比例,e p-1表示第(p-1)个活动片段的结束帧。
为了便于理解,在起止标识的移动方向为反向时,以起止标识的初始位置为图像序列的第18帧,第p个活动片段的起始帧为第11帧,第(p-1)个活动片段的结束帧为第9帧作为示例进行说明,请参阅图12,图12为本申请实施例中确定待匹配活动片段的另一实施例示意图,如图12所示,图12中(A)图示出的H0表示起止标识的初始位置Index,即为图像序列的第第18帧。H1表示第p个活动片段的起始帧s p,即为图像序列的第11帧。H2表示第(p-1)个活动片段的结束帧e p-1,即为图像序列的第9帧。H3表示第p个活动片段的长度,H4表示第(p-1)个活动片段的长度。
由前述公式可以得到第一反向时长最小值为6.4帧,而第一反向时长最大值为12.5帧,由此可以得到目标反向时长区间为[6.4,12.5]。若待匹配语音片段的时长处于[6.4,12.5],即如图12中(B)图示出的待匹配语音片段H5,则可以根据活动片段H3以及活动片段H4中至少一个活动片段确定待匹配活动片段。
当满足起止标识位置更新条件时,音画匹配装置还可以对起止标识的初始位置进行更新,当更新后的起止标识的初始位置小于或等于第一个活动片段的起始帧所对应的位置,将起止标识的移动方向调整为正向。也就是说,如果起止标识的移动方向为反向,并且更新后的起止标识的初始位置已经超过了图像序列中第一个活动片段的第一帧,那么需要将起止标识的移动方向更改为正向,正向的匹配方法在前述实施例中已进行介绍,在此不再赘述。通过对起止标识的移动方向进行正向至反向的更新与调整,能够对实时输入的语音序列进行匹配,从而实时生成真实性较高的视频。
本申请实施例中,提供了另一种获取待匹配活动片段的方法,通过上述方式,所合成的视频更符合对象实际语音描述时的场景,从而视频更具有真实性。此外,通过活动片段与下一个活动片段的匹配使得不同待匹配语音片段对应的待匹配活动片段是首尾相连的,从而提升了合成的视频片段中语音与图像的一致性和连续性。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定待匹配活动片段,可以包括:
根据起止标识的初始位置以及第p个活动片段的起始帧确定第二反向时长最小值,其中,第二反向时长最小值大于第一反向时长最小值;
根据第一反向时长最小值以及第二反向时长最小值确定第一反向时长区间;
若待匹配语音片段在第一反向时长区间内,则根据待匹配语音片段的时长,对第p个活动片段的起始帧至起止标识的初始位置之间的时长进行缩放处理,得到待匹配活动片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第一反向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第p个活动片段的起始帧所对应的位置。
本实施例中,音画匹配装置可以根据第一反向时长最小值以及第二反向时长最小值确定第一反向时长区间,若待匹配语音片段在第一反向时长区间内,根据待匹配语音片段的时长,对第p个活动片段的起始帧至起止标识的初始位置之间的时长进行缩放处理,得到待匹配活动片段。若待匹配语音片段在第一反向时长区间内,表示满足起止标识位置更新条件,并且可将起止标识的初始位置更新为第p个活动片段的起始帧所对应的位置。
具体地,第二反向时长最小值可以通过下式进行计算:
Index-s p+1;
第一反向时长区间可以通过下式进行计算:
[scale short*(Index-s p+1),Index-s p+1];
其中,Index表示起止标识的初始位置,scale short表示最小缩放比例,s p表示第p个活动片段的起始帧。
为了便于理解,以起止标识的初始位置为图像序列的第18帧(第p个活动片段的结束帧),第p个活动片段的起始帧为第11帧,第(p-1)个活动片段的结束帧为第9帧,且待匹配语音片段的时长为7帧作为示例进行说明,请参阅图13,图13为本申请实施例中获取待匹配活动片段的另一实施例示意图,如图13所示,图13中(A)图示出的I0表示起止标识的初始位置Index,且起止标识的初始位置为第18帧。I1表示第p个活动片段的起始帧s p,且为图像序列的第11帧。I2表示第(p-1)个活动片段的结束帧e p-1,且为图像序列的第9帧。I3表示第p个活动片段的长度,I4表示第(p-1)个活动片段的长度。
由前述公式可以得到第一反向时长最小值为6.4帧,而第二反向时长最小值为8帧,由此可以得到第一反向时长区间为[6.4,8]。而图13中(B)图 示出的待匹配语音片段I5的时长为7帧,即待匹配语音片段的时长处于第一反向时长区间内,由此,可以根据待匹配语音片段I5的时长,对第p个活动片段的起始帧至起止标识的初始位置之间的时长进行缩放处理,例如,将第p个活动片段的时长缩放至7帧。从而与待匹配语音片段I5进行匹配。
若待匹配语音片段的时长处于第一正向时长区间内,则满足起止标识位置更新条件,由此需要将起止标识的初始位置更新为第p个活动片段的起始帧所对应的位置,也就是将起止标识的初始位置从第18帧更改为第11帧。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定待匹配活动片段,可以包括:
根据起止标识的初始位置以及第(p-1)个活动片段的结束帧确定第二反向时长最大值,其中,第二反向时长最大值小于第一反向时长最大值;
根据第一反向时长最大值以及第二反向时长最大值确定第二反向时长区间;
若待匹配语音片段在第二反向时长区间内,则根据待匹配语音片段的时长,对第(p-1)个活动片段的结束帧至起止标识的初始位置之间的时长进行缩放处理,得到待匹配活动片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第二反向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第(p-1)个活动片段的结束帧所对应的位置。
本实施例中,音画匹配装置可以根据第一反向时长最大值以及第二反向时长最大值确定第二反向时长区间,若待匹配语音片段在第二反向时长区间内,则根据待匹配语音片段的时长,对第(p-1)个活动片段的结束帧至起止标识的初始位置之间的时长进行缩放处理,得到待匹配活动片段。若待匹配语音片段在第二反向时长区间内,则满足起止标识位置更新条件,然后可以将起止标识的初始位置更新为第(p-1)个活动片段的结束帧所对应的位置。
具体地,第二反向时长最大值可以通过下式进行计算:
Index-e p-1+1;
第二反向时长区间可以通过下式进行计算:
[Index-e p-1+1,scale long*(Index-e p-1+1)];
其中,Index表示起止标识的初始位置,scale long表示最大缩放比例,e p-1表示第(p-1)个活动片段的结束帧。
为了便于理解,以起止标识的初始位置为图像序列的第18帧(第p个活动片段的结束帧),第p个活动片段的起始帧为第11帧,第(p-1)个活动片段 的结束帧为第9帧,且待匹配语音片段的时长为11帧作为示例进行说明,请参阅图14,图14为本申请实施例中获取待匹配活动片段的另一实施例示意图,如图14所示,图14中(A)图示出的J0表示起止标识的初始位置Index,且起止标识的初始位置为第18帧。J1表示第p个活动片段的起始帧s p,且为图像序列的第11帧。J2表示第(p-1)个活动片段的结束帧e p-1,且为图像序列的第9帧。J3表示第p个活动片段的长度,J4表示第(p-1)个活动片段的长度。
由前述公式可以得到第一反向时长最大值为12.5帧,而第二反向时长最大值为10帧,由此可以得到第二反向时长区间为[10,12.5]。而图14中(B)图示出的待匹配语音片段J5为11帧,即待匹配语音片段的时长处于第二反向时长区间内,由此,可以根据待匹配语音片段J5的时长,对第(p-1)个活动片段的结束帧至起止标识的初始位置之间的时长进行缩放处理,例如,也就是将J2至J0之间的时长进行缩放至11帧。由此可以得到时长与待匹配语音片段J5相同的待匹配活动片段。
若待匹配语音片段的时长处于第二反向时长区间内,则满足起止标识位置更新条件,由此可以将起止标识的初始位置更新为第(p-1)个活动片段的结束帧所对应的位置,也就是将起止标识的初始位置从第18帧更改为第9帧。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定待匹配活动片段,可以包括:
根据起止标识的初始位置以及第p个活动片段的起始帧确定第二反向时长最小值,其中,第二反向时长最小值大于第一反向时长最小值;
根据起止标识的初始位置以及第(p-1)个活动片段的结束帧确定第二反向时长最大值,其中,第二反向时长最大值小于第一反向时长最大值;
根据第二反向时长最小值与第二反向时长最大值确定第三反向时长区间;
若待匹配语音片段在第三反向时长区间内,则根据起止标识的初始位置以及待匹配语音片段的时长,确定待匹配活动片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第三反向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为待匹配活动片段的起始帧所对应的位置。
本实施例中,音画匹配装置根据第二反向时长最小值与第二反向时长最大值确定第三反向时长区间,若待匹配语音片段在第三反向时长区间内,则根据起止标识的初始位置以及待匹配语音片段的时长,确定待匹配活动片段。若待匹配语音片段在第三反向时长区间内,则表示满足起止标识位置更新条件,并且将起止标识的初始位置更新为待匹配活动片段的起始帧所对应的位 置。
具体地,第三反向时长区间可以通过下式进行计算:
[Index-s p+1,Index-e p-1+1];
其中,Index表示起止标识的初始位置,s p表示第p个活动片段的起始帧,e p-1表示第(p-1)个活动片段的结束帧。
为了便于理解,以起止标识的初始位置为图像序列的第18帧(第p个活动片段的结束帧),第p个活动片段的起始帧为第11帧,第(p-1)个活动片段的结束帧为第9帧,且待匹配语音片段的时长为9帧作为示例进行说明,请参阅图15,图15为本申请实施例中获取待匹配活动片段的另一实施例示意图,如图15所示,图15中(A)图示出的K0表示起止标识的初始位置Index,且起止标识的初始位置为第18帧。K1表示表示第p个活动片段的起始帧s p,且为图像序列的第11帧。K2表示第(p-1)个活动片段的结束帧e p-1,且为图像序列的第9帧。K3表示第p个活动片段的长度,K4表示第(p-1)个活动片段的长度。
由前述公式可以得到第二反向时长最小值为8帧,而第二反向时长最大值为10帧,由此可以得到第三反向时长区间为[8,10]。而图15中(B)图示出的待匹配语音片段K5的时长为9帧,即待匹配语音片段的时长处于第三反向时长区间内,由此,可以根据起止标识的初始位置K0以及待匹配语音片段K5的时长,采用如下方式确定待匹配活动片段:
[Index-l i+1,Index];
其中,Index表示起止标识的初始位置,l i表示待匹配语音片段的长度。假设待匹配语音片段的长度为9帧,即待匹配活动片段表示为第10帧至第18帧之间的活动片段。
若待匹配语音片段的时长处于第三反向时长区间内,则满足起止标识位置更新条件,由此可以将起止标识的初始位置更新为待匹配活动片段的起始帧所对应的位置,也就是将起止标识的初始位置从第18帧更改为第11帧,从而保证起止标识的初始位置处于静默片段里。
本申请实施例中,提供了另一种获取待匹配活动片段的方法,通过上述方式,可以在待匹配语音片段长度不同的情况下,采用不同的方式对待匹配活动片段进行获取,从而提升匹配算法的多样性,其次,更新后的起止标识的初始位置落在静默片段里,使得活动片段配有语音,从而提升合成视频的显得更加自然。更进一步地,本申请提供的匹配方法计算简单,可用于实时计算,能够在线合成视频片段。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第一反向时长最小值以及第一反向时长最大值确定目标反向时长区间之后,视频的音画匹配方法还包括:
若待匹配语音片段未在目标反向时长区间内,且待匹配语音片段的时长 小于第一反向时长最小值,则根据待匹配语音片段的时长、起止标识的初始位置以及移动半径,确定待匹配活动片段;
或者,
若待匹配语音片段未在目标反向时长区间内,且待匹配语音片段的时长小于第一反向时长最小值,则根据待匹配语音片段的时长以及起止标识的初始位置,确定待匹配活动片段。
本实施例中,若待匹配语音片段未在目标反向时长区间内,且待匹配语音片段的时长小于第一反向时长最小值,音画匹配装置可以采用双向摆动取值的方式确定待匹配活动片段,或者,音画匹配装置可以朝静默片段的方向取若干帧,以得到待匹配活动片段。
为了便于理解,以起止标识的初始位置为第18帧(第p个活动片段的结束帧),第p个活动片段的起始帧为第11帧,第(p-1)个活动片段的结束帧为第9帧,且待匹配语音片段的时长为2帧作为示例进行说明,由前述公式可以得到目标反向时长区间为[6.4,12.5],因此,待匹配语音片段未在目标反向时长区间内,并且小于第一反向时长最小值6.4,于是可以将起止标识的初始位置作为中心,以移动半径为r在活动片段来回移动得到待匹配活动片段。
可以理解的是,移动半径通常为大于或等于1,且小于或等于5的整数,且起止标识的初始位置不进行更新。假设半径为2,起止标识的初始位置为第18帧,那么可以取到第18帧、第17帧、第18帧、第19帧、第18帧、第17帧、第18帧、第19帧、第18帧、第17帧、第18帧等,以此类推。再基于待匹配语音片段的时长,依次获取对应的帧,假设待匹配语音片段的时长为2帧,即从上述序列中取出前2帧图像,即第17帧和第18帧。
可以理解的是,在实际应用中,有两种活动片段的设计方式,具体两种活动片段的设计方式与前述实施例中介绍的类似,在此不再赘述。
本申请实施例中,提供了一种获取待匹配活动片段的方法,通过上述方式,对于语音片段过短的情况,可以配以静默片段,从而不会显得合成后的视频过于突兀,由此提升视频的真实性。
可选地,在上述图4对应的实施例的基础上,在本申请实施例提供的视频的音画匹配方法的另一个可选实施例中,根据第一反向时长最小值以及第一反向时长最大值确定目标反向时长区间之后,视频的音画匹配方法还包括:
若待匹配语音片段未在目标反向时长区间内,且待匹配语音片段的时长大于第一反向时长最大值,则从图像序列中获取第q个活动片段,其中,q为大于或等于1,且小于或等于N的整数;
根据最大缩放比例、起止标识的初始位置以及第q个活动片段的结束帧确定第三反向时长最小值;
根据最小缩放比例、起止标识的初始位置以及第(q-1)个活动片段的起始帧确定第三反向时长最大值;
根据第三反向时长最小值以及第三反向时长最大值确定第四反向时长区间;
若待匹配语音片段在第四反向时长区间内,则根据起止标识的初始位置、最大缩放比例、第q个活动片段的结束帧,确定待匹配活动片段的时长;
根据待匹配活动片段的时长,将待匹配语音片段划分为第三语音片段以及第四语音片段,其中,第三语音片段的时长与待匹配活动片段的时长一致,第四语音片段用于根据起止标识更新后的位置匹配对应的动作片段;
若满足起止标识位置更新条件,则对起止标识的初始位置进行更新,可以包括:
若待匹配语音片段在第四正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第q个活动片段的结束帧所对应的位置。
本实施例中,如果待匹配语音片段过长,则音画匹配装置可以从图像序列中获取第q个活动片段,再确定第三反向时长最小值和第三反向时长最大值,进而确定第四反向时长区间,若待匹配语音片段在第四反向时长区间内,则确定待匹配活动片段的时长,以此将待匹配语音片段划分为第三语音片段以及第四语音片段,第四语音片段用于根据更新后的起止标识的初始位置匹配对应的动作片段,也就是将第四语音片段作为下一轮的待匹配语音片段再次进行音画匹配。如果待匹配语音片段在第四正向时长区间内时,表示满足起止标识位置更新条件,可以将起止标识的初始位置更新为第q个活动片段的结束帧所对应的位置。
具体地,以起止标识的初始位置为图像序列的第38帧(第p个活动片段的结束帧),第p个活动片段的起始帧为第31帧,第(p-1)个活动片段的结束帧为第29帧,且待匹配语音片段的时长为15帧作为示例进行说明,由前述公式可以得到目标反向时长区间为[6.4,12.5],第一正向时长最大值为12.5帧,因此,待匹配语音片段未在目标反向时长区间内,且大于第一反向时长最大值,于是需要获取第q个活动片段。
第三反向时长最小值可以通过下式进行计算:
scale long*(Index-e q+1);
第三反向时长最大值可以通过下式进行计算:
scale short*(Index-s q-1+1);
第四反向时长区间可以通过下式进行计算:
[scale long*(Index-e q+1),scale short*(Index-s q-1+1)];
待匹配活动片段的时长可以通过下式进行计算:
scale long*(Index-e q+1)-1;
其中,Index表示起止标识的初始位置,e q表示第q个活动片段的结束帧,s q-1表示第(q-1)个活动片段的起始帧,scale short表示最小缩放比例,scale long 表示最大缩放比例。
以起止标识的初始位置为图像序列的第38帧(第p个活动片段的结束帧),第(q-1)个活动片段的起始帧为第9帧,第q个活动片段的结束帧为第29帧作为示例进行说明,由前述公式可以得到第三反向时长最小值为12.5帧,第三反向时长最大值为24帧,根据第三反向时长最小值以及第三反向时长最大值确定第四反向时长区间为[12.5,24],若待匹配语音片段的时长为15帧,则该待匹配语音片段在第四反向时长区间内,进一步地,根据前述公式可以得到待匹配活动片段的时长为11.5帧。
通过下式进行对第三语音片段进行计算:
[1,scale long*(Index-e q+1)];
通过下式进行对第四语音片段的时长进行计算:
[l i=l i-scale long*(Index-e q+1)];
其中,Index表示起止标识的初始位置,e q表示第q个活动片段的结束帧,scale long表示最大缩放比例,l i表示待匹配语音片段的时长。
因此,由前述公式可以得到第三语音片段为第1帧至第11.5帧,第四语音片段的时长为2.5帧。其次,由于待匹配语音片段在第四反向时长区间内,表示满足起止标识位置更新条件,因此可以将起止标识的初始位置更新为第q个活动片段的结束帧所对应的位置,也就是将起止标识的初始位置从第38帧更改为第29帧,而前述所得到的第四语音片段可以根据起止标识更新后的位置匹配对应的动作片段,具体匹配方法与前述类似实施例类似,在此不再赘述。
本申请实施例中,提供了一种获取待匹配活动片段的方法,通过上述方式。可以提升匹配的准确度,由此,可以保证合成的视频片段具有更真实的视觉效果,即视频片段中表现出的人物说话的场景更加逼真,与现实场景中人物说话的效果相贴近,难以让人识别出视频片段中的语音和图像是经过合成处理的。
下面对本申请中的音画匹配装置进行详细描述,请参阅图16,图16为本申请实施例中音画匹配装置一个实施例示意图,音画匹配装置20包括:
接收模块201,用于获取语音序列,其中,语音序列包括M个语音片段,M为大于或等于1的整数;
获取模块202,用于从语音序列中获取待匹配语音片段,其中,待匹配语音片段属于语音序列中的任意一个语音片段;
获取模块202,还用于从图像序列中获取起止标识的初始位置以及起止标识的移动方向,其中,图像序列包括N个活动片段,每个活动片段中包括对象的动作画面,起止标识的初始位置为活动片段的起始帧或者活动片段的结束帧,N为大于或等于1的整数;
获取模块202,还用于根据起止标识的初始位置、起止标识的移动方向 以及待匹配语音片段,确定待匹配活动片段;
处理模块203,用于对待匹配语音片段与待匹配活动片段进行合成处理,得到视频片段,其中,视频片段包括对象的动作画面以及对象的语音。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,音画匹配装置20还包括更新模块204以及调整模块205;
获取模块202,具体用于当起止标识的移动方向为正向,且起止标识的初始位置小于或等于第j个活动片段的起始帧时,根据最小缩放比例、起止标识的初始位置以及第j个活动片段的结束帧,确定第一正向时长最小值,其中,j为大于或等于1,且小于或等于(N+1)的整数;
根据最大缩放比例、起止标识的初始位置以及第(j+1)个活动片段的起始帧确定第一正向时长最大值;
根据第一正向时长最小值以及第一正向时长最大值确定目标正向时长区间;
若待匹配语音片段在目标正向时长区间内,则根据第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定待匹配活动片段;
更新模块204,用于若满足起止标识位置更新条件,则对起止标识的初始位置进行更新;
调整模块205,用于若更新后的起止标识的初始位置大于或等于第N个活动片段的结束帧所对应的位置,则将起止标识的移动方向调整为反向。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,具体用于根据起止标识的初始位置以及第j个活动片段的结束帧确定第二正向时长最小值,其中,第二正向时长最小值大于第一正向时长最小值;
根据第一正向时长最小值以及第二正向时长最小值确定第一正向时长区间;
若待匹配语音片段在第一正向时长区间内,则根据待匹配语音片段的时长,对起止标识的初始位置至第j个活动片段的结束帧之间的时长进行缩放处理,得到待匹配活动片段;
更新模块204,具体用于若待匹配语音片段在第一正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第j个活动片段的结束帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,具体用于根据起止标识的初始位置以及第(j+1)个活动片段的结束帧确定第二正向时长最大值,其中,第二正向时长最大值小于第一 正向时长最大值;
根据第一正向时长最大值以及第二正向时长最大值确定第二正向时长区间;
若待匹配语音片段在第二正向时长区间内,则根据待匹配语音片段的时长,对起止标识的初始位置至第(j+1)个活动片段的起始帧之间的时长进行缩放处理,得到待匹配活动片段;
更新模块204,具体用于若待匹配语音片段在第二正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第(j+1)个活动片段的起始帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,具体用于根据起止标识的初始位置以及第j个活动片段的结束帧确定第二正向时长最小值,其中,第二正向时长最小值大于第一正向时长最小值;
根据起止标识的初始位置以及第(j+1)个活动片段的结束帧确定第二正向时长最大值,其中,第二正向时长最大值小于第一正向时长最大值;
根据第二正向时长最小值与第二正向时长最大值确定第三正向时长区间;
若待匹配语音片段在第三正向时长区间内,则根据起止标识的初始位置以及待匹配语音片段的时长,确定待匹配活动片段;
更新模块204,具体用于若待匹配语音片段在第三正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为待匹配活动片段的结束帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,还用于若待匹配语音片段未在目标正向时长区间内,且待匹配语音片段的时长小于第一正向时长最小值,则根据待匹配语音片段的时长、起止标识的初始位置以及移动半径,确定待匹配活动片段;
或者,
获取模块202,还用于若待匹配语音片段未在目标正向时长区间内,且待匹配语音片段的时长小于第一正向时长最小值,则根据待匹配语音片段的时长以及起止标识的初始位置,确定待匹配活动片段。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,音画匹配装置20还包括确定模块206以及划分模块207;
获取模块202,还用于若待匹配语音片段未在目标正向时长区间内,且 待匹配语音片段的时长大于第一正向时长最大值,则从图像序列中获取第k个活动片段,其中,k为大于或等于1,且小于或等于N的整数;
确定模块206,用于根据最大缩放比例、起止标识的初始位置以及第k个活动片段的起始帧确定第三正向时长最小值;
确定模块206,还用于根据最小缩放比例、起止标识的初始位置以及第(k+1)个活动片段的结束帧确定第三正向时长最大值;
确定模块206,还用于根据第三正向时长最小值以及第三正向时长最大值确定第四正向时长区间;
确定模块206,还用于若待匹配语音片段在第四正向时长区间内,则根据起止标识的初始位置、最大缩放比例、第k个活动片段的起始帧,确定待匹配活动片段的时长;
划分模块207,用于根据待匹配活动片段的时长,将待匹配语音片段划分为第一语音片段以及第二语音片段,其中,第一语音片段的时长与待匹配活动片段的时长一致,第二语音片段用于根据更新后的起止标识的初始位置匹配对应的动作片段;
更新模块204,具体用于若待匹配语音片段在第四正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第k个活动片段的起始帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,具体用于当起止标识的移动方向为反向,且起止标识的初始位置大于或等于第p个活动片段的起始帧时,根据最小缩放比例、起止标识的初始位置以及第p个活动片段的起始帧,确定第一反向时长最小值,其中,p为大于或等于1,且小于或等于N的整数;
根据最大缩放比例、起止标识的初始位置以及第(p-1)个活动片段的结束帧确定第一反向时长最大值;
根据第一反向时长最小值以及第一反向时长最大值确定目标反向时长区间;
若待匹配语音片段在目标反向时长区间内,则根据第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定待匹配活动片段;
更新模块204,还用于若满足起止标识位置更新条件,则对起止标识的初始位置进行更新;
调整模块,还用于若更新后的起止标识的初始位置小于或等于第一个活动片段的起始帧所对应的位置,则将起止标识的移动方向调整为正向。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,具体用于根据起止标识的初始位置以及第p个活动片段 的起始帧确定第二反向时长最小值,其中,第二反向时长最小值大于第一反向时长最小值;
根据第一反向时长最小值以及第二反向时长最小值确定第一反向时长区间;
若待匹配语音片段在第一反向时长区间内,则根据待匹配语音片段的时长,对第p个活动片段的起始帧至起止标识的初始位置之间的时长进行缩放处理,得到待匹配活动片段;
更新模块204,具体用于若待匹配语音片段在第一反向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第p个活动片段的起始帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,具体用于根据起止标识的初始位置以及第(p-1)个活动片段的结束帧确定第二反向时长最大值,其中,第二反向时长最大值小于第一反向时长最大值;
根据第一反向时长最大值以及第二反向时长最大值确定第二反向时长区间;
若待匹配语音片段在第二反向时长区间内,则根据待匹配语音片段的时长,对第(p-1)个活动片段的结束帧至起止标识的初始位置之间的时长进行缩放处理,得到待匹配活动片段;
更新模块204,具体用于若待匹配语音片段在第二反向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第(p-1)个活动片段的结束帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,具体用于根据起止标识的初始位置以及第p个活动片段的起始帧确定第二反向时长最小值,其中,第二反向时长最小值大于第一反向时长最小值;
根据起止标识的初始位置以及第(p-1)个活动片段的结束帧确定第二反向时长最大值,其中,第二反向时长最大值小于第一反向时长最大值;
根据第二反向时长最小值与第二反向时长最大值确定第三反向时长区间;
若待匹配语音片段在第三反向时长区间内,则根据起止标识的初始位置以及待匹配语音片段的时长,确定待匹配活动片段;
更新模块204,具体用于若待匹配语音片段在第三反向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为待匹配活动片段的起始帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,在本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,还用于若待匹配语音片段未在目标反向时长区间内,且待匹配语音片段的时长小于第一反向时长最小值,则根据待匹配语音片段的时长、起止标识的初始位置以及移动半径,确定待匹配活动片段;
或者,
获取模块202,还用于若待匹配语音片段未在目标反向时长区间内,且待匹配语音片段的时长小于第一反向时长最小值,则根据待匹配语音片段的时长以及起止标识的初始位置,确定待匹配活动片段。
可选地,在上述图16所对应的实施例的基础上,本申请实施例提供的音画匹配装置20的另一实施例中,
获取模块202,还用于若待匹配语音片段未在目标反向时长区间内,且待匹配语音片段的时长大于第一反向时长最大值,则从图像序列中获取第q个活动片段,其中,q为大于或等于1,且小于或等于N的整数;
确定模块206,还用于根据最大缩放比例、起止标识的初始位置以及第q个活动片段的结束帧确定第三反向时长最小值;
确定模块206,还用于根据最小缩放比例、起止标识的初始位置以及第(q-1)个活动片段的起始帧确定第三反向时长最大值;
确定模块206,还用于根据第三反向时长最小值以及第三反向时长最大值确定第四反向时长区间;
确定模块206,还用于若待匹配语音片段在第四反向时长区间内,则根据起止标识的初始位置、最大缩放比例、第q个活动片段的结束帧,确定待匹配活动片段的时长;
划分模块207,还用于根据待匹配活动片段的时长,将待匹配语音片段划分为第三语音片段以及第四语音片段,其中,第三语音片段的时长与待匹配活动片段的时长一致,第四语音片段用于根据更新后的起止标识的初始位置匹配对应的动作片段;
更新模块204,具体用于若待匹配语音片段在第四正向时长区间内,则满足起止标识位置更新条件;
将起止标识的初始位置更新为第q个活动片段的结束帧所对应的位置。
可选地,在上述图16所对应的实施例的基础上,本申请实施例提供的音画匹配装置20的另一实施例中,视频片段为虚拟视频片段、合成视频片段以及剪辑视频片段中的至少一种;
起止标识为游标或滑动杆。
本申请实施例还提供了一种终端设备,终端设备上可以部署有音画匹配装置,如图17所示,为了便于说明,仅示出了与本申请实施例相关的部分, 具体技术细节未揭示的,请参照本申请实施例方法部分。该终端设备可以为包括手机、平板电脑、PDA(Personal Digital Assistant,个人数字助理)、POS(Point of Sales,销售终端设备)、车载电脑等任意终端设备,以终端设备为手机为例:
图17示出的是与本申请实施例提供的终端设备相关的手机的部分结构的框图。参考图17,手机包括:射频(Radio Frequency,RF)电路310、存储器320、输入单元330、显示单元340、传感器350、音频电路360、无线保真(wireless fidelity,WiFi)模块370、处理器380、以及电源390等部件。本领域技术人员可以理解,图17中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
在本申请实施例中,该终端设备所包括的处理器380可以执行前述实施例中的功能,此处不再赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如前述各个实施例描述的方法中音画匹配装置所执行的步骤。
本申请实施例中还提供一种包括程序的计算机程序产品,当其在计算机上运行时,使得计算机执行前述各个实施例描述的方法中音画匹配装置所执行的步骤。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (18)

  1. 一种视频的音画匹配方法,由计算机设备执行,包括:
    获取语音序列,其中,所述语音序列包括M个语音片段,所述M为大于或等于1的整数;
    从所述语音序列中获取待匹配语音片段,其中,所述待匹配语音片段属于所述语音序列中的任意一个语音片段;
    从图像序列中获取起止标识的初始位置以及所述起止标识的移动方向,其中,所述图像序列包括N个活动片段,每个活动片段中包括对象的动作画面,所述起止标识的初始位置为所述活动片段的起始帧或者所述活动片段的结束帧,所述N为大于或等于1的整数;
    根据所述起止标识的初始位置、所述起止标识的移动方向以及所述待匹配语音片段,确定待匹配活动片段;
    对所述待匹配语音片段与所述待匹配活动片段进行合成处理,得到视频片段,其中,所述视频片段包括所述对象的动作画面以及所述对象的语音。
  2. 根据权利要求1所述的音画匹配方法,当所述起止标识的移动方向为正向,且所述起止标识的初始位置小于或等于第j个活动片段的起始帧时,所述j为大于或等于1,且小于或等于(N+1)的整数;所述根据所述起止标识的初始位置、所述起止标识的移动方向以及所述待匹配语音片段,确定待匹配活动片段,包括:
    根据最小缩放比例、所述起止标识的初始位置以及所述第j个活动片段的结束帧,确定第一正向时长最小值;
    根据最大缩放比例、所述起止标识的初始位置以及第(j+1)个活动片段的起始帧确定第一正向时长最大值;
    根据所述第一正向时长最小值以及所述第一正向时长最大值确定目标正向时长区间;
    若所述待匹配语音片段在所述目标正向时长区间内,则根据所述第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定所述待匹配活动片段;
    所述方法还包括:
    若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新;
    若更新后的所述起止标识的初始位置大于或等于第N个活动片段的结束帧所对应的位置,则将所述起止标识的移动方向调整为反向。
  3. 根据权利要求2所述的音画匹配方法,所述根据所述第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定所述待匹配活动片段,包括:
    根据所述起止标识的初始位置以及所述第j个活动片段的结束帧确定第二正向时长最小值,其中,所述第二正向时长最小值大于所述第一正向时长 最小值;
    根据所述第一正向时长最小值以及所述第二正向时长最小值确定第一正向时长区间;
    若所述待匹配语音片段在所述第一正向时长区间内,则根据所述待匹配语音片段的时长,对所述起止标识的初始位置至所述第j个活动片段的结束帧之间的时长进行缩放处理,得到所述待匹配活动片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在所述第一正向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述第j个活动片段的结束帧所对应的位置。
  4. 根据权利要求2所述的音画匹配方法,所述根据所述第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定所述待匹配活动片段,包括:
    根据所述起止标识的初始位置以及所述第(j+1)个活动片段的结束帧确定第二正向时长最大值,其中,所述第二正向时长最大值小于所述第一正向时长最大值;
    根据所述第一正向时长最大值以及所述第二正向时长最大值确定第二正向时长区间;
    若所述待匹配语音片段在所述第二正向时长区间内,则根据所述待匹配语音片段的时长,对所述起止标识的初始位置至所述第(j+1)个活动片段的起始帧之间的时长进行缩放处理,得到所述待匹配活动片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在所述第二正向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述第(j+1)个活动片段的起始帧所对应的位置。
  5. 根据权利要求2所述的音画匹配方法,所述根据所述第j个活动片段以及第(j+1)个活动片段中至少一个活动片段,确定所述待匹配活动片段,包括:
    根据所述起止标识的初始位置以及所述第j个活动片段的结束帧确定第二正向时长最小值,其中,所述第二正向时长最小值大于所述第一正向时长最小值;
    根据所述起止标识的初始位置以及所述第(j+1)个活动片段的结束帧确定第二正向时长最大值,其中,所述第二正向时长最大值小于所述第一正向时 长最大值;
    根据所述第二正向时长最小值与所述第二正向时长最大值确定第三正向时长区间;
    若所述待匹配语音片段在所述第三正向时长区间内,则根据所述起止标识的初始位置以及所述待匹配语音片段的时长,确定所述待匹配活动片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在所述第三正向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述待匹配活动片段的结束帧所对应的位置。
  6. 根据权利要求2所述的音画匹配方法,所述根据所述第一正向时长最小值以及所述第一正向时长最大值确定目标正向时长区间之后,所述方法还包括:
    若所述待匹配语音片段未在所述目标正向时长区间内,且所述待匹配语音片段的时长小于所述第一正向时长最小值,则根据所述待匹配语音片段的时长、所述起止标识的初始位置以及移动半径,确定所述待匹配活动片段;
    或者,
    若所述待匹配语音片段未在所述目标正向时长区间内,且所述待匹配语音片段的时长小于所述第一正向时长最小值,则根据所述待匹配语音片段的时长以及所述起止标识的初始位置,确定所述待匹配活动片段。
  7. 根据权利要求2或6所述的音画匹配方法,所述根据所述第一正向时长最小值以及所述第一正向时长最大值确定目标正向时长区间之后,所述方法还包括:
    若所述待匹配语音片段未在所述目标正向时长区间内,且所述待匹配语音片段的时长大于所述第一正向时长最大值,则从所述图像序列中获取第k个活动片段,其中,所述k为大于或等于1,且小于或等于N的整数;
    根据所述最大缩放比例、所述起止标识的初始位置以及所述第k个活动片段的起始帧确定第三正向时长最小值;
    根据所述最小缩放比例、所述起止标识的初始位置以及所述第(k+1)个活动片段的结束帧确定第三正向时长最大值;
    根据所述第三正向时长最小值以及所述第三正向时长最大值确定第四正向时长区间;
    若所述待匹配语音片段在所述第四正向时长区间内,则根据所述起止标识的初始位置、所述最大缩放比例、所述第k个活动片段的起始帧,确定所述待匹配活动片段的时长;
    根据所述待匹配活动片段的时长,将所述待匹配语音片段划分为第一语 音片段以及第二语音片段,其中,所述第一语音片段的时长与所述待匹配活动片段的时长一致,所述第二语音片段用于根据更新后的所述起止标识的初始位置匹配对应的动作片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在所述第四正向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述第k个活动片段的起始帧所对应的位置。
  8. 根据权利要求1所述的音画匹配方法,当所述起止标识的移动方向为反向,且所述起止标识的初始位置大于或等于第p个活动片段的起始帧时,所述p为大于或等于1,且小于或等于N的整数;所述根据所述起止标识的初始位置、所述起止标识的移动方向以及所述待匹配语音片段,确定待匹配活动片段,包括:
    根据最小缩放比例、所述起止标识的初始位置以及所述第p个活动片段的起始帧,确定第一反向时长最小值;
    根据最大缩放比例、所述起止标识的初始位置以及第(p-1)个活动片段的结束帧确定第一反向时长最大值;
    根据所述第一反向时长最小值以及所述第一反向时长最大值确定目标反向时长区间;
    若所述待匹配语音片段在所述目标反向时长区间内,则根据所述第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定所述待匹配活动片段;
    所述方法还包括:
    若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新;
    若更新后的所述起止标识的初始位置小于或等于第一个活动片段的起始帧所对应的位置,则将所述起止标识的移动方向调整为正向。
  9. 根据权利要求8所述的音画匹配方法,所述根据所述第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定所述待匹配活动片段,包括:
    根据所述起止标识的初始位置以及所述第p个活动片段的起始帧确定第二反向时长最小值,其中,所述第二反向时长最小值大于所述第一反向时长最小值;
    根据所述第一反向时长最小值以及所述第二反向时长最小值确定第一反向时长区间;
    若所述待匹配语音片段在所述第一反向时长区间内,则根据所述待匹配语音片段的时长,对所述第p个活动片段的起始帧至所述起止标识的初始位 置之间的时长进行缩放处理,得到所述待匹配活动片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在所述第一反向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述第p个活动片段的起始帧所对应的位置。
  10. 根据权利要求8所述的音画匹配方法,所述根据所述第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定所述待匹配活动片段,包括:
    根据所述起止标识的初始位置以及所述第(p-1)个活动片段的结束帧确定第二反向时长最大值,其中,所述第二反向时长最大值小于所述第一反向时长最大值;
    根据所述第一反向时长最大值以及所述第二反向时长最大值确定第二反向时长区间;
    若所述待匹配语音片段在所述第二反向时长区间内,则根据所述待匹配语音片段的时长,对所述第(p-1)个活动片段的结束帧至所述起止标识的初始位置之间的时长进行缩放处理,得到所述待匹配活动片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在所述第二反向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述第(p-1)个活动片段的结束帧所对应的位置。
  11. 根据权利要求8所述的音画匹配方法,所述根据所述第p个活动片段以及第(p-1)个活动片段中至少一个活动片段,确定所述待匹配活动片段,包括:
    根据所述起止标识的初始位置以及所述第p个活动片段的起始帧确定第二反向时长最小值,其中,所述第二反向时长最小值大于所述第一反向时长最小值;
    根据所述起止标识的初始位置以及所述第(p-1)个活动片段的结束帧确定第二反向时长最大值,其中,所述第二反向时长最大值小于所述第一反向时长最大值;
    根据所述第二反向时长最小值与所述第二反向时长最大值确定第三反向时长区间;
    若所述待匹配语音片段在所述第三反向时长区间内,则根据所述起止标识的初始位置以及所述待匹配语音片段的时长,确定所述待匹配活动片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在所述第三反向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述待匹配活动片段的起始帧所对应的位置。
  12. 根据权利要求8所述的音画匹配方法,所述根据所述第一反向时长最小值以及所述第一反向时长最大值确定目标反向时长区间之后,所述方法还包括:
    若所述待匹配语音片段未在所述目标反向时长区间内,且所述待匹配语音片段的时长小于所述第一反向时长最小值,则根据所述待匹配语音片段的时长、所述起止标识的初始位置以及移动半径,确定所述待匹配活动片段;
    或者,
    若所述待匹配语音片段未在所述目标反向时长区间内,且所述待匹配语音片段的时长小于所述第一反向时长最小值,则根据所述待匹配语音片段的时长以及所述起止标识的初始位置,确定所述待匹配活动片段。
  13. 根据权利要求8或12所述的音画匹配方法,所述根据所述第一反向时长最小值以及所述第一反向时长最大值确定目标反向时长区间之后,所述方法还包括:
    若所述待匹配语音片段未在所述目标反向时长区间内,且所述待匹配语音片段的时长大于所述第一反向时长最大值,则从所述图像序列中获取第q个活动片段,其中,所述q为大于或等于1,且小于或等于N的整数;
    根据所述最大缩放比例、所述起止标识的初始位置以及所述第q个活动片段的结束帧确定第三反向时长最小值;
    根据所述最小缩放比例、所述起止标识的初始位置以及所述第(q-1)个活动片段的起始帧确定第三反向时长最大值;
    根据所述第三反向时长最小值以及所述第三反向时长最大值确定第四反向时长区间;
    若所述待匹配语音片段在所述第四反向时长区间内,则根据所述起止标识的初始位置、所述最大缩放比例、所述第q个活动片段的结束帧,确定所述待匹配活动片段的时长;
    根据所述待匹配活动片段的时长,将所述待匹配语音片段划分为第三语音片段以及第四语音片段,其中,所述第三语音片段的时长与所述待匹配活动片段的时长一致,所述第四语音片段用于根据更新后的所述起止标识的初始位置匹配对应的动作片段;
    所述若满足起止标识位置更新条件,则对所述起止标识的初始位置进行更新,包括:
    若所述待匹配语音片段在第四正向时长区间内,则满足所述起止标识位置更新条件;
    将所述起止标识的初始位置更新为所述第q个活动片段的结束帧所对应的位置。
  14. 根据权利要求1所述方法,所述视频片段为虚拟视频片段、合成视频片段以及剪辑视频片段中的至少一种;
    所述起止标识为游标或滑动杆。
  15. 一种音画匹配装置,包括:
    接收模块,用于获取语音序列,其中,所述语音序列包括M个语音片段,所述M为大于或等于1的整数;
    获取模块,用于从所述语音序列中获取待匹配语音片段,其中,所述待匹配语音片段属于所述语音序列中的任意一个语音片段;
    所述获取模块,还用于从图像序列中获取起止标识的初始位置以及所述起止标识的移动方向,其中,所述图像序列包括N个活动片段,每个活动片段中包括对象的动作画面,所述起止标识的初始位置为所述活动片段的起始帧或者所述活动片段的结束帧,所述N为大于或等于1的整数;
    所述获取模块,还用于根据所述起止标识的初始位置、所述起止标识的移动方向以及所述待匹配语音片段,确定待匹配活动片段;
    处理模块,用于对所述待匹配语音片段与所述待匹配活动片段进行合成处理,得到视频片段,其中,所述视频片段包括所述对象的动作画面以及所述对象的语音。
  16. 一种计算机设备,包括:存储器、收发器、处理器以及总线***;
    其中,所述存储器用于存储程序;
    所述处理器用于执行所述存储器中的程序,以实现权利要求1至14中任一项所述的方法;
    所述总线***用于连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
  17. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至14中任一项所述的方法。
  18. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至14任一项所述的方法。
PCT/CN2021/078367 2020-04-23 2021-03-01 一种视频的音画匹配方法、相关装置以及存储介质 WO2021213008A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21792639.3A EP4033769A4 (en) 2020-04-23 2021-03-01 METHOD FOR MATCHING SOUND AND PHOTOGRAPHY OF VIDEOS, ASSOCIATED DEVICE AND STORAGE MEDIA
US17/712,060 US11972778B2 (en) 2020-04-23 2022-04-01 Sound-picture matching method of video, related apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010326306.1A CN111225237B (zh) 2020-04-23 2020-04-23 一种视频的音画匹配方法、相关装置以及存储介质
CN202010326306.1 2020-04-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/712,060 Continuation US11972778B2 (en) 2020-04-23 2022-04-01 Sound-picture matching method of video, related apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2021213008A1 true WO2021213008A1 (zh) 2021-10-28

Family

ID=70828517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078367 WO2021213008A1 (zh) 2020-04-23 2021-03-01 一种视频的音画匹配方法、相关装置以及存储介质

Country Status (4)

Country Link
US (1) US11972778B2 (zh)
EP (1) EP4033769A4 (zh)
CN (1) CN111225237B (zh)
WO (1) WO2021213008A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111225237B (zh) 2020-04-23 2020-08-21 腾讯科技(深圳)有限公司 一种视频的音画匹配方法、相关装置以及存储介质
USD988358S1 (en) * 2021-01-19 2023-06-06 Fujitsu Limited Display screen with animated graphical user interface
US20220374637A1 (en) * 2021-05-20 2022-11-24 Nvidia Corporation Synthesizing video from audio using one or more neural networks
CN114466179A (zh) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 语音与图像同步性的衡量方法及装置
CN115065844B (zh) * 2022-05-24 2023-09-12 北京跳悦智能科技有限公司 一种主播肢体动作节奏的自适应调整方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101523498A (zh) * 2006-08-17 2009-09-02 奥多比公司 用于安置音频和视频片断的技术
US20130073961A1 (en) * 2011-09-20 2013-03-21 Giovanni Agnoli Media Editing Application for Assigning Roles to Media Content
WO2013086607A1 (en) * 2011-12-12 2013-06-20 Corel Corporation Media editing system and method with linked storyboard and timeline
CN108924617A (zh) * 2018-07-11 2018-11-30 北京大米科技有限公司 同步视频数据和音频数据的方法、存储介质和电子设备
CN110781349A (zh) * 2018-07-30 2020-02-11 优视科技有限公司 用于短视频生成的方法、设备、客户端装置及电子设备
CN111225237A (zh) * 2020-04-23 2020-06-02 腾讯科技(深圳)有限公司 一种视频的音画匹配方法、相关装置以及存储介质

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH099224A (ja) * 1995-06-19 1997-01-10 Matsushita Electric Ind Co Ltd リップシンク制御装置を用いた動画像および音声コーデック装置
US7512886B1 (en) * 2004-04-15 2009-03-31 Magix Ag System and method of automatically aligning video scenes with an audio track
CN101005574A (zh) * 2006-01-17 2007-07-25 上海中科计算技术研究所 视频虚拟人手语编辑***
CN100476877C (zh) * 2006-11-10 2009-04-08 中国科学院计算技术研究所 语音和文本联合驱动的卡通人脸动画生成方法
CN101640057A (zh) * 2009-05-31 2010-02-03 北京中星微电子有限公司 一种音视频匹配方法及装置
CN101937570A (zh) * 2009-10-11 2011-01-05 上海本略信息科技有限公司 基于语音和文字识别的动漫口形自动匹配实现方法
US8655152B2 (en) * 2012-01-31 2014-02-18 Golden Monkey Entertainment Method and system of presenting foreign films in a native language
CN104902317A (zh) * 2015-05-27 2015-09-09 青岛海信电器股份有限公司 音视频同步方法及装置
JP6662063B2 (ja) * 2016-01-27 2020-03-11 ヤマハ株式会社 収録データ処理方法
CN106067989B (zh) * 2016-04-28 2022-05-17 江苏大学 一种人像语音视频同步校准装置及方法
US10740391B2 (en) * 2017-04-03 2020-08-11 Wipro Limited System and method for generation of human like video response for user queries
CN107333071A (zh) * 2017-06-30 2017-11-07 北京金山安全软件有限公司 视频处理方法、装置、电子设备及存储介质
CN108305632B (zh) * 2018-02-02 2020-03-27 深圳市鹰硕技术有限公司 一种会议的语音摘要形成方法及***
CN108600680A (zh) * 2018-04-11 2018-09-28 南京粤讯电子科技有限公司 视频处理方法、终端及计算机可读存储介质
CN110830852B (zh) * 2018-08-07 2022-08-12 阿里巴巴(中国)有限公司 一种视频内容的处理方法及装置
CN109447234B (zh) * 2018-11-14 2022-10-21 腾讯科技(深圳)有限公司 一种模型训练方法、合成说话表情的方法和相关装置
CN110062116A (zh) * 2019-04-29 2019-07-26 上海掌门科技有限公司 用于处理信息的方法和设备
CN110087014B (zh) * 2019-04-29 2022-04-19 努比亚技术有限公司 视频补全方法、终端及计算机可读存储介质
CN110070065A (zh) * 2019-04-30 2019-07-30 李冠津 基于视觉以及语音智能的手语***以及通讯方法
CN110267113B (zh) * 2019-06-14 2021-10-15 北京字节跳动网络技术有限公司 视频文件加工方法、***、介质和电子设备
CN110493613B (zh) * 2019-08-16 2020-05-19 江苏遨信科技有限公司 一种视频音唇同步的合成方法及***
CN110688911B (zh) * 2019-09-05 2021-04-02 深圳追一科技有限公司 视频处理方法、装置、***、终端设备及存储介质
CN110781328A (zh) * 2019-09-09 2020-02-11 天脉聚源(杭州)传媒科技有限公司 基于语音识别的视频生成方法、***、装置和存储介质
CN110991391B (zh) * 2019-09-17 2021-06-29 腾讯科技(深圳)有限公司 一种基于区块链网络的信息处理方法及装置
CN110534109B (zh) * 2019-09-25 2021-12-14 深圳追一科技有限公司 语音识别方法、装置、电子设备及存储介质
KR20220090586A (ko) * 2019-11-18 2022-06-29 구글 엘엘씨 오디오-비주얼 매칭을 사용한 자동 음성 인식 가설 재점수화
CN111010589B (zh) * 2019-12-19 2022-02-25 腾讯科技(深圳)有限公司 基于人工智能的直播方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101523498A (zh) * 2006-08-17 2009-09-02 奥多比公司 用于安置音频和视频片断的技术
US20130073961A1 (en) * 2011-09-20 2013-03-21 Giovanni Agnoli Media Editing Application for Assigning Roles to Media Content
WO2013086607A1 (en) * 2011-12-12 2013-06-20 Corel Corporation Media editing system and method with linked storyboard and timeline
CN108924617A (zh) * 2018-07-11 2018-11-30 北京大米科技有限公司 同步视频数据和音频数据的方法、存储介质和电子设备
CN110781349A (zh) * 2018-07-30 2020-02-11 优视科技有限公司 用于短视频生成的方法、设备、客户端装置及电子设备
CN111225237A (zh) * 2020-04-23 2020-06-02 腾讯科技(深圳)有限公司 一种视频的音画匹配方法、相关装置以及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4033769A4

Also Published As

Publication number Publication date
US20220223182A1 (en) 2022-07-14
CN111225237B (zh) 2020-08-21
CN111225237A (zh) 2020-06-02
EP4033769A1 (en) 2022-07-27
US11972778B2 (en) 2024-04-30
EP4033769A4 (en) 2023-05-03

Similar Documents

Publication Publication Date Title
WO2021213008A1 (zh) 一种视频的音画匹配方法、相关装置以及存储介质
CN109308731B (zh) 级联卷积lstm的语音驱动唇形同步人脸视频合成算法
Shlizerman et al. Audio to body dynamics
CN109462776B (zh) 一种视频特效添加方法、装置、终端设备及存储介质
US11514634B2 (en) Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses
US10776977B2 (en) Real-time lip synchronization animation
US11670015B2 (en) Method and apparatus for generating video
CN111935537A (zh) 音乐短片视频生成方法、装置、电子设备和存储介质
US20210027511A1 (en) Systems and Methods for Animation Generation
EP4099709A1 (en) Data processing method and apparatus, device, and readable storage medium
US20210390945A1 (en) Text-driven video synthesis with phonetic dictionary
WO2021196646A1 (zh) 交互对象的驱动方法、装置、设备以及存储介质
US20240212252A1 (en) Method and apparatus for training video generation model, storage medium, and computer device
US20210073611A1 (en) Dynamic data structures for data-driven modeling
CN112950640A (zh) 视频人像分割方法、装置、电子设备及存储介质
KR20160074958A (ko) 객체의 움직임 분석을 이용한 모션 효과 생성 장치 및 방법
RU2721180C1 (ru) Способ генерации анимационной модели головы по речевому сигналу и электронное вычислительное устройство, реализующее его
WO2019127940A1 (zh) 视频分类模型训练方法、装置、存储介质及电子设备
US20210390937A1 (en) System And Method Generating Synchronized Reactive Video Stream From Auditory Input
CN115578512A (zh) 语音播报视频的生成模型训练和使用方法、装置及设备
CN116528016A (zh) 音视频合成方法、服务器和可读存储介质
Gupta et al. Towards generating ultra-high resolution talking-face videos with lip synchronization
CN110415261A (zh) 一种分区域训练的表情动画转换方法及***
CN114879877B (zh) 状态数据同步方法、装置、设备及存储介质
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21792639

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021792639

Country of ref document: EP

Effective date: 20220422

NENP Non-entry into the national phase

Ref country code: DE