CN113115104B

CN113115104B - Video processing method and device, electronic equipment and storage medium

Info

Publication number: CN113115104B
Application number: CN202110296780.9A
Authority: CN
Inventors: 叶奎; 黄旭为
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2023-04-07
Anticipated expiration: 2041-03-19
Also published as: CN113115104A

Abstract

The disclosure relates to a video processing method and device, electronic equipment and a storage medium, and belongs to the technical field of video processing. The method comprises the following steps: acquiring a sound spectrum corresponding to a text to be processed; carrying out segmentation processing on the sound spectrum to obtain a plurality of sound spectrum sections; adopting an expression coefficient sequence generation model to carry out prediction processing on the sound spectrum to obtain an expression coefficient sequence corresponding to the sound spectrum, wherein the expression coefficient sequence comprises: a plurality of expression coefficients and duration corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments; generating an audio segment corresponding to the sound spectrum segment; and sending the expression coefficient sequence and the plurality of audio segments to the client to trigger the client to generate the target video. The method effectively improves the accuracy and timeliness of the expression coefficient sequence generation, and can effectively assist in improving the response efficiency of subsequent video processing because the sound spectrum is segmented and the audio band corresponding to the segmented sound spectrum band is obtained and used for processing the target video.

Description

Video processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of software and hardware technologies of the intelligent terminal, methods for processing video at the intelligent terminal side are becoming more popular, for example, a text-driven method is adopted to process video (the text-driven method is used to process video, that is, video and a piece of audio (the audio is synthesized by text) according to a target character, a video of a target character speaking synchronized with the audio is generated, and an expression coefficient sequence and audio synthesized by text are adopted in the process of processing video).

In the method for processing the video by text driving in the related technology, the generation quality of the expression coefficient sequence is not high, and the response efficiency of video processing is low.

Disclosure of Invention

The present disclosure provides a video processing method, an apparatus, an electronic device, a storage medium, and a computer program product, to at least solve the technical problems of low generation quality of an expression coefficient sequence and low response efficiency of video processing in a text-driven video processing method in related technologies.

The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a video processing method, including: acquiring a sound spectrum corresponding to a text to be processed; carrying out segmentation processing on the sound spectrum to obtain a plurality of sound spectrum sections; predicting the sound spectrum by adopting an expression coefficient sequence generation model to obtain an expression coefficient sequence corresponding to the sound spectrum, wherein the expression coefficient sequence comprises: a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments; generating an audio segment corresponding to the sound spectrum segment; and sending the expression coefficient sequence and the plurality of audio segments to a client to trigger the client to generate a target video.

In some embodiments of the present disclosure, the method for training the expression coefficient sequence generation model includes:

acquiring a plurality of sample sound spectrums and a marked expression coefficient sequence corresponding to the sample sound spectrums;

inputting the plurality of sample sound spectrums into a neural network model to obtain a predicted sample expression coefficient sequence output by the neural network model;

and training the neural network model according to the difference value between the sample expression coefficient sequence and the labeled expression coefficient sequence to obtain an expression coefficient sequence generation model.

In some embodiments of the present disclosure, the segmenting the sound spectrum to obtain a plurality of sound spectrum segments includes:

determining time scale information of the sound spectrum, wherein the time scale information is used for describing distribution duration corresponding to sound spectrum characteristics carried by the sound spectrum;

determining the segmented duration according to the time scale information and the preset frame rate of the target video;

and carrying out segmentation processing on the sound spectrum to obtain a plurality of sound spectrum sections with the segmentation duration.

In some embodiments of the present disclosure, the step of determining a segment duration according to the time scale information and a preset frame rate of the target video includes:

determining the ratio of a preset value to the preset frame rate;

and determining the minimum common divisor of the time scale information and the ratio, and taking the minimum common divisor as the segment time length.

In some embodiments of the disclosure, after the generating the audio segment corresponding to the sound spectrum segment, further includes:

aligning the sound spectrum and the expression coefficient sequence to obtain a target expression coefficient sequence;

dividing the target expression coefficient sequence into a plurality of expression coefficient subsequences, wherein the time length covered by the expression coefficients in the expression coefficient subsequences is the segmentation time length;

and sending the audio segment corresponding to the sound spectrum and the expression coefficient subsequence aligned to the sound spectrum to the client so as to trigger the client to generate the target video by adopting the expression coefficient subsequence and the audio segment.

In some embodiments of the present disclosure, the aligning the sound spectrum and the expression coefficient sequence to obtain a target expression coefficient sequence includes:

acquiring the ratio of the duration of the expression coefficient sequence to the time scale information of the sound spectrum;

when the number of the time scale information contained in the sound spectrum can be divided by the ratio, processing the sound spectrum into a first target sound spectrum, wherein a first target distribution duration corresponding to a first target sound spectrum characteristic carried by the first target sound spectrum is an integer division value of the distribution duration of the sound spectrum and the ratio;

when the quantity of the time scale information contained in the sound spectrum cannot be divided by the ratio, processing the sound spectrum into a second target sound spectrum, wherein a second target distribution duration corresponding to a second target sound spectrum feature carried by the second target sound spectrum is an addition value obtained by adding the division value and a reference value, and the division value is obtained by dividing the distribution duration of the sound spectrum and the ratio;

and taking the expression coefficient sequence aligned with the first target sound spectrum or the second target sound spectrum as the target expression coefficient sequence.

According to a second aspect of the embodiments of the present disclosure, there is provided a video processing method, including: receiving an expression coefficient sequence and a plurality of audio segments, wherein the plurality of audio segments are generated by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments and predicting the sound spectrum according to the plurality of sound spectrum segments, and the expression coefficient sequence is obtained by predicting the sound spectrum and comprises: a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments; and carrying out fusion processing on the expression coefficient sequence and the plurality of audio segments to obtain a target video.

In some embodiments of the disclosure, the fusing the expression coefficient sequence and the audio segments to obtain a target video includes:

acquiring mixed deformation of a plurality of facial expressions corresponding to the expression coefficients;

fusing the expression coefficients and the mixed deformation corresponding to the plurality of facial expressions to obtain facial expression images corresponding to the expression coefficients;

combining the plurality of facial expression images to obtain a composite video according to a preset frame rate;

and fusing the plurality of audio segments and the synthesized video to obtain the target video.

According to a third aspect of the embodiments of the present disclosure, there is provided a video processing apparatus including: the acquisition module is configured to acquire a sound spectrum corresponding to the text to be processed; a segmentation module configured to perform segmentation processing on the sound spectrum to obtain a plurality of sound spectrum segments; the prediction module is configured to perform prediction processing on the sound spectrum by using an expression coefficient sequence generation model to obtain an expression coefficient sequence corresponding to the sound spectrum, and the expression coefficient sequence comprises: a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments; a generation module configured to perform generating an audio segment corresponding to the sound spectrum segment; a sending module configured to execute sending the sequence of expression coefficients and the plurality of audio segments to a client to trigger the client to generate a target video.

In some embodiments of the present disclosure, further comprising:

the training module is configured to acquire a plurality of sample sound spectrums and labeled expression coefficient sequences corresponding to the sample sound spectrums, input the sample sound spectrums into a neural network model to obtain predicted sample expression coefficient sequences output by the neural network model, and train the neural network model according to a difference value between the sample expression coefficient sequences and the labeled expression coefficient sequences to obtain the expression coefficient sequence generation model.

In some embodiments of the disclosure, the segmentation module is configured to perform:

determining a segmented duration according to the time scale information and a preset frame rate of the target video;

determining the ratio of a preset value to the preset frame rate;

In some embodiments of the present disclosure, further comprising:

and the alignment module is configured to perform alignment processing on the sound spectrum and the expression coefficient sequence to obtain a target expression coefficient sequence, divide the target expression coefficient sequence into a plurality of expression coefficient subsequences, wherein the time length covered by the expression coefficients in the expression coefficient subsequences is the segmentation time length, and send an audio segment corresponding to the sound spectrum and the expression coefficient subsequences aligned with the sound spectrum to the client so as to trigger the client to generate the target video by adopting the expression coefficient subsequences and the audio segment.

In some embodiments of the present disclosure, the alignment module is configured to perform:

when the number of the time scale information contained in the sound spectrum cannot be divided by the ratio, processing the sound spectrum into a second target sound spectrum, wherein a second target distribution duration corresponding to a second target sound spectrum characteristic carried by the second target sound spectrum is an addition value obtained by adding the division value and a reference value, and the division value is obtained by dividing the distribution duration of the sound spectrum and the ratio;

According to a fourth aspect of the embodiments of the present disclosure, there is provided a video processing apparatus including: the receiving module is configured to perform receiving of an expression coefficient sequence and a plurality of audio segments, wherein the plurality of audio segments are generated by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments and performing prediction processing on the sound spectrum according to the plurality of sound spectrum segments, and the expression coefficient sequence comprises: a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments; and the fusion module is configured to perform fusion processing on the expression coefficient sequence and the plurality of audio segments to obtain a target video.

In some embodiments of the disclosure, the fusion module is configured to perform:

combining a plurality of facial expression images to synthesize a synthesized video according to a preset frame rate;

and fusing the audio segments and the synthesized video to obtain the target video.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the video processing method as previously described.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the video processing method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the video processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of obtaining a sound spectrum corresponding to a text to be processed, conducting segmentation processing on the sound spectrum to obtain a plurality of sound spectrum segments, conducting prediction processing on the sound spectrum by adopting an expression coefficient sequence generation model to obtain an expression coefficient sequence corresponding to the sound spectrum, wherein the expression coefficient sequence comprises the following steps: the method comprises the steps of generating an audio segment corresponding to a sound spectrum segment by a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segment, generating an audio segment corresponding to the sound spectrum segment, and sending an expression coefficient sequence and a plurality of audio segments to a client to trigger the client to generate a target video.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a flow diagram illustrating a video processing method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a video processing method according to another exemplary embodiment.

Fig. 3 is a flow chart illustrating a video processing method according to yet another exemplary embodiment.

Fig. 4 is a schematic view of an application scenario according to an embodiment of the present disclosure.

Fig. 5 is a flow chart illustrating a video processing method according to yet another exemplary embodiment.

Fig. 6 is a block diagram illustrating a video processing device according to an example embodiment.

Fig. 7 is a block diagram of a video processing apparatus according to another exemplary embodiment.

Fig. 8 is a block diagram illustrating a video processing apparatus according to yet another exemplary embodiment.

FIG. 9 is a block diagram illustrating an electronic device in accordance with an example embodiment. .

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flow diagram illustrating a video processing method according to an example embodiment.

The present embodiment is exemplified in a case where the video processing method is configured as a video processing apparatus.

The video processing method in the present embodiment may be configured in a video processing apparatus, and the video processing apparatus may be provided in a server.

It should be noted that the execution main body in the embodiment of the present disclosure may be, for example, a Central Processing Unit (CPU) of a server in hardware, and may be, for example, a related background service of the server in software, which is not limited to this.

As shown in fig. 1, the video processing method includes the following steps:

in step S101, a sound spectrum corresponding to the text to be processed is acquired.

When the text to be processed is adopted to drive the client to generate the target video, for example, the text to be processed is processed to obtain a corresponding audio segment, the audio segment is fused with the target video, and the text to be processed is loaded and displayed in the target video, which is not limited herein.

The text to be processed may be input by a user at a client side, and after the user inputs a section of text to be processed at the client side, the text to be processed may be sent to a server, and the server performs corresponding processing on the text to be processed to obtain an expression coefficient sequence and a sound band, which is specifically referred to below.

The text to be processed is specifically, for example, such as today's weather really good!

For example, after obtaining a piece of text to be processed, the server may pre-process the text to be processed, such as removing [ e.g., the weather is really good today! Punctuation in the text converts each character in (for example, the weather is really good today) into an initial consonant, a vowel, a tone and the like in pinyin, and then obtains a corresponding word embedding vector according to the converted text to be processed, wherein the word embedding vector can be used for generating a sound spectrum corresponding to the text to be processed.

The sound spectrum can be used to describe the components contained in the sound and the distribution mode of the acoustic energy in the timbre, and can be regarded as a 'picture of the sound', and can be divided into a static sound spectrum and a dynamic sound spectrum.

In the embodiment of the present disclosure, after the text to be processed is preprocessed to obtain the corresponding word embedding vector, a sound spectrum determination manner in the related art may be adopted to obtain a sound spectrum corresponding to the text to be processed (the sound spectrum corresponds to the whole text to be processed), so that the sound spectrum can carry context information of the text to be processed.

For example, the word embedding vector corresponding to the text to be processed may be input to an acoustic model (e.g., an end-to-end speech synthesis model based on deep learning, such as tacontron, tacontron 2, etc.) to obtain a sound spectrum (e.g., a mel spectrum or a cepstrum spectrum (cepstrum)), which is not limited.

In the embodiment of the present disclosure, a sound spectrum is taken as a Cepstrum (Cepstrum), the sound spectrum may be represented as (T _ cep, n _ cep), the T _ cep describes a distribution time duration of the sound spectrum, the T _ cep includes a plurality of time scale information, one time scale information represents a certain time duration (e.g., 10 ms), and the n _ cep represents a dimension of the sound spectrum (e.g., the Cepstrum is 20 dimensions).

It should be noted that, the specific presentation form of the sound spectrum is a matrix, T _ cep represents time length distribution corresponding to the column vectors, the distribution time length occupied by each column vector may be referred to as time scale information, n _ cep represents dimension distribution of all the column vectors, the time scale information represents the distribution time length (for example, 10 ms), and the time scale information represents the sound spectrum by 10ms, which is not limited to this.

In the embodiment of the present disclosure, the sound spectrum corresponding to the whole text to be processed can generally carry context information of the text to be processed, and the context information can be used to describe semantics expressed by the text to be processed.

It can be understood that, based on the semantic difference represented by different texts to be processed, different texts to be processed carry different or the same context information, and based on different or the same context information, the semantics, the intonation, and the duration corresponding to each character in a section of text to be processed are the same or different, and these semantics, intonation, and duration can be represented on the difference of the sound spectrum.

In the embodiment of the present disclosure, a sound spectrum corresponding to a text to be processed is obtained, so that the sound spectrum carries context information of the text to be processed, that is, the context information represents global context information corresponding to the text to be processed and expresses the global context information of the text to be processed.

In step S102, the sound spectrum is segmented to obtain a plurality of sound spectrum segments.

In some embodiments, time scale information of a sound spectrum and time scale information may be determined, where the time scale information is used to describe distribution durations corresponding to sound spectrum features carried by the sound spectrum, and a segmentation duration may be determined according to the time scale information and a preset frame rate of a target video, and the sound spectrum may be segmented to obtain a plurality of sound spectrum segments with the segmentation durations, so that a degree of coincidence between the plurality of sound spectrum segments obtained through division and the preset frame rate of the target video may be effectively improved, and an improvement in a fusion processing effect of the target video may be effectively assisted.

For example, assuming that the specific presentation form of the sound spectrum is a matrix, T _ cep represents a distribution duration corresponding to a column vector (a column vector may correspond to one sound spectrum feature), a distribution duration occupied by each column vector may be referred to as time scale information, n _ cep represents the dimensional distribution of all column vectors, the time scale information represents a distribution duration corresponding to one column vector, and if the distribution duration is 10ms, the time scale information represents the sound spectrum is 10ms, which is not limited.

The determining of the segment duration may be determining the segment duration according to the time scale information and a preset frame rate of the target video (the preset frame rate may be, for example, a Frame Per Second (FPS) of the target video), and is simple and convenient to implement and has better practicability.

For example, assuming that the calculated segmentation duration is 100ms, the sound spectrum may be divided into a plurality of sound spectrum segments with a duration of 100ms, and then, the subsequent steps may be triggered, or any other possible segmentation duration may be used to segment the sound spectrum, which is not limited to this.

In some other embodiments, the audio spectrum may also be segmented according to the actual rendering requirement to obtain a plurality of audio spectrum segments with different durations, which is not limited herein.

In step S103, the expression coefficient sequence generation model is used to perform prediction processing on the sound spectrum to obtain an expression coefficient sequence corresponding to the sound spectrum, where the expression coefficient sequence includes: the expression coefficients and the duration corresponding to the expression coefficients are corresponding to the sound spectrum segments.

The expression coefficients can be used for representing components related to expressions in the shape of the face, and when a user expresses different sound spectrum segments, the user usually presents a rich expression, so that the expression coefficients can correspond to the sound spectrum segments, and when the target video is driven to be generated, the expression coefficients can be used for fusing audio segments corresponding to the sound spectrum segments.

The expression coefficient sequence may specifically include a plurality of expression coefficients, each expression coefficient has a corresponding duration, that is, because the expression coefficient sequence corresponds to the whole sound spectrum, and the sound spectrum may include a plurality of sound spectrum segments, each expression coefficient has a corresponding sound spectrum segment, and the duration corresponding to an expression coefficient may be understood as a component related to an expression in the shape of a human face, and a duration of the corresponding expression, which is not limited thereto.

In the embodiment of the present disclosure, it is further considered that different texts to be processed have different semantic contents, when the user uses audio to present the semantic content corresponding to the text to be processed, different expressions may be correspondingly made, for example, when the text to be processed is [ say, the weather is really good today | ]! When the user is expressing [ say today the weather is really good! When the text to be processed is other content, the expression may correspond to the actual context semantics of the other text to be processed when the user expresses the corresponding audio.

In the embodiment of the disclosure, the category of each expression in the face picture can be determined in advance based on a preset expression classification standard, then, for expressions of different categories, expression coefficients of each category of expressions of a face in the face picture are obtained in different manners, then, a plurality of sample sound spectrums and a labeled expression coefficient sequence corresponding to the sample sound spectrums are obtained, the plurality of sample sound spectrums are input into a neural network model, a predicted sample expression coefficient sequence output by the neural network model is obtained, the neural network model is trained according to a difference value between the sample expression coefficient sequence and the labeled expression coefficient sequence, the trained neural network model is used as an expression coefficient sequence generation model, the expression coefficient sequence generation model has a prediction function of generating a coefficient sequence according to a sound spectrum, so that the generation efficiency of the expression coefficient sequence can be effectively improved in an auxiliary manner, the expression coefficient sequence generation model is obtained in advance based on a large number of sample sound spectrums and a corresponding labeled expression coefficient sequence, and the generation accuracy of the expression coefficient sequence is effectively improved while the generation efficiency of the expression coefficient sequence is effectively improved.

Of course, the training model may also adopt other artificial intelligence models, such as a machine learning model, and the like, which is not limited thereto.

Therefore, after the sound spectrum corresponding to the text to be processed is obtained, the sound spectrum can be input into the trained expression coefficient sequence generation model, the sound spectrum is subjected to prediction processing by adopting the expression coefficient sequence generation model, and the expression coefficient sequence corresponding to the sound spectrum is obtained.

Certainly, the expression coefficient sequence generation model is only one possible implementation manner for implementing generation of the expression coefficient sequence corresponding to the sound spectrum, and in an actual execution process, generation of the expression coefficient sequence corresponding to the sound spectrum may be implemented in any other possible manner, for example, the expression coefficient sequence may also be implemented by using a conventional programming technique (such as a simulation method and an engineering method), or may also be implemented by using a mathematical method.

The expression coefficient sequence in the embodiment of the present disclosure may be represented as a 2-dimensional matrix, and the presentation form may be (T _ e, n _ e), where T _ e indicates a time duration corresponding to a column vector, the time duration corresponding to each column vector may be referred to as time scale information, one time scale information indicates a time duration occupied by the corresponding column vector (for example, 40ms, the time scale information of the expression coefficient sequence may correspond to a number of Frames Per Second (FPS) of a target video), and n _ e indicates a dimension of an expression coefficient (for example, 51 dimensions may be used, and other dimensions may also be used, which is not limited thereto).

In step S104, an audio segment corresponding to the sound spectrum segment is generated.

When generating an audio segment corresponding to a vocoder, such as lpcnt vocoder, waveNet vocoder, waveRNN vocoder, etc., the vocoder may input the vocoder with the corresponding audio segment, and the corresponding audio segment is obtained through corresponding mapping process of the vocoder.

In step S105, the sequence of expression coefficients and the plurality of audio segments are sent to the client to trigger the client to generate the target video.

The method includes the steps of performing segmentation processing on a sound spectrum to obtain a plurality of sound spectrum segments, generating audio segments corresponding to the sound spectrum segments, and triggering a client to generate a target video by using the obtained expression coefficient sequence and audio segments, for example, sending the obtained expression coefficient sequence and audio segments to the client in real time to trigger the client to generate the target video, or sending the obtained expression coefficient sequence and audio segments to the client to trigger the client to generate the target video when the current time reaches a set time point, without limitation.

In this embodiment, a sound spectrum corresponding to a text to be processed is obtained, the sound spectrum is segmented to obtain a plurality of sound spectrum segments, the sound spectrum is subjected to prediction processing by using an expression coefficient sequence generation model to obtain an expression coefficient sequence corresponding to the sound spectrum, and the expression coefficient sequence includes: the method comprises the steps of generating an audio segment corresponding to a sound spectrum segment by a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segment, generating an audio segment corresponding to the sound spectrum segment, and sending an expression coefficient sequence and a plurality of audio segments to a client to trigger the client to generate a target video.

As shown in fig. 2, the video processing method includes the following steps:

in step S201, a sound spectrum corresponding to the text to be processed is acquired.

For the description of step S201, reference may be made to the above embodiments, which are not described herein again.

In step S202, time scale information of the sound spectrum is determined, where the time scale information is used to describe distribution duration corresponding to sound spectrum features carried by the sound spectrum.

In combination with the above description, in the embodiment of the present disclosure, a sound spectrum is taken as a cepstrum (cepstrum) for example, the sound spectrum may be represented as (T _ cep, n _ cep), and if a specific presentation form of the sound spectrum is a matrix, the T _ cep represents a distribution duration corresponding to a column vector (a column vector may correspond to one sound spectrum feature), a distribution duration occupied by each column vector may be referred to as time scale information, the n _ cep represents a dimension distribution of all column vectors, the time scale information represents a distribution duration corresponding to one column vector, and if the distribution duration is 10ms, the time scale information represents the sound spectrum is 10ms, which is not limited.

In step S203, the segment duration is determined according to the time scale information and the preset frame rate of the target video.

The preset frame rate may be, for example, FPS (frame per second) of the target video, where FPS is a definition in the field of images, and refers to frames per second, FPS is a definition for measuring the number of information used to store and display a dynamic video, and more FPS is, more motion is displayed more smoothly. Typically, the minimum limit for the number of frames per second FPS to avoid motion disfluency is 25 frames/second.

The target video includes, for example, a piece of video to be fused with an emoticon sequence and an audio segment, and includes an avatar of a person a.

In this embodiment, the client may analyze the frames per second FPS of the target video, feed back the frames per second FPS of the target video to the server, and receive the frames per second FPS of the target video sent by the client through the server, or the client may also pre-store the frames per second FPS corresponding to the video type in the cloud server, and the local server directly obtains the frames per second FPS of the target video from the cloud server according to the video type of the target video, without limitation.

The number of frames per second FPS transmitted for the target video may be, for example, FPS =30 frames/second.

After the time scale information of the sound spectrum is determined and the FPS of the target video transmitted by the client is received, the segmentation duration can be determined according to the time scale information and the preset frame rate of the target video, so that the segmentation duration can be flexibly determined, the segmentation duration can be matched with the FPS of the target video, and the transmission and display effect of the target video is prevented from being influenced.

For example, the display requirement of the actual target video may be analyzed, and the calculation mode may be adaptively determined according to the display requirement, so that the segmentation duration is determined according to the calculation mode and the time scale information and the frames per second FPS, or the time scale information and the frames per second FPS may be input into a pre-trained calculation model, so as to obtain the segmentation duration output by the calculation model, which is not limited to this.

In this embodiment, a ratio between a preset value and a preset frame rate may be determined, a minimum common divisor of the time scale information and the ratio is determined, and the minimum common divisor is used as a segment duration, so that a minimum segment duration may be determined, and a first-character response time (the first-character response time, which is a time consumed by transmitting a first segment (audio + corresponding expression coefficient sequence) to a client) may be shortened to a maximum extent.

The preset value may be determined according to actual display requirements, such as 1000.

For example, the shortest segment duration T _ min is the least common divisor of K (time scale information of the sound spectrum) and 1000/FPS, specifically, if the time scale information of the sound spectrum is 10ms, and the FPS =30 frames/second of the target video, then T _ min =100ms.

In step S204, the sound spectrum is segmented to obtain a plurality of sound spectrum segments with the segmentation duration.

In some embodiments, the segmentation duration may be determined, the sound spectrum may be segmented, and a plurality of sound spectrum segments with the segmentation duration are obtained, for example, the sound spectrum may be divided into a plurality of sound spectrum segments with 100ms as the segmentation duration, and then, the subsequent step may be triggered, or any other possible segmentation duration may be used to segment the sound spectrum, which is not limited to this.

In step S205, the expression coefficient sequence generation model is used to perform prediction processing on the sound spectrum, so as to obtain an expression coefficient sequence corresponding to the sound spectrum, where the expression coefficient sequence includes: the expression coefficients and the duration corresponding to the expression coefficients are corresponding to the sound spectrum segments.

In step S206, an audio segment corresponding to the sound spectrum segment is generated.

For example, the vocoder can input the sound spectrum segment into the vocoder, and the vocoder performs corresponding mapping process on the sound spectrum segment to obtain the corresponding audio segment, wherein the vocoder is, for example, lpcnt vocoder, waveNet vocoder, waveRNN vocoder, etc.

According to the method, after the sound spectrum corresponding to the text is obtained, the sound spectrum is segmented to obtain the plurality of sound spectrum segments, the audio segments corresponding to the sound spectrum segments are generated, the first segment of audio can be generated more quickly, the waiting time of a user is shortened, and other sound spectrum segments cannot be influenced in the playing process of the target video, so that the playing and processing processes of the target video are seamlessly integrated, and the response efficiency of subsequent video processing is effectively improved in an auxiliary mode.

In step S207, the sequence of expression coefficients and the plurality of audio segments are sent to the client to trigger the client to generate the target video.

The description of step S207 can refer to the above embodiments, and is not repeated herein.

In this embodiment, the sound spectrum can describe the global context information of the text, and the expression coefficient sequence is generated according to the sound spectrum, so that the accuracy of the expression coefficient sequence is effectively improved, and the corresponding audio segment is obtained by performing segmentation processing on the sound spectrum, and the expression coefficient sequence and the audio segment are used for processing the target video, so that the response efficiency of subsequent video processing can be effectively improved in an auxiliary manner. By determining time scale information and time scale information of a sound spectrum, the time scale information is used for describing distribution duration corresponding to sound spectrum characteristics carried by the sound spectrum, determining segmented duration according to the time scale information and a preset frame rate of a target video, and performing segmented processing on the sound spectrum to obtain a plurality of sound spectrum segments with the segmented duration, the coincidence degree of the plurality of sound spectrum segments obtained by division and the preset frame rate of the target video can be effectively improved, and the fusion processing effect of the target video is effectively assisted to be improved. The minimum common divisor of the time scale information and the ratio is determined by determining the ratio of the preset value to the preset frame rate, and the minimum common divisor is used as the segment duration, so that the first character response time can be effectively shortened, the user waiting time is effectively shortened, the user experience is effectively improved, in the playing process of the target video, the processing of other sound spectrum segments cannot be influenced, the playing and processing processes of the target video are seamlessly integrated, and the response efficiency of the subsequent video processing is effectively improved in an auxiliary manner.

As shown in fig. 3, the video processing method includes the steps of:

in step S301, a sound spectrum corresponding to the text to be processed is acquired.

In step S302, the sound spectrum is segmented to obtain a plurality of sound spectrum segments.

In step S303, the expression coefficient sequence generation model is used to perform prediction processing on the sound spectrum, so as to obtain an expression coefficient sequence corresponding to the sound spectrum, where the expression coefficient sequence includes: a plurality of expression coefficients and the duration corresponding to the expression coefficients, the expression coefficients corresponding to the sound spectrum segments.

In step S304, an audio segment corresponding to the sound spectrum segment is generated.

For the description of step S301 and step S304, reference may be made to the above embodiments, which are not described herein again.

In step S305, the sound spectrum and the expression coefficient sequence are aligned to obtain a target expression coefficient sequence.

After the audio segment corresponding to the sound spectrum segment is generated, the sound spectrum and the expression coefficient sequence can be aligned to obtain a target expression coefficient sequence, so that the accuracy of processing a target video at a client side is guaranteed, and the processing effect is guaranteed.

For example, the ratio of the duration of the expression coefficient sequence to the time scale information of the sound spectrum may be used as a reference, and the sound spectrum and the expression coefficient sequence are aligned to obtain a target expression coefficient sequence.

Optionally, in some embodiments, a ratio of a duration of the expression coefficient sequence to time scale information of the sound spectrum may be obtained, when the number of time scale information included in the sound spectrum is divisible by the ratio, the sound spectrum is processed into a first target sound spectrum, a first target distribution duration corresponding to a first target sound spectrum feature carried by the first target sound spectrum is an divisible value of the distribution duration of the sound spectrum and the ratio, when the number of time scale information included in the sound spectrum is not divisible by the ratio, the sound spectrum is processed into a second target sound spectrum, a second target distribution duration corresponding to a second target sound spectrum feature carried by the second target sound spectrum is an added value obtained by adding the divided value to a reference value, and the divided value is obtained by dividing the distribution duration of the sound spectrum by the ratio, and the expression coefficient sequence aligned with the first target sound spectrum or the second target sound spectrum is used as the target expression coefficient sequence, so that alignment processing is effectively ensured, and a subsequent client side can reasonably guarantee that the expression coefficient sequence and a plurality of audio frequency segments are processed, and quality of a video is improved.

It is to be understood that the sound spectrum may be represented as (T _ cep, n _ cep), where the T _ cep describes a distribution time length of the sound spectrum, and the T _ cep includes a plurality of time scale information, and the number of the time scale information included in the sound spectrum may specifically refer to the number of the time scale information included in the T _ cep.

For example, if the time scale information of the sound spectrum is K1=10ms, the duration of the sequence of expression coefficients (which duration may also be understood as time scale information) is K2=40ms, and every 4 lines of sound spectrum (in the form of (T _ cep, n _ cep)) correspond to 1 line of sequence of expression coefficients, the calculation method is: and the ratio K2/K1 of the duration of the expression coefficient sequence to the time scale information of the sound spectrum. In the alignment process, if the number of time scale information contained in the sound spectrum (the number of time scale information contained in T _ cep) can be divided by 4, the sound spectrum is organized into a form of a first target sound spectrum (T _ cep//4, n_cep), and if the number of time scale information contained in the sound spectrum (the number of time scale information contained in T _ cep) cannot be divided by 4, the last 4 rows of the sound spectrum are taken, the sound spectrum is organized into a second target sound spectrum (T _ cep//4+1,4, n_cep), where T _ e = T _ cep//4+1 and "///" is an integer division symbol, so that the sound spectrum and the expression coefficient sequence are aligned based on the ratio of the duration of the expression coefficient sequence to the time scale information of the sound spectrum, and the target expression coefficient sequence is obtained.

In step S306, the target expression coefficient sequence is divided into a plurality of expression coefficient subsequences, and the time length covered by the expression coefficients in the expression coefficient subsequences is the segment time length.

For example, the duration of the expression coefficient sequence is K2=40ms, the total duration covered by the expression coefficient sequence may be understood as a product value of the number of column vectors in the expression coefficient sequence and 40ms (for example, if the number of column vectors is 10 columns, the total duration is 400 ms), and the segment duration is 100ms, the expression coefficient sequence is divided in the time dimension by using 100ms, so as to obtain four expression coefficient subsequences with durations of 100ms.

In step S307, the audio segment corresponding to the sound spectrum and the expression coefficient subsequence aligned with the sound spectrum are sent to the client, so as to trigger the client to generate the target video by using the expression coefficient subsequence and the audio segment.

The voice spectrum and the expression coefficient sequence are aligned by taking the ratio of the duration of the expression coefficient sequence to the time scale information of the voice spectrum as a reference to obtain a target expression coefficient sequence, the target expression coefficient sequence is divided according to the segmented duration to obtain a plurality of expression coefficient subsequences, and then the audio segment corresponding to the voice spectrum and the expression coefficient subsequences aligned with the voice spectrum can be sent to the client to trigger the client to generate the target video by adopting the expression coefficient subsequences and the audio segment.

Referring to fig. 4, fig. 4 is a schematic view of an application scenario of an embodiment of the present disclosure, in fig. 4, first, a segment of text to be processed is input into an acoustic model, the acoustic model is used to process the text to be processed to obtain a corresponding voice spectrum, the voice spectrum can describe global context information of the text to be processed, then, the voice spectrum is input into a trained expression coefficient sequence generation model to obtain an expression coefficient sequence output by the expression coefficient sequence generation model, and meanwhile, the voice spectrum is segmented to obtain a voice spectrum segment, and the voice spectrum segment is input into a vocoder segment by segment to obtain a corresponding audio segment (100 ms), then, an expression coefficient subsequence corresponding to each audio segment (100 ms) is determined, and when transmitting to a client for rendering a target video, specifically, each audio segment (100 ms) and the expression coefficient subsequence corresponding to the audio segment are transmitted to the client segment by segment.

As can be seen from the above, in the embodiment of the present disclosure, after the audio segment corresponding to the sound spectrum segment is generated, the sound spectrum and the expression coefficient sequence may be aligned to obtain the target expression coefficient sequence, so as to ensure the accuracy of processing the target video at the client side and ensure the processing effect. The input of the expression coefficient sequence generation model is a sound spectrum, so that the model prediction efficiency is effectively improved, the sound spectrum can describe the global context information of the text to be processed, the accuracy performance of the generated expression coefficient sequence is better, in addition, when the sound spectrum is transmitted to the mobile phone end side to render the target video, each audio segment (100 ms) and the expression coefficient subsequence corresponding to the audio segment are transmitted to the client side section by section, the first character response time is effectively shortened, and the whole video processing efficiency is improved.

Referring to tables 1 and 2 below, table 1 presents the predicted elapsed time of the expression coefficient sequence generation model, and table 2 presents the first character response time.

TABLE 1

Word length	Time consuming using correlation techniques	Consuming time with the present disclosure
			20 character	125.9+-3.5ms	16.1+-0.5ms

TABLE 2

Word length	Time consuming by using correlation technique	Application of the disclosure
			20 character	1345+-41.4ms	460.9+-43.4ms

In this embodiment, the sound spectrum can describe the global context information of the text, and the expression coefficient sequence is generated according to the sound spectrum, so that the accuracy of the expression coefficient sequence is effectively improved, and the corresponding audio segment is obtained by performing segmentation processing on the sound spectrum, and the expression coefficient sequence and the audio segment are used for processing the target video, so that the response efficiency of subsequent video processing can be effectively improved in an auxiliary manner. After the audio segment corresponding to the sound spectrum segment is generated, the sound spectrum and the expression coefficient sequence can be aligned to obtain a target expression coefficient sequence, so that the accuracy of processing a target video at the client side is guaranteed, and the processing effect is guaranteed. The input of the expression coefficient sequence generation model is a sound spectrum, so that the model prediction efficiency is effectively improved, the sound spectrum can describe the global context information of the text to be processed, the accuracy performance of the generated expression coefficient sequence is better, in addition, when the sound spectrum is transmitted to the mobile phone end side to render the target video, each audio segment (100 ms) and the expression coefficient subsequence corresponding to the audio segment are transmitted to the client side segment by segment, the first character response time is effectively shortened, and the whole video processing efficiency is improved. The expression coefficient sequence aligned with the first target sound spectrum or the second target sound spectrum is used as the target expression coefficient sequence, so that the reasonability of alignment processing can be effectively guaranteed, the fusion processing effect of a subsequent client on the expression coefficient sequence and a plurality of audio frequency segments is guaranteed, and the quality of target video generation is improved.

In this embodiment, the video processing method may be configured in a video processing apparatus, and the video processing apparatus may be configured in a Client running in an electronic device, and the Client (Client), or called Client, refers to a program corresponding to a server and providing a local service for a Client. The client is usually installed on a common client machine and can be operated with the server. The client in the embodiment of the present disclosure may specifically be a client having video processing and playing functions, which is not limited to this.

It should be noted that the execution main body of the embodiment of the present disclosure may be, for example, a Central Processing Unit (CPU) of the electronic device in hardware, and may be, for example, a related background service of the electronic device in software, which is not limited to this.

As shown in fig. 5, the video processing method includes the steps of:

in step S501, an expression coefficient sequence and a plurality of audio segments are received, where the plurality of audio segments are obtained by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments, and the expression coefficient sequence is generated according to the plurality of sound spectrum segments and obtained by performing prediction processing on the sound spectrum, and the expression coefficient sequence includes: the expression coefficients and the duration corresponding to the expression coefficients are corresponding to the sound spectrum segments.

For the explanation and description of the terms in step S501, reference may be made to the above embodiments, and details are not repeated here.

After the server generates the audio segment corresponding to the sound spectrum segment and sends the expression coefficient sequence and the plurality of audio segments to the client, the expression coefficient sequence and the plurality of audio segments can be received by the client.

In some embodiments, when the server transmits to the client side to generate the target video, specifically, each audio segment (100 ms) and the expression coefficient subsequence corresponding to the audio segment are transmitted to the client side segment by segment, so that the expression coefficient sequence and the multiple audio segments are received by the client side segment by segment, which can effectively shorten the first word response time and improve the overall video processing efficiency.

In step S502, the expression coefficient sequence and the plurality of audio segments are fused to obtain a target video.

Optionally, in some embodiments, when the expression coefficient sequence and the plurality of audio segments are fused to obtain the target video, a mixed deformation of a plurality of facial expressions corresponding to the expression coefficients may be obtained, the expression coefficients and the mixed deformation of the plurality of facial expressions are fused to obtain a facial expression image corresponding to the expression coefficient, a composite video is obtained by combining the plurality of facial expression images according to a preset frame rate, and the plurality of audio segments and the composite video are fused to obtain the target video, so that the generation efficiency of the target video is effectively improved, and the generation quality of the target video is effectively improved.

For example, suppose that the client receives an audio segment and an expression coefficient sequence (T _ e, n _ e), and for each expression coefficient n _ e, a blend deformation blendshape of a corresponding n _ e personal face expression is provided, and then, expression coefficients and blend deformations corresponding to a plurality of facial expressions are fused, for example, n _ e expression coefficients and blend deformation blendshapes of n _ e personal face expressions are multiplied and then added to obtain a facial expression image corresponding to the expression coefficients, if there are T _ e expression coefficients, a T _ e facial expression image is generated, then, a plurality of facial expression images are generated into a sound-free composite video according to a preset frame rate, and the generated composite video is fused with a plurality of audio segments by using a multimedia processing tool FFmpeg (FFmpeg is a set of open source computer program that can be used to record, convert digital audio and video into streams), so as to obtain a target audio video.

Of course, in the process of practical application, any other possible video fusion mode may also be adopted to perform fusion processing on the expression coefficient sequence and the plurality of audio segments to obtain the target video, which is not limited herein.

In this embodiment, an expression coefficient sequence and a plurality of audio segments are received, where the plurality of audio segments are obtained by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments, and the expression coefficient sequence is generated according to the plurality of sound spectrum segments and obtained by performing prediction processing on the sound spectrum, and the expression coefficient sequence includes: the target video processing method comprises the steps of obtaining a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to sound spectrum segments, and performing fusion processing on expression coefficient sequences and a plurality of audio segments to obtain a target video.

Fig. 6 is a block diagram illustrating a video processing device according to an example embodiment. .

Referring to fig. 6, the video processing apparatus 60 includes:

an obtaining module 601 configured to perform obtaining a sound spectrum corresponding to a text to be processed;

a segmentation module 602 configured to perform segmentation processing on a sound spectrum to obtain a plurality of sound spectrum segments;

the predicting module 603 is configured to perform prediction processing on the sound spectrum by using the expression coefficient sequence generation model, so as to obtain an expression coefficient sequence corresponding to the sound spectrum, where the expression coefficient sequence includes: a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments;

a generating module 604 configured to perform generating an audio segment corresponding to a sound spectrum segment;

a sending module 605 configured to execute sending the sequence of expression coefficients and the plurality of audio segments to the client to trigger the client to generate the target video.

In some embodiments of the present disclosure, as shown in fig. 7, fig. 7 is a block diagram of a video processing apparatus according to another exemplary embodiment, the video processing apparatus 60 further comprising:

the training module 606 is configured to perform obtaining of a plurality of sample sound spectrums and labeled expression coefficient sequences corresponding to the sample sound spectrums, input the plurality of sample sound spectrums to the neural network model, obtain predicted sample expression coefficient sequences output by the neural network model, and train the neural network model according to a difference value between the sample expression coefficient sequences and the labeled expression coefficient sequences, so as to obtain an expression coefficient sequence generation model.

In some embodiments of the present disclosure, the segmentation module 602 is configured to perform:

and carrying out segmentation processing on the sound spectrum to obtain a plurality of sound spectrum sections with the time length being the segmentation time length.

determining the ratio of a preset value to a preset frame rate;

In some embodiments of the present disclosure, as shown in fig. 7, further comprising:

the alignment module 607 is configured to perform alignment processing on the sound spectrum and the expression coefficient sequence to obtain a target expression coefficient sequence, divide the target expression coefficient sequence into a plurality of expression coefficient subsequences, wherein the time length covered by the expression coefficients in the expression coefficient subsequences is the segment time length, and send the audio segment corresponding to the sound spectrum and the expression coefficient subsequences aligned with the sound spectrum to the client, so as to trigger the client to generate the target video by using the expression coefficient subsequences and the audio segment.

In some embodiments of the present disclosure, the alignment module 607 is configured to perform:

when the quantity of time scale information contained in the sound spectrum can be divided by a ratio, processing the sound spectrum into a first target sound spectrum, wherein the first target distribution duration corresponding to the first target sound spectrum characteristic carried by the first target sound spectrum is an integer division value of the distribution duration of the sound spectrum and the ratio;

when the quantity of the time scale information contained in the sound spectrum cannot be divided by the ratio, processing the sound spectrum into a second target sound spectrum, wherein the second target distribution duration corresponding to the second target sound spectrum characteristic carried by the second target sound spectrum is an addition value obtained by adding the integral division value and the reference value, and the integral division value is obtained by integrally dividing the distribution duration of the sound spectrum and the ratio;

and taking the expression coefficient sequence aligned with the first target sound spectrum or the second target sound spectrum as a target expression coefficient sequence.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In this embodiment, a sound spectrum corresponding to a text to be processed is obtained, the sound spectrum is segmented to obtain a plurality of sound spectrum segments, the sound spectrum is subjected to prediction processing by using an expression coefficient sequence generation model to obtain an expression coefficient sequence corresponding to the sound spectrum, and the expression coefficient sequence includes: the method comprises the steps of generating an audio segment corresponding to a sound spectrum segment by a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segment, sending an expression coefficient sequence and the sound spectrum segment to a client to trigger the client to generate a target video, and the expression coefficient sequence is generated according to the sound spectrum because the sound spectrum can describe the global context information of a text to be processed, so that the accuracy and the timeliness of the generation of the expression coefficient sequence are effectively improved.

Fig. 8 is a block diagram illustrating a video processing device according to yet another exemplary embodiment.

Referring to fig. 8, the video processing apparatus 80 includes:

the receiving module 801 is configured to perform receiving of an expression coefficient sequence and a plurality of audio segments, where the plurality of audio segments are obtained by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments, and are generated according to the plurality of sound spectrum segments, the expression coefficient sequence is obtained by performing prediction processing on the sound spectrum, and the expression coefficient sequence includes: a plurality of expression coefficients and the duration corresponding to the expression coefficients, the expression coefficients corresponding to the sound spectrum segments.

And a fusion module 802 configured to perform fusion processing on the expression coefficient sequence and the plurality of audio segments to obtain a target video.

In some embodiments of the present disclosure, the fusion module 802, is configured to perform:

fusing the expression coefficients and the mixed deformation of the plurality of corresponding facial expressions to obtain facial expression images corresponding to the expression coefficients;

In this embodiment, an expression coefficient sequence and a plurality of audio segments are received, where the plurality of audio segments are obtained by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments, and the expression coefficient sequence is generated according to the plurality of sound spectrum segments and obtained by performing prediction processing on the sound spectrum, and the expression coefficient sequence includes: the method comprises the steps of obtaining a plurality of expression coefficients and duration corresponding to the expression coefficients, wherein the expression coefficients correspond to sound spectrum segments, and performing fusion processing on expression coefficient sequences and the plurality of audio segments to obtain a target video.

An electronic device is also provided in the disclosed embodiments, and fig. 9 is a block diagram of an electronic device shown in accordance with an exemplary embodiment. For example, the electronic device 900 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 9, electronic device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the electronic device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the electronic device 900. Examples of such data include instructions for any application or method operating on the electronic device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the electronic device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 900.

The multimedia component 908 includes a touch sensitive display screen that provides an output interface between the electronic device 900 and a user. In some embodiments, the touch display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 904 or transmitted via the communication component 916.

In some embodiments, audio component 910 further includes a speaker for outputting audio signals.

The I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module, which may be a keyboard, click wheel, button, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status evaluations of various aspects of the electronic device 900. For example, sensor assembly 914 may detect an open/closed state of electronic device 900, the relative positioning of components, such as a display and keypad of electronic device 900, sensor assembly 914 may also detect a change in position of electronic device 900 or a component of electronic device 900, the presence or absence of user contact with electronic device 900, orientation or acceleration/deceleration of electronic device 900, and a change in temperature of electronic device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the electronic device 900 and other devices in a wired or wireless manner. The electronic device 900 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described video processing methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the electronic device 900 to perform the above-described method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the video processing method as described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video processing method, comprising:

acquiring a sound spectrum corresponding to a text to be processed;

carrying out segmentation processing on the sound spectrum to obtain a plurality of sound spectrum sections;

predicting the sound spectrum by adopting an expression coefficient sequence generation model to obtain an expression coefficient sequence corresponding to the sound spectrum, wherein the expression coefficient sequence comprises: a plurality of expression coefficients and duration corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments;

generating an audio segment corresponding to the sound spectrum segment;

sending the expression coefficient sequence and the plurality of audio segments to a client to trigger the client to generate a target video;

the step of performing segmentation processing on the sound spectrum to obtain a plurality of sound spectrum segments includes:

2. The method of claim 1, wherein the method for training the expression coefficient sequence generation model comprises:

and training the neural network model according to the difference value between the sample expression coefficient sequence and the labeled expression coefficient sequence to obtain the expression coefficient sequence generation model.

3. The method of claim 1, wherein the step of determining the segment duration according to the time scale information and the preset frame rate of the target video comprises:

determining the ratio of a preset value to the preset frame rate;

4. The method of claim 1, after said generating the audio segment corresponding to the sound spectrum segment, further comprising:

and sending the audio segment corresponding to the sound spectrum and the expression coefficient subsequence aligned with the sound spectrum to the client so as to trigger the client to generate the target video by adopting the expression coefficient subsequence and the audio segment.

5. The method according to claim 4, wherein the aligning the sound spectrum and the expression coefficient sequence to obtain a target expression coefficient sequence comprises:

when the number of the time scale information contained in the sound spectrum can be evenly divided by the ratio, processing the sound spectrum into a first target sound spectrum, wherein the first target distribution duration corresponding to a first target sound spectrum characteristic carried by the first target sound spectrum is an integral division value of the distribution duration of the sound spectrum and the ratio;

6. A video processing method, comprising:

receiving an expression coefficient sequence and a plurality of audio segments, wherein the plurality of audio segments are generated by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments and predicting the sound spectrum according to the plurality of sound spectrum segments, and the expression coefficient sequence comprises: a plurality of expression coefficients and duration corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments;

performing fusion processing on the expression coefficient sequence and the plurality of audio segments to obtain a target video;

the segmenting the sound spectrum corresponding to the text to be processed to obtain a plurality of sound spectrum segments comprises:

determining time scale information of the sound spectrum, wherein the time scale information is used for describing distribution duration corresponding to sound spectrum characteristics carried by the sound spectrum; determining the segmented duration according to the time scale information and the preset frame rate of the target video;

7. The method of claim 6, wherein the fusing the sequence of expression coefficients and the plurality of audio segments to obtain the target video comprises:

8. A video processing apparatus, comprising:

the acquisition module is configured to acquire a sound spectrum corresponding to the text to be processed;

a segmentation module configured to perform segmentation processing on the sound spectrum to obtain a plurality of sound spectrum segments;

the prediction module is configured to perform prediction processing on the sound spectrum by using an expression coefficient sequence generation model to obtain an expression coefficient sequence corresponding to the sound spectrum, and the expression coefficient sequence includes: a plurality of expression coefficients and duration corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments;

a generation module configured to perform generating an audio segment corresponding to the sound spectrum segment;

a sending module configured to execute sending the expression coefficient sequence and the plurality of audio segments to a client to trigger the client to generate a target video;

the segmentation module configured to perform:

9. The apparatus of claim 8, further comprising:

the training module is configured to acquire a plurality of sample sound spectrums and labeled expression coefficient sequences corresponding to the sample sound spectrums, input the sample sound spectrums into a neural network model to obtain predicted sample expression coefficient sequences output by the neural network model, and train the neural network model according to differences between the sample expression coefficient sequences and the labeled expression coefficient sequences to obtain an expression coefficient sequence generation model.

10. The apparatus of claim 8, wherein the segmentation module is configured to perform:

determining the ratio of a preset value to the preset frame rate;

11. The apparatus of claim 8, further comprising:

and the aligning module is configured to perform aligning processing on the sound spectrum and the expression coefficient sequence to obtain a target expression coefficient sequence, divide the target expression coefficient sequence into a plurality of expression coefficient subsequences, wherein the time length covered by the expression coefficients in the expression coefficient subsequences is the segmentation time length, and send an audio segment corresponding to the sound spectrum and the expression coefficient subsequences aligned with the sound spectrum to the client so as to trigger the client to generate the target video by adopting the expression coefficient subsequences and the audio segment.

12. The apparatus of claim 11, wherein the alignment module is configured to perform:

13. A video processing apparatus, comprising:

the receiving module is configured to perform receiving of an expression coefficient sequence and a plurality of audio segments, wherein the plurality of audio segments are generated by segmenting a sound spectrum corresponding to a text to be processed to obtain a plurality of sound spectrum segments and performing prediction processing on the sound spectrum according to the plurality of sound spectrum segments, and the expression coefficient sequence comprises: a plurality of expression coefficients and time lengths corresponding to the expression coefficients, wherein the expression coefficients correspond to the sound spectrum segments;

the fusion module is configured to perform fusion processing on the expression coefficient sequence and the plurality of audio segments to obtain a target video;

determining time scale information of the sound spectrum, wherein the time scale information is used for describing distribution duration corresponding to sound spectrum characteristics carried by the sound spectrum; determining a segmented duration according to the time scale information and a preset frame rate of the target video;

14. The apparatus of claim 13, wherein the fusion module is configured to perform:

15. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1-7.

16. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-7.