CN113379875B

CN113379875B - Cartoon character animation generation method, device, equipment and storage medium

Info

Publication number: CN113379875B
Application number: CN202110301883.XA
Authority: CN
Inventors: 陈聪; 侯翠琴; 李剑锋
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2023-09-29
Anticipated expiration: 2041-03-22
Also published as: CN113379875A

Abstract

The invention relates to the field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for generating cartoon character animation, which are used for improving the correlation between music cartoon character animation and a music scene. The cartoon character animation generating method comprises the following steps: encoding the music text data in the music parameter data to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; weighting micro-expression vector features, gesture vector features and limb motion vector features in basic vector features of music character image data through a neural network self-attention mechanism to generate a basic cartoon character image; respectively generating a target cartoon character image and target music voice based on a preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the cartoon character animation. The invention also relates to blockchain technology, and music parameter data can be stored in the blockchain.

Description

Cartoon character animation generation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for generating cartoon character animation.

Background

With the continuous satisfaction of material life, more and more people begin to pursue mental satisfaction, and the long-history music culture just fills the mental gap of people. From the earliest singing 'poem' to the current popular music, the music can directly convey the thought and emotion of a musician as a representation form, and the mode of popularizing or spreading music culture is more technological along with the development of technological progress and age nowadays, wherein the most main spreading mode is to spread the music culture by utilizing the cartoon character animation.

In the process of music cartoon animation production, key frames of specified actions are usually drawn directly through original pictures of existing music cartoon roles, then transition frames of the actions are correspondingly inserted in a hand-drawing mode according to differences between two adjacent key frames, and corresponding music cartoon animation is generated, but the relevance between the music cartoon role animation generated in the mode and a music scene is low.

Disclosure of Invention

The invention provides a cartoon character animation generation method, device, equipment and storage medium, which are used for improving the correlation between a music cartoon character animation and a music scene.

The first aspect of the present invention provides a method for generating cartoon character animation, comprising: acquiring music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; extracting basic vector features of cartoon roles corresponding to music role image data in the music parameter data from the preset cartoon role generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, calculating summary vector features of the basic vector features, and generating a basic cartoon role image according to the summary vector features; inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and target music voice respectively based on the preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the cartoon character animation.

Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining music parameter data, encoding music text data in the music parameter data using a preset unicode character table, to obtain music content data, and converting the music content data into music voice data using a voice generation model includes: acquiring music text data in the music parameter data, and extracting text characters in the music text data; searching standard characters which are the same as the text characters in a preset unicode character table, taking byte codes corresponding to the standard characters as code data of corresponding text characters, determining the code data corresponding to the text characters in the music text data as music content data, and taking one byte code corresponding to each standard character; the music content data is converted into music voice data by adopting a voice generation model.

Optionally, in a second implementation manner of the first aspect of the present invention, the converting the music content data into music speech data using a speech generation model includes: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a voice generation model; segmenting the phoneme information by using a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by using an alignment function in the voice generation model to obtain pairs Ji Yinsu; inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting the phoneme duration of the aligned phonemes through the duration prediction model to obtain predicted duration; inputting the phoneme information and the prediction time length into an acoustic model in the voice generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music voice data.

Optionally, in a third implementation manner of the first aspect of the present invention, the extracting, in the preset cartoon character generating model, a basic vector feature of a cartoon character corresponding to the music character image data in the music parameter data, weighting, by a neural network self-attention mechanism, a micro-expression vector feature, a gesture vector feature and a limb motion vector feature in the basic vector feature, and calculating a summary vector feature of the basic vector feature, where generating a basic cartoon character image according to the summary vector feature includes: inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector features in the music character image data from the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb motion vector features of the cartoon character; calculating the attention distribution of the basic vector features through a neural network self-attention mechanism in the preset cartoon character generation model; under the condition of increasing the weight occupied by the attention distribution of the micro-expression vector feature, the gesture vector feature and the limb motion vector feature, summarizing the attention distribution of the basic vector feature by utilizing a summarization formula to obtain a summarization vector feature, wherein the summarization formula is as follows:

wherein ,representing summary vector features, ++>Represents the attention distribution value corresponding to the micro-expression vector characteristics,representing the corresponding weighted attention distribution value of the micro-expression vector features,/->Representing micro-expression vector features,/->Representing attention distribution corresponding to gesture vector featuresValue of->Representing a weighted attention distribution value corresponding to a gesture vector feature,/->Representing gesture vector features, ++>Attention distribution value representing limb motion vector characteristics,/-corresponding to the limb motion vector characteristics>Weighted attention distribution value representing limb motion vector characteristics,/->Representing limb motion vector features,/->Indicate->Attention distribution value corresponding to each residual vector feature,/->Indicate->The remaining vector features correspond to weighted attention distribution values,/->Indicate->Residual vector features, ">The residual vector features are basic vector features other than the micro-expression vector features, the gesture vector features and the limb motion vector features; calculating the summary direction using a cross entropy loss functionAnd adjusting the summarized vector features through the loss function value of the quantity feature, and generating a corresponding basic cartoon character image by utilizing the adjusted summarized vector features.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the calculating, by a neural network self-attention mechanism in the preset cartoon character generating model, an attention distribution of the basis vector feature includes: acquiring query vector features in the music character image data, wherein the query vector features are used for representing basic vector features related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:

wherein ,indicate->Attention distribution value corresponding to each basis vector feature,/->，/>Representing a attention scoring function->Indicate->Basic vector features,/">Indicate->Basic vector features,/">Representing a query vector->Is a positive integer.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the inputting the basic cartoon character image and the music voice data into a preset time-sequence neural network, and generating the target cartoon character image and the target music voice based on the preset time-sequence neural network respectively includes: respectively sequencing the basic cartoon character image and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character image and the music voice data into data to be predicted; acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a hidden layer of a preset time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment; and merging the data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises target cartoon character images and target music voices.

Optionally, in a sixth implementation manner of the first aspect of the present invention, before the obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model, the method for generating cartoon character animation further includes: and acquiring the music character animation data, training the music character animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon character generation model.

The second aspect of the present invention provides a cartoon character animation generating device, comprising: the system comprises an acquisition module, a speech generation module and a processing module, wherein the acquisition module is used for acquiring music parameter data, encoding music text data in the music parameter data by utilizing a preset unicode character table to obtain music content data, and converting the music content data into music speech data by adopting a speech generation model; the computing module is used for extracting basic vector features of cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, computing summary vector features of the basic vector features, and generating a basic cartoon role image according to the summary vector features; the prediction module is used for inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network; and the combination module is used for combining the music content data, the target cartoon character image and the target music voice to obtain the cartoon character animation.

Optionally, in a first implementation manner of the second aspect of the present invention, the acquiring module includes: an extracting unit, configured to obtain music text data in music parameter data, and extract text characters in the music text data; a determining unit, configured to find a standard character identical to the text character in a preset unicode character table, take a byte code corresponding to the standard character as code data of a corresponding text character, and determine code data corresponding to the text character in the music text data as music content data, where each standard character corresponds to one byte code; and the conversion unit is used for converting the music content data into music voice data by adopting a voice generation model.

Optionally, in a second implementation manner of the second aspect of the present invention, the conversion unit is specifically configured to: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a voice generation model; segmenting the phoneme information by using a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by using an alignment function in the voice generation model to obtain pairs Ji Yinsu; inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting the phoneme duration of the aligned phonemes through the duration prediction model to obtain predicted duration; inputting the phoneme information and the prediction time length into an acoustic model in the voice generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music voice data.

Optionally, in a third implementation manner of the second aspect of the present invention, the calculating module includes: the input unit is used for inputting the music character image data in the music parameter data into a preset cartoon character generation model, extracting basic vector features in the music character image data in the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb motion vector features of the cartoon character; a calculating unit, configured to calculate an attention distribution of the basis vector feature through a neural network self-attention mechanism in the preset cartoon character generation model; and the summarizing unit is used for summarizing the attention distribution of the basic vector feature by utilizing a summarizing formula under the condition of increasing the weights occupied by the attention distribution of the micro-expression vector feature, the gesture vector feature and the limb motion vector feature to obtain a summarizing vector feature, wherein the summarizing formula is as follows:

wherein ,representing summary vector features, ++>Representing micro-expressionsThe attention distribution value corresponding to the vector feature,representing the corresponding weighted attention distribution value of the micro-expression vector features,/- >Representing micro-expression vector features,/->Attention distribution value representing correspondence of gesture vector feature, +.>Representing a weighted attention distribution value corresponding to a gesture vector feature,/->Representing gesture vector features, ++>Attention distribution value representing limb motion vector characteristics,/-corresponding to the limb motion vector characteristics>Weighted attention distribution value representing limb motion vector characteristics,/->Representing limb motion vector features,/->Indicate->Attention distribution value corresponding to each residual vector feature,/->Indicate->The remaining vector features correspond to weighted attention distribution values,/->Indicate->Residual vector features, ">The residual vector features are basic vector features other than the micro-expression vector features, the gesture vector features and the limb motion vector features; and the adjusting unit is used for calculating the loss function value of the summarized vector feature by adopting a cross entropy loss function, adjusting the summarized vector feature by the loss function value and generating a corresponding basic cartoon character image by utilizing the adjusted summarized vector feature.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the calculating unit is specifically configured to: acquiring query vector features in the music character image data, wherein the query vector features are used for representing basic vector features related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:

wherein ,indicate->Attention distribution value corresponding to each basis vector feature,/->，/>A scoring function of the attention is represented,/>indicate->Basic vector features,/">Indicate->Basic vector features,/">Representing a query vector->Is a positive integer.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the prediction module is specifically configured to: respectively sequencing the basic cartoon character image and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character image and the music voice data into data to be predicted; acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a hidden layer of a preset time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment; and merging the data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises target cartoon character images and target music voices.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the generating device of the cartoon character animation further includes: and the generating module is used for acquiring the music character animation data, training the music character animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon character generating model.

A third aspect of the present invention provides a cartoon character animation generating apparatus, comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the cartoon character animation generating device to perform the cartoon character animation generating method described above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of generating cartoon character animation described above.

In the technical scheme provided by the invention, music parameter data are obtained, a preset unicode character table is utilized to encode music text data in the music parameter data to obtain music content data, and a voice generation model is adopted to convert the music content data into music voice data; extracting basic vector features of cartoon roles corresponding to music role image data in the music parameter data from the preset cartoon role generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, calculating summary vector features of the basic vector features, and generating a basic cartoon role image according to the summary vector features; inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and target music voice respectively based on the preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the cartoon character animation. According to the embodiment of the invention, the music parameter data are encoded and converted to generate the music content data and the music voice data, the micro-expression vector features, the gesture vector features and the limb motion vector features in the music parameter data are weighted by utilizing a neural network self-attention mechanism to generate the basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain the music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.

Drawings

FIG. 1 is a schematic diagram of one embodiment of a method for generating cartoon character animation in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a method for generating cartoon character animation in accordance with an embodiment of the invention;

FIG. 3 is a schematic diagram of an embodiment of a cartoon character animation generation apparatus in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a cartoon character animation generation apparatus in accordance with an embodiment of the invention;

FIG. 5 is a schematic diagram of an embodiment of a cartoon character animation generation apparatus in accordance with an embodiment of the invention.

Detailed Description

The embodiment of the invention provides a cartoon character animation generation method, device, equipment and storage medium, which are used for improving the correlation between a music cartoon character animation and a music scene.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present application is described below with reference to fig. 1, where one embodiment of a method for generating cartoon character animation in an embodiment of the present application includes:

101. the method comprises the steps of obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;

it will be appreciated that the execution subject of the present application may be a cartoon character animation generation device, or may be a terminal or a server, and is not limited herein. The embodiment of the application is described by taking a server as an execution main body as an example.

The music parameter data acquired by the server specifically includes two types of data:

1. music text data: in particular data related to music and having a text form of the content type.

2. Music character image data: specifically, the data related to music and having the content type in the form of an image, wherein the format of the music character image may be JPEG, TIFF, RAW or the like, the present application is not limited to the format of the music character image.

After obtaining the music parameter data, the server needs to encode the music text data in the music parameter data by using a preset Unicode character table, and convert the music text data into characters which can be identified by a computer, wherein the preset Unicode character table is a character encoding table corresponding to a Unicode (Unicode), and a unified and unique binary code is set for each character in each language so as to meet the requirements of text conversion and processing across languages and platforms.

It should be noted that, after the server obtains the music content data, the music content data is converted into music voice data by using a voice generation model, where the voice generation model refers to Text To Speech (TTS), which is a technology capable of converting any input text into corresponding voice. The speech generation model mainly includes two parts of front end and back end, in the present application, the front end part mainly analyzes input music text data, extracts information required for back end modeling from the music text data, for example: the method comprises the steps of word segmentation, part-of-speech tagging, prosodic structure prediction, polyphonic word disambiguation and the like of music text data. And the rear end part reads in an analysis result of the front end after analyzing the music text data, models the voice part in combination with the analysis result, and generates a voice signal for output by utilizing the music text data and an acoustic model trained in advance in the synthesis process.

It should be emphasized that, to further ensure the privacy and security of the music parameter data, the music parameter data may also be stored in a node of a blockchain.

102. Extracting basic vector features of cartoon characters corresponding to music character image data in music parameter data from a preset cartoon character generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, calculating summary vector features of the basic vector features, and generating a basic cartoon character image according to the summary vector features;

After the server obtains the music content data and the music voice data, the server needs to process the music character image data in the music parameter data, a preset cartoon character generation model is utilized, basic vector features in the music character image data are extracted from the preset cartoon character generation model, the attention distribution of the basic vector features is calculated through a neural network self-attention mechanism, the micro expression vector features, gesture vector features and limb motion vector features in the basic vector features are weighted in the calculating process, so that summarized vector features of the basic vector features are obtained through calculation, and finally the server generates a basic cartoon character image according to the summarized vector features.

It should be noted that, the basic vector features herein refer to pixel vector features in the image data of a music character, where a plurality of basic vector features exist in one image data of a music character, and when the server calculates the attention distribution by using the neural network self-attention mechanism, the purpose of weighting the micro-expression vector features, the gesture vector features and the limb motion vector features is to specifically analyze the cartoon character, so that the correlation between the basic cartoon character image obtained by calculation and the music scene is tighter.

103. Respectively inputting the basic cartoon character image and the music voice data into a preset time sequence neural network, and respectively generating a target cartoon character image and a target music voice based on the preset time sequence neural network;

the basic cartoon character image and the music voice data obtained by the server at the moment are time-free, so that the server needs to generate target cartoon character images and target music voices which are arranged according to a certain time sequence by utilizing a preset time sequence neural network. The preset time sequence neural network refers to a cyclic neural network (recurrent neural network, RNN), which is a neural network for processing time sequence type input, the lengths of time sequence type data input into the cyclic neural network are different, the context of the time sequence type data input is related, convolution calculation is carried out on the input data through a plurality of hidden layers in the cyclic neural network, and finally the convolved data is output through an output layer, so that the data which are arranged according to a certain time sequence order can be generated.

104. And combining the music content data, the target cartoon character image and the target music voice to obtain the cartoon character animation.

After the server acquires the target cartoon character image and the target music voice which are arranged according to the time sequence, the music content data, the target cartoon character image and the target music voice are combined together, so that the music cartoon character animation can be obtained, and when the music cartoon character image is played in the process of playing the music cartoon character animation, the corresponding music content data is displayed and the corresponding target music voice is played at the same time.

According to the embodiment of the invention, the music parameter data are encoded and converted to generate the music content data and the music voice data, the micro-expression vector features, the gesture vector features and the limb motion vector features in the music parameter data are weighted by utilizing a neural network self-attention mechanism to generate the basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain the music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.

Referring to fig. 2, another embodiment of the method for generating cartoon character animation according to an embodiment of the present invention includes:

201. acquiring music character animation data, training the music character animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon character generation model;

the server needs to collect a large amount of music character animation data before processing the music parameter data, train the large amount of music character animation data, and generate a preset cartoon character generation model. Wherein the music character animation data includes at least: symphony Orchestra for music animation, 2000 for music animation, and golden strings for music animation.

When training a large amount of music character animation data, the method adopted is a neural network autonomy mechanism, the training obtains a preset cartoon character generation model, and a corresponding cartoon character image can be generated according to the animation or the image input into the model, and the training process of the music character animation data is the same as that of the step 203, so that the details are not repeated here.

202. The method comprises the steps of obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;

It should be noted that, here, the preset unicode character table is used to record the byte codes corresponding to the standard characters, for example: the byte code corresponding to the standard character A is "≡x0041", the byte code corresponding to the standard character leaf is "≡x53F6", so that the standard character identical to the text character in the music text data can be searched in the preset unicode character table, and after the server searches the standard character, the code data corresponding to the text character can be defined from the preset unicode character table, thereby converting the text character in the music text data into a computer readable and writable language.

The music content data is converted into music voice data by adopting a voice synthesis technology, and the music text data is divided into 4 parts by adopting the voice synthesis technology to carry out voice synthesis, wherein the specific steps are as follows:

1. text-to-phoneme

The server inputs the music content data into the speech generation model, but because of the phenomenon of 'same word and different sound' of different languages, each text character in the music content data needs to be converted into corresponding phoneme information by using a phonetic notation algorithm, namely, chinese characters are converted into pinyin for the text characters.

2. Audio segmentation

After the server obtains the phoneme information, the phoneme information needs to be segmented by adopting a segmentation function, the beginning of the phoneme information is definitely obtained, namely, which phonemes can form a complete character phonetic symbol is definitely obtained, and after the server definitely obtains the beginning of the phoneme information, the segmented phonemes need to be processed by utilizing an alignment function, so that the aligned phonemes are obtained, and the subsequent prediction of the phoneme duration is facilitated.

3. Phoneme duration prediction

The server inputs the aligned phonemes into a duration prediction model, namely, the predicted duration corresponding to the aligned phonemes can be output, and the server calculates the predicted duration so as to facilitate subsequent generation of sound waveforms.

4. Acoustic model

The server inputs the phoneme information with predicted duration into an acoustic model, the acoustic model is equivalent to a vocoder and is used for converting the input phoneme information into corresponding sound waveforms, so that the sound waveforms corresponding to each text character can be obtained, and a plurality of sound waveforms are spliced together, so that music voice data can be obtained. Here, there are further improvements to the acoustic model, such as: increasing the number of network layers, increasing the number of residual channels, replacing up-sampling convolution with matrix multiplication, optimizing the CPU, optimizing the GPU, etc.

203. Extracting basic vector features of cartoon characters corresponding to music character image data in music parameter data from a preset cartoon character generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, calculating summary vector features of the basic vector features, and generating a basic cartoon character image according to the summary vector features;

specifically, inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector features in the music character image data in the preset cartoon character generation model, wherein the basic vector features at least comprise micro-expression vector features, gesture vector features and limb motion vector features of the cartoon character; calculating the attention distribution of the basic vector features through a neural network self-attention mechanism in a preset cartoon character generation model; under the condition of increasing the weight occupied by the attention distribution of the micro-expression vector features, the gesture vector features and the limb motion vector features, summarizing the attention distribution of the basic vector features by utilizing a summarization formula to obtain summarized vector features, wherein the summarization formula is as follows:

wherein ,representing summary vector features, ++>Represents the attention distribution value corresponding to the micro-expression vector characteristics,representing the corresponding weighted attention distribution value of the micro-expression vector features,/->Representing micro-expression vector features,/->Attention distribution value representing correspondence of gesture vector feature, +.>Representing a weighted attention distribution value corresponding to a gesture vector feature,/->Representing gesture vector features, ++>Attention distribution value representing limb motion vector characteristics,/-corresponding to the limb motion vector characteristics>Weighted attention distribution value representing limb motion vector characteristics,/->Representing limb motion vector features,/->Indicate->Attention distribution value corresponding to each residual vector feature,/->Indicate->The remaining vector features correspond to weighted attention distribution values,/->Indicate->Residual vector features, ">The residual vector features are basic vector features except for micro-expression vector features, gesture vector features and limb motion vector features; and calculating a loss function value of the summary vector feature by adopting a cross entropy loss function, adjusting the summary vector feature by the loss function value, and generating a corresponding basic cartoon character image by utilizing the adjusted summary vector feature.

The server calculates the attention distribution of the basic vector features through a neural network self-attention mechanism in a preset cartoon character generation model as follows: the server acquires query vector features in the music character image data, wherein the query vector features are used for representing basic vector features related to cartoon characters in the music character image; the server calculates the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in a preset cartoon character generation model, wherein the calculation formula is as follows:

Here, the query vector feature in the music character image data is used to indicate information related to a query task, for example, in the present application, a query task refers to generating cartoon characters from the music character image data, that is, the query vector feature should be a vector feature related to the cartoon characters in the music character image data.

Further, in the present application, the attention scoring function is a dot product model, and in addition, the attention scoring function may be:

1. bilinear model:

, wherein ,/>Representing a attention scoring function->Indicate->Basic vector features,/">Representing a query vector->Representing learning parameters->Is a positive integer.

2. Scaling the dot product model:

wherein ,representing a attention scoring function->Indicate->Basic vector features,/">Representing a query vector->Dimension representing the feature of the basis vector,/->Is a positive integer.

204. Respectively inputting the basic cartoon character image and the music voice data into a preset time sequence neural network, and respectively generating a target cartoon character image and a target music voice based on the preset time sequence neural network;

Because the generated basic cartoon character image and music voice data are generated in a frame as a unit and have no corresponding time sequence, and a coherent animation can not be generated, the server utilizes a preset time sequence neural network to perform time sequence processing on the basic cartoon character image and the music voice data. The specific process of the preset time sequence neural network for time sequence processing is as follows:

input layer: carrying out convolution calculation on the data to be predicted at the previous moment and the current data to be predicted, and inputting the obtained first convolution result into the first hidden layer;

first hidden layer: carrying out convolution calculation on a first convolution result at the previous moment and a first convolution result at the next moment (a current first convolution result is arranged at an interval), and inputting an obtained second convolution result into a second hidden layer;

second hidden layer: performing convolution calculation on front and rear second convolution results of the three second convolution results at intervals, and inputting an obtained third convolution result into a third hidden layer;

third hidden layer: performing convolution calculation on front and back third convolution results of seven third convolution results at intervals, and inputting the obtained target prediction data into an output layer;

Output layer: and outputting the target prediction data.

The method is characterized in that time sequence processing is carried out on the basic cartoon character image and the music voice data respectively, and the obtained target cartoon character image and the obtained target music voice are combined, so that target prediction data are obtained.

205. And combining the music content data, the target cartoon character image and the target music voice to obtain the cartoon character animation.

The method for generating cartoon character animation in the embodiment of the present invention is described above, and the device for generating cartoon character animation in the embodiment of the present invention is described below, referring to fig. 3, one embodiment of the device for generating cartoon character animation in the embodiment of the present invention includes:

the obtaining module 301 is configured to obtain music parameter data, encode music text data in the music parameter data by using a preset unicode character table, obtain music content data, and convert the music content data into music voice data by using a voice generation model; the computing module 302 is configured to extract basic vector features of cartoon characters corresponding to music character image data in the preset cartoon character generation model, perform weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, compute summary vector features of the basic vector features, and generate a basic cartoon character image according to the summary vector features; the prediction module 303 is configured to input the basic cartoon character image and the music voice data into a preset time-sequence neural network, and generate a target cartoon character image and a target music voice based on the preset time-sequence neural network respectively; and the combination module 304 is used for combining the music content data, the target cartoon character image and the target music voice to obtain the cartoon character animation.

Referring to fig. 4, another embodiment of the cartoon character animation generating apparatus of the present invention includes:

Optionally, the acquiring module 301 includes: an extracting unit 3011, configured to obtain music text data in the music parameter data, and extract text characters in the music text data; a determining unit 3012, configured to search a preset unicode character table for standard characters identical to the text characters, take a byte code corresponding to the standard characters as code data corresponding to the text characters, and determine code data corresponding to the text characters in the music text data as music content data, where each standard character corresponds to one byte code; a conversion unit 3013 for converting the music content data into music speech data using the speech generation model.

Optionally, the conversion unit 3013 is specifically configured to: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a voice generation model; segmenting the phoneme information by using a segmentation function in the speech generation model to obtain segmented phonemes, and aligning the segmented phonemes by using an alignment function in the speech generation model to obtain pairs Ji Yinsu; inputting the aligned phonemes into a duration prediction model in a speech generation model, and predicting the phoneme duration of the aligned phonemes through the duration prediction model to obtain predicted duration; inputting the phoneme information and the predicted time length into an acoustic model in a voice generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music voice data.

Optionally, the computing module 302 includes: an input unit 3021 for inputting the music character image data in the music parameter data into a preset cartoon character generation model, extracting basic vector features in the music character image data in the preset cartoon character generation model, wherein the basic vector features at least comprise micro-expression vector features, gesture vector features and limb motion vector features of the cartoon character; a calculation unit 3022 for calculating the attention distribution of the basis vector features by a neural network self-attention mechanism in the preset cartoon character generation model; and a summarizing unit 3023, configured to summarize the attention distribution of the basic vector feature by using a summarizing formula under the condition of increasing the weights occupied by the attention distribution of the micro-expression vector feature, the gesture vector feature, and the limb motion vector feature, to obtain a summarized vector feature, where the summarizing formula is:

wherein ,representing summary vector features, ++>Represents the attention distribution value corresponding to the micro-expression vector characteristics,representing the corresponding weighted attention distribution value of the micro-expression vector features,/->Representing micro-expression vector features,/->Attention distribution value representing correspondence of gesture vector feature, +. >Representing a weighted attention distribution value corresponding to a gesture vector feature,/->Representing gesture vector features, ++>Attention distribution value representing limb motion vector characteristics,/-corresponding to the limb motion vector characteristics>Weighted attention distribution value representing limb motion vector characteristics,/->Representing limb motion vector features,/->Indicate->Attention distribution value corresponding to each residual vector feature,/->Indicate->The remaining vector features correspond to weighted attention distribution values,/->Indicate->Residual vector features, ">The residual vector features are basic vector features except for micro-expression vector features, gesture vector features and limb motion vector features; an adjustment unit 3024 for calculating a summary using the cross entropy loss functionAnd the loss function value of the vector feature is used for adjusting the summarized vector feature through the loss function value, and a corresponding basic cartoon character image is generated by utilizing the adjusted summarized vector feature.

Optionally, the computing unit 3022 is specifically configured to: acquiring query vector features in the music character image data, wherein the query vector features are used for representing basic vector features related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in a preset cartoon character generation model, wherein the calculation formula is as follows:

Optionally, the prediction module 303 is specifically configured to: respectively sequencing the basic cartoon character image and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character image and the music voice data into data to be predicted; acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a hidden layer of a preset time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment; and merging a plurality of data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.

Optionally, the cartoon character animation generating device further comprises: the generating module 305 is configured to acquire music character animation data, train the music character animation data by using a neural network self-attention mechanism, and generate a preset cartoon character generating model.

The apparatus for generating cartoon character animation in the embodiment of the present invention is described in detail above in terms of modularized functional entities in fig. 3 and 4, and the apparatus for generating cartoon character animation in the embodiment of the present invention is described in detail below in terms of hardware processing.

Fig. 5 is a schematic structural diagram of a cartoon character animation generating device 500 according to an embodiment of the present invention, where the cartoon character animation generating device 500 may have a relatively large difference according to a configuration or a performance, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the cartoon character animation generating apparatus 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the cartoon character animation generating device 500.

The cartoon character animation generating device 500 can also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the configuration of the cartoon character animation generation device illustrated in fig. 5 does not constitute a limitation of the cartoon character animation generation device, and may include more or fewer components than illustrated, or may combine certain components, or may be arranged in a different arrangement of components.

The invention also provides a cartoon character animation generating device, which comprises a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the cartoon character animation generating method in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, or may be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, when the instructions are executed on a computer, cause the computer to perform the steps of the method for generating cartoon character animation.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for generating the cartoon character animation is characterized by comprising the following steps of:

acquiring music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;

extracting basic vector features of cartoon roles corresponding to music role image data in the music parameter data from the preset cartoon role generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, calculating summary vector features of the basic vector features, and generating a basic cartoon role image according to the summary vector features;

Inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and target music voice respectively based on the preset time sequence neural network;

combining the music content data, the target cartoon character image and the target music voice to obtain a music cartoon character animation;

extracting basic vector features of cartoon roles corresponding to music role image data in the music parameter data from the preset cartoon role generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, calculating summary vector features of the basic vector features, and generating basic cartoon role images according to the summary vector features, wherein the generating of the basic cartoon role images comprises the following steps:

inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector features in the music character image data from the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb motion vector features of the cartoon character;

Calculating the attention distribution of the basic vector features through a neural network self-attention mechanism in the preset cartoon character generation model;

under the condition of increasing the weight occupied by the attention distribution of the micro-expression vector feature, the gesture vector feature and the limb motion vector feature, summarizing the attention distribution of the basic vector feature by utilizing a summarization formula to obtain a summarization vector feature, wherein the summarization formula is as follows:

wherein ,representing summary vector features, ++>Attention distribution value representing the correspondence of micro-expression vector features,/->Representing the corresponding weighted attention distribution value of the micro-expression vector features,/->Representing micro-expression vector features,/->Attention distribution value representing correspondence of gesture vector feature, +.>Representing a weighted attention distribution value corresponding to a gesture vector feature,/->Representing gesture vector features, ++>Attention distribution value representing limb motion vector characteristics,/-corresponding to the limb motion vector characteristics>Weighted attention distribution value representing limb motion vector characteristics,/->Representing limb motion vector features,/->Indicate->The attention distribution value corresponding to the remaining vector features,indicate->Residual vector features, ">The residual vector features are, for positive integers, basis vector features other than the micro-expression vector features, the gesture vector features, and the limb motion vector features.

2. The method of generating cartoon character animation according to claim 1, wherein the obtaining music parameter data, encoding music text data in the music parameter data using a preset unicode character table to obtain music content data, and converting the music content data into music voice data using a voice generation model comprises:

acquiring music text data in the music parameter data, and extracting text characters in the music text data;

searching standard characters which are the same as the text characters in a preset unicode character table, taking byte codes corresponding to the standard characters as code data of corresponding text characters, determining the code data corresponding to the text characters in the music text data as music content data, and taking one byte code corresponding to each standard character;

the music content data is converted into music voice data by adopting a voice generation model.

3. The method of generating cartoon character animation as claimed in claim 2, wherein the converting the music content data into music voice data using a voice generation model comprises:

converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a voice generation model;

Segmenting the phoneme information by using a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by using an alignment function in the voice generation model to obtain pairs Ji Yinsu;

inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting the phoneme duration of the aligned phonemes through the duration prediction model to obtain predicted duration;

inputting the phoneme information and the prediction time length into an acoustic model in the voice generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music voice data.

4. The method for generating cartoon character animation according to claim 1, wherein after summarizing the attention distribution of the basic vector feature by using a summary formula to obtain a summary vector feature, further comprising:

and calculating a loss function value of the summary vector feature by adopting a cross entropy loss function, adjusting the summary vector feature by the loss function value, and generating a corresponding basic cartoon character image by utilizing the adjusted summary vector feature.

5. The method of generating cartoon character animation as claimed in claim 4, wherein said calculating the attention profile of the basis vector features through neural network self-attention mechanisms in the preset cartoon character generation model comprises:

acquiring query vector features in the music character image data, wherein the query vector features are used for representing basic vector features related to cartoon characters in the music character image;

calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:

wherein ,indicate->Attention distribution value corresponding to each basis vector feature,/->，/>Representing a attention scoring function->Indicate->Basic vector features,/">Indicate->Basic vector features,/">Representing a query vector->The basic vector features include micro-expression vector features, gesture vector features, limb motion vector features, and residual vector features.

6. The method of generating cartoon character animation according to claim 1, wherein the inputting the basic cartoon character image and the music voice data into a preset time-series neural network, respectively, and generating the target cartoon character image and the target music voice based on the preset time-series neural network, respectively, comprises:

Respectively sequencing the basic cartoon character image and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character image and the music voice data into data to be predicted;

acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a hidden layer of a preset time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment;

and merging the data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises target cartoon character images and target music voices.

7. The method of generating a cartoon character animation according to any one of claims 1 to 6, wherein before the obtaining of music parameter data, encoding music text data in the music parameter data using a preset unicode character table to obtain music content data, and converting the music content data into music voice data using a voice generation model, the method of generating a cartoon character animation further comprises:

And acquiring the music character animation data, training the music character animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon character generation model.

8. A cartoon character animation generating device, characterized in that the cartoon character animation generating device comprises:

the system comprises an acquisition module, a speech generation module and a processing module, wherein the acquisition module is used for acquiring music parameter data, encoding music text data in the music parameter data by utilizing a preset unicode character table to obtain music content data, and converting the music content data into music speech data by adopting a speech generation model;

the computing module is used for extracting basic vector features of cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, carrying out weighting processing on micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features through a neural network self-attention mechanism, computing summary vector features of the basic vector features, and generating a basic cartoon role image according to the summary vector features;

the prediction module is used for inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network;

The combination module is used for combining the music content data, the target cartoon character image and the target music voice to obtain a music cartoon character animation;

the computing module comprises:

the input unit is used for inputting the music character image data in the music parameter data into a preset cartoon character generation model, extracting basic vector features in the music character image data in the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb motion vector features of the cartoon character;

a calculating unit, configured to calculate an attention distribution of the basis vector feature through a neural network self-attention mechanism in the preset cartoon character generation model;

and the summarizing unit is used for summarizing the attention distribution of the basic vector feature by utilizing a summarizing formula under the condition of increasing the weights occupied by the attention distribution of the micro-expression vector feature, the gesture vector feature and the limb motion vector feature to obtain a summarizing vector feature, wherein the summarizing formula is as follows:

wherein ,representing summary vector features, ++>Attention distribution value representing the correspondence of micro-expression vector features,/- >Representing the corresponding weighted attention distribution value of the micro-expression vector features,/->Representing micro-expression vector features,/->Attention distribution value representing correspondence of gesture vector feature, +.>Representing a weighted attention distribution value corresponding to a gesture vector feature,/->Representing gesture vector features, ++>Attention distribution value representing limb motion vector characteristics,/-corresponding to the limb motion vector characteristics>Weighted attention distribution value representing limb motion vector characteristics,/->Representing limb motion vector features,/->Indicate->The attention distribution value corresponding to the remaining vector features,indicate->Residual vector features, ">The residual vector features are the features of the micro-expression vector and the gesture vectorAnd a basis vector feature other than the limb motion vector feature.

9. A cartoon character animation generating apparatus, comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invoking the instructions in the memory to cause the cartoon character animation generation apparatus to perform the cartoon character animation generation method of any of claims 1-7.

10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement a method of generating a cartoon character animation according to any of claims 1-7.