CN117152317B

CN117152317B - Optimization method for digital human interface control

Info

Publication number: CN117152317B
Application number: CN202311436484.XA
Authority: CN
Inventors: 刘松国; 范诗扬
Original assignee: Zhijiang Laboratory Technology Holdings Co ltd
Current assignee: Zhijiang Laboratory Technology Holdings Co ltd
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-02-13
Anticipated expiration: 2043-11-01
Also published as: CN117152317A

Abstract

The application relates to the technical field of intelligent control, and particularly discloses an optimization method of digital human interface control, which utilizes a deep learning technology to extract lip reference features from a lip language identification image set, extract audio semantic features from input audio data, simultaneously establish an association relationship between the audio semantic features and the lip reference features, and generate lip action videos corresponding to input audio based on the association features of the audio semantic features and the lip reference features. Therefore, more accurate mouth shape driving can be realized, so that the mouth shape of the virtual digital person is more vivid in appearance, and more natural and smooth virtual man-machine interaction experience is provided.

Description

Optimization method for digital human interface control

Technical Field

The present application relates to the field of intelligent control technologies, and more particularly, to an optimization method for digital human interface control.

Background

The digital person refers to an avatar image generated by using computer technology, the digital person has the appearance and behavior pattern of human beings, and the digital person body exists in a computing device (such as a computer and a mobile phone) and is presented through a display device, so that the human beings can see through eyes. In the usual sense, the digital person is a visual digital virtual person integrating a plurality of leading artificial intelligence technologies such as character image simulation, character sound cloning, natural language processing, knowledge graph analysis and the like.

In the field of digital human interface control, the virtual image voice animation synthesis technology can generate a corresponding 3D virtual image facial expression coefficient through a certain rule or a deep learning algorithm according to input voice, thereby finishing the accurate driving of the mouth shape of the 3D virtual image and realizing the application of virtual digital people in the fields of news broadcasting, virtual customer service and the like. The lip shape driving control directly influences the fidelity of the digital person, but the existing lip shape driving control technology has poor matching effect of the lip shape and the voice in the voice interaction process of the virtual digital person.

Therefore, an optimization method for digital human interface control is desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides an optimization method for digital human interface control, which utilizes a deep learning technology to extract lip reference features from a lip recognition image set, extract audio semantic features from input audio data, simultaneously establish an association relationship between the audio semantic features and the lip reference features, and generate lip action videos corresponding to input audio based on the association features of the audio semantic features and the lip reference features. Therefore, more accurate mouth shape driving can be realized, so that the mouth shape of the virtual digital person is more vivid in appearance, and more natural and smooth virtual man-machine interaction experience is provided.

Accordingly, according to one aspect of the present application, there is provided a method of optimizing digital human interface control, comprising:

acquiring a lip language identification image set and audio data;

each lip language identification image in the lip language identification image set is respectively passed through a multi-scale lip language image feature extractor to obtain a plurality of multi-scale lip language action feature vectors;

the multiple multi-scale lip language action feature vectors pass through a two-way long-short term memory neural network model to obtain lip reference data feature vectors;

extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the audio data, and arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multichannel sound spectrogram;

the multichannel sound spectrogram is subjected to a convolutional neural network model using a channel attention mechanism to obtain an audio feature vector;

performing association coding on the lip reference data feature vector and the audio feature vector to obtain an association feature matrix;

performing order parameterization based on feature engineering on the associated feature matrix to obtain an optimized associated feature matrix;

the optimized association feature matrix is passed through a video generator that generates a network based on the countermeasure to obtain lip motion video corresponding to the input audio.

In the optimization method of digital human interface control, the multi-scale lip language image feature extractor comprises a first convolution layer, a second convolution layer parallel to the first convolution layer and a cascade layer connected with the first convolution layer and the second convolution layer, wherein the first convolution layer uses a two-dimensional convolution kernel with a first scale, and the second convolution layer uses a two-dimensional convolution kernel with a second scale.

In the above-mentioned optimization method of digital human interface control, the steps of passing each lip language identification image in the lip language identification image set through a multi-scale lip language image feature extractor to obtain a plurality of multi-scale lip language action feature vectors include: each layer of the first convolution layer of the multi-scale lip language image feature extractor is used for respectively carrying out input data in forward transfer of the layer: performing convolution processing, global average pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the first scale to obtain a lip language action feature vector with the first scale; each layer of the second convolution layer of the multi-scale lip language image feature extractor is used for respectively carrying out input data in forward transfer of the layer: performing convolution processing, global mean pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the second scale to obtain a lip language action feature vector with the second scale; and merging the first-scale lip language action feature vector and the second-scale lip language action feature vector to obtain the multi-scale lip language action feature vector.

In the above-mentioned optimizing method of digital human interface control, the method for obtaining the audio feature vector from the multichannel spectrogram by using the convolutional neural network model of the channel attention mechanism includes: input data are respectively carried out in forward transfer of layers by using each layer of the convolutional neural network: performing convolution processing on the input data based on a three-dimensional convolution check to obtain a convolution characteristic diagram; carrying out global mean pooling on each feature matrix of the convolution feature graph along the channel dimension to obtain a channel feature vector; calculating the ratio of the characteristic value of each position in the channel characteristic vector relative to the weighted sum of the characteristic values of all positions of the channel characteristic vector to obtain a channel weighted characteristic vector; weighting the feature matrix of the convolution feature images along the channel dimension by taking the feature value of each position in the channel weighted feature vector as a weight to obtain a channel attention feature image; carrying out global pooling treatment on each feature matrix along the channel dimension on the channel attention feature map to obtain a pooled feature map; performing activation processing on the pooled feature map to generate an activated feature map; the output of the last layer of the convolutional neural network is the audio feature vector, and the input of the first layer of the convolutional neural network is the multichannel sound spectrogram.

In the above-mentioned optimization method of digital human interface control, performing association coding on the lip reference data feature vector and the audio feature vector to obtain an association feature matrix, including: performing association coding on the lip reference data feature vector and the audio feature vector by using the following association formula to obtain the association feature matrix;

wherein, the association formula is:

wherein the method comprises the steps ofRepresenting the lip reference data feature vector,a transpose of the lip reference data feature vector,the audio feature vector is represented as such,representing the matrix of the associated feature,representing vector multiplication.

In the above-mentioned optimization method of digital human interface control, the performing order parameterization based on feature engineering on the correlation feature matrix to obtain an optimized correlation feature matrix includes: performing feature matrix segmentation on the associated feature matrix to obtain a sequence of associated local feature matrices; passing the sequence of the associated local feature matrix through an order weight generator based on a Softmax function to obtain a sequence of order weight values; based on the sequence of the order weight values, sequencing the sequence of the associated local feature matrix to obtain a sequence of rearranged associated local feature matrix; performing feature flattening on the sequence of the rearranged associated local feature matrix to obtain a sequence of rearranged associated local feature vectors; passing the sequence of reordered associated local feature vectors through a context encoder based on a converter to obtain a sequence of context reordered associated local feature vectors; carrying out normalization processing based on the maximum value on the sequence of the order weight values to obtain a sequence of normalized order weight values; taking the normalized order weight value of each position in the sequence of normalized order weight values as weight, and respectively weighting the sequence of context rearrangement associated local feature vectors to obtain the sequence of optimized context rearrangement associated local feature vectors; and carrying out dimension reconstruction on the sequence of the optimization context rearrangement associated local feature vector to obtain the optimization associated feature matrix.

In the above-mentioned optimization method of digital human interface control, the video generator based on the countermeasure generation network includes a discriminator and a generator, wherein the generator is used for generating video, the discriminator is used for calculating the difference between the generated video and the reference video, and the network parameters of the generator are updated through a gradient descent direction propagation algorithm to obtain a generator with the function of generating accurate lip motion video; further, the optimized correlation feature matrix is input into a generator of the video generator based on the countermeasure generation network to obtain the lip motion video corresponding to the input audio.

According to another aspect of the present application, there is provided an optimization system of digital human interface control, comprising:

the data acquisition module is used for acquiring the lip language identification image set and the audio data;

the lip-shaped motion feature extraction module is used for enabling each lip-language identification image in the lip-language identification image set to pass through the multi-scale lip-language image feature extractor respectively so as to obtain a plurality of multi-scale lip-language motion feature vectors;

the lip-shaped motion forward and backward correlation feature extraction module is used for enabling the multiple multi-scale lip-language motion feature vectors to pass through a two-way long-short-term memory neural network model to obtain lip-shaped reference data feature vectors;

The audio spectrogram extraction module is used for extracting a logarithmic Mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the audio data and arranging the logarithmic Mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multichannel sound spectrogram;

the audio feature extraction module is used for obtaining an audio feature vector from the multichannel sound spectrogram through a convolutional neural network model using a channel attention mechanism;

the association module is used for carrying out association coding on the lip reference data feature vector and the audio feature vector to obtain an association feature matrix;

the optimization module is used for parameterizing the order based on the feature engineering on the correlation feature matrix to obtain an optimized correlation feature matrix;

and the lip control result generation module is used for enabling the optimized association characteristic matrix to pass through a video generator based on an countermeasure generation network so as to obtain lip action videos corresponding to input audio.

In the above-mentioned optimizing system for digital human interface control, the multi-scale lip language image feature extractor includes a first convolution layer, a second convolution layer parallel to the first convolution layer, and a cascade layer connected to the first convolution layer and the second convolution layer, wherein the first convolution layer uses a two-dimensional convolution kernel having a first scale, and the second convolution layer uses a two-dimensional convolution kernel having a second scale.

In the above-mentioned optimizing system of digital human interface control, the lip-shaped action feature extraction module includes: the first scale feature extraction unit is used for respectively carrying out input data on each layer of the first convolution layer of the multi-scale lip language image feature extractor in forward transfer of the layer: performing convolution processing, global average pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the first scale to obtain a lip language action feature vector with the first scale; the second scale feature extraction unit is used for respectively carrying out input data on each layer of the second convolution layer of the multi-scale lip language image feature extractor in forward transfer of the layer: performing convolution processing, global mean pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the second scale to obtain a lip language action feature vector with the second scale; and the multi-scale feature fusion unit is used for fusing the first-scale lip language motion feature vector and the second-scale lip language motion feature vector to obtain the multi-scale lip language motion feature vector.

Compared with the prior art, the optimization method for digital human interface control provided by the application utilizes a deep learning technology to extract lip reference features from a lip recognition image set, extracts audio semantic features from input audio data, establishes an association relationship between the audio semantic features and the lip reference features, and generates lip action videos corresponding to input audio based on the association features of the audio semantic features and the lip reference features. Therefore, more accurate mouth shape driving can be realized, so that the mouth shape of the virtual digital person is more vivid in appearance, and more natural and smooth virtual man-machine interaction experience is provided.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a flow chart of a method of optimizing digital human interface control according to an embodiment of the present application.

Fig. 2 is a schematic architecture diagram of an optimization method of digital human interface control according to an embodiment of the present application.

Fig. 3 is a flowchart of a method for optimizing digital human interface control according to an embodiment of the present application, in which each of the lip recognition images in the lip recognition image set is respectively passed through a multi-scale lip image feature extractor to obtain a plurality of multi-scale lip motion feature vectors.

FIG. 4 is a block diagram of an optimization system for digital human interface control according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

FIG. 1 is a flow chart of a method of optimizing digital human interface control according to an embodiment of the present application. As shown in fig. 1, the optimization method of the digital human interface control according to the embodiment of the application includes the steps of: s110, acquiring a lip language identification image set and audio data; s120, respectively passing each lip language identification image in the lip language identification image set through a multi-scale lip language image feature extractor to obtain a plurality of multi-scale lip language action feature vectors; s130, passing the multiple multi-scale lip language action feature vectors through a two-way long-short term memory neural network model to obtain lip reference data feature vectors; s140, extracting a logarithmic Mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the audio data, and arranging the logarithmic Mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel sound spectrogram; s150, the multichannel sound spectrogram is subjected to a convolutional neural network model using a channel attention mechanism to obtain an audio feature vector; s160, carrying out association coding on the lip reference data feature vector and the audio feature vector to obtain an association feature matrix; s170, performing order parameterization based on feature engineering on the associated feature matrix to obtain an optimized associated feature matrix; and S180, the optimized association characteristic matrix is passed through a video generator based on a countermeasure generation network to obtain lip motion video corresponding to the input audio.

Fig. 2 is a schematic architecture diagram of an optimization method of digital human interface control according to an embodiment of the present application. As shown in fig. 2, first, a lip language recognition image set and audio data are acquired. And then, respectively passing each lip language identification image in the lip language identification image set through a multi-scale lip language image feature extractor to obtain a plurality of multi-scale lip language action feature vectors. And then, the multiple multi-scale lip language action feature vectors pass through a two-way long-short term memory neural network model to obtain lip reference data feature vectors. Meanwhile, extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram of the audio data, and arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel sound spectrogram. Secondly, the multichannel spectrogram is obtained through a convolutional neural network model using a channel attention mechanism so as to obtain an audio feature vector. And then, carrying out association coding on the lip reference data feature vector and the audio feature vector to obtain an association feature matrix. Then, the associated feature matrix is parameterized based on the order of feature engineering to obtain an optimized associated feature matrix. Finally, the optimized association feature matrix is passed through a video generator that generates a network based on the countermeasure to obtain a lip motion video corresponding to the input audio.

In the above-mentioned optimization method of digital human interface control, in step S110, a lip language identification image set and audio data are obtained. Accordingly, in order to improve the matching degree of the lips of the digital person and the voice, lip reference data can be provided by acquiring a lip language identification image set, wherein the lip language identification image set is a data set containing lip shape and motion information of different speakers, different pronunciations and different expressions, and can be used for training mouth shapes and lip motions of virtual digital persons so as to generate mouth shape motions matched with input audio. Therefore, in the technical scheme of the application, the deep learning technology is utilized to extract lip reference features from the lip language identification image set, audio semantic features are extracted from the input audio data, the association relation between the audio semantic features and the lip reference features is established, and lip action videos corresponding to the input audio are generated based on the association features of the audio semantic features and the lip reference features. Therefore, more accurate mouth shape driving can be realized, so that the mouth shape of the virtual digital person is more vivid in appearance, and more natural and smooth virtual man-machine interaction experience is provided. Specifically, in the technical scheme of the application, first, a lip language identification image set and audio data are acquired.

In the above-mentioned optimization method for digital human interface control, in step S120, each of the lip recognition images in the lip recognition image set is passed through a multi-scale lip image feature extractor to obtain a plurality of multi-scale lip motion feature vectors. Considering that lip language is one way to understand and interpret language by observing the shape and movement of lips, different lip motion images have features of different scales in a high-dimensional feature space. In order to capture the lip motion information under different scales, a multi-scale lip image feature extractor is further used for respectively carrying out feature mining on each lip recognition image in the lip recognition image set. It should be appreciated that the multi-scale lipgraphic feature extractor captures details and overall features of lip movement by performing convolution operations on the lipgraphic recognition image using convolution kernels of different scales. For example, smaller scale convolution kernels may better capture fine movements of the lips, while larger scale convolution kernels may better capture overall shape changes of the lips. By extracting a plurality of multi-scale lip language action feature vectors, information under different scales can be comprehensively utilized, so that lip language actions can be more comprehensively described. In this way, multiscale lip motion feature vectors corresponding to the lip recognition images are extracted from the lip recognition image set, so that the expression capability of lip features is improved, and the accuracy of subsequent lip driving control is enhanced.

Accordingly, in one specific example, the multi-scale liplet image feature extractor includes a first convolution layer, a second convolution layer in parallel with the first convolution layer, and a concatenation layer connected to the first and second convolution layers, wherein the first convolution layer uses a two-dimensional convolution kernel having a first scale, and the second convolution layer uses a two-dimensional convolution kernel having a second scale.

Fig. 3 is a flowchart of a method for optimizing digital human interface control according to an embodiment of the present application, in which each of the lip recognition images in the lip recognition image set is respectively passed through a multi-scale lip image feature extractor to obtain a plurality of multi-scale lip motion feature vectors. As shown in fig. 3, the step S120 includes: s210, using each layer of the first convolution layer of the multi-scale lip language image feature extractor to perform, in forward transfer of the layer, input data: performing convolution processing, global average pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the first scale to obtain a lip language action feature vector with the first scale; s220, using each layer of the second convolution layer of the multi-scale lip language image feature extractor to respectively carry out input data in forward transfer of the layer: performing convolution processing, global mean pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the second scale to obtain a lip language action feature vector with the second scale; and S230, merging the first-scale lip language action feature vector and the second-scale lip language action feature vector to obtain the multi-scale lip language action feature vector.

In the above-mentioned optimization method of digital human interface control, step S130 is to pass the multiple multi-scale lip language motion feature vectors through a two-way long-short term memory neural network model to obtain lip reference data feature vectors. Consider that lip language action is a time-varying sequence in which each time step contains action information for the lip. To better understand and represent lip language actions, the timing relationships in the sequence of actions need to be considered. Therefore, the two-way long-short-term memory neural network model is further usedLong Short-Term Memory) to process the plurality of multi-scale lip-language motion feature vectors to capture timing information and contextual relationships of the lip-language motion. It should be understood that the two-way long-short-term memory neural network model enables the weight of the neural network to be updated by adding the input gate, the output gate and the forgetting gate, and the weight scale of different channels can be dynamically changed under the condition that the parameters of the network model are fixed, so that the problems of gradient disappearance or gradient expansion can be avoided. Particularly, the bidirectional long-short-term memory neural network model is formed by combining a forward LSTM and a backward LSTM, and can model lip language action characteristics in the forward direction and the backward direction, so that the lip-shaped reference data characteristic vector obtained through the bidirectional long-short-term memory neural network model learns more comprehensive action context semantic understanding information through combination of bidirectional transmission, and is helpful for modeling the time sequence characteristics of lip language actions more accurately.

In the above-mentioned optimizing method of digital human interface control, in step S140, a logarithmic mel spectrogram, a cochlear spectrogram, and a constant Q transformation spectrogram of the audio data are extracted, and the logarithmic mel spectrogram, the cochlear spectrogram, and the constant Q transformation spectrogram are arranged into a multi-channel sound spectrogram. It should be appreciated that the audio data includes frequency and energy information of the speech signal, and that the log mel-spectrogram, cochlear spectrogram, and constant Q transform spectrogram of the audio data are further extracted in order to effectively capture the spectral feature representation of the audio signal. It should be appreciated that logarithmic mel-spectrograms are spectrograms obtained by converting an audio signal into a spectral representation on the mel scale and taking the logarithm, which more closely conforms to the non-linear frequency-sensing characteristics in human auditory perception. The cochlear spectrogram is a frequency distribution characteristic of the volute in the simulated human ear, and can better simulate the perception of the audio signal by the human ear by converting the audio signal into a frequency spectrum representation on the cochlear scale. Whereas a constant Q transform spectrum is a spectral representation using different frequency resolutions in different frequency ranges. It has a higher frequency resolution in the low frequency range and a lower frequency resolution in the high frequency range. This feature enables the constant Q transform spectrogram to better capture the details and overall characteristics of the audio signal. And then, the spectral features are arranged into a multi-channel sound spectrogram, so that different feature representations are combined together, the spectral features of the audio signal are better expressed, more information is provided for subsequent associated codes, and the matching effect of lips and voices is improved.

In the above-mentioned optimizing method of digital human interface control, in step S150, the audio feature vector is obtained by using a convolutional neural network model of a channel attention mechanism for the multichannel spectrogram. Considering that the importance of different spectral features is different in audio processing, in order to adaptively learn and weight the importance of different spectral features, the multi-channel sound spectrogram is further processed using a convolutional neural network model containing a channel attention mechanism. It should be appreciated that the channel attention mechanism is a variation of the attention mechanism for automatically learning the importance weights of each channel in the multi-channel input data. It dynamically adjusts the contribution of each channel by learning the channel attention weights so that the network can focus more on channels that are more helpful to the current task, while filtering out the interference of extraneous information and noise. In this way, by introducing a channel attention mechanism into the convolutional neural network model, the characteristic extraction and modeling are carried out on the multichannel sound spectrogram, and the weight of each frequency spectrum characteristic channel is automatically learned, so that more accurate and expressive audio characteristic vectors are obtained, and the modeling effect on audio information is improved.

Accordingly, in a specific example, the step S150 includes: input data are respectively carried out in forward transfer of layers by using each layer of the convolutional neural network: performing convolution processing on the input data based on a three-dimensional convolution check to obtain a convolution characteristic diagram; carrying out global mean pooling on each feature matrix of the convolution feature graph along the channel dimension to obtain a channel feature vector; calculating the ratio of the characteristic value of each position in the channel characteristic vector relative to the weighted sum of the characteristic values of all positions of the channel characteristic vector to obtain a channel weighted characteristic vector; weighting the feature matrix of the convolution feature images along the channel dimension by taking the feature value of each position in the channel weighted feature vector as a weight to obtain a channel attention feature image; carrying out global pooling treatment on each feature matrix along the channel dimension on the channel attention feature map to obtain a pooled feature map; performing activation processing on the pooled feature map to generate an activated feature map; the output of the last layer of the convolutional neural network is the audio feature vector, and the input of the first layer of the convolutional neural network is the multichannel sound spectrogram.

In the above-mentioned optimization method of digital human interface control, in step S160, the lip reference data feature vector and the audio feature vector are subjected to association coding to obtain an association feature matrix. It should be appreciated that by correlating the lip reference data feature vector and the audio feature vector, lip shape and motion information is combined with speech features to capture the correlation and consistency between lip shape and motion and speech, thereby more accurately modeling the speaker's mouth shape and lip motion. And, the correlation feature matrix provides a comprehensive feature representation in which each element contains information about lip shape, motion and audio features for subsequent model training and generation processes to enable the generation of more natural, realistic virtual digital person's mouth shapes and lip movements.

Accordingly, in a specific example, the step S160 includes: performing association coding on the lip reference data feature vector and the audio feature vector by using the following association formula to obtain the association feature matrix;

wherein, the association formula is:

wherein the method comprises the steps ofRepresenting the lip reference data feature vector, A transpose of the lip reference data feature vector,the audio feature vector is represented as such,representing the matrix of the associated feature,representing vector multiplication.

In the above-mentioned optimization method of digital human interface control, in step S170, the order parameterization based on feature engineering is performed on the correlation feature matrix to obtain an optimized correlation feature matrix. Considering that the correlation feature matrix has feature redundancy and noise, and that there is a relationship and order between the local features of the correlation feature matrix, that is, there is order between the local features of the correlation feature matrix, if the context correlation between the local features and the order information implicit in the correlation feature matrix can be utilized, the correlation feature matrix can be modeled in order to improve sparsity and certainty of feature expression of the correlation feature matrix. Based on the order parameterization based on feature engineering is carried out on the correlation feature matrix to obtain an optimized correlation feature matrix.

Specifically, firstly, feature matrix segmentation is performed on the local feature distribution of the associated feature matrix based on the local feature distribution of the associated feature matrix to obtain a sequence of the associated local feature matrix. And then, passing the sequence of the associated local feature matrix through an order weight generator based on a Softmax function to obtain a sequence of order weight values, wherein the order weight values are used for representing the contribution degree of each associated local feature matrix to an associated result in a class probability domain, and correspondingly, the sequence of order weight values form sequence information of the sequence of the associated local feature matrix, namely, the influence ranking of each associated local feature matrix in the class probability domain. And then, based on the sequence of the order weight values, sequencing the sequence of the associated local feature matrix to obtain a sequence of rearranged associated local feature matrix, namely, based on order information provided by the sequence of the order weight values, arranging the sequence of the associated local feature matrix in order from small to large or from large to small so as to enhance the order information of each local feature in the associated local feature matrix and reduce the information loss in a section. Then, the sequence of the rearranged associated local feature matrix is subjected to feature flattening to obtain a sequence of rearranged associated local feature vectors, and the sequence of rearranged associated local feature vectors is subjected to a context encoder based on a converter to obtain a sequence of context rearranged associated local feature vectors. That is, contextual relevance information between individual local features in the relevance feature matrix is captured based on a converter mechanism (transducer mechanism). Further, the sequence of order weight values is subjected to normalization processing based on the maximum value to obtain a sequence of normalized order weight values, and the sequence of context rearrangement associated local feature vectors is weighted by taking the normalized order weight values of all positions in the sequence of normalized order weight values as weights respectively to obtain a sequence of optimized context rearrangement associated local feature vectors. That is, the stealth order information and the context information are fused and overlaid in a high-dimensional feature space. And finally, carrying out dimension reconstruction on the sequence of the optimization context rearrangement associated local feature vector to obtain the optimization associated feature matrix.

In this way, the order parameterization based on feature engineering is carried out on the association feature matrix so as to utilize the association information between each local feature in the association feature matrix and the order information of each local feature in a class probability domain, thereby reducing dimension and noise, increasing information quantity and interpretability, and improving the accuracy and generalization capability of the model, because the parameterized features can better reflect the real structure and rule of the data.

Accordingly, in a specific example, the step S170 includes: performing feature matrix segmentation on the associated feature matrix to obtain a sequence of associated local feature matrices; passing the sequence of the associated local feature matrix through an order weight generator based on a Softmax function to obtain a sequence of order weight values; based on the sequence of the order weight values, sequencing the sequence of the associated local feature matrix to obtain a sequence of rearranged associated local feature matrix; performing feature flattening on the sequence of the rearranged associated local feature matrix to obtain a sequence of rearranged associated local feature vectors; passing the sequence of reordered associated local feature vectors through a context encoder based on a converter to obtain a sequence of context reordered associated local feature vectors; carrying out normalization processing based on the maximum value on the sequence of the order weight values to obtain a sequence of normalized order weight values; taking the normalized order weight value of each position in the sequence of normalized order weight values as weight, and respectively weighting the sequence of context rearrangement associated local feature vectors to obtain the sequence of optimized context rearrangement associated local feature vectors; and carrying out dimension reconstruction on the sequence of the optimization context rearrangement associated local feature vector to obtain the optimization associated feature matrix.

In the above-mentioned optimization method of digital human interface control, the step S180 is to pass the optimized correlation feature matrix through a video generator based on a countermeasure generation network to obtain a lip motion video corresponding to the input audio. It should be understood that the countermeasure generation network (GAN) is a framework of generators and discriminators by which realistic data is generated. The generator attempts to generate a lip action sequence that matches the input audio based on the input features. The discriminator is then used to evaluate the authenticity of the lip motion video generated by the generator, thereby providing a feedback signal for training of the generator. Finally, the optimized association characteristic matrix is input into a trained generator to generate lip motion video matched with input audio so as to simulate the mouth shape and lip motion of a virtual digital person.

Accordingly, in one specific example, the video generator based on the countermeasure generation network comprises a discriminator and a generator, wherein the generator is used for generating video, the discriminator is used for calculating the difference between the generated video and the reference video, and the network parameters of the generator are updated through a gradient descent direction propagation algorithm to obtain the generator with the function of generating accurate lip action video; further, the optimized correlation feature matrix is input into a generator of the video generator based on the countermeasure generation network to obtain the lip motion video corresponding to the input audio.

In the optimization method for controlling the digital human interface, the digital human interface adopts the image coding compression technology, so that the data quantity of the description image can be reduced, and the image transmission, processing time and memory capacity are saved. In a digital human interface, image coding compression may be used to store and transmit lip recognition image sets, lip reference data, and other image data related to mouth shape and lip movement. By using the encoding compression algorithm, the image data is compressed into a smaller file size according to the statistical characteristics and redundant information of the image, thereby reducing the space and bandwidth required for storage and transmission, and the delay and cost of data transmission can be reduced to realize efficient image data compression and decompression, thereby reducing the storage space and transmission bandwidth of the image data while maintaining the image quality to support real-time application and network transmission.

In summary, an optimization method of digital human interface control according to an embodiment of the present application is explained, which extracts lip-reference features from a lip-recognition image set using a deep learning technique, extracts audio semantic features from input audio data, establishes an association relationship between the audio semantic features and the lip-reference features, and generates a lip-action video corresponding to the input audio based on the association features of the two. Therefore, more accurate mouth shape driving can be realized, so that the mouth shape of the virtual digital person is more vivid in appearance, and more natural and smooth virtual man-machine interaction experience is provided.

FIG. 4 is a block diagram of an optimization system for digital human interface control according to an embodiment of the present application. As shown in fig. 4, an optimization system 100 for digital human interface control according to an embodiment of the present application includes: a data acquisition module 110 for acquiring a lip language identification image set and audio data; the lip-shaped motion feature extraction module 120 is configured to pass each lip-language identification image in the lip-language identification image set through a multi-scale lip-language image feature extractor to obtain a plurality of multi-scale lip-language motion feature vectors; the lip-motion forward-backward correlation feature extraction module 130 is configured to pass the multiple multi-scale lip-motion feature vectors through a two-way long-short term memory neural network model to obtain lip-reference data feature vectors; an audio spectrogram extraction module 140, configured to extract a logarithmic mel spectrogram, a cochlear spectrogram, and a constant Q transform spectrogram of the audio data, and arrange the logarithmic mel spectrogram, the cochlear spectrogram, and the constant Q transform spectrogram into a multi-channel sound spectrogram; an audio feature extraction module 150, configured to obtain an audio feature vector from the multi-channel spectrogram through a convolutional neural network model using a channel attention mechanism; the association module 160 is configured to perform association encoding on the lip reference data feature vector and the audio feature vector to obtain an association feature matrix; an optimization module 170, configured to perform order parameterization based on feature engineering on the correlation feature matrix to obtain an optimized correlation feature matrix; the lip control result generating module 180 passes the optimized correlation feature matrix through a video generator based on a countermeasure generation network to obtain a lip motion video corresponding to the input audio.

Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described optimizing system of the digital human interface control have been described in detail in the above description of the optimizing method of the digital human interface control with reference to fig. 1 to 3, and thus, repetitive descriptions thereof will be omitted.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method for optimizing digital human interface control, comprising:

acquiring a lip language identification image set and audio data;

passing the optimized correlation feature matrix through a video generator that generates a network based on the countermeasure to obtain lip motion video corresponding to the input audio;

wherein, carrying out order parameterization based on feature engineering on the associated feature matrix to obtain an optimized associated feature matrix, comprising:

Performing feature matrix segmentation on the associated feature matrix to obtain a sequence of associated local feature matrices;

passing the sequence of the associated local feature matrix through an order weight generator based on a Softmax function to obtain a sequence of order weight values;

based on the sequence of the order weight values, sequencing the sequence of the associated local feature matrix to obtain a sequence of rearranged associated local feature matrix;

performing feature flattening on the sequence of the rearranged associated local feature matrix to obtain a sequence of rearranged associated local feature vectors;

passing the sequence of reordered associated local feature vectors through a context encoder based on a converter to obtain a sequence of context reordered associated local feature vectors;

carrying out normalization processing based on the maximum value on the sequence of the order weight values to obtain a sequence of normalized order weight values;

taking the normalized order weight value of each position in the sequence of normalized order weight values as weight, and respectively weighting the sequence of context rearrangement associated local feature vectors to obtain the sequence of optimized context rearrangement associated local feature vectors;

and carrying out dimension reconstruction on the sequence of the optimization context rearrangement associated local feature vector to obtain the optimization associated feature matrix.

2. The method of optimizing digital human interface control of claim 1, wherein the multi-scale lipped image feature extractor comprises a first convolution layer, a second convolution layer in parallel with the first convolution layer, and a concatenation layer connected to the first and second convolution layers, wherein the first convolution layer uses a two-dimensional convolution kernel having a first scale and the second convolution layer uses a two-dimensional convolution kernel having a second scale.

3. The method for optimizing digital human interface control according to claim 2, wherein passing each of the set of lip language identification images through a multi-scale lip language image feature extractor to obtain a plurality of multi-scale lip language motion feature vectors, respectively, comprises:

each layer of the first convolution layer of the multi-scale lip language image feature extractor is used for respectively carrying out input data in forward transfer of the layer: performing convolution processing, global average pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the first scale to obtain a lip language action feature vector with the first scale;

each layer of the second convolution layer of the multi-scale lip language image feature extractor is used for respectively carrying out input data in forward transfer of the layer: performing convolution processing, global mean pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the second scale to obtain a lip language action feature vector with the second scale;

And merging the first-scale lip language action feature vector and the second-scale lip language action feature vector to obtain the multi-scale lip language action feature vector.

4. A method of optimizing digital human interface control according to claim 3, wherein passing the multichannel sound spectrogram through a convolutional neural network model using a channel attention mechanism to obtain an audio feature vector, comprising: input data are respectively carried out in forward transfer of layers by using each layer of the convolutional neural network:

performing convolution processing on the input data based on a three-dimensional convolution check to obtain a convolution characteristic diagram;

carrying out global mean pooling on each feature matrix of the convolution feature graph along the channel dimension to obtain a channel feature vector;

calculating the ratio of the characteristic value of each position in the channel characteristic vector relative to the weighted sum of the characteristic values of all positions of the channel characteristic vector to obtain a channel weighted characteristic vector;

weighting the feature matrix of the convolution feature images along the channel dimension by taking the feature value of each position in the channel weighted feature vector as a weight to obtain a channel attention feature image;

carrying out global pooling treatment on each feature matrix along the channel dimension on the channel attention feature map to obtain a pooled feature map;

Performing activation processing on the pooled feature map to generate an activated feature map;

the output of the last layer of the convolutional neural network is the audio feature vector, and the input of the first layer of the convolutional neural network is the multichannel sound spectrogram.

5. The method of optimizing digital human interface control according to claim 4, wherein the encoding the lip reference data feature vector and the audio feature vector in association to obtain an associated feature matrix comprises: performing association coding on the lip reference data feature vector and the audio feature vector by using the following association formula to obtain the association feature matrix;

wherein, the association formula is:wherein->Representing the lip reference data feature vector, and (2)>Transpose vector representing the lip reference data feature vector,/a>Representing the audio feature vector,/->Representing the associated feature matrix,/->Representing vector multiplication.

6. The optimizing method of digital human interface control according to claim 5, wherein the video generator based on the countermeasure generation network includes a discriminator for generating a video and a generator for calculating a difference between the generated video and a reference video, and updating network parameters of the generator by a direction propagation algorithm of gradient descent to obtain a generator having a function of generating an accurate lip motion video; further, the optimized correlation feature matrix is input into a generator of the video generator based on the countermeasure generation network to obtain the lip motion video corresponding to the input audio.

7. An optimization system for digital human interface control, comprising:

the lip control result generation module is used for enabling the optimized association characteristic matrix to pass through a video generator based on a countermeasure generation network so as to obtain lip action videos corresponding to input audio;

wherein, the optimization module includes:

8. The digital human interface control optimization system of claim 7, wherein the multi-scale lipped image feature extractor comprises a first convolution layer, a second convolution layer in parallel with the first convolution layer, and a concatenation layer connected to the first and second convolution layers, wherein the first convolution layer uses a two-dimensional convolution kernel having a first scale and the second convolution layer uses a two-dimensional convolution kernel having a second scale.

9. The digital human interface control optimization system of claim 8, wherein the lip motion feature extraction module comprises:

the first scale feature extraction unit is used for respectively carrying out input data on each layer of the first convolution layer of the multi-scale lip language image feature extractor in forward transfer of the layer: performing convolution processing, global average pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the first scale to obtain a lip language action feature vector with the first scale;

The second scale feature extraction unit is used for respectively carrying out input data on each layer of the second convolution layer of the multi-scale lip language image feature extractor in forward transfer of the layer: performing convolution processing, global mean pooling processing and nonlinear activation processing on the input data based on the two-dimensional convolution check with the second scale to obtain a lip language action feature vector with the second scale;

and the multi-scale feature fusion unit is used for fusing the first-scale lip language motion feature vector and the second-scale lip language motion feature vector to obtain the multi-scale lip language motion feature vector.