CN113556600A

CN113556600A - Drive control method and device based on time sequence information, electronic equipment and readable storage medium

Info

Publication number: CN113556600A
Application number: CN202110788537.9A
Authority: CN
Inventors: 钱立辉; 韩欣彤; 王法强; 董浩业
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-10-26
Anticipated expiration: 2041-07-13
Also published as: CN113556600B

Abstract

The application provides a drive control method, a device, electronic equipment and a readable storage medium based on time sequence information, wherein a plurality of continuous multi-frame video frames containing a target object are obtained, the multi-frame video frames contain a video frame to be processed and an adjacent video frame which is positioned in front of the video frame to be processed in time sequence, key points of the target object in each video frame are extracted to obtain a plurality of groups of key point information, then the plurality of groups of key point information are processed by utilizing a drive model obtained through pre-training, a drive signal corresponding to the target object in the video frame to be processed is output, and then the drive control is carried out on a target virtual image based on the drive signal. In the scheme, the driving signal corresponding to the video frame to be processed is obtained by combining the video frame to be processed and the adjacent video frame, the driving signal error can be reduced by utilizing the context information of the continuous video frames on the time sequence, and the problem of low accuracy of the driving signal under the condition of missing or shaking of the video frame to be processed can be effectively solved.

Description

Drive control method and device based on time sequence information, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of live broadcast technologies, and in particular, to a drive control method and apparatus based on timing information, an electronic device, and a readable storage medium.

Background

With the rapid development of computer vision technology, computer vision technology is widely applied to various fields. For example, in a live webcast application scenario, in order to increase the interest of live webcasting, a live webcast method of rendering an avatar on a live webcast interface is popular. In the live broadcasting mode, the video image of the user can be collected, and the video image is processed by utilizing a computer vision technology, so that a driving signal for driving the virtual image to follow up is obtained.

In the existing mode, the single-frame video images are often processed respectively to obtain corresponding driving signals, and in the mode, the driving signals are obtained independently based on the single-frame video images, so that when the single-frame video images are missing or unstable in jitter, accurate driving signals are difficult to obtain, and the problem of poor follow-up effect is caused.

Disclosure of Invention

An object of the present application includes, for example, providing a driving control method, apparatus, electronic device and readable storage medium based on timing information, which can effectively alleviate the problem of low accuracy of driving signals in the presence of missing or jitter.

The embodiment of the application can be realized as follows:

in a first aspect, the present application provides a driving control method based on timing information, the method including:

acquiring continuous multi-frame video frames containing a target object, wherein the multi-frame video frames contain a video frame to be processed and an adjacent video frame positioned in front of the video frame to be processed in time sequence;

extracting key points of the target object in each video frame to obtain a plurality of groups of key point information;

processing the multiple groups of key point information by using a driving model obtained by pre-training, and outputting a driving signal corresponding to a target object in the video frame to be processed;

and performing drive control on the target virtual image by using the drive signal.

In an alternative embodiment, the keypoint information comprises coordinates and confidence levels of keypoints;

the step of processing the multiple groups of key point information by using a driving model obtained by pre-training and outputting a driving signal corresponding to a target object in the video frame to be processed comprises the following steps:

importing the multiple groups of key point information into a driving model obtained by pre-training, and reducing the weight occupied by the key point coordinates of the video frame to be processed and increasing the weight occupied by the key point coordinates of each adjacent video frame when the video frame to be processed has an unreliable key point, wherein the unreliable key point is a key point with the confidence coefficient lower than a preset value;

and outputting a driving signal corresponding to the target object in the video frame to be processed based on the multiple groups of key point information after the weight adjustment.

In an optional implementation manner, the step of decreasing the weight occupied by the keypoint coordinates of the video frame to be processed and increasing the weight occupied by the keypoint coordinates of each of the adjacent video frames includes:

determining a reference key point corresponding to the non-trusted key point in each adjacent video frame;

and reducing the weight occupied by the coordinates of the untrusted key points, and increasing the weight occupied by the coordinates of each reference key point.

In an optional embodiment, the driving model is obtained by training a constructed network model by using a training sample in advance;

the training samples comprise positive samples without key points with confidence degrees lower than a preset value and negative samples with key points with confidence degrees lower than the preset value, wherein the negative samples are obtained after the key point coordinates in the positive samples are randomly disturbed.

In an optional embodiment, the step of processing the multiple sets of keypoint information by using a driving model obtained by pre-training and outputting a driving signal corresponding to a target object in the video frame to be processed includes:

importing the multiple groups of key point information into a driving model obtained by pre-training;

aiming at any target video frame, obtaining a previous frame of the target video frame, and obtaining state characteristics corresponding to key point information of the previous frame, wherein the target video frame is any adjacent video frame or a video frame to be processed;

obtaining the state characteristics of the target video frame according to the state characteristics of the previous frame and the key point information of the target video frame;

and outputting a driving signal corresponding to a target object in the video frame to be processed according to the state characteristic of the video frame to be processed.

In an alternative embodiment, the multi-frame video frame further includes an adjacent video frame chronologically subsequent to the video frame to be processed;

and processing the multiple groups of key point information, and outputting a driving signal corresponding to a target object in the video frame to be processed after the processing of an adjacent video frame behind the video frame to be processed is finished.

In an optional embodiment, the method is applied to a live broadcast provider;

the step of processing the plurality of groups of key point information by using the driving model obtained by pre-training comprises the following steps:

acquiring equipment performance information of the live broadcast provider;

determining a driving model adaptive to the live broadcast providing terminal from a plurality of driving models obtained by pre-training according to the equipment performance information;

and processing the plurality of groups of key point information by using the determined driving model.

In an alternative embodiment, the method further comprises the step of pre-training the resulting driving model, the step comprising:

constructing a first network model and a second network model, wherein the first network model is a network model with a calculated amount larger than a first calculated amount, the second network model is a network model with a calculated amount smaller than a second calculated amount, and the first calculated amount is larger than the second calculated amount;

respectively processing each obtained training sample by using the first network model and the second network model to obtain corresponding output results;

and adjusting the model parameters of the second network model to reduce the difference of the output results of the second network model and the first network model, and continuing training until a driving model obtained by optimizing the second network model is obtained when the preset requirement is met.

acquiring a plurality of training samples, wherein each training sample comprises continuous multi-frame training video frames, and each training sample has a corresponding real driving signal;

leading each training sample into the constructed network model for training to obtain a corresponding output driving signal;

and performing minimization processing on a time sequence loss function constructed by the real driving signal and the output driving signal, and obtaining a driving model obtained by optimizing the network model when multiple times of iterative training are carried out until a set requirement is met.

In an alternative embodiment, the real drive signal and the output drive signal comprise six-dimensional spatial information;

the time sequence loss function is constructed by any one or more of six-dimensional space information contained in a real driving signal and an output driving signal of the multi-frame training video frame, 2D key point coordinate information obtained based on six-dimensional space information projection, and 3D key point coordinate information obtained based on six-dimensional space information projection.

In a second aspect, the present application provides a driving control apparatus based on timing information, the apparatus including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring continuous multi-frame video frames containing a target object, and the multi-frame video frames contain a video frame to be processed and an adjacent video frame positioned in front of the video frame to be processed in time sequence;

the extraction module is used for extracting key points of the target object in each video frame to obtain a plurality of groups of key point information;

the processing module is used for processing the multiple groups of key point information by using a driving model obtained by pre-training and outputting a driving signal corresponding to a target object in the video frame to be processed;

and the driving module is used for driving and controlling the target virtual image by using the driving signal.

In a third aspect, the present application provides an electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when executed by the electronic device, are executed by the processors to perform the method steps of any one of the preceding embodiments.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon machine-executable instructions which, when executed, implement the method steps of any one of the preceding embodiments.

The beneficial effects of the embodiment of the application include, for example:

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic view of an application scenario of a driving control method based on timing information according to an embodiment of the present application;

fig. 2 is a flowchart of a driving control method based on timing information according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a streaming video frame provided by an embodiment of the present application;

FIG. 4 is a flow chart of a method for pre-training a driver model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a driving model trained by distillation loss method according to an embodiment of the present application;

FIG. 6 is another flow chart of a method for pre-training a driver model according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating one of the sub-steps included in step S230 of FIG. 2;

FIG. 8 is a second flowchart illustrating the sub-steps included in step S230 of FIG. 2;

FIG. 9 is a schematic diagram of an LSTM network model provided by an embodiment of the present application;

fig. 10 is a schematic diagram of a fully-connected network model provided by an embodiment of the present application;

FIG. 11 is a third flowchart illustrating the sub-steps included in step S230 of FIG. 2;

FIG. 12 is a fourth flowchart illustrating the sub-steps included in step S230 of FIG. 2;

fig. 13 is a block diagram of an electronic device according to an embodiment of the present application;

fig. 14 is a functional block diagram of a drive control device based on timing information according to an embodiment of the present application.

Icon: 100-live broadcast providing terminal; 110-a storage medium; 120-a processor; 130-a drive control means based on timing information; 131-an acquisition module; 132-an extraction module; 133-a processing module; 134-a drive module; 140-a communication interface; 200-a live broadcast server; 300-live broadcast receiving end.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it is noted that the terms "first", "second", and the like are used merely for distinguishing between descriptions and are not intended to indicate or imply relative importance.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

As shown in fig. 1, an application scenario of the timing information-based driving control method according to the embodiment of the present application is schematically illustrated, where the application scenario may include a live broadcast providing end 100, a live broadcast receiving end 300, and a live broadcast server 200 in communication connection with the live broadcast providing end 100 and the live broadcast receiving end 300 respectively.

The live broadcast providing terminal 100 may be a terminal device (such as a mobile phone, a tablet computer, a computer, etc.) used by a main broadcast during live broadcast, and the live broadcast receiving terminal 300 may be a terminal device (such as a mobile phone, a tablet computer, a computer, etc.) used by an audience during live broadcast watching.

In this embodiment, a video capture device for capturing video frames of the anchor may be further included in the scene, and the video capture device may be, but is not limited to, a camera, a lens of a digital camera, a monitoring camera, a webcam, or the like.

The video capture device may be directly installed or integrated in the live broadcast provider 100. For example, the video capture device may be a camera configured on the live provider 100, and other modules or components in the live provider 100 may receive videos, images, and the like sent from the video capture device via the internal bus. Alternatively, the video capture device may be independent of the live broadcast provider 100, and the two devices may communicate with each other in a wired or wireless manner.

The live video provider 100 may send the live video stream to the live server 200, and the viewer may access the live server 200 through the live receiver 300 to watch the live video.

With reference to fig. 2, an embodiment of the present application further provides a timing information-based driving control method applicable to an electronic device, for performing driving control on an avatar in a live video. The electronic device may be the live broadcast provider 100 or the live broadcast server 200. The method steps defined by the flow relating to the timing information based drive control method may be implemented by the electronic device. The specific process shown in fig. 2 will be described in detail below.

Step S210, acquiring continuous multi-frame video frames containing a target object, wherein the multi-frame video frames contain a video frame to be processed and an adjacent video frame positioned in front of the video frame to be processed in time sequence.

Step S220, performing key point extraction on the target object in each video frame to obtain multiple groups of key point information.

And step S230, processing the multiple groups of key point information by using a driving model obtained by pre-training, and outputting a driving signal corresponding to a target object in the video frame to be processed.

And step S240, performing driving control on the target avatar by using the driving signal.

In a live application scenario, a user may utilize the live provider 100 to perform a live webcast, where a target user may be, for example, a main webcast. The video capture device may continuously capture the live video stream of the anchor. The captured live video stream may be a set of sequential, large number, fast, consecutive video frames, i.e., streaming video data.

In this embodiment, the live broadcast picture may render and display an avatar, such as a rally avatar, an animal avatar, or the like. The target avatar may be an avatar having a relationship with the target object, that is, the target avatar may be driven to control to follow the body motion of the target object to perform the same motion.

Optionally, the video data acquired by the video acquisition device may be sent to the live broadcast providing terminal 100 for analysis and processing, and then the target avatar is controlled based on the obtained driving signal. In addition, the video data collected by the video collecting device can also be sent to the live broadcast server 200 for analysis and processing, and the live broadcast server can control the virtual image in the corresponding live broadcast room to perform limb movement by using the obtained driving signal.

In this embodiment, when analyzing and processing the video data, the video data may be subjected to framing processing to obtain multiple frames of video frames. When the driving signal corresponding to a certain frame of video frame needs to be obtained, the information of the previous frames of video frames of the video frame can be combined and utilized to comprehensively obtain the driving signal of the frame of video frame.

In detail, for the processing of the video frame to be processed, it is also possible to obtain a neighboring video frame before the video frame to be processed, which may be a previous neighboring video frame or several previous neighboring video frames, as shown in fig. 3.

Each video frame may be imported into a key point detection network obtained through pre-training to obtain key point information of the target object, where the key points may include, for example, key points for arms, such as shoulder key points, elbow key points, wrist key points, and the like. In addition, key points such as for legs, torso, etc. may also be included. Each video frame can obtain a corresponding group of key point information, and the key point information can reflect the body movement of the target object.

In this embodiment, a driving model is obtained through pre-training, and multiple groups of key point information of the video frame to be processed and the adjacent video frame can be imported into the driving model for processing. The driving model can be combined with the key point information of the adjacent video frames and the key point information of the video frames to be processed to comprehensively obtain the driving signal corresponding to the target object in the video frames to be processed.

Finally, the target avatar may be subjected to driving control according to the driving signal, wherein the driving signal may be control information for corresponding key points of the target avatar, such as elbow key points, shoulder key points, and the like. So that the arms, torso, legs, etc. of the avatar can be driven to perform the same actions as the target object.

In this embodiment, the timing information of multiple frames of video frames is integrated to obtain the driving signal corresponding to the video frame to be processed in combination with the processing of the video frame to be processed and the adjacent video frame in front of the video frame to be processed. By combining the context information of the video signal, the error of the driving signal can be reduced, the consistency of the driving signal can be improved, and the problem of low accuracy of the driving signal under the condition of missing or shaking of a video frame to be processed can be effectively solved.

In this embodiment, the driving model may be obtained by training in advance, and the obtaining process of the driving model is described first below.

The driving model in this embodiment may be a light-weighted LSTM (Long Short-Term Memory) network model or a light-weighted fully-connected network model. The weight reduction may be a model whose calculation amount is less than a certain calculation amount. The existing driving model for processing video frames usually occupies a large amount of calculation, and is difficult to be applied to equipment with low processing performance, such as terminal equipment. In addition, the method has a defect in real-time performance, and the requirement for real-time drive control of the virtual image is difficult to meet in a live broadcast application scene.

In this embodiment, the lightweight network model is applicable to various consumer-level terminal devices, and real-time control over the avatar can be achieved.

Because the lightweight network model is adopted, the accuracy of the output result of the model can be reduced to a certain degree after the network structure is simplified. In order to avoid the decrease of the accuracy of the output result of the model, the embodiment adopts a distillation loss mode to solve the problem. Referring to fig. 4, in the present embodiment, the driving model can be obtained by training in advance in the following manner.

Step S110A, a first network model and a second network model are constructed. The first network model is a network model with a calculated amount larger than a first calculated amount, the second network model is a network model with a calculated amount smaller than a second calculated amount, and the first calculated amount is larger than the second calculated amount.

Step S120A, processing each acquired training sample by using the first network model and the second network model respectively, and obtaining a corresponding output result.

Step S130A, adjusting the model parameters of the second network model to reduce the difference between the output results of the second network model and the first network model, and continuing training until a driving model optimized by the second network model is obtained when a preset requirement is met.

The unit of the calculation amount of the network model is GFlops, and generally, if there is no other difference, the output result accuracy of the network model with a large calculation amount is higher than that of the network model with a small calculation amount. In this embodiment, the constructed first network model may be a large model with a large calculation amount, and the second network model before the optimization of the driving model in this embodiment may be a small model with a small calculation amount.

A plurality of training samples can be collected in advance, and each training sample can be respectively led into the first network model and the second network model to be processed. The first network model and the second network model can respectively output the output results of the same training sample. Before optimizing the second network model, the output results of the first network model and the second network model for the same training sample should be different due to the difference in network structure.

In this embodiment, the purpose of training the second network model is to make the output result of the second network model as consistent as possible with the output result of the first network model. Referring to fig. 5, taking the second network model and the first network model as LSTM models as an example, the first network model respectively includes a large LSTM layer 1, a large LSTM layer 2 and a large fully-connected layer 1, and the second network model may include a large LSTM layer 1, a large LSTM layer 2 and a fully-connected layer 1.

Each training sample respectively contains continuous multiframe video frames, the key point information can be two-dimensional coordinate information, and if four continuous video frames are contained, and each frame of video frame has 7 key points, the input of the LSTM layer 1 is 56-dimensional information, and the output is 256-dimensional feature information for the second network model. The input of the LSTM layer 2 is 256-dimensional feature information and the output is 512-dimensional feature information. For the first network model, the input of the large LSTM layer 1 is 56-dimensional information, the output thereof is 1024-dimensional feature information, the input of the large LSTM layer 2 is 1024-dimensional feature information, and the output thereof is 512-dimensional feature information.

In this embodiment, the second network model may further include a fully connected layer 2, but in order to unify the outputs of the first network model and the second network model, when training the second network model, the output of the fully connected layer 1 is taken as a model output result, and is compared with the output result of the large fully connected layer 1 of the first network model.

In this embodiment, a distillation loss function based on the output results of the first network model and the second network model may be constructed as shown in the following formula:

wherein, N is the total dimension of the characteristic,

representing the output results of the second network model,

representing the output of the first network model. The distillation loss function may characterize a two-norm distance between the output results of the first network model and the second network model.

And based on the constructed distillation loss function, carrying out minimization processing on the distillation loss function, and adjusting the model parameters of the second network model in an iterative process. The training of the second network model may be stopped when the distillation loss function reaches a point where convergence no longer decreases or the number of training iterations reaches a set maximum number.

By training the second network model by means of the first network model with a large calculation amount, the driving model obtained by training the second network model with a small calculation amount can also have high accuracy. The obtained lightweight driving model can be suitable for terminal equipment with lower processing performance such as a live broadcast terminal and the like, and can meet the requirement of high real-time performance in a live broadcast application scene.

In this example, the driving models obtained by the distillation method and the driving models not obtained by the distillation method were compared, and the comparison results are shown in table 1. In this embodiment, pa-mpjpe (Procrustes analysis-Mean Per Joint Position Error after alignment) is used as the comparison index, and a smaller pa-mpjpe indicates a smaller result Error. As can be seen from the data in Table 1, the values of pa-mpjpe obtained for the driving model in the distillation mode are smaller, indicating that the error in the results is smaller in the distillation mode.

TABLE 1

	pa-mpjpe
		Without using distillation means	46.77
By distillation	45.21

In this embodiment, in the training process of the driving model, except that the output result may reach the accuracy of the output result of the network model as large as possible by comparing with the network model with a large calculation amount, the training of the driving model itself may be performed based on the real label of the training sample. Referring to fig. 6, as a possible implementation manner, for the pre-training of the driving model, the following steps may be further included:

step S110B, obtaining a plurality of training samples, each training sample including a plurality of consecutive training video frames, each training sample having a corresponding real driving signal.

And step S120B, each training sample is led into the constructed network model for training to obtain a corresponding output driving signal.

Step S130B, performing minimization processing on the timing loss function constructed by the real driving signal and the output driving signal, and obtaining a driving model optimized by the network model when multiple times of iterative training are performed until a set requirement is met.

In this embodiment, each of the pre-collected training samples is marked with a real driving signal, for example, if the training sample includes four training video frames, such as a first training video frame, a second training video frame, a third training video frame, and a fourth training video frame arranged in front of and behind each other according to a time sequence. And if the third training video frame is the video frame to be identified, the real driving signal is the driving signal corresponding to the limb action of the user in the video frame to be identified.

And processing each training sample by using the constructed network model, wherein the network model can obtain an output driving signal of the training sample, namely the output driving signal corresponding to the video frame to be recognized in the training sample.

The aim of training the network model is to make the output of the network model as consistent as possible with the real label of the training sample. Thus, a timing loss function consisting of the real drive signal and the output drive signal can be constructed. The time sequence loss function can be minimized, model parameters of the network model are continuously adjusted in the training process, and when the set requirement is met, the driving model obtained by optimizing the network model can be obtained. The setting requirement may be, for example, that the timing loss function is converged and is not reduced, or that the training duration or the training times reaches a set maximum value.

In this embodiment, the finally obtained output driving signal may be six-dimensional spatial information, the driving signal may be continuously represented by using the six-dimensional spatial information, and the problem of too large data redundancy caused by too high dimensionality may be avoided. Accordingly, the real driving signal of each training video frame is also six-dimensional spatial information.

In the traditional mode, because a single-frame video frame is often processed, the constructed loss function is also constructed based on the output result and the real result of the single-frame video frame, and the information of other video frames is difficult to be combined to construct a comprehensive loss function. In this embodiment, the constructed timing loss function not only considers the video frame to be identified, but also combines information of other adjacent video frames.

In addition, the final output result is six-dimensional spatial information, and 2D and 3D keypoint coordinate information may be obtained through projection based on the six-dimensional spatial information.

Therefore, the constructed timing loss function can be constructed by any one or more of six-dimensional spatial information contained in the real driving signal and the output driving signal of the multi-frame training video frame, 2D key point coordinate information obtained based on six-dimensional spatial information projection, and 3D key point coordinate information obtained based on six-dimensional spatial information projection.

For example, a timing loss function constructed based on six-dimensional spatial information alone may be represented as follows:

t represents the total number of training video frames, N represents the number of key points in the driving signal, Poss^GSix-dimensional spatial information, Poss, representing the real drive signal^PSix-dimensional spatial information representing the output drive signal.

In addition, the time sequence loss function constructed based on the 2D key point coordinate information projected by six-dimensional spatial information alone can be represented by the following formula:

N_2Dthe total number of the key points is represented,

2D keypoint coordinate information obtained by projection of six-dimensional spatial information representing the real drive signal,

and 2D key point coordinate information obtained by projecting six-dimensional space information representing the output driving signals.

In addition, the time sequence loss function constructed based on the 3D key point coordinate information projected by six-dimensional spatial information alone can be represented by the following formula:

3D keypoint coordinate information obtained by projecting six-dimensional spatial information representing a real drive signal,

and 3D key point coordinate information obtained by projecting six-dimensional space information representing the output driving signals.

In this embodiment, the training of the driving model may be performed based on any one of the timing loss functions, or may be performed by using any two or three of the timing loss functions. For example, when two or three of the timing loss functions are utilized, different weights can be set for different timing loss functions and then the functions are superposed to obtain a comprehensive timing loss function for training.

In this embodiment, model training is performed by using a timing loss function of continuous multi-frame training video frames, and compared with a timing loss function constructed by using a single-frame video frame, an alignment effect of an optimized model is better.

In addition, on the basis of a driving signal of six-dimensional space information, 2D key point coordinate information and 3D key point coordinate information obtained by projection are added into a loss function, so that the projected key point alignment result can be further improved, and the alignment effect of the target virtual image and the target object on the action can be improved subsequently.

The network model is trained by using continuous multi-frame training video frames, integrating a time sequence loss function constructed by six-dimensional spatial information, 2D key point coordinate information and 3D key point coordinate information of the continuous multi-frame training video frames and a loss function constructed by six-dimensional spatial information of a single-frame training video frame, and finally the pa-mpjpe of the driving model obtained respectively can be shown as the following table 2. As can be seen from the data in the table, compared with the loss function of a single frame training video, the time sequence loss function provided by this embodiment is smaller in pa-mpjpe of the model, that is, the joint position error is lower.

TABLE 2

In this embodiment, the information input to the network model is substantially the key point information of each training video frame, and the key point information of the training video frame can be obtained by using the constructed key point detection model for identification in advance. The obtained key point information includes coordinates of key points and confidence degrees of the key points. The confidence of the keypoints can characterize the accuracy of the keypoints, for example, a lower confidence indicates a lower accuracy of the keypoints, whereas a higher confidence indicates a higher accuracy of the keypoints.

Since part of the position of the user in the training video frame may be blocked or the jitter phenomenon is too serious, the obtained key point information may not be completely accurate. For this situation, in order to enable the model to make more use of the information of the adjacent video frames, so as to compensate for the interference of the inaccurate key points in the video frames to be processed with the result, in this embodiment, a data enhancement strategy may be adopted when the driving model is trained in advance, so as to enable the model to be trained specifically by marking such inaccurate key points.

In this embodiment, the driving model is obtained by training the constructed network model in advance using training samples, and in a possible implementation manner, the training samples include positive samples where there is no keypoint with a confidence lower than a preset value, and negative samples where there is a keypoint with a confidence lower than a preset value. And the negative sample is obtained by randomly disturbing the coordinates of the key points in the positive sample.

In this embodiment, in order to make the network model able to learn some samples with inaccurate key points, a way of obtaining an output result by using more information in neighboring video frames of the samples is adopted for such samples, so that interference of the inaccurate key points on the result can be avoided.

When the confidence of the key points is lower than a preset value, the corresponding key points can be represented as inaccurate key points. The key points in the training samples obtained by collection are all accurate, and in order to construct a negative sample, the coordinates of the key points in the collected part of positive samples can be randomly adjusted, so that the key points are converted into inaccurate key points. For example, a perturbation may be performed on 30% of the collected positive samples, such as coordinate adjustment by randomly selecting several key points in the samples. The negative sample is generated by disturbing part of the positive samples, so that the problem that the amount of the negative sample data is less in the actual scene can be solved, and the definite negative sample can be generated in a targeted manner.

Each keypoint may be labeled when a positive and a negative sample are input to the network model. For example, the keypoints that are accurate (i.e., confidence not below a preset value) are labeled as 0, and the keypoints that are inaccurate (confidence below a preset value) are labeled as 1. In this embodiment, when the negative sample is used to train the network model, the learning target of the network is not changed, so that the network model can learn by using more information of the neighboring video frames of the network model under the condition that the negative sample contains inaccurate key points. Alternatively, the network model may determine whether the keypoints are accurate through the labeling information of the respective keypoints. In this embodiment, the training results of the network model under various situations are compared, and specifically, see table 3 below.

TABLE 3

Network architecture	pa-mpjpe
		Single frame network	52.31
LSTM network model, using data enhancement policy	46.77
		LSTM network model without data enhancement policy	48.65
Full connectivity network model, using data enhancement strategy	48.34
		Full connectivity network model without data enhancement strategy	49.81

As can be seen from the data in the table, the obtained pa-mpjpe of the network model trained by using a single frame of video frame is the largest, which indicates that the error is the largest and the effect is the worst. And the network model is obtained by combining the multi-frame video frames with the time sequence information and training by marking inaccurate key points, and the obtained pa-mpjpe is minimum, which shows that the error is minimum and the effect is best.

The process is a process of obtaining the driving model through pre-training, and after the driving model is obtained, the driving model can be used in a live broadcast application scene and is used for driving the control of the virtual image in real time based on the body actions of the user.

As can be seen from the above, in this embodiment, the driving model may be an LSTM network model or a fully connected network model. The LSTM network model has a better extraction effect on the time sequence information, so that the overall performance result is better. However, the deployment compatibility of the LSTM network model is poor, and it is difficult to apply to various terminal devices. The fully-connected network model can also extract timing information, but the timing characteristics extracted are inferior to the LSTM network model. The fully connected network model facilitates deployment to a variety of end devices.

Therefore, referring to fig. 7, after a plurality of different driving models are obtained by training in advance, as a possible implementation manner, when the driving model is used to process the key point information in the step S230, the following manner may be implemented:

step S231A, acquiring the device performance information of the live broadcast provider 100.

Step S232A, determining a driving model adapted to the live broadcast provider 100 from a plurality of driving models obtained by pre-training according to the device performance information.

Step S233A, the sets of keypoint information are processed using the determined driving model.

In this embodiment, since the live broadcast providers 100 used by the anchor may have different performances, the corresponding optimal driving models are also different. In order to enable the driving model to normally operate on the terminal device and achieve a better effect, so that a user can use a more suitable model without perception, in this embodiment, the adapted driving model may be determined based on the device performance information, and the adapted driving model is used for processing based on the key point information.

Optionally, the device capability information may include information such as a video card, a CPU, etc., for example, if the terminal device has a video card above Nvidia 1050, or the CPU is an AMD dragon 53600 or intel i7-8700 terminal device, the LSTM network model with better performance may be determined as the adapted model, and other terminal devices may determine the fully connected network model as the adapted model.

It should be noted that the device performance information and the adaptation rule for the terminal device are only examples, and may be set according to actual requirements during implementation, and the application is not limited thereto.

In this embodiment, after obtaining the key point information of consecutive multi-frame video frames, the driving model may extract the time sequence information thereof for processing, referring to fig. 8, in a possible implementation manner, when the driving model is used to process the key point information in step S230, the following manner may be implemented:

step S231B, importing the multiple sets of key point information into a driving model obtained by training in advance.

Step S232B, for any target video frame, obtaining a previous frame of the target video frame, and obtaining a status feature corresponding to the key point information of the previous frame, where the target video frame is any adjacent video frame or a video frame to be processed.

Step S233B, obtaining the state feature of the target video frame according to the state feature of the previous frame and the key point information of the target video frame.

Step S234B, outputting a driving signal corresponding to the target object in the video frame to be processed according to the status feature of the video frame to be processed.

The driving model can be an LSTM model and a fully connected model, and is described by taking the driving model as an LSTM network model as an example, as shown in fig. 9. Wherein the LSTM model includes a plurality of network layers, for example, including a first LSTM layer, a second LSTM layer, a first fully-connected layer, and a second fully-connected layer. The obtained continuous multi-frame video frames comprise a video frame t-2, a video frame t-1, a video frame t and a video frame t +1 which are sequentially arranged in time sequence. The video frame t may be a video frame to be processed. Firstly, key points of each video frame can be extracted by using a key point identification model, and 4 groups of key point information can be obtained.

And importing the obtained 4 groups of key point information into an LSTM model, and processing each group of key point information by utilizing a first LSTM layer, a second LSTM layer, a first full-connection layer and a second full-connection layer respectively. When each network layer of the LSTM model performs processing, for example, for any target video frame (any adjacent video frame or video frame to be processed), the input of the network layer is the state feature of the previous frame of the target video frame, the key point information of the target video frame, and the output is the state feature of the target video frame. That is, each video frame needs to be processed in combination with the intermediate state of the previous frame, and the information of the previous frame is utilized. Therefore, the output result can be obtained by combining the context information of the multi-frame video frame, and the consistency of the result is improved.

In detail, the input of the first LSTM layer is 56-dimensional information (4 frames × 7 keypoints × two-dimensional coordinate information), and the output is 256-dimensional features. The input to the second LSTM layer is a 256-dimensional feature and the output is a 512-dimensional feature. The output of each LSTM layer is two groups of multi-dimensional feature vectors, and the two groups of feature vectors respectively represent the characteristics of a memory gate and a forgetting gate of the LSTM network, so that the LSTM network has long-term memory capacity.

In practice, for other video frames except the first frame, when the video frames are processed, the input of the video frames has the state characteristic of the previous frame, and the state characteristic of the input of the video frames of the first frame is 0 because the video frames of the first frame do not have the previous frame.

The 512-dimensional state features output by the second LSTM layer carry timing information and are input into the first fully-connected layer. The core operation of the full connection layer is matrix vector multiplication, and the setting of matrix parameters in the network layer can be continuously adjusted by training the full connection layer, so that the obtained multiplication result is continuously close to real driving information. The input and output of the full connection layer are vectors, the stored parameters are network layer matrixes, and the actual operation can be simplified into that the input vectors are multiplied by the network layer matrixes to obtain output results.

The output of the first full connection layer is 512-dimensional abstract feature data, the obtained abstract feature data is input into the second full connection layer, and finally a 144-dimensional driving signal is output. The finally output driving signal is six-dimensional spatial information, for example, the obtained driving signal is a 24 × 6 signal for 24 joint points of the avatar.

The second full-connection layer comprises a three-layer structure, wherein the input and the output of the three-layer structure are respectively 512-dimensional abstract feature data and 512-dimensional feature information, the input and the output of the second structure are respectively 512-dimensional feature information and 256-dimensional feature information, and the input and the output of the third structure are respectively 256-dimensional feature information and 144-dimensional driving signals.

In this embodiment, if any video frame is processed in the above processing manner, the second LSTM layer may output the status feature of the video frame to be processed for the video frame to be processed. Based on the state characteristics of the video frame to be processed, the driving signal corresponding to the target object in the video frame to be processed can be output after the processing of the first full connection layer and the second full connection layer.

In addition, when processing a multi-frame video frame by using the full-connected model, please refer to fig. 10, the full-connected model may include three full-connected layers, and the full-connected model may splice the input multiple sets of key point information in time sequence, that is, input the key point information as 56-dimensional information. The input and output of the three fully connected layers are input 56-dimensional output 256-dimensional, input 256-dimensional output 256-dimensional and input 256-dimensional output 144-dimensional respectively. The final result is a drive signal of 144-dimensional six-dimensional spatial information.

It should be noted that the processing principle of the fully connected model is similar to that of the LSTM model, and therefore, the specific processing procedure may refer to the description of the processing procedure of the LSTM model, which is not described herein again.

In this embodiment, the processing condition of a single frame video frame, the processing condition of the LSTM model based on the timing information, and the processing condition of the full-link model based on the timing information are compared, and specific reference may be made to table 4. As can be seen from table 4, although the time consumption of the LSTM model and the full-link model based on the timing information is increased compared with that of the single-frame video, the time consumption is still within the range that can meet the real-time requirement, and the output results of the LSTM model and the full-link model are greatly improved in prediction accuracy compared with that of the single-frame video frame.

TABLE 4

	Single frame model	Time sequence model based on full connection	LSTM-based timing model
				pa-mpjpe	53.23	48.34	46.77
Averaging uniform frames consumes time	12.3ms	13.5ms	14.8ms

Common drive signals include quaternions (4 dimensions), shaft angles (3 dimensions), and rotation matrices (9 dimensions). In this embodiment, the driving of the avatar joints requires the use of the rotation angle in the three-dimensional space. In the corners representing three-dimensional space, at least 5-dimensional vectors are required to represent the drive signal continuously. Therefore, the six-dimensional spatial information is selected for the drive control in this embodiment. Because the six-dimensional spatial information and quaternion conversion are more convenient (and quaternion is beneficial to being transmitted to a rendering engine in an actual scene for virtual image driving), and the actual network performance is better. Moreover, since the representation method with higher dimension has higher data redundancy, which leads to unstable model training, the present embodiment does not use higher dimension, such as 9-dimensional information for driving.

In this embodiment, in a live application scenario, the real-time performance is an important index, but if the smoothness of the driving control can be improved on the premise that the audience does not perceive, the impression experience of the user will be further improved. In view of this, referring to fig. 11, in the present embodiment, when the driving signal is output based on the driving model in step S230, the following steps are performed:

step S231C, importing the multiple sets of key point information into a driving model obtained by training in advance.

Step S232C, processing the multiple sets of key point information, and outputting a driving signal corresponding to a target object in the video frame to be processed after processing of an adjacent video frame located after the video frame to be processed is completed.

In this embodiment, when a driving signal corresponding to a video frame to be processed needs to be obtained, a plurality of frames of adjacent video frames before the video frame to be processed and adjacent video frames after the video frame to be processed in time sequence may be introduced into a driving model obtained by pre-training together with the video frame to be processed.

In this embodiment, in order to improve the fluency of the final output result, when the driving signal of the video frame to be processed is obtained, the driving signal is not immediately output to drive the avatar, but the driving signal of the video frame to be processed is output after the processing of the subsequent adjacent video frame is completed. Therefore, the fluency among output results of each frame can be effectively improved by adopting a mode of delaying for a period of time.

In order to avoid the user's perception of the delay, the adjacent video frame after the video frame to be processed may be a frame, i.e. one frame may be delayed for output. For example, as shown in FIG. 9, the multi-frame video frames input to the driving model may be video frame t-2, video frame t-1, video frame t, and video frame t + 1. The video frame t is a video frame to be processed, the video frame t-2 and the video frame t-1 are adjacent video frames positioned before the video frame to be processed, and the video frame t +1 is an adjacent video frame positioned after the video frame to be processed. After the processing of the video frame t +1 is completed, a driving signal of the video frame t to be processed may be output to drive the avatar.

Thus, the smoothness of the drive control can be improved while the user is not aware of the delay.

As can be seen from the above, when the driving model is trained in advance, the positive sample and the negative sample can be labeled, so that the model can learn about the negative sample by using more information of previous and subsequent frames. Accordingly, when the driving model is used for real-time processing of the video frames, the inaccurate key points in the video frames can be processed by more using the adjacent video frames, so that the influence of the inaccurate key points on the result is avoided. Based on this, referring to fig. 12, in a possible implementation manner, when the key point information is processed by using the driving model in step S230, the following manner may be implemented:

step S231D, importing the multiple sets of key point information into a pre-trained driving model, and when there is an untrusted key point in the to-be-processed video frame, reducing the weight occupied by the key point coordinates of the to-be-processed video frame and increasing the weight occupied by the key point coordinates of each of the adjacent video frames. The non-credible key points are key points with the confidence degrees lower than a preset value.

Step S232D, based on the multiple sets of key point information after the weight adjustment, outputting a driving signal corresponding to the target object in the video frame to be processed.

In this embodiment, a key point detection model is first used to perform key point detection on each frame of video frames, and the obtained key point information includes coordinates and confidence of key points. The key points with the confidence degree lower than the preset value are non-credible key points, such as key points which may be blocked, key points which are difficult to accurately determine due to serious jitter, and the like.

After the key point information of the adjacent video frames and the key point information of the video frame to be processed are imported into the driving model, the driving model can determine the incredible key points based on the key point confidence in the key point information. When the video frame to be processed contains the unreliable key points, the model can reduce the weight occupied by the key point coordinates in the video frame to be processed and increase the weight occupied by the key points of the adjacent video frames.

Therefore, the driving model can utilize the information of the key points in the adjacent video frames more to obtain the driving signals, and the influence of the incredible key points existing in the video frame to be processed on the result is avoided.

Since each video frame contains a plurality of key points, only part of the key points may be unreliable. For example, the video frame to be processed includes 7 key points, wherein 2 key points are non-trusted key points. If all the key point coordinates in the video frame to be processed are reduced by the weight values, the rest parts which are not the unreliable key points are influenced.

Based on this consideration, in this embodiment, when there is an untrusted key point in the to-be-processed video frame, a reference key point corresponding to the untrusted key point in each adjacent video frame may be determined, the weight occupied by the coordinates of the untrusted key point is reduced, and the weight occupied by the coordinates of the reference key point is increased.

Therefore, the weighted value is adjusted only for the key points which are not trusted and the key points which correspond to the key points, so that meaningless interference on other accurate key points is avoided, and the accuracy of the obtained output result is further improved.

Referring to fig. 13, a schematic diagram of exemplary components of an electronic device according to an embodiment of the present disclosure is shown, where the electronic device may be the live broadcast provider 100 or the live broadcast server 200 shown in fig. 1, and the electronic device may include a storage medium 110, a processor 120, a driving control device 130 based on timing information, and a communication interface 140. In this embodiment, the storage medium 110 and the processor 120 are both located in the electronic device and are separately disposed. However, it should be understood that the storage medium 110 may be separate from the electronic device and may be accessed by the processor 120 through a bus interface. Alternatively, the storage medium 110 may be integrated into the processor 120, for example, may be a cache and/or general purpose registers.

The driving control device 130 based on the timing information may be understood as the electronic device or the processor 120 of the electronic device, or may be understood as a software functional module that is independent of the electronic device or the processor 120 and implements the driving control method based on the timing information under the control of the electronic device.

As shown in fig. 14, the timing information-based driving control apparatus 130 may include an obtaining module 131, an extracting module 132, a processing module 133, and a driving module 134, and the functions of the functional modules of the timing information-based driving control apparatus 130 are described in detail below.

An obtaining module 131, configured to obtain a continuous multi-frame video frame including a target object, where the multi-frame video frame includes a to-be-processed video frame and an adjacent video frame located before the to-be-processed video frame in time sequence;

it is understood that the obtaining module 131 may be configured to perform the step S210, and for a detailed implementation of the obtaining module 131, reference may be made to the content related to the step S210.

An extracting module 132, configured to perform key point extraction on the target object in each video frame to obtain multiple sets of key point information;

it is understood that the extracting module 132 can be used to execute the step S220, and for the detailed implementation of the extracting module 132, reference can be made to the above-mentioned contents related to the step S220.

The processing module 133 is configured to process the multiple sets of key point information by using a driving model obtained through pre-training, and output a driving signal corresponding to a target object in the video frame to be processed;

it is understood that the processing module 133 can be used to execute the step S230, and for the detailed implementation of the processing module 133, reference can be made to the content related to the step S230.

And a driving module 134 for performing driving control on the target avatar by using the driving signal.

It is understood that the driving module 134 may be configured to perform the step S240, and for the detailed implementation of the driving module 134, reference may be made to the content related to the step S240.

In a possible implementation manner, the keypoint information includes coordinates and confidence degrees of keypoints, and the processing module 133 may specifically be configured to:

In a possible implementation, the processing module 133 may specifically be configured to determine the adjustment weight by:

In a possible implementation manner, the driving model is obtained by training a constructed network model by using a training sample in advance;

In a possible implementation manner, the processing module 133 may specifically be configured to:

In a possible implementation manner, the multi-frame video frame further includes an adjacent video frame chronologically after the video frame to be processed; the processing module 133 may specifically be configured to:

acquiring device performance information of the live broadcast provider 100;

determining a driving model adapted to the live broadcast providing terminal 100 from a plurality of driving models obtained through pre-training according to the device performance information;

In a possible implementation manner, the driving control device 130 based on the timing information further includes a training module, and the training module may be configured to:

In one possible implementation, the training module may be further configured to:

In one possible implementation, the real drive signal and the output drive signal comprise six-dimensional spatial information;

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Further, an embodiment of the present application also provides a computer-readable storage medium, where machine-executable instructions are stored, and when the machine-executable instructions are executed, the method for driving and controlling based on timing information provided in the foregoing embodiment is implemented.

Specifically, the computer-readable storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, and when executed, the computer program on the computer-readable storage medium can execute the above-described drive control method based on the timing information. With regard to the processes involved when the executable instructions in the computer-readable storage medium are executed, reference may be made to the related descriptions in the above method embodiments, which are not described in detail herein.

In summary, the present application provides a drive control method, an apparatus, an electronic device, and a readable storage medium based on timing information, in which a plurality of continuous frames of video frames including a target object are obtained, the plurality of frames of video frames include a video frame to be processed and an adjacent video frame located before the video frame to be processed in time sequence, a plurality of sets of key point information are obtained by extracting key points from the target object in each video frame, the plurality of sets of key point information are processed by using a pre-trained drive model, a drive signal corresponding to the target object in the video frame to be processed is output, and a target avatar is further drive controlled based on the drive signal. In the scheme, the driving signal corresponding to the video frame to be processed is obtained by combining the video frame to be processed and the adjacent video frame, the driving signal error can be reduced by utilizing the context information of the continuous video frames on the time sequence, and the problem of low accuracy of the driving signal under the condition of missing or shaking of the video frame to be processed can be effectively solved.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for driving control based on timing information, the method comprising:

2. The timing information based drive control method according to claim 1, wherein the key point information includes coordinates and confidence degrees of key points;

3. The method according to claim 2, wherein the step of reducing the weight of the key point coordinates of the video frame to be processed and increasing the weight of the key point coordinates of each of the adjacent video frames comprises:

4. The drive control method based on the time sequence information as claimed in claim 2, wherein the drive model is obtained by training a constructed network model with a training sample in advance;

5. The method according to claim 1, wherein the step of processing the plurality of sets of keypoint information by using a driving model obtained by pre-training and outputting a driving signal corresponding to a target object in the video frame to be processed comprises:

6. The driving control method based on timing information as claimed in claim 1, wherein the multi-frame video frame further includes an adjacent video frame located chronologically after the video frame to be processed;

7. The driving control method based on the time sequence information as claimed in claim 1, wherein the method is applied to a live broadcast providing terminal;

acquiring equipment performance information of the live broadcast provider;

8. The method for controlling driving based on timing information as claimed in claim 1, wherein the method further comprises a step of training a driving model in advance, the step comprising:

9. The method for controlling driving based on timing information as claimed in claim 1, wherein the method further comprises a step of training a driving model in advance, the step comprising:

10. The timing information based drive control method according to claim 9, wherein the real drive signal and the output drive signal include six-dimensional spatial information;

11. A drive control apparatus based on timing information, characterized in that the apparatus comprises:

12. An electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when executed by the electronic device, are executed by the processors to perform the method steps of any of claims 1-10.

13. A computer-readable storage medium, characterized in that it stores machine-executable instructions which, when executed, implement the method steps of any one of claims 1-10.