CN113554034A

CN113554034A - Key point detection model construction method, detection method, device, equipment and medium

Info

Publication number: CN113554034A
Application number: CN202010332493.4A
Authority: CN
Inventors: 陈昕; 王华彦
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2021-10-26
Anticipated expiration: 2040-04-24

Abstract

The disclosure provides a method and a device for constructing a key point detection model, electronic equipment and a storage medium, and relates to the technical field of image processing. According to the technical scheme, a single-frame detection model is used for labeling key points in each frame of image to obtain pseudo-labeling data, and the pseudo-labeling data is used for monitoring the key point prediction result of a time sequence model, so that a key point detection model for detecting the key points of an object to be detected in a video image is constructed. According to the technical scheme provided by the disclosure, the construction efficiency of the key point detection model can be improved, and the cost of key point detection of the video image can be reduced.

Description

Key point detection model construction method, detection method, device, equipment and medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method for constructing a keypoint detection model, a method and an apparatus for detecting keypoints in a video image, an electronic device, and a storage medium.

Background

With the development of image processing technology, a technology for detecting key points of objects such as faces and vehicles has emerged. For example, the detection of face key points in a video image is a very important technology, and the accuracy of the detection of the face key points is directly related to subsequent applications, such as expression analysis, three-dimensional animation, three-dimensional reconstruction of a face, and the like.

In the related art, when detecting key points of an object to be detected in a video image, a key point detection model is obtained after a CNN model is trained by relying on a large amount of manually labeled data sets, and the construction mode of the key point detection model needs a large amount of manpower and time to label and check the key points in the video image so as to ensure the quality of the data sets. However, in video applications, if every frame needs to be labeled with respect to all the key points of the object to be detected, such as a human face, the time for labeling the data set satisfying the model training will far exceed the time of training a single frame image, which results in low efficiency of constructing the key point detection model applied to the video image by using the technology.

Disclosure of Invention

The present disclosure provides a method for constructing a keypoint detection model, a method for detecting keypoints in a video image, an apparatus, an electronic device, and a storage medium, so as to at least solve the problem of low efficiency of constructing a keypoint detection model in a video image in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, a method for constructing a keypoint detection model is provided, including:

acquiring a video image sample containing an object to be detected;

inputting the video image sample into a single-frame detection model, respectively detecting key points of the object to be detected in each frame image of the video image sample by the single-frame detection model to obtain a first detection result, and taking the first detection result as a key point marking result corresponding to each frame image;

inputting the video image sample into a time sequence model, respectively carrying out key point detection on the object to be detected in each frame image of the video image sample by using the time sequence model to obtain a second detection result, and taking the second detection result as a key point prediction result corresponding to each frame image;

and calculating a model loss value of the time sequence model according to the key point marking result and the key point prediction result, and training the time sequence model according to the model loss value to obtain a key point detection model.

In an exemplary embodiment, the calculating a model loss value of the time sequence model according to the keypoint labeling result and the keypoint prediction result, and training the time sequence model according to the model loss value to obtain a keypoint detection model includes:

inputting the key point marking result and the key point prediction result into a loss function of the time sequence model, and calculating to obtain a model loss value of the time sequence model; if the model loss value is within the threshold value range, taking the time sequence model as the key point detection model; and if the model loss value is not within the threshold value range, adjusting the network weight of the time sequence model by adopting a back propagation method based on the loss function of the time sequence model, and training the adjusted time sequence model to obtain the key point detection model.

In an exemplary embodiment, the adjusting the network weight of the time sequence model by using a back propagation method based on the loss function of the time sequence model, and training the adjusted time sequence model to obtain the keypoint detection model includes:

based on the back propagation method, acquiring the gradient of the loss function to the network weight; based on the gradient, adjusting the network weight of the time sequence model by a gradient descending method; training the adjusted time sequence model until the model loss value of the adjusted time sequence model is within the threshold range, and taking the adjusted time sequence model as the key point detection model.

In an exemplary embodiment, the inputting the video image sample into a time sequence model, where the time sequence model respectively performs the keypoint detection on the object to be detected in each frame image of the video image sample to obtain a second detection result, and taking the second detection result as a keypoint prediction result corresponding to each frame image includes:

acquiring an initial hidden layer state of the time sequence model aiming at the current frame image; acquiring a hidden layer state of the time sequence model aiming at the current frame image according to the hidden layer weight, the input layer weight, the initial hidden layer state of the time sequence model and the image data of the current frame image; and acquiring the second detection result as a key point prediction result corresponding to each frame of image according to the hidden layer state and the output layer weight of the time sequence model.

In an exemplary embodiment, the obtaining the initial hidden layer state of the time-series model for the current frame image includes:

when the current frame image is a non-first frame image of the video image sample, taking a hidden layer state of a previous frame image of the current frame image as the initial hidden layer state; and when the current frame image is the first frame image of the video image sample, taking the all-zero vector as the initial hidden layer state.

In an exemplary embodiment, the inputting the video image sample into a single-frame detection model, where the single-frame detection model performs keypoint detection on the object to be detected in each frame image of the video image sample to obtain a first detection result, and takes the first detection result as a keypoint labeling result corresponding to each frame image, includes:

inputting the video image sample into the single-frame detection model to obtain the first detection result, and taking the first detection result as an initial labeling result corresponding to each frame of image; acquiring a time sequence interval between adjacent frame images; acquiring the position constraint relation of the adjacent frame images to the key points of the object to be detected according to the time sequence interval and the motion attribute of the object to be detected; and sequentially correcting the initial labeling result of each frame of image so that the corrected initial labeling result meets the position constraint relation, and obtaining a key point labeling result corresponding to each frame of image.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for detecting a keypoint of a video image, including:

acquiring a video image to be detected;

inputting the video image into a key point detection model, and acquiring a detection result of the key point of the object to be detected in the video image, which is output by the key point detection model; the key point detection model is obtained according to the construction method of the key point detection model.

According to a third aspect of the embodiments of the present disclosure, there is provided a device for constructing a keypoint detection model, including:

the sample acquisition module is used for acquiring a video image sample containing an object to be detected;

the result marking module is used for inputting the video image sample into a single-frame detection model, the single-frame detection model respectively detects key points of the object to be detected in each frame image of the video image sample to obtain a first detection result, and the first detection result is used as a key point marking result corresponding to each frame image;

the result prediction module is used for inputting the video image samples into a time sequence model, the time sequence model respectively detects key points of the object to be detected in each frame image of the video image samples to obtain second detection results, and the second detection results are used as key point prediction results corresponding to each frame image;

and the model construction module is used for calculating a model loss value of the time sequence model according to the key point marking result and the key point prediction result, and training the time sequence model according to the model loss value to obtain a key point detection model.

In an exemplary embodiment, the model building module is further configured to input the keypoint annotation result and the keypoint prediction result into a loss function of the time sequence model, and calculate a model loss value of the time sequence model; if the model loss value is within the threshold value range, taking the time sequence model as the key point detection model; and if the model loss value is not within the threshold value range, adjusting the network weight of the time sequence model by adopting a back propagation method based on the loss function of the time sequence model, and training the adjusted time sequence model to obtain the key point detection model.

In an exemplary embodiment, the model building module is further configured to obtain a gradient of the loss function to the network weight based on the back propagation method; based on the gradient, adjusting the network weight of the time sequence model by a gradient descending method; training the adjusted time sequence model until the model loss value of the adjusted time sequence model is within the threshold range, and taking the adjusted time sequence model as the key point detection model.

In an exemplary embodiment, the result prediction module is further configured to obtain an initial hidden layer state of the temporal model for a current frame image; acquiring a hidden layer state of the time sequence model aiming at the current frame image according to the hidden layer weight, the input layer weight, the initial hidden layer state of the time sequence model and the image data of the current frame image; and acquiring the second detection result as a key point prediction result corresponding to each frame of image according to the hidden layer state and the output layer weight of the time sequence model.

In an exemplary embodiment, the result prediction module is further configured to use, when the current frame image is a non-first frame image of the video image sample, a concealment layer state of a previous frame image of the current frame image as the initial concealment layer state; and when the current frame image is the first frame image of the video image sample, taking the all-zero vector as the initial hidden layer state.

In an exemplary embodiment, the result labeling module is further configured to input the video image sample into the single-frame detection model to obtain the first detection result, and use the first detection result as an initial labeling result corresponding to each frame of image; acquiring a time sequence interval between adjacent frame images; acquiring the position constraint relation of the adjacent frame images to the key points of the object to be detected according to the time sequence interval and the motion attribute of the object to be detected; and sequentially correcting the initial labeling result of each frame of image so that the corrected initial labeling result meets the position constraint relation, and obtaining a key point labeling result corresponding to each frame of image.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a keypoint detection apparatus for a video image, comprising:

the image acquisition module is used for acquiring a video image to be detected;

the key point detection module is used for inputting the video image into a key point detection model and acquiring a detection result of the key point of the object to be detected in the video image, which is output by the key point detection model; the key point detection model is obtained according to the construction method of the key point detection model.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, causing the device to perform the method described in any one of the embodiments of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the method comprises the steps of obtaining a video image sample containing an object to be detected, inputting the video image sample into a single-frame detection model to obtain a first detection result, taking the first detection result as a key point marking result corresponding to each frame of image, inputting the video image sample into a time sequence model to obtain a second detection result and taking the second detection result as a key point prediction result corresponding to each frame of image, wherein the key point marking result is a key point marking result which is not manually marked and can be used as supervision information for predicting key points in the video image sample by the time sequence model, calculating a model loss value of the time sequence model according to the key point marking result and the key point prediction result, training the time sequence model according to the model loss value, and obtaining the key point detection model. According to the technical scheme, a single-frame detection model is used for labeling key points in each frame of image to obtain pseudo-labeling data, and the pseudo-labeling data is used for monitoring the key point prediction result of a time sequence model, so that a key point detection model for detecting the key points of an object to be detected in a video image is constructed. According to the technical scheme provided by the disclosure, the construction efficiency of the key point detection model can be improved, and the cost of key point detection of the video image can be reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a diagram illustrating an application environment for a method of constructing a keypoint detection model, according to an example embodiment.

FIG. 2 is a flow chart illustrating a method of constructing a keypoint detection model, according to an exemplary embodiment.

FIG. 3 is a flowchart illustrating steps for obtaining keypoint annotation results, according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating steps for obtaining a timing model in accordance with an exemplary embodiment.

FIG. 5 is a flowchart illustrating a method of constructing a keypoint detection model, according to an example embodiment.

Fig. 6 is a flowchart illustrating a method of keypoint detection for a video image according to an exemplary embodiment.

Fig. 7 is a schematic diagram illustrating a method of constructing a face keypoint detection model according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating an apparatus for constructing a keypoint detection model according to an exemplary embodiment.

Fig. 9 is a block diagram illustrating a keypoint detection apparatus for a video image according to an exemplary embodiment.

Fig. 10 is an internal block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The method for constructing the keypoint detection model provided by the present disclosure may be applied to an application environment as shown in fig. 1, where fig. 1 is an application environment diagram of a method for constructing the keypoint detection model according to an exemplary embodiment, and the application environment may include a terminal 110 and a server 120, where the terminal 110 may interact with the server 120 through a network. The terminal 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 120 may be implemented by an independent server or a server cluster formed by a plurality of servers.

Specifically, the method for constructing the keypoint detection model provided by the present disclosure may be executed by the terminal 110 or the server 120 shown in fig. 1 alone. Taking the server 120 as an example of being executed alone, the server 120 may obtain a video image sample including an object to be detected, the server 120 inputs the video image sample into a single-frame detection model, the single-frame detection model performs a keypoint detection on the object to be detected in each frame image of the video image sample to obtain a first detection result, and uses the first detection result as a keypoint marking result corresponding to each frame image, the server 120 inputs the video image sample into a time sequence model, the time sequence model performs a keypoint detection on the object to be detected in each frame image of the video image sample to obtain a second detection result, and uses the second detection result as a keypoint prediction result corresponding to each frame image, and finally the server 120 calculates a model loss value of the time sequence model according to the keypoint marking result and the keypoint prediction result, trains the time sequence model according to the model loss value, and obtaining a key point detection model.

In addition, the method for constructing the key point detection model provided by the present disclosure may also be executed by the terminal 110 and the server 120 in a matching manner, specifically, the terminal 110 may obtain a video image sample including an object to be detected, send the video image sample to the server 120, the server 120 inputs the video image sample into the single-frame detection model, the single-frame detection model performs key point detection on the object to be detected in each frame image of the video image sample to obtain a first detection result, and uses the first detection result as a key point labeling result corresponding to each frame image, and the server 120 inputs the video image sample into the time sequence model, the time sequence model performs key point detection on the object to be detected in each frame image of the video image sample to obtain a second detection result, and uses the second detection result as a key point prediction result corresponding to each frame image, then, the server 120 calculates a model loss value of the time sequence model according to the key point labeling result and the key point prediction result, trains the time sequence model according to the model loss value, and obtains a key point detection model, and the server 120 can further send the constructed key point detection model to the terminal 110, so that the terminal 110 can detect the key points of the object to be detected in the video image by using the constructed key point detection model.

The following describes the construction method of the keypoint detection model provided by the present disclosure in detail with reference to the accompanying drawings and embodiments.

Fig. 2 is a flowchart illustrating a method for constructing a keypoint detection model according to an exemplary embodiment, which may be applied to the server 120 shown in fig. 1, as shown in fig. 2, and includes the following steps.

Step S201, acquiring a video image sample containing an object to be detected;

in this step, the server 120 may obtain a video image sample including an object to be detected, where the video image sample is a sample used for training the keypoint detection model. The video image sample may be collected by the server 120 through the terminal 110, or one or more video images may be obtained from a video image database local to the server 120 as the video image sample. The video image sample acquired by the server 120 needs to include an object to be detected, the object to be detected is determined according to an actual application scene, and the object to be detected may be an object such as a person, a pet, a vehicle, or the like, and may be used as the object to be detected. The video image sample comprises a plurality of frames of images with space-time relationship, each frame of image can comprise the object to be detected, the object to be detected has key points on the image, the key points on the image are pixel points used for marking the area of the object to be detected on the image, the human face is taken as the object to be detected, and the key points on the image can comprise pixel points corresponding to the five sense organs of the human face and the outline of the human face.

Step S202, inputting a video image sample into a single-frame detection model, respectively detecting key points of an object to be detected in each frame image of the video image sample by the single-frame detection model to obtain a first detection result, and taking the first detection result as a key point marking result corresponding to each frame image;

in this step, after obtaining the video image sample, the server 120 inputs the video image sample into a single-frame detection model, and the single-frame detection model respectively detects the key points of the object to be detected in each frame of image of the video image sample to obtain a first detection result, and the first detection result obtained by detecting the single-frame detection model is used as the key point annotation result corresponding to each frame of image in the video image sample, that is, each frame of image corresponds to the key point annotation result of the single-frame detection model. In this step, the server 120 performs the key point detection on each frame image by using the single frame detection model alone, and the detection sequence may not be limited, and only the key points of the complete frames of images need to be detected. Since the single frame detection model can be implemented by the pre-constructed CNN model, when the server 120 performs the key point detection on each frame image of the video image sample by using the single frame detection model, it is not actually required to depend on manual key point labeling on each frame image, that is, in this step, the task of performing key point labeling on each frame image of the video image sample can be completed by the pre-constructed single frame detection model, and this kind of key point labeling result on each frame image in a non-manual manner can be referred to as pseudo-labeling data.

Step S203, inputting the video image sample into a time sequence model, respectively detecting key points of the object to be detected in each frame image of the video image sample by the time sequence model to obtain a second detection result, and taking the second detection result as a key point prediction result corresponding to each frame image;

in this step, the server 120 inputs the video image sample into the timing model, the timing model may sequentially detect the keypoints of the object to be detected in each frame image in the video image sample to obtain a second detection result, the second detection result obtained by the single frame detection model is used as the keypoint prediction result corresponding to each frame image, for example, the video image sample includes frame images from sequence number 1 to sequence number 100, the server 120 may detect the keypoints of the object to be detected in the frame images from sequence number 1 to sequence number 100 in the video image sample by using a timing model such as RNN (Recurrent Neural network) model or LSTM (Long Short-Term Memory) model, each frame image corresponds to the relevant keypoint prediction result, and thus after the server 120 detects the keypoints in each frame image by using the timing model, and obtaining a key point prediction result corresponding to each frame of image.

It should be noted that the execution sequence of step S202 and step S203 in the server 120 is not limited, and the server 120 may execute step S202 first and then step S203, may also execute step S203 first and then step S202, or may execute step S202 and step S203 simultaneously.

And S204, calculating a model loss value of the time sequence model according to the key point marking result and the key point prediction result, and training the time sequence model according to the model loss value to obtain a key point detection model.

In this step, the server 120 may use the key point labeling result obtained in step S202 as the supervision information of the time sequence model, and the supervision information may determine whether the time sequence model can be used as a final key point detection model, and is applied to detecting the key points of the object to be detected in the video image. The server 120 may calculate a model loss value of the time sequence model according to the key point labeling result and the key point prediction result, train the time sequence model according to the model loss value, and use the time sequence model as the key point detection model until the model loss value satisfies a set threshold range.

The method for constructing the key point detection model comprises the steps of obtaining a video image sample containing an object to be detected, inputting the video image sample into a single-frame detection model to obtain a first detection result, taking the first detection result as a key point marking result corresponding to each frame image, inputting the video image sample into a time sequence model to obtain a second detection result, taking the second detection result as a key point prediction result corresponding to each frame image, wherein the key point marking result is a key point marking result which is not manually marked and can be used as supervision information for predicting key points in the video image sample by the time sequence model, calculating a model loss value of the time sequence model according to the key point marking result and the key point prediction result, and training the time sequence model according to the model loss value to obtain the key point detection model. According to the technical scheme, a single-frame detection model is used for labeling key points in each frame of image to obtain pseudo-labeling data, and the pseudo-labeling data is used for monitoring the key point prediction result of a time sequence model, so that a key point detection model for detecting the key points of an object to be detected in a video image is constructed. According to the technical scheme provided by the disclosure, the construction efficiency of the key point detection model can be improved, and the cost of key point detection of the video image can be reduced.

In an exemplary embodiment, as shown in fig. 3, fig. 3 is a flowchart of a step of obtaining a key point annotation result according to an exemplary embodiment, where the step S202 inputs a video image sample into a single-frame detection model, the single-frame detection model performs key point detection on an object to be detected in each frame image of the video image sample to obtain a first detection result, and the first detection result is used as a key point annotation result corresponding to each frame image, and the method may include:

step S301, inputting a video image sample into a single-frame detection model to obtain a first detection result, and taking the first detection result as an initial labeling result corresponding to each frame of image;

in this step, after the server 120 detects the key points in each frame of image by using the single-frame detection model, the first detection result obtained by the detection is used as the initial labeling result corresponding to each frame of image, and before the server 120 uses the initial labeling result as the supervision information of the timing model, the initial labeling result may be corrected by the following steps, so that the key point labeling result as the supervision information is more accurate.

Step S302, acquiring a time sequence interval between adjacent frame images;

in this step, the server 120 may obtain the frame difference time of the video image as the time sequence interval between adjacent frame images, that is, how many seconds the server 120 obtains the interval time between adjacent frame images in the video image.

Step S303, acquiring a position constraint relation of adjacent frame images to key points of the object to be detected according to the time sequence interval and the motion attribute of the object to be detected;

the motion attribute of the object to be detected may be an attribute of a motion speed, a motion direction, and the like of the object to be detected in a scene shown in the video image. The server 120 may obtain a position constraint relationship of the key points of the object to be detected by the adjacent frame images according to the time sequence interval between the adjacent frame images and the motion attribute of the object to be detected, where the position constraint relationship is used to represent a position relationship that the positions of the key points of the object to be detected have in the adjacent frame images under the constraints of the time sequence interval and the motion attribute. For example, the moving speed of a person in a general scene has a certain upper limit, and the upper limit of the moving speed can be used as a motion attribute of the object to be detected in the scene of the video image, and the server 120 calculates a position change range of a key point, such as a human face, between adjacent frame images by combining the upper limit of the moving speed of the person and a time interval of the adjacent frame images, where the position change range can be used to represent a position range in which the key point on the frame image is allowed to appear on the next frame image under the aforementioned constraint condition.

Step S304, the initial labeling result of each frame of image is sequentially corrected, so that the corrected initial labeling result meets the position constraint relationship, and the key point labeling result corresponding to each frame of image is obtained.

In this step, the server 120 may sequentially correct the initial labeling result of each frame of image based on the position constraint relationship of the adjacent frame images to the key points of the object to be detected obtained in step S303, the server 120 may correct the initial labeling result of the second frame of image in combination with the position constraint relationship from the initial labeling result of the first frame of image, for example, when the initial labeling result of the second frame of image satisfies the position constraint relationship, the initial labeling result of the second frame of image may not be adjusted, then the initial labeling result of the third frame of image is corrected in combination with the position constraint relationship based on the initial labeling result of the second frame of image, if the initial labeling result of the third frame of image does not satisfy the position constraint relationship, for example, a certain initial labeling result exceeds the position change range specified by the position constraint relationship, the initial annotation result is corrected so that the corrected initial annotation result meets the position constraint relationship, the corrected initial annotation result of the third frame of image is corrected by combining the position constraint relationship with the initial annotation result of the fourth frame of image, and so on, the initial annotation result of each frame of image is corrected so that the corrected initial annotation result meets the position constraint relationship, and thus the server 120 can obtain the key point annotation result corresponding to each frame of image.

In the solution of the foregoing embodiment, after the server 120 performs the preliminary annotation on each frame of image by using the single-frame detection model, the server corrects the initial annotation result obtained by the single-frame detection model by combining the time sequence interval between adjacent frame of images and the motion attribute of the object to be detected, so that the corrected annotation result of the key point is more accurate, which is beneficial to improving the accuracy of constructing the key point detection model.

In an exemplary embodiment, the inputting the video image sample into the time sequence model in step S203, where the time sequence model performs the keypoint detection on the object to be detected in each frame image of the video image sample to obtain the second detection result, and taking the second detection result as the keypoint prediction result corresponding to each frame image, may include:

acquiring an initial hidden layer state of a time sequence model aiming at a current frame image; acquiring a hidden layer state of the time sequence model aiming at the current frame image according to the hidden layer weight, the input layer weight, the initial hidden layer state and the image data of the current frame image of the time sequence model; and acquiring a second detection result as a key point prediction result corresponding to each frame of image according to the hidden layer state and the output layer weight of the time sequence model.

In this embodiment, the timing model may include an output layer, a hidden layer, and an input layer. Wherein the image data of each frame image can be input data of an input layer as a time sequence model. In the time sequence model, a hidden layer state corresponds to each frame of image, the hidden layer state corresponding to each frame of image can be calculated by the hidden layer weight and the input layer weight of the time sequence model and the initial hidden state and the input data corresponding to each frame of image, and the server 120 can calculate the hidden layer state of the time sequence model for each frame of image according to the hidden layer weight, the input layer weight, the initial hidden state and the input data.

In an exemplary embodiment, the step of obtaining the initial hidden layer state of the time-series model for the current frame image may include: when the current frame image is a non-first frame image of a video image sample, taking the hidden layer state of the previous frame image of the current frame image as an initial hidden layer state; and when the current frame image is the first frame image of the video image sample, taking the all-zero vector as the initial hidden layer state. Specifically, since the hidden layer state of the time sequence model for the current frame image is determined by the hidden layer state of the previous frame, when the current frame image is a non-first frame image of the video image sample, the server 120 may directly use the hidden layer state of the previous frame image of the current frame image as the initial hidden layer state, and when the current frame image is a first frame image of the video image sample, the server 120 may select the all-zero vector as the initial hidden layer state.

After obtaining the hidden layer state of the current frame image, the server 120 may calculate a second detection result of the key point of the current frame image according to the hidden layer state and the output layer weight of the time sequence model. The time sequence model is used for calculating the second detection result output by the output layer of the current frame image according to the output layer weight of the time sequence model and the hidden layer state of the current frame image. The above solution of this embodiment may be applied to the server 120 to implement prediction on the keypoints in each frame image by using the time sequence model, so as to obtain the keypoint prediction result corresponding to each frame image.

In an exemplary embodiment, as shown in fig. 4, fig. 4 is a flowchart illustrating a step of obtaining a time sequence model according to an exemplary embodiment, where the step S203 calculates a model loss value of the time sequence model according to the keypoint labeling result and the keypoint prediction result, trains the time sequence model according to the model loss value, and obtains the keypoint detection model, and the step S may include:

step S401, inputting a key point marking result and a key point prediction result into a loss function of the time sequence model, and calculating to obtain a model loss value of the time sequence model;

step S402, judging whether the model loss value is in a threshold value range; if yes, go to step S403; if not, go to step S404;

step S403, if the model loss value is within the threshold value range, taking the time sequence model as a key point detection model;

and S404, if the model loss value is not within the threshold range, adjusting the network weight of the time sequence model by adopting a back propagation method based on the loss function of the time sequence model, and training the adjusted time sequence model to obtain a key point detection model.

In this embodiment, the server 120 may input the key point labeling result and the key point prediction result into a loss function of the time sequence model, and calculate a model loss value corresponding to the time sequence model in the current training by using the loss function. The server 120 may further compare the model loss value with a set threshold range, and determine whether the model loss value is within the threshold range, and if the model loss value is within the threshold range, the server 120 may directly use the time sequence model as a final key point detection model; if the model loss value is not within the threshold range, the server 120 needs to continue training the time sequence model, and may adjust the network weight of the time sequence model by using a back propagation method based on the loss function of the time sequence model, where the network weight may include an input layer weight, a hidden layer weight, and an output layer weight of the time sequence model, and after the server 120 adjusts the network weight of the time sequence model, the adjusted time sequence model is trained to obtain the keypoint detection model.

According to the technical scheme of the embodiment, the key point marking result of the single-frame detection model is used as the supervision information of the time sequence model, and the network weight of the time sequence model is adjusted based on the supervision information so as to train the time sequence model to obtain the key point detection model.

In an exemplary embodiment, further, in step S404, based on the loss function of the timing sequence model, the network weight of the timing sequence model is adjusted by using a back propagation method, and the training of the adjusted timing sequence model is performed to obtain the key point detection model, which specifically includes:

acquiring the gradient of the loss function to the network weight based on a back propagation method; based on the gradient, adjusting the network weight of the time sequence model by a gradient descending method; and training the adjusted time sequence model until the model loss value of the adjusted time sequence model is within the threshold range, and taking the adjusted time sequence model as a key point detection model.

In this embodiment, the server 120 may train the timing model based on a back propagation method. Specifically, the server 120 may adopt a back propagation algorithm to obtain a gradient of the loss function for the input layer weight, the hidden layer weight, and the output layer weight, respectively, continuously update the input layer weight, the hidden layer weight, and the output layer weight by a gradient descent method based on the gradient, and continue training the time sequence model after the network weight update adjustment, so that a model loss value corresponding to the time sequence model after the network weight update adjustment is within a set threshold range, that is, after the time sequence model training is converged, the server 120 takes the updated adjusted time sequence model as the key point detection model.

In the above embodiment, the time sequence model may perform the update of the network weight under the key point labeling result labeled by the single-frame detection model to obtain the continuous training, and the time sequence model after the training convergence may be used as the key point detection model, so that the detection of the key point of the object to be detected in the video image may be more accurate.

In an exemplary embodiment, a method for constructing a keypoint detection model is provided, as shown in fig. 5, where fig. 5 is a flowchart illustrating a method for constructing a keypoint detection model according to an exemplary embodiment, and the method may include:

step S501, the server 120 obtains a video image sample containing an object to be detected;

step S502, the server 120 inputs the video image sample into the single-frame detection model to obtain a first detection result, and the first detection result is used as an initial labeling result corresponding to each frame of image;

step S503, the server 120 obtains a time interval between adjacent frame images;

step S504, the server 120 obtains the position constraint relation of the key points of the object to be detected by the adjacent frame images according to the time sequence interval and the motion attribute of the object to be detected;

step S505, the server 120 corrects the initial annotation result of each frame of image in sequence, so that the corrected initial annotation result satisfies the position constraint relationship, and obtains the key point annotation result corresponding to each frame of image;

step S506, the server 120 inputs the video image sample into a time sequence model, the time sequence model respectively detects key points of the object to be detected in each frame image of the video image sample to obtain a second detection result, and the second detection result is used as a key point prediction result corresponding to each frame image;

step S507, the server 120 inputs the key point annotation result and the key point prediction result into the loss function of the time sequence model, and calculates a model loss value of the time sequence model;

step S508, the server 120 determines whether the model loss value is within the threshold range; if yes, go to step S509; if not, go to step S510;

step S509, if the model loss value is within the threshold range, the server 120 uses the time sequence model as a key point detection model;

step S510, if the model loss value is not within the threshold range, the server 120 adjusts the network weight of the time sequence model by using a back propagation method based on the loss function of the time sequence model, and trains the adjusted time sequence model to obtain the key point detection model.

In the method, the server 120 may firstly correct an initial annotation result obtained by a single-frame detection model based on a time sequence interval between adjacent frame images and a motion attribute of an object to be detected, so as to accurately obtain a key point annotation result corresponding to each frame image, the server 120 predicts key points in each frame image by using the time sequence model, so as to obtain a key point prediction result corresponding to each frame image, the server 120 uses the key point annotation result corresponding to each frame image as supervision information of the time sequence model, the server 120 inputs the key point annotation result and the key point prediction result into a loss function of the time sequence model, so as to calculate a model loss value of the time sequence model, and when the model loss value is within a threshold range, the server 120 directly uses the time sequence model as a key point detection model; when the model loss value is not within the threshold range, the server 120 adjusts the network weight of the time sequence model by using a back propagation method based on the loss function of the time sequence model, and continues to train the adjusted time sequence model to obtain the key point detection model.

In one embodiment, a method for detecting a key point of a video image is provided, as shown in fig. 6, where fig. 6 is a flowchart illustrating a method for detecting a key point of a video image according to an exemplary embodiment, the method may be applied to the server 120 shown in fig. 1, and may include the following steps:

step S601, acquiring a video image to be detected;

step S602, inputting a video image into a key point detection model, and acquiring a detection result of a key point of an object to be detected in the video image, which is output by the key point detection model; the key point detection model is obtained according to the construction method of the key point detection model in any one of the embodiments.

In this embodiment, the server 120 obtains a video image to be detected, and the server 120 inputs the video image to a pre-constructed key point detection model, where the key point detection model may be obtained based on the construction method of the key point detection model described in any one of the above embodiments. Specifically, the key point detection model may perform key point detection on the object to be detected in each frame of image of the video image, and the server 120 obtains the detection result of the key point of the image to be detected in the video image, which is output by the key point detection model. The method for detecting the video image key points provided by the embodiment can improve the efficiency of detecting the video image key points and reduce the cost of detecting the video image key points.

In order to more clearly illustrate the technical solution provided by the present disclosure, the technical solution is applied to construct a face keypoint detection model for description, referring to fig. 7, fig. 7 is a schematic diagram illustrating a principle of a method for constructing a face keypoint detection model according to an exemplary embodiment, and the method may specifically include the following steps:

the application example takes the face in the video image as the object to be detected of the video image, can be based on a single-frame face key point detection model which is constructed and aims at the single-frame image, the single-frame face key point detection model can be constructed and obtained based on a CNN (compressed natural number network) model, and each frame image of the video image can be input into the single-frame face key point detection model to obtain the face key point output of each frame image. Meanwhile, a time sequence deep learning model is also constructed, such as an RNN model, an LSTM model and the like can be used as the time sequence model, a human face key point prediction result of the time sequence model on each frame of image can be obtained by inputting a video image into the time sequence model, when the time sequence model is trained, each frame of image summarizes the human face key points obtained by a pre-constructed single frame of human face key point detection model and outputs the human face key points as a human face key point labeling result of each frame of image (namely, the single frame of human face key point detection model is used as a teacher model), the labeling result is compared with the prediction result obtained when the time sequence model is trained to obtain a loss function, the network weight of the time sequence model in each training process is updated by using the loss function, and thus the training of the time sequence model can be realized based on the labeling result of the single frame of human face key point detection model, and the accuracy of single frame output can be improved through the relation of the front frame and the rear frame of the time sequence model. Finally, the trained timing model (e.g., RNN model) can be used as a key point detection model for detecting key points of a face of a video image.

The method can greatly reduce the time cost and the labor cost for constructing the video face key point detection model, accelerate the period and the speed of the related model training, greatly reduce the cost for detecting the video face key points, has quick development and operability, and is beneficial to the subsequent application such as face expression analysis, three-dimensional animation, face three-dimensional reconstruction and the like.

It should be understood that, although the steps in the flowcharts of fig. 2 to 7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 to 7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

Fig. 8 is a block diagram illustrating an apparatus for constructing a keypoint detection model according to an exemplary embodiment. Referring to fig. 8, the apparatus 800 for constructing the keypoint detection model may include:

a sample acquiring module 801, configured to acquire a video image sample including an object to be detected;

a result labeling module 802, configured to input the video image sample into a single-frame detection model, where the single-frame detection model performs keypoint detection on the object to be detected in each frame image of the video image sample, respectively, to obtain a first detection result, and uses the first detection result as a keypoint labeling result corresponding to each frame image;

the result prediction module 803 is configured to input the video image sample into a time sequence model, where the time sequence model performs keypoint detection on the object to be detected in each frame image of the video image sample to obtain a second detection result, and uses the second detection result as a keypoint prediction result corresponding to each frame image;

and the model construction module 804 is used for calculating a model loss value of the time sequence model according to the key point marking result and the key point prediction result, and training the time sequence model according to the model loss value to obtain the key point detection model.

In an exemplary embodiment, the model constructing module 804 is further configured to input the keypoint labeling result and the keypoint prediction result into a loss function of the time sequence model, and calculate a model loss value of the time sequence model; if the model loss value is within the threshold value range, taking the time sequence model as a key point detection model; and if the model loss value is not within the threshold range, adjusting the network weight of the time sequence model by adopting a back propagation method based on the loss function of the time sequence model, and training the adjusted time sequence model to obtain the key point detection model.

In an exemplary embodiment, the model building module 804 is further configured to obtain a gradient of the loss function to the network weight based on a back propagation method; based on the gradient, adjusting the network weight of the time sequence model by a gradient descending method; and training the adjusted time sequence model until the model loss value of the adjusted time sequence model is within the threshold range, and taking the adjusted time sequence model as a key point detection model.

In an exemplary embodiment, the result predicting module 803 is further configured to obtain an initial hidden layer state of the time-series model for the current frame image; acquiring a hidden layer state of the time sequence model aiming at the current frame image according to the hidden layer weight, the input layer weight, the initial hidden layer state and the image data of the current frame image of the time sequence model; and acquiring a second detection result as a key point prediction result corresponding to each frame of image according to the hidden layer state and the output layer weight of the time sequence model.

In an exemplary embodiment, the result predicting module 803 is further configured to, when the current frame image is a non-first frame image of the video image sample, take the hidden layer state of the previous frame image of the current frame image as the initial hidden layer state; and when the current frame image is the first frame image of the video image sample, taking the all-zero vector as the initial hidden layer state.

In an exemplary embodiment, the result labeling module 802 is further configured to input the video image sample into a single-frame detection model to obtain a first detection result, and use the first detection result as an initial labeling result corresponding to each frame of image; acquiring a time sequence interval between adjacent frame images; acquiring the position constraint relation of the key points of the object to be detected by the adjacent frame images according to the time sequence interval and the motion attribute of the object to be detected; and correcting the initial labeling result of each frame of image in sequence so that the corrected initial labeling result meets the position constraint relation, and obtaining the key point labeling result corresponding to each frame of image.

Fig. 9 is a block diagram illustrating a keypoint detection apparatus for a video image according to an exemplary embodiment. Referring to fig. 9, the apparatus 900 for detecting a keypoint of a video image may include:

an image obtaining module 901, configured to obtain a video image to be detected;

a key point detection module 902, configured to input the video image into the key point detection model, and obtain a detection result of the key point of the object to be detected in the video image, which is output by the key point detection model; the key point detection model is obtained according to the construction method of the key point detection model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 10 is an internal block diagram of an electronic device shown in accordance with an example embodiment. For example, the device 1000 may be a server. Referring to fig. 10, device 1000 includes a processing component 1020 that further includes one or more processors and memory resources, represented by memory 1022, for storing instructions, such as application programs, that are executable by processing component 1020. The application programs stored in memory 1022 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1020 is configured to execute instructions to perform the above-described methods.

Device 1000 can also include a power component 1024 configured to perform power management for device 1000, a wired or wireless network interface 1026 configured to connect device 1000 to a network, and an input-output (I/O) interface 1028. Device 1000 may operate based on an operating system stored in memory 1022 such as Windows server, Mac OS X, Unix, Linux, FreeBSD, or the like.

In an exemplary embodiment, a storage medium comprising instructions, such as memory 1022 comprising instructions, executable by a processor of device 1000 to perform the above-described method is also provided. The storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, the program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, causing the device to perform the method as described in any one of the embodiments above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A construction method of a key point detection model is characterized by comprising the following steps:

acquiring a video image sample containing an object to be detected;

2. The method of claim 1, wherein the calculating a model loss value of the time sequence model according to the keypoint labeling result and the keypoint prediction result, and the training the time sequence model according to the model loss value to obtain a keypoint detection model comprises:

inputting the key point marking result and the key point prediction result into a loss function of the time sequence model, and calculating to obtain a model loss value of the time sequence model;

if the model loss value is within the threshold value range, taking the time sequence model as the key point detection model;

and if the model loss value is not within the threshold value range, adjusting the network weight of the time sequence model by adopting a back propagation method based on the loss function of the time sequence model, and training the adjusted time sequence model to obtain the key point detection model.

3. The method of claim 2, wherein the adjusting the network weight of the time sequence model by using a back propagation method based on the loss function of the time sequence model, and training the adjusted time sequence model to obtain the keypoint detection model comprises:

based on the back propagation method, acquiring the gradient of the loss function to the network weight;

based on the gradient, adjusting the network weight of the time sequence model by a gradient descending method;

training the adjusted time sequence model until the model loss value of the adjusted time sequence model is within the threshold range, and taking the adjusted time sequence model as the key point detection model.

4. The method according to claim 1, wherein the inputting the video image sample into a time sequence model, the time sequence model respectively performing the keypoint detection on the object to be detected in each frame image of the video image sample to obtain a second detection result, and using the second detection result as the keypoint prediction result corresponding to each frame image comprises:

acquiring an initial hidden layer state of the time sequence model aiming at the current frame image;

acquiring a hidden layer state of the time sequence model aiming at the current frame image according to the hidden layer weight, the input layer weight, the initial hidden layer state of the time sequence model and the image data of the current frame image;

and acquiring the second detection result as a key point prediction result corresponding to each frame of image according to the hidden layer state and the output layer weight of the time sequence model.

5. The method according to claim 1, wherein the inputting the video image sample into a single-frame detection model, the single-frame detection model respectively performing keypoint detection on the object to be detected in each frame image of the video image sample to obtain a first detection result, and using the first detection result as a keypoint annotation result corresponding to each frame image, comprises:

inputting the video image sample into the single-frame detection model to obtain the first detection result, and taking the first detection result as an initial labeling result corresponding to each frame of image;

acquiring a time sequence interval between adjacent frame images;

acquiring the position constraint relation of the adjacent frame images to the key points of the object to be detected according to the time sequence interval and the motion attribute of the object to be detected;

and sequentially correcting the initial labeling result of each frame of image so that the corrected initial labeling result meets the position constraint relation, and obtaining a key point labeling result corresponding to each frame of image.

6. A method for detecting a key point of a video image is characterized by comprising the following steps:

acquiring a video image to be detected;

inputting the video image into a key point detection model, and acquiring a detection result of the key point of the object to be detected in the video image, which is output by the key point detection model; the key point detection model is obtained by the construction method of the key point detection model according to any one of claims 1 to 5.

7. A construction device of a key point detection model is characterized by comprising the following steps:

8. A key point detecting device for a video image, comprising:

the key point detection module is used for inputting the video image into a key point detection model and acquiring a detection result of the key point of the object to be detected in the video image, which is output by the key point detection model; the key point detection model is obtained by the construction method of the key point detection model according to any one of claims 1 to 5.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 6.

10. A storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-6.