CN108229418B

CN108229418B - Human body key point detection method and apparatus, electronic device, storage medium, and program

Info

Publication number: CN108229418B
Application number: CN201810055582.1A
Authority: CN
Inventors: 刘文韬; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2018-01-19
Filing date: 2018-01-19
Publication date: 2021-04-02
Anticipated expiration: 2038-01-19
Also published as: CN108229418A

Abstract

The embodiment of the invention discloses a method and a device for detecting key points of a human body, electronic equipment, a storage medium and a program, wherein the method comprises the following steps: carrying out face detection on an image to obtain position information of a face in the image; determining the position information of the human body central point corresponding to the face according to the position information of the face; and detecting the key points of the human body of the image according to the position information of the central point of the human body. The method and the device have the advantage that the position accuracy of the human body key points detected by the embodiment of the invention is higher.

Description

Human body key point detection method and apparatus, electronic device, storage medium, and program

Technical Field

The present invention relates to artificial intelligence technology, and in particular, to a method and apparatus for detecting human body key points, an electronic device, a storage medium, and a program.

Background

The human body key point detection technology is the basis of human body video data automatic processing, human body behavior analysis and human-computer interaction, and can provide important technical support for video structuring.

A related human body key point detection technology is mainly realized based on a depth sensor, and a depth image acquired by the depth sensor is used as input to detect the position of a human body key point.

Another related human key point detection technology is a human key point detection system based on a yellow-green-blue (RGB) camera, which mainly includes two parts, human body positioning and human key point positioning. The human body key point detection system takes an image collected by a common RGB camera as input, obtains a human body external frame and carries out human body key point detection based on the human body external frame.

Disclosure of Invention

The embodiment of the invention provides a technical scheme for detecting key points of a human body.

According to an aspect of an embodiment of the present invention, a method for detecting a key point of a human body is provided, including:

carrying out face detection on an image to obtain position information of a face in the image;

determining the position information of the human body central point corresponding to the face according to the position information of the face;

and detecting the key points of the human body of the image according to the position information of the central point of the human body.

Optionally, in the foregoing method embodiments of the present invention, the determining, according to the position information of the face, the position information of the human body center point corresponding to the face includes:

acquiring a first image block with a first preset size from the image according to the position information of the face, wherein the first image block comprises at least one part of a human body corresponding to the face;

and determining the position information of the human body central point corresponding to the face according to the first image block and the head-body mapping network.

Optionally, in each of the above method embodiments of the present invention, the position information of the face includes a center position of the face and size information of the face;

the acquiring a first image block with a first preset size from the image according to the position information of the face includes:

determining a normalization parameter of the image according to the size information of the face;

according to the normalization parameters of the image, carrying out size normalization processing on the image to obtain an image with a normalized size;

and intercepting the first image block with a first preset size from the image with the normalized size according to the central position of the face.

Optionally, in the foregoing method embodiments of the present invention, the determining a normalization parameter of the image according to the size information of the face includes:

and determining a normalization parameter corresponding to the size of the face scaled to a preset normalization face size according to the size information of the face.

Optionally, in the foregoing method embodiments of the present invention, the intercepting, according to the center position of the human face, the first image block having a first preset size from the image after the size normalization includes:

and intercepting an image block which takes the central position of the face as the center and has the size which is M times of the size of the normalized face from the image with the normalized size, and scaling the intercepted image block to the first preset size to obtain the first image block, wherein the value of M is more than 3 and less than 20.

Optionally, in the foregoing method embodiments of the present invention, the performing, according to the position information of the human body central point, human body key point detection on the image includes:

acquiring a second image block with a second preset size from the image according to the position information of the human body central point, wherein the second image block comprises at least one part of a human body corresponding to the human face;

and detecting the human key points in the second image block through a human key point detection network.

Optionally, in the above method embodiments of the present invention, the acquiring, according to the position information of the center point of the human body, a second image block having a second preset size from the image includes:

and intercepting the second image block with a second preset size by taking the position of the human body central point as the center from the image with the normalized size.

Optionally, in the above method embodiments of the present invention, the performing human key point detection in the second image block includes:

and detecting key points of the human body on the second image block to obtain the positions of the key points of the human body in the second image block.

Optionally, in the above method embodiments of the present invention, the performing human key point detection on the second image block to obtain positions of the human key points in the second image block includes:

performing key point detection on the human body in the second image block to acquire a confidence image of the second image block for each key point in at least one key point of the human body, wherein the confidence image of the key point comprises the confidence of the key point of which at least one position in the second image block is judged as the confidence image;

and determining the position with the maximum confidence level in the confidence level images of the key points as the position of the key point to which the confidence level image belongs in the second image block.

Optionally, in each of the above method embodiments of the present invention, after obtaining the positions of the key points of the human body in the second image block, the method further includes:

and determining the positions of the key points of the human body in the image according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image.

Optionally, in the above method embodiments of the present invention, the determining, according to the position of the key point of the human body in the second image block and the position of the second image block in the image, the position of the key point of the human body in the image includes:

acquiring the positions of the key points of the human body in the image with the normalized size according to the positions of the key points of the human body in the second image block and the positions of the key points of the human body in the image with the normalized size in the second image block;

and determining the positions of the key points of the human body in the image based on the positions of the key points of the human body in the image with the normalized size and the normalization parameters of the image.

Optionally, in each of the above method embodiments of the present invention, the human body key point detection network includes a plurality of convolutional neural networks;

the detecting key points of the human body in the second image block to obtain a confidence image of the second image block for each key point of at least one key point of the human body includes:

extracting, by each convolutional neural network of the plurality of convolutional neural networks, image features of the second image block, wherein the extracted image features of different convolutional neural networks of the plurality of convolutional neural networks have different scales;

splicing the image features of different scales extracted by the plurality of convolutional neural networks to obtain spliced features;

classifying the splicing features by using a classifier corresponding to each key point in the at least one key point of the human body to obtain a confidence image of the second image block for each key point in the at least one key point of the human body.

Optionally, in the above method embodiments of the present invention, a first convolutional neural network in the plurality of convolutional neural networks includes a plurality of convolutional layers respectively located at different network depths;

the extracting, by each convolutional neural network of the plurality of convolutional neural networks, the image feature of the second image block includes:

and performing feature fusion on a first feature output by a first convolution layer with a network depth of i and a second feature output by a second convolution layer with a network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of the second image block output by the first convolution neural network is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

Optionally, in the foregoing method embodiments of the present invention, the human body keypoint detection network includes M network blocks connected, each of the network blocks includes the multiple convolutional neural networks, an output of a pth network block in the M network blocks is a splicing feature obtained by the multiple convolutional neural networks included in the pth network block, and the splicing feature output by the pth network block is input into a pth +1 network block, where M is greater than or equal to 2, p is 1, …, and M-1;

the classifying the stitching features by using the classifier corresponding to each key point of the plurality of key points of the human body to obtain a confidence image of the second image block for each key point of the plurality of key points of the human body includes:

and classifying the splicing features output by the Mth network block by using the classifier corresponding to each key point in the plurality of key points of the human body to obtain a confidence image of the second image block aiming at each key point in the plurality of key points of the human body.

Optionally, in the above method embodiments of the present invention, the human body key point detection network includes a plurality of convolution layers respectively located at different network depths;

and performing feature fusion on a first feature output by a first convolution layer with a network depth of i and a second feature output by a second convolution layer with a network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of the second image block is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

According to another aspect of the embodiments of the present invention, there is provided a human body key point detecting device, including:

the face detection module is used for carrying out face detection on the image to obtain the position information of the face in the image;

the head-body mapping network is used for determining the position information of the human body central point corresponding to the face according to the position information of the face;

and the human body key point detection network is used for detecting the human body key points of the image according to the position information of the human body center point.

Optionally, in each of the above apparatus embodiments of the present invention, further including:

the first intercepting module is used for acquiring a first image block with a first preset size from the image according to the position information of the face, wherein the first image block comprises at least one part of a human body corresponding to the face;

the head-body mapping network is specifically configured to determine, according to the first image block, position information of a human body center point corresponding to the face.

Optionally, in each of the above apparatus embodiments of the present invention, the position information of the face includes a center position of the face and size information of the face;

the first intercepting module is specifically configured to:

Optionally, in each of the apparatus embodiments of the present invention, when the first capture module determines the normalization parameter of the image according to the size information of the human face, the first capture module is specifically configured to: and determining a normalization parameter corresponding to the size of the face scaled to a preset normalization face size according to the size information of the face.

Optionally, in each embodiment of the apparatus of the present invention, when the first image block with the first preset size is intercepted from the image after the size normalization according to the center position of the face, specifically, the method is configured to: and intercepting an image block which takes the central position of the face as the center and has the size which is M times of the size of the normalized face from the image with the normalized size, and scaling the intercepted image block to the first preset size to obtain the first image block, wherein the value of M is more than 3 and less than 20.

the second intercepting module is used for acquiring a second image block with a second preset size from the image according to the position information of the human body central point, wherein the second image block comprises at least one part of a human body corresponding to the human face;

the human body key point detection network is specifically used for detecting human body key points in the second image block through the human body key point detection network.

Optionally, in each of the above apparatus embodiments of the present invention, the second intercepting module is specifically configured to: and intercepting the second image block with a second preset size by taking the position of the human body central point as the center from the image with the normalized size.

Optionally, in each of the above device embodiments of the present invention, when the human key point detection network performs human key point detection in the second image block, the human key point detection network is specifically configured to: and detecting key points of the human body on the second image block to obtain the positions of the key points of the human body in the second image block.

Optionally, in each of the apparatus embodiments of the present invention, when the human key point detection network performs human key point detection on the second image block to obtain positions of the human key points in the second image block, the human key point detection network is specifically configured to:

performing key point detection on the human body in the second image block to acquire a confidence image of the second image block for each key point in at least one key point of the human body, wherein the confidence image of the key point comprises the confidence of the key point of which at least one position in the second image block is judged as the confidence image; and determining the position with the maximum confidence level in the confidence level images of the key points as the position of the key point to which the confidence level image belongs in the second image block.

and the acquisition module is used for determining the positions of the key points of the human body in the image according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image.

Optionally, in each of the apparatus embodiments of the present invention, the obtaining module is specifically configured to:

Optionally, in the above apparatus embodiments of the present invention, the human body key point detection network includes:

the convolution networks are respectively used for extracting the image features of the second image block through each convolution network in the convolution networks, wherein the image features extracted by different convolution networks in the convolution networks have different scales;

the splicing unit is used for splicing the image features of different scales extracted by the plurality of convolutional networks to obtain spliced features;

and the classifiers corresponding to the key points in the at least one key point of the human body are respectively used for classifying the splicing features to obtain a confidence image of the second image block aiming at each key point in the at least one key point of the human body.

Optionally, in the above device embodiments of the present invention, each of the networks includes a plurality of convolutional layers respectively located at different network depths;

a first convolutional network of the plurality of convolutional networks is specifically configured to: and performing feature fusion on a first feature output by a first convolution layer with a network depth of i and a second feature output by a second convolution layer with a network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of the second image block output by the first convolution network is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

Optionally, in each of the above apparatus embodiments of the present invention, the human body keypoint detection network includes M network blocks, each of the network blocks includes the plurality of convolutional networks, an output of a pth network block in the M network blocks is a concatenation feature obtained by the plurality of convolutional networks included in the pth network block, and the concatenation feature output by the pth network block is input into a pth +1 network block, where M is greater than or equal to 2, p is equal to 1, …, M-1;

the classifiers corresponding to each key point in at least one key point of the human body are respectively and specifically used for: and classifying the splicing features output by the Mth network block to obtain a confidence image of the second image block aiming at each key point in the plurality of key points of the human body.

the plurality of convolution layers are respectively positioned at different network depths and are respectively used for carrying out feature extraction;

the fusion unit is used for performing feature fusion on a first feature output by a first convolution layer with a network depth of i and a second feature output by a second convolution layer with a network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of the second image block is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j;

and the classifiers corresponding to the key points in the at least one key point of the human body are respectively used for classifying the image characteristics of the second image block to obtain a confidence image of the second image block for each key point in the at least one key point of the human body.

According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including:

a memory for storing executable instructions; and

a processor, configured to communicate with the memory to execute the executable instructions so as to complete the operations of the human body key point detection method according to any of the above embodiments of the present invention.

According to still another aspect of the embodiments of the present invention, there is provided a computer storage medium for storing computer readable instructions, which when executed, implement the operations of the human body key point detection method according to any one of the above embodiments of the present invention.

According to a further aspect of an embodiment of the present invention, there is provided a computer program product for storing computer readable instructions, which when executed, cause a computer to perform the human key point detection method described in any one of the above possible implementations.

In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

According to a further aspect of the embodiments of the present invention, there is provided a computer program, which includes computer readable instructions, and when the computer readable instructions are run in a device, a processor in the device executes executable instructions for implementing the steps in the human body key point detection method according to any one of the above embodiments of the present invention.

Based on the human body key point detection method and device, the electronic device, the storage medium and the program provided by the embodiment of the invention, the image is subjected to face detection, after the position information of the face in the image is obtained, the position information of the human body center point corresponding to the face is determined according to the position information of the face; and detecting key points of the human body of the image according to the position information of the central point of the human body. The embodiment of the invention detects the key points of the human body by detecting the positions of the human face and the human body central point without outputting a complete human body external connecting frame, and because the position of the human body central point is generally positioned in the range of the upper trunk of the human body and the deformation of the upper trunk is relatively small, the embodiment of the invention is less influenced by the complex posture of the human body, and the position accuracy of the detected key points of the human body is higher.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart of an embodiment of a method for detecting key points in a human body according to the present invention.

FIG. 2 is a flowchart of another embodiment of a method for detecting human key points according to the present invention.

Fig. 3 is a schematic structural diagram of an embodiment of the human body key point detection device of the present invention.

Fig. 4 is a schematic structural diagram of another embodiment of the human body key point detection device of the invention.

Fig. 5 is a schematic structural diagram of an embodiment of a human body key point detection network according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of another embodiment of a human body key point detection network according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations, and with numerous other electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In the process of implementing the invention, the inventor finds that the prior art has at least the following problems through research:

the human body key point detection technology realized based on the depth sensor needs the depth sensor, so the depth sensor is expensive and is not easy to install and deploy;

the human body key point detection system based on the RGB camera needs to accurately detect the position of a human body, and the human body position is greatly influenced by the posture of the human body, so that the position of the detected human body key point is inaccurate.

FIG. 1 is a flowchart of an embodiment of a method for detecting key points in a human body according to the present invention. As shown in fig. 1, the human body key point detection method of the embodiment includes:

and 102, carrying out face detection on the image to obtain the position information of the face in the image.

The image in the embodiment of the present invention may be an avatar acquired by any camera, for example, an image acquired by an RGB camera or a depth camera, and accordingly, the image may be a color image, and the like, which is not limited in the embodiment of the present invention.

And 104, determining the position information of the human body central point (namely, the human body central position) corresponding to the human face according to the position information of the human face.

And 106, detecting key points of the human body on the image according to the position information of the central point of the human body.

Based on the human body key point detection method provided by the embodiment of the invention, the image is subjected to face detection, after the position information of the face in the image is obtained, the position information of the human body center point corresponding to the face is determined according to the position information of the face; and detecting key points of the human body of the image according to the position information of the central point of the human body. The embodiment of the invention can detect the key points of the human body by detecting the position of the face and the position of the central point of the human body, and because the position of the central point of the human body is generally positioned in the range of the upper trunk of the human body and the deformation of the upper trunk is relatively small, compared with other modes of detecting the key points according to a complete external frame of the human body, the embodiment of the invention is less influenced by the complex posture of the human body, and the position accuracy of the detected key points of the human body is higher.

When the embodiment of the invention detects the key points of the human body through the images collected by the RGB camera, the equipment is easy to install and deploy and has lower price because additional input information such as a depth sensor is not needed.

In one implementation manner of each embodiment of the method for detecting a key point of a human body according to the embodiment of the present invention, the operation 104 may include: acquiring a first image block with a first preset size from the image according to the position information of the face, wherein the first image block can comprise at least one part of a human body corresponding to the face; and determining the position information of the human body central point corresponding to the human face according to the first image block and the head-body mapping network.

In an alternative of embodiments of the invention, the head-to-body mapping network may be a neural network, such as a deep neural network.

Optionally, the first image block includes at least a part of a human body, as an optional example, the first image block may include a central point of the human body and a part above the central point, or include an upper half of the human body, and so on.

In one or more implementation manners, the first image block may be input to a head-body mapping network, and the first image block is processed by the head-body mapping network to obtain the position information of the human body center point.

In one or more implementations of the embodiments of the present invention, the human body center point may include one point located at a center of a human body or a plurality of points located in a center area of a human body, and the specific implementation of the human body center point is not limited in the embodiments of the present application. Optionally, the position information of the face may be used to indicate the position of the face, and in one or more embodiments, the position information of the face may include a center position of the face, or positions of a plurality of boundary points of the face, and the like. As an alternative embodiment, the position information of the face may include a center position of the face and size information of the face. The size information of the face may include the size of the face in the image, or may further include other information, which is not limited in this embodiment of the application.

In an alternative embodiment, obtaining a first image block having a first preset size from the image according to the position information of the face may include:

determining normalization parameters of the image according to the size information of the face;

and intercepting a first image block with a first preset size from the image with the normalized size according to the central position of the face. For example, in one or more optional embodiments, the first image block with the first preset size and centered at the center position of the face is cut out from the image with the normalized size, or the first image block with the first preset size and centered at a position with a preset distance from the center position of the face may be cut out from the image with the normalized size.

Illustratively, determining the normalization parameters of the image according to the size information of the human face may include:

according to the size information of the human face, determining a normalization parameter corresponding to the size of the human face normalized (i.e. scaled) to a preset normalized human face size.

For example, the intercepting a first image block having a first preset size and centered at the center position of the face from the image after size normalization may include: and intercepting an image block which takes the central position of the human face as the center and has the size which is M times of the size of a preset normalized human face from the image with the normalized size, and zooming the intercepted image block to a first preset size to obtain the first image block, wherein the value of M is more than 3 and less than 20.

In the embodiment, according to the position information of the face, the first image block with the first preset size is firstly intercepted from the original image, and then the position information of the human body central point corresponding to the face is predicted through the head-body mapping network, so that the detection range of the human body central position is narrowed, the prediction efficiency and the prediction accuracy of the human body central position are improved, and the training efficiency of the head-body mapping network can be improved.

In another implementation manner of each embodiment of the human body key point detecting method according to the present invention, the operation 106 may include: acquiring a second image block with a second preset size from the original image according to the position information of the human body central point, wherein the second image block comprises at least one part of the human body corresponding to the human face; and detecting the human key points in the second image block through a human key point detection network.

In an alternative of embodiments of the invention, the human keypoint detection network may be a neural network, such as a deep neural network.

Optionally, the second image block includes at least a portion of a human body, as an optional example, the second image block may include a portion centered on the center point of the human body and having a second preset size, or include a portion of the upper half of the human body, or include portions of both the upper half and the lower half of the human body, and so on, and the embodiment of the present application is not limited to a specific implementation of the second image block.

In one or more implementations, the second image block may be input to a human keypoint detection network, and the human keypoint detection network is utilized to perform human keypoint detection on the second image block.

Optionally, in an example, obtaining a second image block having a second preset size from the original image according to the position information of the center point of the human body may include: and intercepting a second image block with a second preset size and centered at the position of the human body central point from the image with the normalized size.

Optionally, in the foregoing embodiment, the performing human body keypoint detection in the second image block may include: and detecting the key points of the human body on the second image block to obtain the positions of the key points of the human body in the second image block.

For example, performing human key point detection on the second image block to obtain the positions of the key points of the human body in the second image block may include:

detecting key points of the human body in the second image block, and respectively acquiring a confidence coefficient image of the second image block aiming at each key point in at least one key point of the human body, wherein the confidence coefficient image of one key point comprises the confidence coefficient of the key point of which at least one position in the second image block is judged as the confidence coefficient image;

Based on the foregoing embodiment, in another embodiment of the method for detecting human key points according to the present invention, after obtaining the positions of the key points of the human body in the second image block, the method may further include: and determining the positions of the key points of the human body in the image according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image.

In one optional embodiment, the positions of the key points of the human body in the image may be determined according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image by:

acquiring the positions of the key points of the human body in the image with the normalized size according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image with the normalized size;

and determining the position of the key point of the human body in the original image according to the position of the key point of the human body in the image with the normalized size and the normalization parameter.

In the above embodiment, the second image block with the second preset size is captured from the original image according to the position information of the human body central point, and then the human body key point detection network is used to perform the human body key point detection in the second image block, so that the key point detection range is reduced, and the key point detection efficiency and accuracy are improved.

FIG. 2 is a flowchart of another embodiment of a method for detecting human key points according to the present invention. As shown in fig. 2, the human body key point detection method of the embodiment includes:

202, performing face detection on the image to obtain the position information of the face in the image, including the center position of the face and the size (i.e. image size) information of the face.

The size information of the face may indicate the size of the face in the image, and as an example, the size information of the face may include a face length w and a face width h, that is, the length w of the face in the image and the width h of the face in the image. The center position of the face may refer to a position coordinate of the center of the face in the image, but this is not limited in this embodiment of the application.

And 204, calculating a normalization parameter corresponding to the size of the human face normalized to the preset normalized human face size through the size information of the human face.

The image may be normalized (i.e., size normalization processing) according to the size information of the face, so that the face size in the normalized image is the preset normalized face size. The preset normalized face size may specifically be a normalized face width, a normalized face length, or a sum of the normalized face length and the normalized face width, and the like, which is not limited in this embodiment of the present application. The value of the normalized face size may be set according to actual needs, for example, may be 29 pixels, and the specific implementation of the normalized face size is not limited in this embodiment.

Optionally, the normalization parameter used when the image is normalized may be determined according to the size information of the face and a preset normalized face size. In one alternative example of embodiments of the present invention, the normalization parameter S may be determined by: and S is Wr/(w + h), wherein Wr is the preset normalized face size. When the normalized face size Wr is the normalized face width, the normalization parameter S is calculated by the above formula, so that the influence of the face length and/or the face width due to the face angle can be prevented.

And 206, carrying out size normalization processing on the image according to the normalization parameters to obtain an image with a normalized size.

As an alternative example, the image may be normalized by interpolation. For example, the length and the width of the original image are W and H, respectively, the original image may be normalized to an image with a size of W × S, H × S by performing interpolation on the original image, such as bilinear interpolation or other interpolation methods, but this is not limited in this embodiment.

And 208, intercepting a first image block with a first preset size and centered at the center position of the face from the image with the normalized size.

In an optional example of the embodiments of the present invention, an image block with a size M times of the normalized face size and centered on the center position of the face may be intercepted from the image after size normalization, and the image block is scaled to the first preset size to obtain a first image block, where a value of M may be greater than 3 and smaller than 20.

The first preset size may be set according to actual needs, and in an optional example, the first preset size may be, for example, 256 × 256 pixels, but the embodiment of the present application does not limit a specific implementation of the first preset size.

Optionally, in consideration of the head-body ratio of the human, the value of M may be 6 to 11, in this case, the first image block may include a human body center point, and specifically may include the upper body of the human body, or a majority of the upper body of the human body, or include the upper body of the human body and a portion of the lower body of the human body, which may be different according to different actual situations, and this is not limited in this embodiment of the application.

As an alternative example, M ═ 9. The inventor discovers through data statistics that the length of a human body is about 8 times of the length of a human head, when the value of M is 9, an image block which takes the center position of a human face as the center and has the length of 9 times of the size of the normalized human face can be intercepted from an image after size normalization, the image block can comprise the upper half of the complete human body, and can also comprise part of background information in the image, so that the subsequent image feature extraction and key point detection are facilitated. In addition, since the first image block is used for predicting the center position of the human body, the center position of the human body is generally located in the upper half of the human body, the lower part of the human body is not important for predicting the center position of the human body, and the first image block may not include the part of the lower half of the human body such as the leg, thereby reducing the data processing amount and improving the detection efficiency.

And 210, determining the position of the human body central point corresponding to the human face in the first image block through a head-body mapping network.

The first image block can be processed by using a head-body mapping network to obtain the position of the human body central point. Optionally, the first image block may be directly input to the head-body mapping network, or the first image block may be subjected to one or more kinds of preprocessing, and the preprocessed first image block is input to the head-body mapping network. Optionally, the position of the body center point may be included in the output of the head-body mapping network, or obtained by performing one or more kinds of processing on the output of the head-body mapping network, which is not limited in this embodiment of the present application.

Optionally, the head-to-body mapping network may be a network or other types of networks, which is not limited in this embodiment of the present application.

And 212, a second image block with a second preset size and centered at the position of the human body central point is cut out from the image with the normalized size.

In one optional example of the embodiments of the present invention, the first preset size may be the same as or different from the second preset size.

Optionally, the image with the normalized size may be an image obtained by normalizing the image by using the normalization parameter S, or an image obtained by normalizing the image by using another normalization parameter, which is not limited in this embodiment of the present application.

As an example, the process of truncating the second image block may refer to the above description of the process of truncating the first image block, and is not described herein again.

And 214, performing key point detection on the second image block through a human body key point detection network to obtain the positions of the key points of the human body in the second image block.

Based on the embodiment, the size normalization processing is performed on the image according to the size information of the face, so that the human body can be normalized to the preset size, the second image block is intercepted from the image after the size normalization to predict the position of the human body central point and perform key point detection, the efficiency of predicting the human body central position and detecting the key point can be improved, and when the embodiment is used for training of a head-body mapping network, the difficulty of image learning of the head-body mapping network and the human body key point detection network can be reduced.

And 216, acquiring the position of the key point of the human body in the image with the normalized size from the position of the key point of the human body in the second image block based on the position relationship between the position of the key point of the human body in the second image block and the position relationship between the position of the key point of the human body in the image with the normalized size.

In one alternative example of the embodiments of the present invention, the operation 216 may be implemented as follows:

detecting key points of a human body in a second image block, and respectively acquiring a confidence coefficient image of the second image block aiming at each key point in at least one key of the human body, wherein the confidence coefficient image comprises the confidence coefficient of each position in the second image block which is judged as the key point to which the confidence coefficient image belongs; and determining the position with the maximum confidence level in the confidence level image of the key point as the position of the key point in the second image block.

In some embodiments, the at least one keypoint may be part or all of a keypoint of the human body. Optionally, for the confidence image of a certain keypoint, the confidence image may include the confidence that each position in at least one position in the second image block is determined as the keypoint, where the at least one position may specifically refer to some or all positions or pixel points in the second image block, which is not limited in this embodiment of the present application.

And 218, determining the positions of the key points of the human body in the image according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image (namely, the original image).

In one alternative example of embodiments of the present invention, the operation 218 may be implemented as follows: acquiring the positions of the key points of the human body in the image with the normalized size according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image with the normalized size; and acquiring the positions of the key points of the human body in the image after the size normalization according to the positions of the key points of the human body in the image after the size normalization and the normalization parameters.

In a first optional implementation manner of the embodiments of the present invention, the human body key point detection network may include a plurality of convolutional neural networks, each of the plurality of convolutional neural networks serves as a branch to form a plurality of convolutional neural networks with different branches, the convolutional neural networks with different branches may include different numbers of network layers, and/or more than one network layer parameter in the convolutional neural networks with different branches may be different. Accordingly, in this optional example, performing keypoint detection on the human body in the second image block, and acquiring a confidence image of the second image block for each keypoint of at least one keypoint of the human body may include:

extracting the image features of the second image block through each convolutional neural network in the convolutional neural networks of the plurality of different branches; wherein, the image features extracted by different convolutional neural networks have different scales;

splicing the image features of different scales extracted by the convolutional neural networks of the different branches to obtain spliced features;

and classifying the splicing features through classifiers corresponding to the key points respectively to obtain a confidence image of the second image block for each key point in at least one key point of the human body.

The embodiment constructs a multi-branch network structure, and can extract and splice the image features of the second image block in different scales through the convolutional neural networks in different branches, so that the global information and the detail information in different levels can be captured simultaneously for the second image block, the key point decision is performed based on the obtained spliced features rather than the features in a single scale, and the accuracy of key point detection is improved.

In an optional example of the foregoing embodiment, one or any of the plurality of branched convolutional neural networks, referred to herein as a first convolutional neural network, may include a plurality of convolutional layers respectively located at different network depths. Accordingly, in this example, extracting the image feature of the second image block by the first convolutional neural network of the plurality of convolutional neural networks may include:

and performing feature fusion on a first feature output by a first convolution layer with the network depth of i and a second feature output by a second convolution layer with the network depth of j in the first convolution neural network to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of a second image block output by the first convolution neural network is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

The method for fusing the features extracted from the convolutional layers with different network depths may be splicing or adding the features extracted from the convolutional layers with different network depths.

In some embodiments, the image feature output by the first convolutional neural network may be the fused feature, or may be obtained by performing any one or more processes on the fused feature, for example, the fused feature may be input to a subsequent convolutional layer and subjected to a feature extraction process by the subsequent convolutional layer, so as to obtain an output image feature, but the embodiment of the present application is not limited thereto.

In the embodiment of the present invention, the first convolutional layer and the second convolutional layer are only used for distinguishing any two convolutional layers with different network depths in the convolutional neural network, and do not represent specific convolutional layers, and the first convolutional layer and the second convolutional layer may be two adjacent convolutional layers or two convolutional layers separated by at least one convolutional layer.

In a second optional implementation manner of the embodiments of the present invention, the human body key point detection network may include a plurality of network blocks (blocks), for example, M network blocks, where an output of a previous network block is used as an input of a next network block; each network block may comprise a convolutional neural network of a plurality of different branches, namely: the convolutional neural network of a plurality of different branches included in the human body key point detection network in the first optional embodiment described above constitutes one network block in this example, namely: the output of the p-th network block in the M network blocks is the splicing characteristics obtained by the plurality of convolutional neural networks included in the p-th network block, and the splicing characteristics output by the p-th network block are input into the p + 1-th network block, wherein M is more than or equal to 2, and p is 1, …, and M-1. Correspondingly, in this embodiment, classifying the splicing features by using a classifier corresponding to each of the plurality of key points of the human body to obtain a confidence image of the second image block for each of the plurality of key points of the human body may include: and classifying the splicing features output by the Mth network block by using the classifier corresponding to each key point in the plurality of key points of the human body to obtain a confidence image of the second image block aiming at each key point in the plurality of key points of the human body.

In some embodiments, the feature output by each network block may be input into the subsequent network block, wherein the feature output by each network block may be a spliced feature obtained by splicing the features of the plurality of convolutional neural networks. At this time, optionally, the confidence images of the key points may be obtained by classifying the features (i.e., the stitching features) output by the last network block through a classifier.

In a third optional implementation of the embodiments of the present invention, the human keypoint detection network may include L nested network blocks. The nested network block may be specifically a nested inclusion module or other types of nested modules, which is not limited in this disclosure. The nested network block of each embodiment of the present invention includes a plurality of branched network blocks, each branched network block of at least one branched network block of the plurality of branched network blocks includes a plurality of branched convolutional neural networks, respectively, and features extracted by convolutional neural networks of different branches in the same network block have different scales. Wherein the value of L is an integer greater than 0. In one of the embodiments of the present invention, each or at least one of the nested network blocks includes a plurality of branched network blocks, and the number of the network blocks of different branches is different. In one of the embodiments of the present invention, different branches of the convolutional neural networks may include different numbers of network layers, and/or more than one network layer parameter may be different in different branches of the convolutional neural networks. Accordingly, in this optional example, performing keypoint detection on the human body in the second image block, and acquiring a confidence image of the second image block for each keypoint of at least one keypoint of the human body may include: and respectively extracting the characteristics of the input object through the network block of each branch in the plurality of branches included by the first nested network block to obtain first output characteristics, wherein the input object includes the characteristics output by the second image block or the previous nested network block of the first nested network block. Specifically, when the first nested network block is the 1 st nested network block of the L nested network blocks, the input object is the second image block; when the first nested network block is any one of the 2 nd to the L th nested network blocks in the L nested network blocks, the input object is the characteristic (namely, the first output characteristic) output by the previous nested network block; splicing (concatee) first output characteristics output by a plurality of branched network blocks included in the first nested network block to obtain second output characteristics; classifying the second output features by using a classifier corresponding to each key point in at least one key point of the human body to obtain a confidence coefficient image of the second image block for each key point in at least one key point of the human body, wherein the confidence coefficient image of the key point comprises the confidence coefficient of the key point of which at least one position in the image block is judged to be the confidence coefficient of the key point to which the confidence coefficient image belongs; and determining the position with the maximum confidence level in the confidence level images of the key points as the position of the key point to which the confidence level image belongs in the image block.

Further, optionally, in the third optional embodiment, the L nested network blocks may further include a second nested network block, wherein an input of the second nested network block is connected to an output of the first nested network block. Correspondingly, performing keypoint detection on the human body in the second image block to obtain a confidence image of the second image block for each keypoint of at least one keypoint of the human body, and may further include: and performing feature extraction on the second output features output by the first nested network block through the second nested network block to obtain third output features. Correspondingly, classifying the third output features by using a classifier corresponding to each key point in at least one key point of the human body to obtain a confidence coefficient image of the second image block for each key point in at least one key point of the human body, wherein the confidence coefficient image of a key point comprises the confidence coefficient of the key point of which at least one position in the image block is judged to be the confidence coefficient of the key point to which the confidence coefficient image belongs; and determining the position with the maximum confidence level in the confidence level images of the key points as the position of the key point to which the confidence level image belongs in the image block.

In a fourth alternative implementation of the embodiments of the present invention, the human keypoint detection network may include a plurality of convolutional layers respectively located at different network depths. Accordingly, in this optional example, performing keypoint detection on the human body in the second image block, and acquiring a confidence image of the second image block for each keypoint of at least one keypoint of the human body may include:

and performing feature fusion on a first feature output by a first convolution layer with the network depth of i and a second feature output by a second convolution layer with the network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of the second image block is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

In the embodiment, cross-layer connection is added, two or more features extracted from the convolutional layers at different network depths are fused, and the feature extracted from the convolutional layer at the lower layer is connected to the feature extracted from the convolutional layer at the higher layer, so that the information content of the feature and the detail information of the feature extracted from the convolutional layer at the higher layer are added in the obtained fused feature, and further detail features are provided for key point decision, thereby being beneficial to improving the accuracy of key point detection.

The embodiment of the invention realizes the human body key point detection network by using the convolutional neural network, and ensures the robustness and accuracy of key point detection. The embodiment of the invention further improves the accuracy of key point detection by constructing a multi-branch network structure and adding cross-layer connection.

In addition, the fourth optional embodiment may also be used in combination with the second optional embodiment or the third optional embodiment, and when used in combination with the second optional embodiment or the third optional embodiment, the above technical solution of feature fusion may be adopted in one or more convolutional networks of different branches in one or more network blocks; the features output by the network blocks of different layers may be connected across layers, for example, the feature output by the first network block and the feature output by the fifth convolutional layer may be fused to be the feature output by the fifth network block.

Further, before the flow of the embodiment of the method for detecting human key points according to the above embodiments of the present invention, the method may further include:

training a head-body mapping network through a first sample image block, wherein the first sample image block is marked with the position of a human body central point; and/or

And training the human body key point detection network through a second sample image block, wherein the second sample image block is marked with the key point information of the human body.

In an alternative example of embodiment of the present invention, the header-to-body mapping network includes a plurality of convolutional layers and a classification layer. Training a head-body mapping network through a first sample image block, wherein the first sample image block is marked with a human body center position, and the training process may include the following steps:

by adopting the method of any embodiment of the invention, the normalization parameter S is utilized to carry out size normalization processing on all the collected original images, so that the long edge or the wide edge of the face is equal to the normalized face size Wr, such as 29 pixels;

intercepting an image block which takes the center position of the face as the center and has the size of M times of the size of the normalized face from the image with the normalized size, and zooming the image block to a first preset size, for example, cutting out an image block which is 9 times of the size of the normalized face from an original image, zooming the image block to 256 × 256 pixels, inputting the image block as a first sample image block into a head and body mapping network, and marking the position of an accurate human body central point in the first sample image block as a supervision label;

the method comprises the steps that a plurality of convolutional layers in a head-body mapping network sequentially extract features in a first sample image block, each position (namely each pixel) in the first sample image block is judged to be the confidence coefficient of a body central point through a classification layer based on the features output by the plurality of convolutional layers, and a position with the maximum confidence coefficient is selected as the position of a predicted body central point;

training the head-body mapping network through a random gradient descent method according to the difference between the position of the human body central point marked by the first sample image block and the position of the predicted human body central point, and adjusting network parameter values of each network layer of the head-body mapping network until a preset condition is met.

The training process may be an iterative training process, i.e.: and repeatedly executing the training process until a preset condition is met, for example, the training times reach the preset times, or the difference between the position of the human body central point marked on the first sample image block and the position of the predicted human body central point is smaller than a first preset value.

After the training of the head-body mapping network is completed, a test image can be input into the head-body mapping network, and whether the position of the human body central point in the image output by the head-body mapping network is correct or not is compared.

In another optional example of the embodiment of the present invention, the human body key point detection network includes a plurality of hierarchical network blocks, and an output of a network block of a previous hierarchy is used as an input of a network block of a next hierarchy; each network block comprises a convolutional network of a plurality of different branches.

The human body key point detection network is trained through the second sample image block, the second sample image block is marked with the key point information of the human body, and the training process may include, for example:

after the position of the human body central point and the normalization parameter S are obtained, firstly, scaling the original image by using the normalization parameter S to obtain a human body image with normalized size;

and cutting out an image block with a specific second preset size from the original image according to the position of the human body central point, wherein the image block serves as a second sample image block and is input into the human body key point detection network. The second predetermined size may be 256 × 256 pixels, for example. For each key point of the human body, respectively generating a confidence image of the key point existing at each position in a second sample image block as training and monitoring information of the human body key point detection network, wherein the confidence image is generated by utilizing a Gaussian response function based on the distance from each position in the second sample image block to the key point around the position marked as the human body key point;

and each network layer block sequentially performs feature extraction on the second sample image block. The convolutional networks of different branches in each network layer block respectively extract image features of different scales and carry out feature splicing to obtain splicing features; performing characteristic fusion on convolutional layers with different network depths in different network layer blocks or the same network block through cross-layer connection;

classifying the characteristics finally output by the multiple hierarchical network blocks through classifiers corresponding to each key point in at least one key point of the human body respectively to obtain a confidence image of the second sample image block for each key point of the human body, namely: and outputting a corresponding confidence coefficient image for each key point to be detected, wherein the confidence coefficient image comprises the confidence coefficient of each position in the second sample image block judged as the key point. Each key point of the human body corresponds to one classifier which is used for judging that each position in the second sample image block is a confidence image of the key point;

respectively selecting a position with the maximum confidence coefficient as a key point position according to the confidence coefficient image of each key point, thereby obtaining the predicted position of each key point of the human body;

and training the human body key point detection network through a random gradient descent method according to the difference between the positions of the human body key points determined by the training supervision information of the second sample image blocks and the predicted positions of all the key points of the human body, and adjusting the network parameter values of all the network layers of the human body key point detection network until preset conditions are met.

The training process may be an iterative training process, i.e.: and repeatedly executing the training process until a preset condition is met, for example, the training times reach the preset times, or the difference between the positions of the key points of the human body determined by the training supervision information and the predicted positions of the key points of the human body is smaller than a second preset value.

After the training of the human body key point detection network is finished, a test image can be input into the human body key point detection network, and whether the predicted position of each key point of the human body in the image output by the human body key point detection network is correct or not is compared.

In a further optional example, in the training process of the human body keypoint detection network, a classifier corresponding to each keypoint of the human body may be added behind an output layer of each network block, the features output by the current network block are classified, a confidence image of each keypoint of the human body is obtained, a predicted position of each keypoint of the human body is obtained, then a difference between the position of the human body keypoint determined by the training supervision information of the second sample image block and the predicted position of each keypoint of the human body obtained by the current network block is used as a current difference, and the human body keypoint detection network is trained by further combining the current difference of each network block, so that the training efficiency and the training result are improved.

In the above training process of the head-body mapping network and the human body key point detection network, the implementation of the corresponding technical content in each human body key point detection method can be implemented by adopting any corresponding embodiment, and the details are not repeated.

Any human body key point detection method provided by the embodiment of the invention can be executed by any appropriate device with data processing capability, including but not limited to: terminal equipment, a server and the like. Alternatively, any human body key point detection method provided by the embodiment of the present invention may be executed by a processor, for example, the processor may execute any human body key point detection method mentioned in the embodiment of the present invention by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Fig. 3 is a schematic structural diagram of an embodiment of the human body key point detection device of the present invention. The human body key point detection device of the embodiment can be used for realizing the human body key point detection method embodiments of the invention. As shown in fig. 3, the human body key point detecting device of this embodiment includes: the system comprises a face detection module, a head-body mapping network and a human body key point detection network. Wherein:

and the face detection module is used for carrying out face detection on the image to obtain the position information of the face in the image.

And the head-body mapping network module is used for determining the position information of the human body central point corresponding to the human face according to the position information of the human face.

And the human body key point detection network module is used for detecting the human body key points of the image according to the position information of the human body center point.

Based on the human body key point detection device provided by the embodiment of the invention, the human face detection is carried out on the image, after the position information of the human face in the image is obtained, the position information of the human body center point corresponding to the human face is determined according to the position information of the human face; and detecting key points of the human body of the image according to the position information of the central point of the human body. The embodiment of the invention detects the key points of the human body by detecting the positions of the human face and the human body central point without outputting a complete human body external connecting frame, and because the position of the human body central point is generally positioned in the range of the upper trunk of the human body and the deformation of the upper trunk is relatively small, the embodiment of the invention is less influenced by the complex posture of the human body, and the position accuracy of the detected key points of the human body is higher.

Fig. 4 is a schematic structural diagram of another embodiment of the human body key point detection device of the invention. As shown in fig. 4, compared with the embodiment shown in fig. 3, the human body key point detecting device of this embodiment further includes: the first intercepting module is used for acquiring a first image block with a first preset size from the image (namely, an original image) according to the position information of the face, wherein the first image block comprises at least one part of a human body corresponding to the face. Correspondingly, in this embodiment, the head-body mapping network module is specifically configured to determine, according to the first image block, the position information of the human body center point corresponding to the human face.

In an alternative implementation of the embodiment of the human body key point detecting apparatus shown in fig. 4, the position information of the face includes a center position of the face and size information of the face. Accordingly, in this embodiment, the first truncation module is specifically configured to: determining normalization parameters of the image according to the size information of the face; according to the normalization parameters of the image, carrying out size normalization processing on the image to obtain an image with a normalized size; and intercepting a first image block with a first preset size from the image with the normalized size according to the center position of the face.

Illustratively, when the first truncation module determines the normalization parameter of the image according to the size information of the face, the first truncation module is specifically configured to: and according to the size information of the face, determining a normalization parameter corresponding to the size of the face to be scaled to a preset normalization face size.

Further exemplarily, the first truncation module, when truncating the first image block having the first preset size from the image after size normalization according to the center position of the human face, is specifically configured to: and intercepting an image block which takes the central position of the face as the center and has the size of M times of the size of the normalized face from the image with the normalized size, and scaling the intercepted image block to a first preset size to obtain a first image block, wherein the value of M is more than 3 and less than 20.

In addition, referring back to fig. 4, in another embodiment of the human body key point detecting device, the method further includes: and the second intercepting module is used for acquiring a second image block with a second preset size from the image according to the position information of the human body central point, wherein the second image block comprises at least one part of a human body corresponding to the human face. Correspondingly, in this embodiment, the human key point detection network module is specifically configured to perform human key point detection in the second image block.

In an optional implementation manner of the above human body key point detecting device embodiment, the second intercepting module is specifically configured to: and intercepting a second image block with a second preset size and centered at the position of the human body central point from the image with the normalized size.

Illustratively, when the human key point detection network module performs human key point detection in the second image block, it is specifically configured to: and detecting the key points of the human body on the second image block to obtain the positions of the key points of the human body in the second image block.

Further exemplarily, the human key point detection network module performs human key point detection on the second image block, and when the positions of the human key points in the second image block are obtained, the human key point detection network module is specifically configured to: detecting key points of the human body in the second image block, and acquiring a confidence image of the second image block aiming at each key point in at least one key point of the human body, wherein the confidence image of each key point comprises the confidence of each position in the second image block which is judged as the key point; and determining the position with the maximum confidence level in the confidence level images of the key points as the position of the key point to which the confidence level image belongs in the second image block.

In addition, referring to fig. 4 again, in yet another embodiment of the human body key point detecting device, the method may further include: and the acquisition module is used for determining the positions of the key points of the human body in the image according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image.

In one optional implementation, the obtaining module is specifically configured to: acquiring the positions of the key points of the human body in the image with the normalized size according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image with the normalized size; and determining the position of the key point of the human body in the image according to the position of the key point of the human body in the image with the normalized size and the normalization parameter.

Fig. 5 is a schematic structural diagram of an embodiment of a human body key point detection network module in the embodiment of the present invention. As shown in fig. 5, the human body key point detection network module of this embodiment includes: the system comprises a plurality of convolutional network modules, a splicing unit and classifiers corresponding to each key point of a human body. Wherein:

and the plurality of convolutional network modules are respectively used for extracting the image features of the second image block through each convolutional network module in the plurality of convolutional network modules, wherein the image features extracted by different convolutional network modules in the plurality of convolutional network modules have different scales.

And the splicing unit is used for splicing the image features of different scales extracted by the plurality of convolution network modules to obtain spliced features.

And the classifiers corresponding to the key points of the human body are respectively used for classifying the splicing characteristics to obtain a confidence image of the second image block aiming at each key point in at least one key point of the human body.

In an alternative implementation of the embodiment shown in fig. 5, each network module may include a plurality of convolutional layers located at different network module depths. Wherein a first convolutional network module of the plurality of convolutional network modules is specifically configured to: and performing feature fusion on a first feature output by a first convolution layer with the network depth of i and a second feature output by a second convolution layer with the network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of a second image block output by the first convolution neural network is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

In one optional example, the human body key point detection network module may include a plurality of network blocks, for example, M network blocks, with the output of the previous network module block as the input of the next network block; each network block comprises a plurality of convolutional network modules, namely: the output of the p-th network block in the M network blocks is the splicing characteristics obtained by the plurality of convolutional neural networks included in the p-th network block, and the splicing characteristics output by the p-th network block are input into the p + 1-th network block, wherein M is more than or equal to 2, and p is 1, … and M-1. Correspondingly, in this embodiment, the stitching unit in each network block in the human body keypoint detection network module is specifically configured to classify the stitching features output by the mth network block, and obtain a confidence image of the second image block for each keypoint of the multiple keypoints of the human body.

Fig. 6 is a schematic structural diagram of another embodiment of a human body key point detection network module in the embodiment of the present invention. As shown in fig. 6, the human body key point detection network module of this embodiment includes: and the plurality of convolution layers are respectively positioned at different network module depths, and the fusion unit and the classifier corresponding to each key point of the human body are integrated. Wherein:

and the plurality of convolution layers respectively positioned at different network module depths are respectively used for sequentially carrying out feature extraction on the second image block.

And the fusion unit is used for performing feature fusion on a first feature output by a first convolution layer with the network depth of i and a second feature output by a second convolution layer with the network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of the second image block is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

In addition, the embodiment of the invention also provides electronic equipment which comprises the human body key point detection device in any one of the embodiments of the invention.

In addition, another electronic device is provided in an embodiment of the present invention, including:

a memory for storing executable instructions; and

a processor for communicating with the memory to execute the executable instructions to perform the operations of the human keypoint detection method of any of the above embodiments of the invention.

Fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present invention. Referring now to fig. 7, shown is a schematic diagram of an electronic device suitable for use in implementing a terminal device or server of an embodiment of the present application. As shown in fig. 7, the electronic device includes one or more processors, a communication section, and the like, for example: one or more Central Processing Units (CPUs), and/or one or more image processors (GPUs), etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM) or loaded from a storage section into a Random Access Memory (RAM). The communication part may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card, and the processor may communicate with the read-only memory and/or the random access memory to execute the executable instructions, connect with the communication part through the bus, and communicate with other target devices through the communication part, so as to complete operations corresponding to any method provided by the embodiments of the present application, for example, perform face detection on an image, and obtain position information of a face in the image; determining the position information of the human body central point corresponding to the face according to the position information of the face; and detecting the key points of the human body of the image according to the position information of the central point of the human body.

In addition, in the RAM, various programs and data necessary for the operation of the apparatus can also be stored. The CPU, ROM, and RAM are connected to each other via a bus. In the case of RAM, ROM is an optional module. The RAM stores executable instructions or writes executable instructions into the ROM during operation, and the executable instructions cause the processor to execute operations corresponding to any one of the methods of the invention. An input/output (I/O) interface is also connected to the bus. The communication unit may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.

The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is mounted into the storage section as necessary.

It should be noted that the architecture shown in fig. 7 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 7 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication part may be separately set or integrated on the CPU or the GPU, and so on. These alternative embodiments are all within the scope of the present disclosure.

In addition, an embodiment of the present invention further provides a computer storage medium, configured to store a computer-readable instruction, where the instruction is executed to implement the operation of the human body key point detection method according to any one of the above embodiments of the present invention.

In addition, an embodiment of the present invention further provides a computer program, which includes computer readable instructions, and when the computer readable instructions are run in a device, a processor in the device executes executable instructions for implementing steps in the human body key point detection method according to any one of the above embodiments of the present invention.

In an alternative embodiment, the computer program is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

In one or more alternative embodiments, the present invention further provides a computer program product for storing computer readable instructions, which when executed, cause a computer to execute the human key point detection method described in any one of the above possible implementation manners.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative example, the computer program product is embodied as a computer storage medium, and in another alternative example, the computer program product is embodied as a software product, such as an SDK or the like.

In one or more optional implementation manners, embodiments of the present invention further provide a human body key point detection method, and a corresponding apparatus and electronic device, a computer storage medium, a computer program, and a computer program product, where the method includes: the first device sends a human body key point detection instruction to the second device, wherein the instruction causes the second device to execute the human body key point detection method in any possible embodiment; the first device receives the human body key point information sent by the second device.

In some embodiments, the human body key point detection instruction may be specifically a call instruction, and the first device may instruct, in a call manner, the second device to perform the detection of the human body key point, and accordingly, in response to receiving the call instruction, the second device may perform the steps and/or processes in any embodiment of the human body key point detection method.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as a computer software program. For example, embodiments of the present invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, where the program code may include instructions corresponding to executing steps of the method provided by embodiments of the present invention, for example, instructions for performing face detection on an image to obtain position information of a face in the image; determining the position information of the human body central point corresponding to the face according to the position information of the face; and carrying out human key point detection on the image according to the position information of the human center point.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The method and apparatus of the present invention may be implemented in a number of ways. For example, the methods and apparatus of the present invention may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form of the disclosed embodiments. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A human body key point detection method is characterized by comprising the following steps:

determining the position information of the human body central point corresponding to the face according to the position information of the face through a head-body mapping network;

and detecting the human key points of the image according to the position information of the human center point through a human key point detection network.

2. The method according to claim 1, wherein the determining the position information of the human body center point corresponding to the face according to the position information of the face through a head-body mapping network comprises:

3. The method according to claim 2, wherein the position information of the face includes a center position of the face and size information of the face;

4. The method of claim 3, wherein determining the normalization parameter of the image according to the size information of the human face comprises:

5. The method according to claim 4, wherein the intercepting the first image block with the first preset size in the image after the size normalization according to the center position of the face comprises:

6. The method according to any one of claims 1 to 5, wherein the performing, by the human body key point detection network, human body key point detection on the image according to the position information of the human body center point comprises:

and detecting the human key points in the second image block through the human key point detection network.

7. The method according to claim 6, wherein the obtaining a second image block with a second preset size from the image according to the position information of the center point of the human body comprises:

8. The method of claim 6, wherein the performing human keypoint detection in the second image block comprises:

9. The method according to claim 8, wherein the performing human key point detection on the second image block to obtain the locations of the human key points in the second image block comprises:

performing key point detection on the human body in the second image block to obtain a confidence image of the second image block for each key point in at least one key point of the human body, wherein the confidence image of each key point comprises the confidence of the key point of which at least one position in the second image block is judged as the confidence image;

10. The method according to claim 8, wherein after said obtaining the locations of the key points of the human body in the second image block, the method further comprises:

11. The method according to claim 10, wherein determining the positions of the key points of the human body in the image according to the positions of the key points of the human body in the second image block and the positions of the second image block in the image comprises:

12. The method of claim 9, wherein the human keypoint detection network comprises a plurality of convolutional neural networks;

extracting, by each convolutional neural network of the plurality of convolutional neural networks, image features of the second image block, wherein the image features extracted by different convolutional networks of the plurality of convolutional neural networks have different scales;

13. The method of claim 12, wherein a first convolutional neural network of the plurality of convolutional neural networks comprises a plurality of convolutional layers respectively located at different network depths;

14. The method according to claim 12, wherein the human key point detection network comprises M network blocks connected, each network block comprises the plurality of convolutional neural networks, the output of a p-th network block in the M network blocks is a spliced feature obtained by the plurality of convolutional neural networks included in the p-th network block, and the spliced feature output by the p-th network block is input into a p + 1-th network block, wherein M is greater than or equal to 2, p is 1, …, M-1;

15. The method of claim 9, wherein the human keypoint detection network comprises a plurality of convolutional layers respectively located at different network depths;

16. A human key point detection device, comprising:

the head-body mapping network module is used for determining the position information of the human body central point corresponding to the face according to the position information of the face;

17. The apparatus of claim 16, further comprising:

the head-body mapping network module is specifically configured to determine, according to the first image block, position information of a human body center point corresponding to the face.

18. The apparatus according to claim 17, wherein the position information of the face comprises a center position of the face and size information of the face;

the first intercepting module is specifically configured to:

19. The apparatus according to claim 18, wherein the first clipping module, when determining the normalization parameter of the image according to the size information of the human face, is specifically configured to: and determining a normalization parameter corresponding to the size of the face scaled to a preset normalization face size according to the size information of the face.

20. The apparatus according to claim 19, wherein when the first image block having the first preset size is cut out from the image after the size normalization according to the center position of the face, the apparatus is specifically configured to: and intercepting an image block which takes the central position of the face as the center and has the size which is M times of the size of the normalized face from the image with the normalized size, and scaling the intercepted image block to the first preset size to obtain the first image block, wherein the value of M is more than 3 and less than 20.

21. The apparatus of any of claims 16-20, further comprising:

the human body key point detection network module is specifically used for detecting human body key points in the second image block.

22. The apparatus of claim 21, wherein the second intercept module is specifically configured to: and intercepting the second image block with a second preset size by taking the position of the human body central point as the center from the image with the normalized size.

23. The apparatus according to claim 22, wherein the human keypoint detection network module, when performing human keypoint detection in the second image block, is specifically configured to: and detecting key points of the human body on the second image block to obtain the positions of the key points of the human body in the second image block.

24. The apparatus according to claim 23, wherein the human keypoint detection network module performs human keypoint detection on the second image block, and when obtaining the positions of the human keypoints in the second image block, is specifically configured to:

25. The apparatus of claim 23, further comprising:

26. The apparatus of claim 25, wherein the obtaining module is specifically configured to:

27. The apparatus of claim 24, wherein the human keypoint detection network module comprises:

the convolutional neural network modules are respectively used for extracting the image features of the second image block through each convolutional neural network module in the convolutional neural network modules, wherein the image features extracted by different convolutional neural network modules in the convolutional neural network modules have different scales;

the splicing unit is used for splicing the image features of different scales extracted by the plurality of convolutional neural network modules to obtain spliced features;

28. The apparatus of claim 27, wherein each of the network modules comprises a plurality of convolutional layers at different network module depths;

a first convolutional neural network of the plurality of convolutional neural networks is specifically configured to: and performing feature fusion on a first feature output by a first convolution layer with a network depth of i and a second feature output by a second convolution layer with a network depth of j to obtain a fusion feature, wherein the second feature is obtained by sequentially performing feature extraction on the first feature through at least one convolution layer, the image feature of the second image block output by the first convolution neural network is obtained by processing the fusion feature, and i is more than or equal to 1 and less than j.

29. The apparatus according to claim 27, wherein the human key point detection network comprises M network blocks, each network block comprises the plurality of convolutional neural networks, the output of a p-th network block in the M network blocks is a spliced feature obtained by the plurality of convolutional neural networks included in the p-th network block, and the spliced feature output by the p-th network block is input into a p + 1-th network block, wherein M ≧ 2, p ≧ 1, …, M-1;

30. The apparatus of claim 24, wherein the human keypoint detection network comprises:

31. An electronic device, comprising:

a memory for storing computer readable instructions and a processor, wherein execution of the computer readable instructions by the processor causes the processor to perform the method of human keypoint detection of any of claims 1 to 15.

32. A computer storage medium storing computer readable instructions, wherein the computer readable instructions, when executed in a device, cause a processor in the device to perform the method for human keypoint detection according to any of claims 1 to 15.