WO2022237688A1 - Procédé et appareil d'estimation de pose, dispositif informatique et support de stockage - Google Patents

Procédé et appareil d'estimation de pose, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2022237688A1
WO2022237688A1 PCT/CN2022/091484 CN2022091484W WO2022237688A1 WO 2022237688 A1 WO2022237688 A1 WO 2022237688A1 CN 2022091484 W CN2022091484 W CN 2022091484W WO 2022237688 A1 WO2022237688 A1 WO 2022237688A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
feature
position information
features
Prior art date
Application number
PCT/CN2022/091484
Other languages
English (en)
Chinese (zh)
Inventor
贾配洋
侯俊
Original Assignee
影石创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 影石创新科技股份有限公司 filed Critical 影石创新科技股份有限公司
Publication of WO2022237688A1 publication Critical patent/WO2022237688A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Definitions

  • the present application relates to the technical field of computer vision, in particular to a pose estimation method, device, computer equipment and storage medium.
  • attitude estimation as one of the important applications in computer vision, has also developed rapidly and is widely used in object activity analysis, video surveillance or object interaction and other fields.
  • human body pose estimation in pose estimation through human body pose estimation, can detect various key points of a human body in an image containing a human body.
  • the facial features, limbs or joints of the human body can be obtained through human body pose estimation. Because of its functions, it is widely used in scenes such as stop-motion animation, collage dance, transparent people, walking stitching or action classification.
  • a pose estimation method comprising: acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; performing feature extraction based on the target image to obtain a first extracted feature;
  • the expansion network performs feature expansion on the first extraction feature to obtain an expanded image feature; performs feature extraction on the expanded image feature to obtain a second extraction feature; performs feature compression on the second extraction feature through an image feature compression network, Obtaining compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
  • the image feature expansion network includes a plurality of feature convolution channels
  • performing feature expansion on the first extracted feature through the image feature expansion network, and obtaining the expanded image feature includes:
  • the extracted features are respectively input into a plurality of feature convolution channels corresponding to the image feature expansion network, and each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted feature to obtain each of the feature
  • the convolution features output by the convolution channels; the expanded image features are obtained by combining the convolution features output by each of the feature convolution channels.
  • the determining the key point position information corresponding to the target object in the target image based on the compressed image features includes: amplifying the compressed image features to obtain enlarged image features; The enlarged image features are convolved to obtain third extracted features; based on the third extracted features, key point position information corresponding to the target object in the target image is determined.
  • the acquiring the target image to be subjected to pose estimation includes: acquiring an initial image; performing object detection on the initial image to obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object; Based on the probability that the candidate image area includes the target object, select from the candidate image area to obtain an object image area including the target object; according to the object image area, extract an intercepted image area from the initial image, and extract the intercepted image as the target image for pose estimation.
  • the extraction of the intercepted image area from the initial image according to the object image area, and using the intercepted image as the target image for pose estimation includes: obtaining the image area in the object image area the center coordinates of the object image area; obtain the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; The extension direction is extended to obtain the extension coordinates; the image area located within the extension coordinates is used as the image interception area, and the intercepted image is used as the target image to be subjected to pose estimation.
  • the method further includes: according to the mapping relationship between the key point position information and the target point position information, converting each key point position information into a corresponding The target point position information;
  • the target point position information is the position information of the key point position information in the initial image; Based on each of the target position information, perform pose estimation on the target object to obtain the target image corresponding target pose.
  • a method for generating a target video the method further comprising: acquiring a target action, determining a gesture sequence corresponding to the target action, and performing the gestures in the gesture sequence in order to obtain the target action; performing the above gesture estimation
  • the method is to obtain the target pose corresponding to each target image in the target image set; to obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; according to the sorting of the poses in the pose sequence
  • the obtained video frame images are arranged to obtain the target video corresponding to the target action.
  • a pose estimation device comprising: a target image acquisition module, for acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; a first feature extraction module, for based on the The target image is feature extracted to obtain the first extracted feature; the expanded image feature obtaining module is used to expand the first extracted feature through the image feature expansion network to obtain the expanded image feature; the second extracted feature obtained module is used for Feature extraction is performed on the expanded image features to obtain second extracted features; a compressed image feature obtaining module is used to perform feature compression on the second extracted features through an image feature compression network to obtain compressed image features; key point position information is determined A module, configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
  • the expanded image feature obtaining module is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels utilizes a feature dimension Keeping the convolution kernel to convolve the first extracted features to obtain the convolution features output by each of the feature convolution channels; combining the output convolution features of each of the feature convolution channels to obtain the expanded image features.
  • the key point location information determination module is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain a third extracted feature; based on the The third feature extraction is to determine key point position information corresponding to the target object in the target image.
  • the target image acquisition module is used to acquire an initial image; object detection is performed on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the candidate image areas The probability of including the target object is selected from the candidate image area to obtain the object image area including the target object; according to the object image area, the intercepted image area is extracted from the initial image, and the intercepted image is used as the image to be estimated. target image.
  • the target image acquisition module is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; Based on the central coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain extension coordinates; use the image area within the extension coordinates as the intercepted image area, and use the intercepted image as The target image to be pose estimated.
  • the target image acquisition module is used to convert each of the key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point The position information is the position information of the key point position information in the initial image; based on each of the target position information, pose estimation is performed on the target object to obtain a target pose corresponding to the target image.
  • a target video generation device the device is used to acquire target actions, determine the gesture sequence corresponding to the target action, the gestures in the gesture sequence are executed in order to obtain the target action; acquire each target image set The target pose corresponding to the target image; obtaining the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; pairing the obtained video frame images according to the sorting of the poses in the pose sequence Arranging to obtain the target video corresponding to the target action.
  • a computer device comprising a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program: acquiring a target image to be subjected to pose estimation; the target image includes a target image to be processed The target object; Based on the target image, feature extraction is performed to obtain the first extraction feature; the image feature expansion network is used to perform feature expansion on the image extraction feature to obtain the expanded image feature; the feature extraction is performed on the expanded image feature to obtain The second extraction feature; performing feature compression on the second extraction feature through an image feature compression network to obtain a compressed image feature; determining key point position information corresponding to the target object in the target image based on the compressed image feature, Estimating the pose of the target object based on the position information of the key points.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: acquiring a target image to be subjected to attitude estimation; the target image includes a target object to be processed; based on The target image is subjected to feature extraction to obtain the first extracted feature;
  • Feature expansion is performed on the image extraction feature by an image feature expansion network to obtain an expanded image feature; feature extraction is performed on the expanded image feature to obtain a second extraction feature; and the second extraction feature is obtained by an image feature compression network. compressing to obtain compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
  • the above attitude estimation method, device, computer equipment and storage medium obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; perform feature extraction based on the target image to obtain the first extracted feature;
  • the expansion network performs feature expansion on the first extraction feature to obtain the expanded image feature; performs feature extraction on the expanded image feature to obtain the second extraction feature; performs feature compression on the second extraction feature through the image feature compression network to obtain the compressed image feature; Compress image features to determine the key point position information corresponding to the target object in the target image, and perform pose estimation on the target object based on the key point position information.
  • the image feature expansion network is used to expand the extracted first extraction features, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.
  • Fig. 1 is an application environment diagram of a posture estimation method in an embodiment
  • Fig. 2 is a schematic flow chart of a pose estimation method in an embodiment
  • Fig. 3 is a schematic flow chart of a pose estimation method in another embodiment
  • FIG. 4 is a schematic flow chart of a pose estimation method in another embodiment
  • FIG. 5 is a schematic flow chart of a pose estimation method in another embodiment
  • Fig. 6 is a schematic flow chart of a pose estimation method in another embodiment
  • FIG. 7 is a schematic flow chart of a pose estimation method in another embodiment
  • Fig. 8 is a schematic flow chart of a method for generating a target video in an embodiment
  • Fig. 9 is a schematic diagram of a panoramic image including objects in one embodiment
  • Fig. 10 is a schematic diagram of detecting an object in an embodiment
  • Fig. 11 is a schematic diagram of intercepting object subgraphs in an embodiment
  • Fig. 12 is a schematic diagram of object key points in an embodiment
  • Fig. 13 is a schematic diagram of a human body posture model in an embodiment
  • Fig. 14 is a structural block diagram of a pose estimation device in an embodiment
  • Figure 15 is a diagram of the internal structure of a computer device in one embodiment.
  • the attitude estimation method provided in this application can be applied to the application environment shown in FIG. 1 , and specifically applied to an attitude estimation system.
  • the pose estimation system includes an image acquisition device 102 and a terminal 104 , wherein the image acquisition device 102 and the terminal 104 are connected in communication.
  • the terminal 104 executes a pose estimation method.
  • the terminal 104 acquires a target image to be subjected to pose estimation transmitted from the image acquisition device 102; the target image includes a target object to be processed; the terminal 104 performs feature extraction based on the target image , to obtain the first extraction feature; expand the first extraction feature through the image feature expansion network to obtain the expanded image feature; perform feature extraction on the above-mentioned expanded image feature to obtain the second extraction feature; use the image feature compression network to obtain the second extraction feature
  • the feature is compressed to obtain the compressed image feature; the key point position information corresponding to the target object in the target image is determined based on the compressed image feature, and the pose estimation of the target object is performed based on the key point position information.
  • the image acquisition device 102 may be, but not limited to, various devices with an image acquisition function, and may be distributed outside the terminal 104 or inside the terminal 104 .
  • the terminal 104 can be, but is not limited to, various cameras, personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. It can be understood that the method provided in the embodiment of the present application may also be executed by a server.
  • a pose estimation method is provided.
  • the method is applied to the terminal in FIG. 1 as an example for illustration, including the following steps:
  • Step 202 acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed.
  • pose estimation refers to the process of estimating the pose of the target object by detecting key points in the target object and describing one or more key points.
  • Key points refer to feature points that can describe the structural features of the target object. For example, the facial features, leg joints, or hand joints of the target object.
  • the target object refers to the object for pose estimation. For example, human body or animal etc.
  • the terminal can directly or indirectly obtain the target image to be subjected to pose estimation.
  • the terminal takes the received image including the target object to be processed and transmitted from the image acquisition device as the target image.
  • the terminal preprocesses the received image transmitted from the image acquisition device, and uses the preprocessed image as the target image.
  • the image collection device is a panoramic camera. After the panoramic camera collects the panoramic image, the panoramic image is used as a target image to be subjected to pose estimation, and the target image includes a target object to be processed.
  • the target object can be complete, only partially contained or occluded.
  • the terminal acquires a panoramic image by extracting frames from the panoramic video, and directly or after preprocessing the panoramic image, obtains a target image to be subjected to pose estimation.
  • the preprocessing includes normalizing the panoramic image or cropping the target object in the panoramic image.
  • Step 204 perform feature extraction based on the target image, and obtain first extracted features.
  • a feature is information representing a specific attribute of a target image, through which an object in the target image can be identified or the target image can be classified.
  • feature extraction may be performed on the target image through a feature extraction network to obtain the first extracted feature.
  • the feature extraction of the target image may be performed through a lightweight deep neural network to obtain the first extracted feature.
  • Step 206 performing feature expansion on the first extracted feature through the image feature expansion network to obtain expanded image features.
  • the image feature expansion network refers to a network that can increase the number of image features.
  • the expanded image feature refers to the image feature after the image feature is expanded.
  • the acquired channels of the first extracted features are expanded by point-by-point convolution, the number of features is enriched, and the expanded image features are obtained.
  • the image feature expansion network uses 1*1 point-by-point convolution to perform feature expansion on the image extracted features to obtain expanded image features.
  • Step 208 perform feature extraction on the features of the expanded image to obtain second extracted features.
  • feature extraction may be performed on the expanded image features through convolution with fewer parameters to obtain the second extracted features.
  • the expanded image features may be down-sampled by preset convolution and activation functions, and feature extraction may be performed on the expanded image features to obtain the second extracted features.
  • the preset convolution can be 3*3 convolution and a ReLU (Rectified Linear Unit, rectified linear unit) activation function to perform feature extraction on the expanded image feature to obtain the second extracted feature.
  • the above activation function can use Sigmoid function (Sigmoid function, S-type growth curve), ELU (Exponential Linear Unit, Exponential Linear Unit), GELU (Gaussian Error Linear Unit, Gaussian error linear unit) and other replacements.
  • Step 210 perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features.
  • the image feature compression network refers to a network that can reduce the number of image features.
  • Compressed image features refer to image features after image features are compressed.
  • the terminal may perform feature compression on the second extracted features, so as to increase the speed of the terminal's pose estimation.
  • the image feature compression network uses 1*1 point-by-point convolution to perform feature compression on the second extracted features, and obtains compressed image features after linear transformation.
  • Step 212 determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
  • the key point position information refers to information capable of determining the position of the key point in the target image. For example, information such as the coordinates, names or directions of the key points in the target image.
  • the terminal may obtain the key point position information corresponding to the target object through the correspondence between the compressed image feature and the key point position information corresponding to the target object.
  • the terminal stores a matching relationship table between compressed image features and key point location information. After obtaining the compressed image features, the terminal obtains corresponding key point location information by traversing the above matching relationship table. According to the position coordinates and names in the key point position information, the pose estimation of the target object is performed. For example, the position information of the key point obtained by the terminal after traversing the above matching relationship table is (200, 200, wrist joint), indicating that the position coordinates of the key point are (200, 200), and the key point is the wrist joint. Describe the position information of multiple key points above, and estimate the pose.
  • the target image includes the target object to be processed; feature extraction is performed based on the target image to obtain the first extraction feature; the first extraction feature is obtained through the image feature expansion network Perform feature expansion to obtain the expanded image features; perform feature extraction on the expanded image features to obtain the second extracted features; perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features; determine the target image based on the compressed image features
  • the target object corresponds to the key point position information, and the pose estimation of the target object is performed based on the key point position information.
  • the feature expansion of the extracted first extraction features is performed through the image feature expansion network, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.
  • the image feature expansion network includes a plurality of feature convolution channels, and the first extracted feature is subjected to feature expansion through the image feature expansion network, and the expanded image features obtained include:
  • Step 302 input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses a feature dimension-preserving convolution kernel to convolve the first extracted features to obtain each feature convolution The convolutional features output by the channel.
  • the feature dimension preserving convolution kernel refers to the convolution kernel that can keep the dimension of the image unchanged, and the dimension of the image refers to the number of channels of the image. For example, a convolution kernel of size 1*1.
  • the terminal can maintain the convolution kernel by setting the feature dimension, and perform convolution on the first extracted feature from each feature convolution channel, which can be obtained by using fewer parameters while keeping the scale of the first extracted feature unchanged.
  • the convolutional features output by the feature convolution channel can be maintained by setting the feature dimension, and perform convolution on the first extracted feature from each feature convolution channel, which can be obtained by using fewer parameters while keeping the scale of the first extracted feature unchanged.
  • the terminal performs convolution on the image extraction features of each feature convolution channel by setting the feature dimension to maintain the number and size of the convolution kernels, and obtains the convolution features output by the feature convolution channel. For example, on a 3*3 convolutional network with a network of 64 channels, after adding a convolution kernel with a size of 1*1 and a channel number of 256, the original network can be converted to 64*256 parameters. The number of channels has been expanded from 64 to 256.
  • step 304 the convolution features output by each feature convolution channel are synthesized to obtain expanded image features.
  • the feature dimension-preserving convolution kernel can linearly combine each pixel in the first extracted feature in the image on different channels to obtain the expanded image feature .
  • the composition of the expanded network is to add a 1*1, 28-channel convolution kernel behind the 3*3, 64-channel convolution kernel, and convert it into a 3*3, 28-channel convolution kernel.
  • the original 64 The channels are linearly combined across channels to become 28 channels, which realizes the information interaction between channels, and obtains the expanded image features through the convolution features output by each feature convolution channel.
  • the convolution feature output by the feature convolution channel is obtained, and each feature convolution The convolution feature output by the product channel is used to obtain the expanded image feature, which can achieve the purpose of obtaining the expanded image feature with fewer parameters, thereby improving the efficiency of pose estimation.
  • determining the key point position information corresponding to the target object in the target image based on the compressed image features includes:
  • Step 402 amplifying the compressed image features to obtain enlarged image features.
  • the enlarged image features are obtained by upsampling the features.
  • the terminal amplifies the compressed image features, and sets the number of input and output channels of the three-layer sampling network to (256, 128), (128, 64), (64, 64 ), which can reduce the amount of network parameters and computation.
  • the terminal performs interpolation calculation on the compressed image features through an interpolation method to obtain enlarged image features. For example, on the basis of compressed image features, new elements are inserted between pixels using appropriate interpolation algorithms such as linear interpolation or bilinear interpolation.
  • Step 404 performing convolution on the enlarged image features to obtain a third extracted feature.
  • the enlarged image features are obtained, in order to compensate for the reduction of nonlinear units during the process of enlarging the compressed image features, the enlarged image features are convolved to obtain the third extracted features.
  • Step 406 based on the third extracted feature, determine key point position information corresponding to the target object in the target image.
  • searching and filtering are performed on the third extracted features to determine key point position information corresponding to the target object in the target image.
  • the terminal stores a matching list of image features and key point position information. After obtaining the third extracted feature, the matching list is traversed to obtain the key point position information corresponding to the third extracted feature, that is, the key point position information in the target image The key point position information corresponding to the target object.
  • the point position information can achieve the purpose of improving the image output quality, so as to achieve the purpose of obtaining the key point position information corresponding to the target object more accurately.
  • acquiring a target image to be subjected to pose estimation includes:
  • Step 502 acquiring an initial image.
  • the initial image refers to the unprocessed original image.
  • the original image is an image obtained directly by an image acquisition device or a terminal.
  • the terminal can collect the initial image through the connected image acquisition device, and the acquisition device transmits the acquired initial image to the terminal in real time; or the acquisition device temporarily stores the acquired initial image locally in the acquisition device, when When receiving an image acquisition instruction from the terminal, the locally stored initial image is transmitted to the terminal, and accordingly, the terminal can acquire the initial image.
  • the terminal collects the initial image through an internal image collection module, stores the collected image in the terminal memory, and obtains the initial image from the memory when the terminal needs to obtain the initial image.
  • Step 504 perform object detection on the initial image, and obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object.
  • the initial image is divided into multiple image sub-regions as candidate image regions, and the probability of the target object in each candidate image region is detected. For example, if the image is divided into sub-region A, sub-region B and sub-region C, the probability of the target object in sub-region A is 0%, the probability of the target object in sub-region B is 10%, and the probability of the target object in sub-region C is The probability is 90%.
  • the terminal obtains the probability that each image sub-region includes the target object by gradually reducing the size of the image sub-regions.
  • Step 506 based on the probability that the candidate image region includes the target object, an object image region including the target object is selected from the candidate image regions.
  • the terminal may compare the probabilities of each candidate image region to obtain candidate image regions whose probabilities are within a preset probability threshold range, and use the candidate The image area serves as an object image area including the target object.
  • the terminal traverses the probability that the candidate image areas include the target object, obtains the maximum probability value among the probabilities, and uses the candidate image area corresponding to the maximum probability value as the target image area of the target object.
  • Step 508 according to the target image area, extract the intercepted image area from the initial image, and use the intercepted image as the target image to be subjected to pose estimation.
  • the image of the obtained image area can be intercepted as the target image to be subjected to pose estimation, so as to reduce the amount of computation during pose estimation, Improve pose estimation efficiency.
  • the terminal can extract the coordinate information of the object image area, and use the coordinate information to intercept and obtain the target image for pose estimation.
  • the frame-selected image area may be used as the target image area, and the frame-selected image area may be intercepted from the initial image as the target image to be pose estimated.
  • the probabilities that the multiple candidate image areas in the initial image respectively include the target object are obtained, and based on the probability that the candidate image areas include the target object, the candidate image areas including the target object are selected to obtain
  • the object image area of the target object is extracted from the initial image to obtain the object image area, and the intercepted image is used as the target image for pose estimation, so as to achieve the purpose of accurately obtaining the target image from the initial image.
  • the intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be subjected to pose estimation including:
  • Step 602 acquiring the center coordinates in the object image area.
  • the center coordinates refer to the coordinates of the pixel at the center point of the object image area.
  • the coordinates are based on the coordinates of the pixel at the center of the object image area of the target object in the initial image, and can be determined according to the length and width of the initial image.
  • the terminal obtains the pixel point at the center of the target image area, and obtains the coordinates of the pixel point through the pixel point coordinate obtaining tool.
  • step 604 the area size corresponding to the target image area is obtained, and the area extension value is obtained based on the area size and the size expansion coefficient.
  • the area size refers to the area length and area width of the target image area. For example, if the area length of the target image area is h, and the area width of the target image area is w, then the area size is w*h, and the size expansion coefficient refers to a coefficient that can increase the area size.
  • the area extension value refers to the growth value of the area size obtained by correcting the area size with the size expansion coefficient.
  • the terminal may then acquire the area size corresponding to the target image area, and obtain the area extension value through the functional relationship between the area size and the size expansion coefficient.
  • the terminal obtains the area size corresponding to the object image area through the image size measurement tool, and obtains the area extension value by using the product relationship between the area size and the size expansion coefficient.
  • the area width in the area size is w
  • the size expansion coefficient is exp_ratio
  • the area extension value of the area width is w*exp_ratio*1.2/2; similarly, the area extension value of the area length in the area size can also be passed
  • the corresponding dimensional expansion coefficients are obtained.
  • Step 606 based on the center coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain the extension coordinates.
  • the extension direction refers to the direction in which the width and length increase corresponding to the area extension value.
  • the extension coordinates refer to the coordinates of the object image area obtained by extending the object image area in the extension direction with the center coordinates as a reference point.
  • the coordinates may be represented by the coordinates of the upper left corner and the lower right corner of the object image area.
  • the terminal may use the center coordinates as a reference point to expand the target image area by using the area extension value to obtain the extension coordinates corresponding to the target image area.
  • the terminal may use the center coordinates as a reference point to expand the target image area by using the area extension value to obtain the extension coordinates corresponding to the target image area.
  • the center coordinates are expressed as (x, y), and the extension coordinates are (x0, y0) and (x1, y1), where x0 and x1 are the coordinates of the extension value of the object image area in the direction of the width extension of the image , y0 and y1 are the coordinates of the object image area extension value in the length extension direction of the image, then the extension coordinates can be expressed as the formula:
  • the extension value in the width extension direction in the area extension value when the area extension value in the width extension direction in the area extension value is less than or equal to 0, the extension value is zero; when the area extension value is greater than or equal to the width of the initial image, the width of the initial image is used as the area extension value .
  • the area extension value in the length extension direction in the area extension value is less than or equal to zero, the area extension value is zero; when the area extension value is greater than or equal to the height of the initial image, the height of the initial image is used as the area extension value.
  • Step 608 taking the image area within the extended coordinates as the intercepted image area, and using the intercepted image as the target image to be subjected to pose estimation.
  • the terminal may use the image area within the extended coordinates as the intercepted image area, and use the intercepted image as the target image for pose estimation.
  • the area extension value is obtained based on the area size and the size expansion coefficient, and the area extension value corresponding to the area extension value is obtained based on the center coordinates and the area extension value.
  • the extension direction is extended to obtain the extension coordinates, and the image area located in the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be estimated for attitude, which can achieve the purpose of accurately intercepting the target image, and then improve the performance of attitude estimation. efficiency.
  • the method also includes:
  • Step 702 According to the mapping relationship between the key point position information and the target point position information, each key point position information is converted into corresponding target point position information; the target point position information is the position information of the key point position information in the initial image.
  • the location information refers to information related to the location that can reflect a certain location point.
  • the position point may be a key point, or other points having the same structure or function as the key point.
  • the position-related information may be coordinate information of a key point or information describing the position of the key point.
  • keypoint location information can be expressed as (100,100, eyes).
  • the key point position information in the target image there is a one-to-one correspondence between the key point position information in the target image and the target point position information in the initial image, and they can be converted to each other. After knowing the position information of the key point, the position information of the key point in the initial image can be correspondingly obtained, so that the position information in the initial image obtained in the target image can be accurately reflected in the initial image.
  • the key point position information of the jth key point is expressed as (x_keypoints_j, y_keypoints_j), and the vertex coordinates of the i-th target image in the upper left corner of the initial image are expressed as (x_person_i, y_person_i), the key point is in The coordinates in the initial image are expressed as (x_original_keypoints, y_original_keypoints), then the coordinates of the key point in the initial image are expressed as the formula:
  • x_original_keypoints x_person_i + x_keypoints_j
  • y_original_keypoints y_person_i + y_keypoints_j
  • Step 704 perform pose estimation on the target object based on each target position information, and obtain the target pose corresponding to the target image.
  • the terminal After determining each target position information corresponding to a plurality of key points, the terminal performs pose estimation through the corresponding relationship between the specific type of the key point and the target position information, and obtains the target pose corresponding to the target image.
  • the position information of each key point is converted into the corresponding target point position information, and the pose estimation of the target object is performed through each target position information to obtain the corresponding position of the target image.
  • target attitude The purpose of obtaining the target pose in the target image can be achieved.
  • the target video generation method includes:
  • Step 802 acquire the target action, and determine the gesture sequence corresponding to the target action, wherein the gestures in the gesture sequence are executed in order to obtain the target action.
  • the target action refers to an action obtained after each gesture is executed in sequence.
  • Gesture refers to the individual sub-actions that make up an action.
  • the target action is an arm stretching movement, and the target action is composed of multiple sub-actions such as placing the arms flat, straightening the arms, and turning the arms together sideways.
  • Multiple gestures can form a gesture sequence according to the order of the front and back, and when the sequence of gestures is executed sequentially, the target action is obtained.
  • Step 804 acquire the target pose corresponding to each target image in the target image set.
  • the terminal can obtain various poses based on the above-mentioned pose estimation methods, and each pose constituting the target action is in a different target image, and the corresponding poses can be obtained from each target image.
  • the target pose F exists in the target image E
  • the target pose H exists in the target image G, and so on.
  • the target pose corresponding to each target image is obtained in the target image set.
  • Step 806 acquire images corresponding to each target pose in the pose sequence from the target image set as video frame images.
  • the terminal After acquiring the target pose, the terminal obtains images corresponding to the target pose according to the target pose, and uses each obtained image as a video frame image.
  • the time stamps corresponding to the images corresponding to the target poses are obtained, and the images carrying the respective time stamps are used as video frame images.
  • Step 808 arrange the obtained video frame images according to the pose sequence in the pose sequence, and obtain the target video corresponding to the target action.
  • the corresponding video frame images are arranged according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.
  • the gestures in the gesture sequence are sorted, that is, after obtaining the video frame images, according to the The time stamps of the frame images arrange the video frame images, and the target video is obtained according to the video frame images.
  • the image corresponding to each target pose in the pose sequence is obtained from the target image set as a video frame image, according to the pose in the pose sequence
  • Sorting Arranges the obtained video frame images to obtain the target video corresponding to the target action, which can achieve the purpose of obtaining the target video corresponding to the target action through pose estimation, so that the pose estimation can be realized and obtained in practical applications.
  • the terminal is a panoramic camera and the target object is a human body as an example.
  • the target object is a human body as an example.
  • a human target object for human pose estimation needs to be included.
  • the human target object can be complete, or only contain a part of it or have occlusions.
  • the coordinate value of the human body bounding box B1 is obtained through the human body tracking or detection algorithm, and the coordinate value of the human body bounding box or the expansion of the human body bounding box is obtained. Coordinate value of bounding box B2 after expansion.
  • the panoramic image is cropped to obtain a sub-panoramic image with a border of B2.
  • the sub-panoramic image is input into the trained human pose estimation model, as shown in Figure 12
  • the heat map of the first key point C is obtained
  • the sub-panoramic image is input into the trained human pose estimation model to obtain the heat map of the second key point, and so on, by obtaining multiple heat maps map, and in turn get a preset number of heatmaps with keypoints.
  • the coordinates of the key points in the heat map are mapped to the sub-panoramic image, and the position of the key point in the sub-panoramic image is mapped to the panoramic image, so as to obtain the position of the key point in the panoramic image, thereby estimating the posture of the human body.
  • the terminal when the terminal performs normalization processing on the panoramic image or the normalization processing on the sub-panoramic image, the pixel value of the pixel point in the normalized image and the pixel value of the pixel point in the original image The proportional relationship between the difference with the average value of the pixel value, and obtain the pixel value of the pixel point in the normalized image.
  • the pixel value of a certain pixel in the normalized image is expressed as X_normalization
  • the pixel value of a certain pixel in the panoramic image or sub-panoramic image is expressed as X
  • the pixels of all the pixels in the panoramic image or sub-panoramic image The average value of the value is expressed as mean
  • the proportional coefficient is expressed as std
  • X_normalization is expressed as a formula:
  • std can be the variance of all pixels in the panoramic image or sub-panoramic image; a certain pixel in the panoramic image or sub-panoramic image can be a pixel of RGB (red, green and blue) three channels.
  • the terminal may obtain the coordinate value of the bounding box of the human body through a human body detection algorithm.
  • a human body detection algorithm For example, using Faster RCNN (Faster Region-CNN), YOLO (You Only Look Once) series algorithms, SSD (Single Shot MultiBox Detector) series algorithms, etc. or tracking algorithms such as Siamese (Siamese network ) tracking algorithm, etc.
  • the human pose estimation model can be reduced by reducing the number of image feature blocks between stages in HRNet, for example, changing the number of down-sampled image feature blocks in the second stage to 1 , so that the human pose estimation model can reduce the amount of parameters and calculations, thereby improving the efficiency of human pose estimation.
  • a pose estimation device 1400 including: a target image acquisition module 1402, a first feature extraction module 1404, an expanded image feature acquisition module 1406, and a second extraction feature acquisition module 1408 , compressed image feature obtaining module 1410 and key point position information determining module 1412, wherein: the target image obtaining module 1402 is used to obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; the first feature extraction module 1404, for feature extraction based on the target image, and obtain the first extracted feature; the expanded image feature obtaining module 1406, for performing feature expansion on the first extracted feature through the image feature expansion network, to obtain the expanded image feature; the second extracted feature is obtained Module 1408, for feature extraction of expanded image features, to obtain second extracted features; compressed image feature obtaining module 1410, for feature compression of second extracted features through image feature compression network, to obtain compressed image features; key point position
  • the information determination module 1412 is configured to determine key point position information corresponding to the target object in the target image based on the
  • the expanded image feature obtaining module 1406 is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses the feature dimension to keep the convolution kernel to the first Extract features for convolution to obtain the convolution features output by each feature convolution channel; synthesize the convolution features output by each feature convolution channel to obtain expanded image features.
  • the key point position information determination module 1412 is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain the third extracted features; based on the third extracted features, determine The key point position information corresponding to the target object in the target image.
  • the target image acquisition module 1402 is used to acquire an initial image; perform object detection on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the probability that the candidate image areas include the target object from An object image area including the target object is selected from the candidate image area; according to the object image area, an intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be pose estimated.
  • the target image acquisition module 1402 is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; based on the center coordinates and the area extension The value is extended to the extension direction corresponding to the area extension value to obtain the extension coordinates; the image area located within the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image for pose estimation.
  • the target image acquisition module 1402 is configured to convert each key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point position information is the key point position The position information of the information in the initial image; the pose estimation of the target object is performed based on each target position information, and the target pose corresponding to the target image is obtained.
  • the target video generation device is used to obtain target actions, determine the gesture sequence corresponding to the target action, and the gestures in the gesture sequence are executed in order to obtain the target action; acquire the target gesture corresponding to each target image in the target image set ; Obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; arrange the obtained video frame images according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.
  • Each module in the above-mentioned attitude estimation device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure may be as shown in FIG. 15 .
  • the computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer programs.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (Near Field Communication) or other technologies.
  • a pose estimation method is realized.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
  • Figure 15 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer equipment on which the solution of this application is applied.
  • the specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
  • a computer device including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the above method embodiments when executing the computer program.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.
  • Non-volatile memory can include read-only memory (Read-Only Memory, ROM), tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include Random Access Memory (Random Access Memory, RAM) or external cache memory.
  • RAM can come in many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (Dynamic Random Access Memory) Access Memory, DRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

La présente demande se rapporte à un procédé et à un appareil d'estimation de pose, à un dispositif informatique et à un support de stockage. Le procédé consiste : à acquérir une image cible pour subir une estimation de pose, l'image cible comprenant un sujet cible à traiter ; à réaliser une extraction de caractéristiques sur la base de l'image cible et à acquérir une première caractéristique extraite ; à réaliser une expansion de caractéristique sur la première caractéristique extraite au moyen d'un réseau d'expansion de caractéristique d'image et à obtenir une caractéristique d'image étendue ; à réaliser une extraction de caractéristique sur la caractéristique d'image étendue et à obtenir une seconde caractéristique extraite ; à réaliser une compression de caractéristiques sur la seconde caractéristique extraite au moyen d'un réseau de compression de caractéristiques d'image et à obtenir une caractéristique d'image compressée ; à déterminer, sur la base de la caractéristique d'image compressée, des informations d'emplacement de point clé correspondant au sujet cible dans l'image cible et à réaliser une estimation de pose sur le sujet cible sur la base des informations d'emplacement de point clé. Le procédé selon la présente invention peut améliorer l'efficacité d'estimation de pose.
PCT/CN2022/091484 2021-05-12 2022-05-07 Procédé et appareil d'estimation de pose, dispositif informatique et support de stockage WO2022237688A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110517805.3 2021-05-12
CN202110517805.3A CN113158974A (zh) 2021-05-12 2021-05-12 姿态估计方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022237688A1 true WO2022237688A1 (fr) 2022-11-17

Family

ID=76874942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091484 WO2022237688A1 (fr) 2021-05-12 2022-05-07 Procédé et appareil d'estimation de pose, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN113158974A (fr)
WO (1) WO2022237688A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071785A (zh) * 2023-03-06 2023-05-05 合肥工业大学 一种基于多维空间交互的人体姿态估计方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158974A (zh) * 2021-05-12 2021-07-23 影石创新科技股份有限公司 姿态估计方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062526A (zh) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 一种人体姿态估计方法及移动终端
US20190012790A1 (en) * 2017-07-05 2019-01-10 Canon Kabushiki Kaisha Image processing apparatus, training apparatus, image processing method, training method, and storage medium
CN112241731A (zh) * 2020-12-03 2021-01-19 北京沃东天骏信息技术有限公司 一种姿态确定方法、装置、设备及存储介质
CN112308950A (zh) * 2020-08-25 2021-02-02 北京沃东天骏信息技术有限公司 视频生成方法及装置
CN112614184A (zh) * 2020-12-28 2021-04-06 清华大学 基于2d检测的物体6d姿态估计方法、装置及计算机设备
CN113158974A (zh) * 2021-05-12 2021-07-23 影石创新科技股份有限公司 姿态估计方法、装置、计算机设备和存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626218B (zh) * 2020-05-28 2023-12-26 腾讯科技(深圳)有限公司 基于人工智能的图像生成方法、装置、设备及存储介质
CN112347861B (zh) * 2020-10-16 2023-12-05 浙江工商大学 一种基于运动特征约束的人体姿态估计方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190012790A1 (en) * 2017-07-05 2019-01-10 Canon Kabushiki Kaisha Image processing apparatus, training apparatus, image processing method, training method, and storage medium
CN108062526A (zh) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 一种人体姿态估计方法及移动终端
CN112308950A (zh) * 2020-08-25 2021-02-02 北京沃东天骏信息技术有限公司 视频生成方法及装置
CN112241731A (zh) * 2020-12-03 2021-01-19 北京沃东天骏信息技术有限公司 一种姿态确定方法、装置、设备及存储介质
CN112614184A (zh) * 2020-12-28 2021-04-06 清华大学 基于2d检测的物体6d姿态估计方法、装置及计算机设备
CN113158974A (zh) * 2021-05-12 2021-07-23 影石创新科技股份有限公司 姿态估计方法、装置、计算机设备和存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071785A (zh) * 2023-03-06 2023-05-05 合肥工业大学 一种基于多维空间交互的人体姿态估计方法

Also Published As

Publication number Publication date
CN113158974A (zh) 2021-07-23

Similar Documents

Publication Publication Date Title
JP6961749B2 (ja) インターリーブチャネルデータ用の構成可能な畳み込みエンジン
JP6789402B2 (ja) 画像内の物体の姿の確定方法、装置、設備及び記憶媒体
CN111598998B (zh) 三维虚拟模型重建方法、装置、计算机设备和存储介质
JP6560480B2 (ja) 画像処理システム、画像処理方法、及びプログラム
WO2019128508A1 (fr) Procédé et appareil de traitement d'image, support de mémoire et dispositif électronique
WO2020010979A1 (fr) Procédé et appareil d'apprentissage de modèle permettant la reconnaissance de points clés d'une main et procédé et appareil de reconnaissance de points clés d'une main
WO2020164270A1 (fr) Procédé, système et appareil de détection de piéton sur la base d'un apprentissage profond et support d'informations
US8861800B2 (en) Rapid 3D face reconstruction from a 2D image and methods using such rapid 3D face reconstruction
WO2022237688A1 (fr) Procédé et appareil d'estimation de pose, dispositif informatique et support de stockage
WO2020134528A1 (fr) Procédé de détection cible et produit associé
WO2020134818A1 (fr) Procédé de traitement d'images et produit associé
US20240153213A1 (en) Data acquisition and reconstruction method and system for human body three-dimensional modeling based on single mobile phone
TW202011284A (zh) 眼睛狀態檢測系統及眼睛狀態檢測系統的操作方法
WO2020223940A1 (fr) Procédé de prédiction de posture, dispositif informatique et support d'informations
WO2023216526A1 (fr) Procédé et appareil de détermination d'informations d'étalonnage et dispositif électronique
CN114640833A (zh) 投影画面调整方法、装置、电子设备和存储介质
EP4135317A2 (fr) Procédé d'acquisition d'images stéréoscopiques, dispositif électronique et support d'enregistrement
WO2022063321A1 (fr) Procédé et appareil de traitement d'image, dispositif et support de stockage
WO2019000464A1 (fr) Procédé et dispositif d'affichage d'image, support de stockage et terminal
CN116452745A (zh) 手部建模、手部模型处理方法、设备和介质
WO2024000233A1 (fr) Procédé et appareil de reconnaissance d'expression faciale, et dispositif et support de stockage lisible
CN116229130A (zh) 模糊图像的类型识别方法、装置、计算机设备和存储介质
CN113011250A (zh) 一种手部三维图像识别方法及***
JP6467994B2 (ja) 画像処理プログラム、画像処理装置、及び画像処理方法
CN112580544A (zh) 图像识别方法、装置和介质及其电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22806654

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22806654

Country of ref document: EP

Kind code of ref document: A1