CN112084958A - Method and device for recognizing human skeleton of multiple persons at mobile terminal - Google Patents

Method and device for recognizing human skeleton of multiple persons at mobile terminal Download PDF

Info

Publication number
CN112084958A
CN112084958A CN202010952910.5A CN202010952910A CN112084958A CN 112084958 A CN112084958 A CN 112084958A CN 202010952910 A CN202010952910 A CN 202010952910A CN 112084958 A CN112084958 A CN 112084958A
Authority
CN
China
Prior art keywords
human body
frame
mobile terminal
model
human
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010952910.5A
Other languages
Chinese (zh)
Other versions
CN112084958B (en
Inventor
张德宇
章晋睿
许晓晖
贾富程
张尧学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010952910.5A priority Critical patent/CN112084958B/en
Publication of CN112084958A publication Critical patent/CN112084958A/en
Application granted granted Critical
Publication of CN112084958B publication Critical patent/CN112084958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for identifying human skeletons of multiple persons at a mobile terminal, wherein the method comprises the following steps: s1, decoding from a video to obtain an image frame; s2, identifying a human body region from the image frame to generate a human body subgraph; and S3, identifying and scheduling the human body subgraph, allocating the human body subgraph to a CPU (Central processing Unit) and/or a GPU (graphics processing Unit) of the mobile terminal for identification, and respectively operating a preset skeleton identification model through the CPU and/or the GPU to identify the human body skeleton. The invention has the advantages of high recognition speed, low delay, high recognition precision and the like.

Description

Method and device for recognizing human skeleton of multiple persons at mobile terminal
Technical Field
The invention relates to the field of mobile computing, in particular to a method and a device for recognizing human skeletons of multiple persons at a mobile terminal.
Background
Along with the development of mobile device performance, the application field of human gesture estimation is wider and wider, for example action recognition through tracking the change of a person's gesture in a period of time to realize the reading to human action, this can be used for detecting whether one falls down or the disease, also can be used for the automatic teaching of body-building sports and dance.
At present, a human body posture estimation model based on deep learning is difficult to achieve a satisfactory effect on mobile terminal equipment with limited resources. Although the current latest models can achieve higher accuracy, huge calculation task amount is brought, and the models are operated on the mobile device, so that great calculation delay is caused, and extremely poor user experience is brought. As disclosed in SUN, k., XIAO, b., LIU, d., AND WANG, j.deep high-resolution representation for human position estimation. arXiv representation arXiv:1902.09212(2019), the single pose estimation model achieves a high accuracy on the MPII dataset, AND the PCKh value (Head-normalized Percentage of Correct keys) is 92.3, but when running the model onto the mobile end device, it takes 600 ms even under the support of the mobile device GPU for typical CNN processing on a video frame, AND therefore, it is difficult to implement AND achieve good results on the mobile end. While
A posture recognition model Posenet suitable for the operation of a mobile terminal, which is proposed by Google corporation, designs a human body posture recognition model suitable for the mobile terminal through a deep learning model compression method. The conventional convolution operation (CNN) is replaced by a deep separable convolution (Depthwise CNN) with less computation in PoseNet, and the deep separable convolutions are stacked step by step into a single-branch network structure. PoseNet runs a single person gesture recognition model on Snapdragon 845SoC, and the delay of running on the CPU is reduced to only 60 ms. However, this is only a single person pose estimation, and if the pose estimation is a multi-person pose estimation, the delay will increase linearly with the increase of the number of people, and the loss of accuracy of the model after compression is large, for example, the PCKh value of the PoseNet in the MPII data set is only 80.3.
Therefore, how to ensure the recognition speed, reduce the delay and ensure the recognition accuracy in the human body skeleton recognition at the mobile end, especially in the human body skeleton recognition of multiple people, is still a problem to be solved urgently.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a method and a device for recognizing human skeletons of multiple persons at a mobile terminal, which have the advantages of high recognition speed, low delay and high recognition precision.
In order to solve the technical problems, the technical scheme provided by the invention is as follows: a method for recognizing human skeleton of multiple persons at a mobile terminal comprises the following steps:
s1, decoding from a video to obtain an image frame;
s2, identifying a human body region from the image frame to generate a human body subgraph;
and S3, identifying and scheduling the human body subgraph, allocating the human body subgraph to a CPU (Central processing Unit) and/or a GPU (graphics processing Unit) of the mobile terminal for identification, and respectively operating a preset skeleton identification model through the CPU and/or the GPU to identify the human body skeleton.
Further, the image frame in the step S1 includes a key frame and a tracking frame;
the key frame is a first predetermined frame in the video, and the tracking frame is a second predetermined frame in the video;
or: the key frame is based on a preset reference point, if the displacement of the reference point in the current image frame relative to the reference point in the previous key frame is greater than a preset threshold value, the current image frame is the key frame, otherwise, the current image frame is the tracking frame.
Further, the step S2 specifically includes: aiming at the key frame, identifying a human body region through a fine-grained human body identification model, and generating a human body subgraph of the key frame; and for the tracking frame, identifying a human body region through a human body identification model based on a motion vector, and generating a human body subgraph of the tracking frame.
Further, the step S3 specifically includes:
when the number of the human body subgraphs to be recognized meets the condition that h is larger than or equal to k + a, monitoring the idle thread conditions of a CPU (central processing unit) and a GPU (graphics processing unit) of the mobile terminal, and when the idle thread exists, allocating one human body subgraph to be recognized for the idle thread to recognize the human body skeleton, wherein h is the number of the human body subgraphs to be recognized, k is a preset allocation parameter, and a is the maximum thread number of the CPU of the mobile terminal.
Further, the step S3 specifically includes:
when the number of the human body subgraphs to be identified satisfies h<k + a, first, the
Figure BDA0002677618510000021
Distributing the human body subgraphs to be identified to a GPU;
judging the number of the rest human body subgraphs to be identified when the number meets the requirement
Figure BDA0002677618510000022
And monitoring the idle thread condition of the CPU of the mobile terminal, distributing one human body subgraph to be identified to the idle thread when the idle thread exists, and otherwise, distributing one human body subgraph to be identified to more than one thread.
Further, the preset distribution parameter is determined according to the ratio of the CPU operation delay and the GPU operation delay of the mobile terminal.
Further, the preset skeleton recognition model in step S3 is a convolutional neural network model, and a neural network layer of the convolutional neural network model includes at least 1 inversion residual convolutional layer and at least 1 enhancement inversion residual convolutional layer; the inverted residual convolution layer adopts a ReLU6 nonlinear activation function; the enhanced inverted residual convolutional layer adopts an Squeeze and Excitation module and an H-Swish activation function.
Further, the innermost neural network of the convolutional neural network model comprises 2 or more than 2 enhanced inverse residual convolutional layers.
Further, a preprocessing layer is arranged between the input layer of the convolutional neural network and the first neural network layer; the preprocessing layer comprises at least 1 first inverted residual model and at least 1 second inverted residual model; the step sizes of the first inverted residual error model and the second inverted residual error model are not equal.
A multi-person human body skeleton recognition device at a mobile terminal comprises a processor and a memory, wherein the processor is used for executing an application program in the memory, and the memory stores the application program which can realize the human body skeleton recognition method when being executed.
Compared with the prior art, the invention has the advantages that:
1. the invention can effectively utilize the computing power of the CPU and the GPU by scheduling the framework recognition task of the human body subgraph to the CPU and/or the GPU for framework recognition, adopts a parallelization processing strategy, and can effectively improve the speed of human body framework recognition and reduce the delay of framework recognition especially when the human body framework recognition of a plurality of people is involved.
2. In the convolutional neural network model, the inverse residual convolutional layer and the enhanced inverse residual convolutional layer are arranged on the neural network layer, so that the convolutional neural network model has high precision and low time delay and effectively reduces the calculated amount of the convolutional neural network model.
3. The method decomposes the video into the key frame and the tracking frame, adopts the human body recognition model with fine granularity to recognize the human body area for the key frame, adopts the human body recognition model based on the motion vector to recognize the human body area for the tracking frame, effectively utilizes the accuracy of the human body recognition model with fine granularity and the high efficiency of the human body recognition model based on the motion vector, can quickly and efficiently determine the human body area from continuous video frames, effectively improves the efficiency and the speed of framing the human body area, and reduces the calculation amount of human body area recognition.
Drawings
FIG. 1 is a schematic process flow diagram of an embodiment of the present invention.
FIG. 2 is a diagram illustrating human body joints and macro block division according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating joint point determination through consecutive motion frames according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating comparative analysis of the delay effect according to an embodiment of the present invention.
FIG. 6 is a graph of a comparison analysis of recognition accuracy for an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
As shown in fig. 1, the method for recognizing a human skeleton of multiple persons at a mobile terminal of the present embodiment includes the following steps: s1, decoding from a video to obtain an image frame; s2, identifying a human body region from the image frame to generate a human body subgraph; and S3, identifying and scheduling the human body subgraph, allocating the human body subgraph to a CPU (Central processing Unit) and/or a GPU (graphics processing Unit) of the mobile terminal for identification, and operating a preset skeleton identification model through the CPU and/or the GPU respectively to identify the human body skeleton.
In the present embodiment, the image frame in step S1 includes a key frame and a tracking frame; the key frame is a first preset frame in the video, and the tracking frame is a second preset frame in the video; or: the key frame is based on a preset reference point, if the displacement of the reference point in the current image frame relative to the reference point in the previous key frame is greater than a preset threshold value, the current image frame is the key frame, otherwise, the current image frame is the tracking frame. In the first case: the first predetermined frame may be a specified start frame in the video, and frames separated by a certain step, such as frame 1, frame 11, and frame 21 as key frames when the step is 10, and the rest of the frames are tracking frames. In the second case: if the 1 st frame is determined as a key frame, then the joint points of the human body are used as reference points, such as wrist joints, ankle joints and the like, and when the joint points pass through the images of a plurality of continuous frames and the displacement is larger than a preset threshold value, the next image frame is used as a key frame. The other frames are used as tracking frames. In this embodiment, through a large number of experimental analyses, by using the technical scheme of identifying the position of the human body region of this embodiment, after the displacement of the reference point reaches a certain distance, the accuracy of identification will be seriously affected, and in order to ensure the identification accuracy and the identification efficiency, the preset threshold is preferably the distance of 10 pixels.
In this embodiment, step S2 specifically includes: aiming at the key frame, identifying a human body region through a fine-grained human body identification model to generate a human body subgraph of the key frame; and for the tracking frame, identifying a human body region through a human body identification model based on the motion vector, and generating a human body subgraph of the tracking frame. Compared with a human body recognition model based on motion vectors, the fine-grained human body recognition model in the embodiment is a model with higher recognition accuracy, and is preferably a MobileNet-V3-SSDLite model, and a human body position region in an image frame can be accurately calibrated through the model to generate a human body subgraph. Of course, other models may be used for the fine-grained human recognition model. After the human body position area in the key frame is accurately calibrated through the fine-grained human body recognition model, for the tracking frame behind the key frame, the position of a human body can be tracked and recognized through a Motion Vector (Motion Vector) generated by displacement change of the human body between front and back continuous image frames according to the displacement change of the Motion Vector, the human body position area in the tracking frame is quickly calibrated, and the human body subgraph of the tracking frame is generated. In the human body recognition model based on the motion vector, the human body skeleton joint points are taken as the center, such as the wrist joint and the ankle joint shown in fig. 2 and 3, the motion vector of the macro block around the center is selected to predict the motion of the joint, so that the adverse effect caused by the motion vector of the wrong macro block can be reduced to the maximum extent, and the situation that the body trunk part is still and part of limbs, such as the palm and the arm, move around the joint points in the human body motion process is met. As shown in fig. 2, taking the left wrist as an example, taking the identified left wrist joint point as the center, selecting 4 frames respectively at the upper left, the upper right, the lower left and the lower right, i.e. the region of interest of the left wrist joint point, then calculating the motion vectors in the region, counting the motion vectors moving more in the same direction as effective motion vectors, and obtaining the moving direction and the displacement of the joint point according to the moving direction and the moving amount of the effective motion vectors. The motion vector can be obtained by a hardware coding interface of a mobile phone Android system, the bottom layer of the motion vector can calculate the motion vector by using digital signal processing equipment, and the motion vector describes displacement information of each macro block (Macroblock) between two frames.
In the present embodiment, the joint points of the skeleton are determined by motion vectors and an adaptive regression algorithm. In the continuous 3 frames shown in fig. 3, the left ankle is taken as an example of going forward. In fig. 3, light dots and arrows are motion vectors of objects in the picture, and it can be seen that the motion vector of a region without motion in the frame is in a dot shape, the value of which is 0, and the ankle region moves, so that the motion vector of the portion is an arrow, the arrow direction is the ankle moving direction, and the arrow length is the displacement of the ankle moving. In the diagram corresponding to time t in fig. 3, an enlarged view of the ankle is used for explanation, where a dot below the middle is a joint point identified by the skeleton identification model in the previous frame and used as a reference for predicting the joint point by using the motion vector of the next frame, a dot on the left side of the ankle is a joint point predicted by using an autoregressive filter according to the joint point position of the previous frame and the motion vector displacement of the previous frame, and a dot on the right side is a joint point position actually obtained according to the motion vector displacement of the current frame. The dots are white, outlined and black filled dots in the figure. And finally, framing the person by obtaining the joint points.
In this embodiment, it is further preferable to calibrate the positions of the tracked skeleton points by using a diffusion kalman filter, so as to further improve the accuracy of the identification. Therefore, the high-precision characteristic of the fine-grained human body recognition model and the high efficiency of the human body recognition model based on the motion vector are effectively utilized, the human body position area in each frame can be quickly and accurately marked in the continuous frames, the human body subgraph is generated, the total calculated amount is effectively reduced, and the efficiency and the accuracy are greatly improved.
In this embodiment, step S3 specifically includes: when the number of the human body subgraphs to be identified meets the condition that h is larger than or equal to k + a, monitoring the idle thread conditions of a CPU (central processing unit) and a GPU (graphics processing unit) of the mobile terminal, and when the idle thread exists, allocating one human body subgraph to be identified for the idle thread to identify the human body skeleton, wherein h is the number of the human body subgraphs to be identified, k is a preset allocation parameter, and a is the maximum thread number of the CPU of the mobile terminal.
In this embodiment, the thread of the CPU is determined by the core of the CPU used for the skeleton recognition in the present technical solution, for example, in the existing mobile phone including 4 big cores and 4 small cores, when only 4 big cores are used for the skeleton recognition, then, the maximum number of threads of the CPU is 4; for example, for a CPU with cores of different sizes, when all cores are used for skeleton recognition, the maximum thread number of the CPU is the core number of the CPU.
In this embodiment, step S3 specifically includes: when the number of the human body subgraphs to be recognized satisfies h<k + a, first, the
Figure BDA0002677618510000051
Distributing the human body subgraphs to be recognized to a GPU; then judging the number of the rest human body subgraphs to be identified when the number meets the requirement
Figure BDA0002677618510000052
And monitoring the idle thread condition of the CPU of the mobile terminal, and distributing a human body subgraph to be identified to the idle thread when the idle thread exists, or distributing one human body subgraph to be identified to more than one thread.
In this embodiment, the preset allocation parameter is determined according to a ratio of the CPU operation delay to the GPU operation delay of the mobile terminal. Further, preferably, the CPU operation delay refers to a time required for a single thread of the CPU to perform human skeleton recognition on a single human body subgraph, and the GPU operation delay refers to a time required for the GPU to perform human skeleton recognition on a single human body subgraph. Of course, parameters related to the operation delay may also be directly determined, such as the ratio of the computation speed of the CPU to the computation speed of the GPU.
In this embodiment, the skeleton recognition model preset in step S3 is a convolutional neural network model, and a neural network layer of the convolutional neural network model includes at least 1 inversion residual convolutional layer and at least 1 enhancement inversion residual convolutional layer; the inverse residual convolution layer adopts a ReLU6 nonlinear activation function; the enhanced inverse residual convolutional layer employs the Squeeze and Excitation module and the H-Swish activation function. Preferably, the innermost neural network of the convolutional neural network model comprises 2 or more than 2 enhanced inverse residual convolutional layers. Further preferably, a preprocessing layer is arranged between the input layer of the convolutional neural network and the first neural network layer; the preprocessing layer comprises at least 1 first inverted residual model and at least 1 second inverted residual model; the step sizes of the first inverted residual error model and the second inverted residual error model are not equal.
In this embodiment, the convolutional neural network model is a 192 × 192 × 3 picture as shown in fig. 4, and the output is a 48 × 48 × 3 human key point thermodynamic diagram. The network structure of Hourglass and U-Net is referred to in the whole network structure, the picture is firstly reduced from high resolution to low resolution, then the low resolution is restored to the high resolution after the characteristics are extracted, so that the network structure with the resolution reduced first and then increased can capture information of the scale of the image as much as possible, and the accuracy of model identification can be effectively improved. In the convolutional neural network model, the input picture is reduced from high resolution to very low resolution through convolution and maximum pooling operation, meanwhile, the network gradually reduces the low resolution back to the high resolution through bilinear interpolation operation, and features under different scales are fused to obtain a final output result. The features of the human skeleton points are extracted by using an inverted residual convolution as a basic convolution operation, wherein an inverted residual convolution model is formed by serially connecting a 1x1 convolution operation containing a ReLU6 nonlinear activation function, a 3x3 deep convolution operation and a 1x1 convolution operation containing a linear activation function, and the computation amount of the convolution operation is greatly reduced compared with the traditional CNN convolution. In this embodiment, because the number of parameters and the amount of calculation in the inverse residual convolution operation are both reduced, the accuracy of extracting features of the inverse residual convolution operation is also correspondingly reduced, which may cause the reduction of the accuracy of the whole human posture recognition model. In order to make the model have high precision and low delay, the innermost layer of the convolutional neural network model of the embodiment adopts enhanced inversion residual convolution, i.e. combining the Squeeze and Excitation modules in the operation of inversion residual, and replacing the ReLU6 with an H-Swish activation function. According to the characteristics of the human body structure, parts which are difficult to identify, such as ankles, exist, therefore, the feature extraction capability needs to be further enhanced for skeleton points which are difficult to identify, and feature information of the image at low resolution can be extracted by adopting enhanced inversion residual convolution through the innermost layer of network. In this embodiment, a Layer Module (n, e) is defined as an extracted feature Module of a model network, where n represents the total number of convolution operations including an inverted residual modulus and an enhanced inverted residual convolution operation in the Layer network, and e represents the number of convolutions including the enhanced inverted residual in the network. In a high-resolution network, Layer Module (4, 1) is adopted for feature extraction, the recognition degree of joint points such as the head and the neck of a human body is high, the feature extraction is relatively obvious, in a low-resolution network, joint points such as the wrist and the ankle are mainly recognized, and due to the fact that the difficulty in feature extraction is high, in order to improve the capability of model feature extraction, Layer Module (4, 2) is adopted for feature extraction of the joint points. Therefore, the recognition precision of the model to each joint point of the human body can be effectively improved, and meanwhile, the calculated amount of the model can be reduced.
The multi-person human body skeleton recognition device of the mobile terminal in the embodiment comprises a processor and a memory, wherein the processor is used for executing an application program in the memory, and the memory stores the application program which can realize the human body skeleton recognition method when being executed.
In this embodiment, the following conclusions can be drawn by running the technical solution of this embodiment on different models of mobile phones and comparing with the prior art. As shown in fig. 5, compared with the technical solutions that only operate by a CPU and only operate by a GPU, the technical solution of this embodiment obviously optimizes the delay on the mobile phone platforms such as vivoiqooo, millet 8, and HiKey970, and obviously improves the efficiency of identification. Meanwhile, as shown in fig. 6, the scheme of the present embodiment is also substantially equivalent to the comparison technical scheme in terms of accuracy, and some indexes are improved to some extent.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A method for recognizing human skeleton of multiple persons at a mobile terminal is characterized by comprising the following steps:
s1, decoding from a video to obtain an image frame;
s2, identifying a human body region from the image frame to generate a human body subgraph;
and S3, identifying and scheduling the human body subgraph, allocating the human body subgraph to a CPU (Central processing Unit) and/or a GPU (graphics processing Unit) of the mobile terminal for identification, and respectively operating a preset skeleton identification model through the CPU and/or the GPU to identify the human body skeleton.
2. The method for multi-person human skeleton recognition at a mobile end according to claim 1, wherein: the image frame in the step S1 includes a key frame and a tracking frame;
the key frame is a first predetermined frame in the video, and the tracking frame is a second predetermined frame in the video;
or: the key frame is based on a preset reference point, if the displacement of the reference point in the current image frame relative to the reference point in the previous key frame is greater than a preset threshold value, the current image frame is the key frame, otherwise, the current image frame is the tracking frame.
3. The method for multi-person human skeleton recognition at a mobile end according to claim 2, wherein: the step S2 specifically includes: aiming at the key frame, identifying a human body region through a fine-grained human body identification model, and generating a human body subgraph of the key frame; and for the tracking frame, identifying a human body region through a human body identification model based on a motion vector, and generating a human body subgraph of the tracking frame.
4. The mobile-end multi-person human skeleton recognition method according to any one of claims 1 to 3, wherein: the step S3 specifically includes:
when the number of the human body subgraphs to be recognized meets the condition that h is larger than or equal to k + a, monitoring the idle thread conditions of a CPU (central processing unit) and a GPU (graphics processing unit) of the mobile terminal, and when the idle thread exists, allocating one human body subgraph to be recognized for the idle thread to recognize the human body skeleton, wherein h is the number of the human body subgraphs to be recognized, k is a preset allocation parameter, and a is the maximum thread number of the CPU of the mobile terminal.
5. The method for multi-person human skeleton recognition at the mobile end according to claim 4, wherein: the step S3 specifically includes:
when the number of the human body subgraphs to be identified satisfies h<k + a, first, the
Figure FDA0002677618500000011
Distributing the human body subgraphs to be identified to a GPU;
judging the number of the rest human body subgraphs to be identified when the number meets the requirement
Figure FDA0002677618500000012
And monitoring the idle thread condition of the CPU of the mobile terminal, distributing one human body subgraph to be identified to the idle thread when the idle thread exists, and otherwise, distributing one human body subgraph to be identified to more than one thread.
6. The method for multi-person human skeleton recognition at a mobile end according to claim 5, wherein: and the preset distribution parameters are determined according to the ratio of the CPU operation delay and the GPU operation delay of the mobile terminal.
7. The mobile-end multi-person human skeleton recognition method according to any one of claims 1 to 3, wherein: the preset skeleton recognition model in the step S3 is a convolutional neural network model, and a neural network layer of the convolutional neural network model includes at least 1 inversion residual convolutional layer and at least 1 enhancement inversion residual convolutional layer; the inverted residual convolution layer adopts a ReLU6 nonlinear activation function; the enhanced inverted residual convolutional layer adopts an Squeeze and Excitation module and an H-Swish activation function.
8. The method for multi-person human skeleton recognition at a mobile end according to claim 7, wherein: the innermost neural network of the convolutional neural network model comprises 2 or more than 2 enhanced inverted residual convolutional layers.
9. The method for multi-person human skeleton recognition at a mobile end according to claim 8, wherein: a preprocessing layer is also arranged between the input layer of the convolutional neural network and the first neural network layer; the preprocessing layer comprises at least 1 first inverted residual model and at least 1 second inverted residual model; the step sizes of the first inverted residual error model and the second inverted residual error model are not equal.
10. A multi-person human skeleton recognition device at a mobile terminal comprises a processor and a memory, wherein the processor is used for executing an application program in the memory, and the device is characterized in that: the memory stores an application program which can realize the multi-person human skeleton recognition method of the mobile terminal according to any one of claims 1 to 9 when executed.
CN202010952910.5A 2020-09-11 2020-09-11 Multi-human skeleton recognition method and device for mobile terminal Active CN112084958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010952910.5A CN112084958B (en) 2020-09-11 2020-09-11 Multi-human skeleton recognition method and device for mobile terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010952910.5A CN112084958B (en) 2020-09-11 2020-09-11 Multi-human skeleton recognition method and device for mobile terminal

Publications (2)

Publication Number Publication Date
CN112084958A true CN112084958A (en) 2020-12-15
CN112084958B CN112084958B (en) 2024-05-10

Family

ID=73736656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010952910.5A Active CN112084958B (en) 2020-09-11 2020-09-11 Multi-human skeleton recognition method and device for mobile terminal

Country Status (1)

Country Link
CN (1) CN112084958B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062526A (en) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 A kind of estimation method of human posture and mobile terminal
CN109543549A (en) * 2018-10-26 2019-03-29 北京陌上花科技有限公司 Image processing method and device, mobile end equipment, server for more people's Attitude estimations
US20190278985A1 (en) * 2018-03-09 2019-09-12 Baidu Online Network Technology (Beijing) Co., Ltd . Method, system and terminal for identity authentication, and computer readable storage medium
WO2020088433A1 (en) * 2018-10-30 2020-05-07 腾讯科技(深圳)有限公司 Method and apparatus for recognizing postures of multiple persons, electronic device, and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062526A (en) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 A kind of estimation method of human posture and mobile terminal
US20190278985A1 (en) * 2018-03-09 2019-09-12 Baidu Online Network Technology (Beijing) Co., Ltd . Method, system and terminal for identity authentication, and computer readable storage medium
CN109543549A (en) * 2018-10-26 2019-03-29 北京陌上花科技有限公司 Image processing method and device, mobile end equipment, server for more people's Attitude estimations
WO2020088433A1 (en) * 2018-10-30 2020-05-07 腾讯科技(深圳)有限公司 Method and apparatus for recognizing postures of multiple persons, electronic device, and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DANIIL OSOKIN: "Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose", 《ARXIV》, 29 November 2018 (2018-11-29), pages 1 - 5 *
RANA OSAMA等: "Realtime Multi-Person 2D Pose Estimation", 《INTERNATIONAL JOURNAL OF ADVANCED NETWORKING AND APPLICATIONS》, vol. 11, no. 6, 30 June 2020 (2020-06-30), pages 4501 - 4508 *
肖贤鹏等: "基于深度图像的实时多人体姿态估计", 《传感器与微***》, vol. 39, no. 6, 2 June 2020 (2020-06-02), pages 26 - 29 *
赵昱文: "基于深度学习的视频中的人体姿态估计", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 2, 15 February 2020 (2020-02-15), pages 138 - 1199 *

Also Published As

Publication number Publication date
CN112084958B (en) 2024-05-10

Similar Documents

Publication Publication Date Title
US20200372246A1 (en) Hand pose estimation
CN111859023B (en) Video classification method, apparatus, device and computer readable storage medium
CN111539290B (en) Video motion recognition method and device, electronic equipment and storage medium
CN109685037B (en) Real-time action recognition method and device and electronic equipment
CN110866509A (en) Action recognition method and device, computer storage medium and computer equipment
CN114529982B (en) Lightweight human body posture estimation method and system based on streaming attention
US11636712B2 (en) Dynamic gesture recognition method, device and computer-readable storage medium
CN112580545B (en) Crowd counting method and system based on multi-scale self-adaptive context network
CN103020580B (en) Fast face detecting method
CN110969110A (en) Face tracking method and system based on deep learning
CN110807380B (en) Human body key point detection method and device
CN112084958B (en) Multi-human skeleton recognition method and device for mobile terminal
CN111079567B (en) Sampling method, model generation method, video behavior identification method and device
CN112825116B (en) Method, device, medium and equipment for detecting and tracking human face of monitoring video image
CN116189281B (en) End-to-end human behavior classification method and system based on space-time self-adaptive fusion
CN111539420B (en) Panoramic image saliency prediction method and system based on attention perception features
JP7495368B2 (en) Skeleton estimation device and program
CN112150608B (en) Three-dimensional face reconstruction method based on graph convolution neural network
CN116052233A (en) Neural network optimization method, device, computing equipment and storage medium
CN113706390A (en) Image conversion model training method, image conversion method, device and medium
CN114359892A (en) Three-dimensional target detection method and device and computer readable storage medium
Li et al. Fast Fourier inception networks for occluded video prediction
Xiang Lightweight open pose based body posture estimation for badminton players
Zhang et al. Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN
CN112989952B (en) Crowd density estimation method and device based on mask guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant