CN114821654A

CN114821654A - Human hand detection method fusing local and depth space-time diagram network

Info

Publication number: CN114821654A
Application number: CN202210497768.9A
Authority: CN
Inventors: 陈飞; 孙骞; 刘莞玲
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-29

Abstract

The invention provides a human hand detection method of a spatiotemporal image network fusing local and depth, which comprises the steps of firstly, acquiring a multi-scale characteristic image of a video image by using a traditional detector; then selecting a target candidate frame with high confidence coefficient from the feature map, inputting the image in the target candidate frame into the detector again to obtain the feature map, obtaining a local key information candidate frame by using an activation value of a sliding window, establishing a map relation network for the candidate frame, and respectively establishing a map network for training by fusing the depth information and the image of the previous frame of the video; and finally, adding the node attribute obtained by graph calculation and the feature information of the original feature graph, thereby achieving the effect of enhancing identification. The invention utilizes the graph relation network, the depth information and the video fore-and-aft information to enhance the image recognition, and solves the problem that the traditional detector can not utilize the incidence relation among the image depth information, the video image sequence information and the target.

Description

Human hand detection method of space-time diagram network fusing local part and depth

Technical Field

The invention relates to the technical field of image processing, in particular to a human hand detection method of a spatiotemporal image network fusing local and depth.

Background

Multi-target video detection is a core problem of computer vision, and at present, human hand detection is increasingly applied to the fields of virtual reality, rehabilitation, remote control and the like as a key technology of human-computer interaction. The real-time video image sequence is input, the correct hand target is calculated through the detector, the detection data can be accurately applied to the field of man-machine interaction, but in the target detection, due to the reasons of video image vibration blur and the like, the hand target cannot be accurately detected through shot instantaneous images, and therefore, the problem of how to accurately detect the hand target in the video detection becomes a difficult point. The target detection refers to the detection of a target to be identified from a daily image, and is a core problem of computer vision, and many successful algorithms have been generated for the target detection problem, such as fast R-CNN, yoolox, and the like. The mainstream method in video detection utilizes the CNN method to obtain video context information, but this method only considers the feature information of the context and does not consider the context relationship information between video objects. Therefore, how to utilize the relationship between the video context information and the human hand target and the local information in the human hand target on the video image is always a difficult problem of the image detection algorithm.

Disclosure of Invention

In view of the above, the present invention provides a human hand detection method for a spatiotemporal image network fusing local and depth, which utilizes an image relationship network, depth information and video context information to enhance image recognition, and solves the problem that a conventional detector cannot utilize an association relationship between image depth information, video image precedence information and a target.

In order to achieve the purpose, the invention adopts the following technical scheme: a human hand detection method fusing a local space-time diagram network and a depth network is characterized by comprising the following steps: the method comprises the following steps:

step S1: inputting the N frame sequence images and the label containing the hand position into a traditional detector, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;

step S2: independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;

step S3: inputting the image of each target candidate frame with the confidence coefficient larger than the learnable threshold V into a traditional detector, and outputting a feature map of each target;

step S4: respectively passing the feature maps of the targets through 7 groups of sliding windows with specific sizes, and calculating the activation value of each sliding window

Activation value

The calculation method comprises the following steps:

wherein A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and width of the sliding window; selecting activation values using non-maxima suppression

The largest first P windows are used as local candidate frames;

step S5: respectively constructing a graph relation network for each frame of image;

step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation;

step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation;

step S8: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; updating the enhanced characteristic diagram through the loss calculation of the traditional detector;

step S9: looping S1-S8 until network training is completed;

step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image every time, only inputting the real-time image if the first N-1 frame image does not exist, and inputting the N frame image into the network;

step S11: outputting a graph network represented by each frame of image to perform graph calculation;

step S12: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;

step S13: and looping the steps S10-S12 until the real-time final prediction result of the video is output.

In a preferred embodiment: the step S5 specifically includes: and taking the target candidate frame with the confidence degree larger than the learnable threshold value V in the feature graph of each frame of image as a target graph node, taking the feature information of the candidate frame in the feature graph as the attribute of the target graph node, taking P local candidate frames in each target as graph local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the graph relation network of each frame of image.

In a preferred embodiment: the step S6 specifically includes: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes _ij And relative depth m _ij Generating a (K + L) × (K + L) adjacency matrix, wherein K is the number of target graph nodes in the image, and L is the number of local nodes in the imageAmount in which (K + L) × (K + L) represents the relationship between each graph node, and λ is set _ij Representing the relationship between the areas represented by the i and j nodes:

λ _ij ＝(w ₁ d _ij +w ₂ m _ij )

m _ij ＝δ(ω(I),i,j)

wherein x is _i 、y _i And x _j 、y _j Coordinates of center points, w, representing two candidate frames, respectively ₁ And w ₂ For learnable parameters, ω (-) is a calculated depth map, δ (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets _ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected _ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph; attribute of each graph node h _j Is updated to be h' _i The updating process is as follows:

wherein alpha is _ij Attention coefficients for graph nodes i and j, h _j Is the attribute of j nodes around the i node, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,

representing any node in the neighborhood of node i.

In a preferred embodiment: the step S7 specifically includes: converting each frame of image into a depth map and combining the pixel distance d between each node _ij And a hierarchical distance k _ij And relative depth m _ij Generating an adjacency matrix of (K + L) × (K + L); in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set _ij Representing the relationship between the areas represented by the i and j nodes:

γ _ij ＝(w ₃ d _ij +w ₄ k _ij +w ₅ m _ij )

k _ij ＝Fra(i)-Fra(j)

wherein, w ₃ 、w ₄ And w ₅ Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets _ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected _ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph.

Compared with the prior art, the invention has the following beneficial effects:

1) by introducing a graph relation network, depth information and local information, the relation between human hand targets and the local information in the hand targets are reasonably used to enhance recognition, the problem that the traditional detector only recognizes inaccurately according to target features is solved, and the detection effect of the hand targets is better enhanced;

2) by introducing the time sequence and reasonably combining the relation between the front image and the rear image among the video images, the correct hand target can be more accurately found by utilizing the context relation of the videos in the detection, and the detection effect is favorably enhanced.

Drawings

Fig. 1 is a flowchart of a human hand detection method of a spatiotemporal image network fusing a part and a depth in a preferred embodiment of the present invention.

Fig. 2 is a schematic diagram of a network architecture for merging local and depth spatiotemporal patterns according to a preferred embodiment of the present invention.

FIG. 3 is a graph comparing the detection effect of the preferred embodiment of the present invention with other detection algorithms.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the human hand detection method of the spatiotemporal image network fusing local and depth of the invention is implemented according to the following steps:

step S1: inputting the N frame sequence images and the labels containing the positions of human hands into a traditional detector (such as a Yolo), and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;

step S3: inputting the image of each target candidate frame with the confidence coefficient larger than a learnable threshold V (initially 0.1) into a traditional detector, and outputting a feature map of each target;

Activation value

The calculation method comprises the following steps:

wherein, A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and the width of the sliding window; selecting activation values using non-maxima suppression

The largest first P windows are used as local candidate frames;

step S5: respectively constructing a graph relation network for each frame of image, specifically: taking a target candidate frame with the confidence degree larger than a learnable threshold value V in the feature map of each frame of image as a target map node, taking the feature information of the candidate frame in the feature map as the attribute of the target map node, taking P local candidate frames in each target as map local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the map relational network of each frame of image;

step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation, specifically: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes _ij And relative depth m _ij An adjacency matrix of (K + L) × (K + L) is generated (K is the number of nodes of the target graph in the image and L is the number of local nodes in the image). In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and λ is set _ij Representing the relationship between the areas represented by the i and j nodes:

λ _ij ＝(w ₁ d _ij +w ₂ m _ij )

m _ij ＝δ(ω(I),i,j)

wherein x is _i 、y _i And x _j 、y _j Coordinates of center points, w, representing two candidate frames, respectively ₁ And w ₂ The parameters are learnable parameters (the initial values are 0.5), omega (-) is a calculated depth map, delta (-) is a depth difference of a region where I and j are located, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets _ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected _ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph. Attribute of each graph node h _j Is updated to be h' _i The updating process is as follows:

represents any node in the neighborhood of node i;

step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation; the method specifically comprises the following steps: converting each frame of image into a depth map and combining the pixel distance d between each node _ij And a hierarchical distance k _ij And a relative depth m _ij An adjacency matrix of (K + L) × (K + L) is generated. In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set _ij Representing the relationship between the areas represented by the i and j nodes:

γ _ij ＝(w ₃ d _ij +w ₄ k _ij +w ₅ m _ij )

k _ij ＝Fra(i)-Fra(j)

wherein, w ₃ 、w ₄ And w ₅ The parameters are learnable parameters (the initial values are all 0.5), and Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets _ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected _ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph;

step S8: and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Updating the enhanced characteristic diagram through the loss calculation of the traditional detector;

step S9: looping S1-S8 until network training is completed;

step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image (only inputting the real-time image if the first N-1 frame image does not exist) each time, and inputting the N frame images into the network;

step S12: and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;

step S13: and circulating the steps S10-S12 until the real-time final prediction result of the video is output.

The following is a specific embodiment of the present invention.

The application of the algorithm provided by the invention to the target detection specifically comprises the following steps:

1. inputting the 3-frame sequence image and a label containing the hand position into a traditional detector YooloX, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the 3-frame image;

2. independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;

3. inputting the image of each target candidate box with the confidence coefficient larger than a learnable threshold value V (initially 0.1) into a traditional detector, and outputting a feature map of each target;

4. the feature maps of the respective objects are respectively passed through 5 sets of sliding windows of a specific size (respectively {3 × 3,3 × 5,6 × 6,8 × 8,7 × 10}), and the activation value of each sliding window is calculated

Activation value

The calculation method comprises the following steps:

The largest first 5 windows are used as local candidate frames;

5. respectively constructing a graph relation network for each frame of image, which specifically comprises the following steps: taking a target candidate frame with the confidence degree larger than a learnable threshold value V in the feature map of each frame of image as a target map node, taking feature information of the candidate frame in the feature map as an attribute of the target map node, taking 5 local candidate frames in each target as map local nodes, averaging feature values of each channel in the local candidate frames as the attribute of the local nodes, and constructing node information of a map relational network of each frame of image;

6. constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation; the method specifically comprises the following steps: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes _ij And relative depth m _ij An adjacency matrix of (K + L) × (K + L) is generated (K is the number of nodes of the target graph in the image and L is the number of local nodes in the image). In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and λ is set _ij Representing the relationship between the areas represented by the i and j nodes:

λ _ij ＝(w ₁ d _ij +w ₂ m _ij )

m _ij ＝δ(ω(I),i,j)

wherein x is _i 、y _i And x _j 、y _j Coordinates of center points, w, representing two candidate frames, respectively ₁ And w ₂ Parameters which can be learned (initial values are 0.5), omega (-) is a calculated depth map, delta (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image (fixed to 1280 multiplied by 720); let λ be between local nodes of different targets _ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected _ij Constructing a sparse adjacent matrix of each frame of image from the maximum first 7 points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph. Attribute of each graph node h _j Is updated to be h' _i The updating process is as follows:

wherein alpha is _ij Attention coefficients for graph nodes i and j, h _j Is section iThe properties of the j nodes around the point, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,

represents any node in the neighborhood of node i;

7. constructing an adjacency matrix from the graph relation network of the 3 frames of images and carrying out graph calculation; the method specifically comprises the following steps: converting each frame of image into a depth map and combining the pixel distance d between each node _ij And a hierarchical distance k _ij And relative depth m _ij An adjacency matrix of (K + L) × (K + L) is generated. In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set _ij Representing the relationship between the areas represented by the i and j nodes:

γ _ij ＝(w ₃ d _ij +w ₄ k _ij +w ₅ m _ij )

k _ij ＝Fra(i)-Fra(j)

wherein, w ₃ 、w ₄ And w ₅ The parameters are learnable parameters (the initial values are all 0.5), and Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets _ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected _ij Constructing a sparse adjacent matrix with hierarchical information at the maximum first 8 points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph;

8. and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Updating the enhanced characteristic diagram through the loss calculation of the traditional detector;

9. circulating S1-S8 until the network training is completed;

10. inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first 2 frames of images (only inputting the real-time image if the first 2 frames of images are not available) each time, and inputting 3 frames of images into the network;

11. outputting a graph network represented by each frame of image to perform graph calculation;

12. and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;

13. and looping the steps S10-S12 until the real-time final prediction result of the video is output.

In order to verify the effectiveness of the method, the method integrates the popular detection algorithm YooloX for comparison. Video image data set the test image size was 1280 x 720 using hand images as proposed by Bambach et al in EgoHands: A Dataset for Hands in compact Egocentric Interactions. Fig. 3 shows the specific detection effect of the detection algorithm in detecting an image, and it can be found through comparison that the method of the present invention can better detect a better detection effect by using the local information of the hand target and the time series information.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A human hand detection method fusing a local space-time diagram network and a depth network is characterized by comprising the following steps: the method comprises the following steps:

Activation value

The calculation method comprises the following steps:

The largest first P windows are used as local candidate frames;

step S9: looping S1-S8 until network training is completed;

2. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S5 specifically includes: and taking the target candidate frame with the confidence degree larger than the learnable threshold value V in the feature graph of each frame of image as a target graph node, taking the feature information of the candidate frame in the feature graph as the attribute of the target graph node, taking P local candidate frames in each target as graph local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the graph relation network of each frame of image.

3. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S6 specifically includes: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes _ij And relative depth m _ij Generating an adjacency matrix of (K + L) × (K + L), where K is the number of target graph nodes in the image and L is the number of local nodes in the image, and in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, setting λ _ij Representing the relationship between the areas represented by the i and j nodes:

λ _ij ＝(w ₁ d _ij +w ₂ m _ij )

m _ij ＝δ(ω(I),i,j)

representing any node in the neighborhood of node i.

4. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S7 specifically includes: converting each frame of image into a depth map and combining the pixel distance d between each node _ij And a hierarchical distance k _ij And relative depth m _ij Generating an adjacency matrix of (K + L) × (K + L); in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set _ij Representing the relationship between the areas represented by the i and j nodesComprises the following steps:

γ _ij ＝(w ₃ d _ij +w ₄ k _ij +w ₅ m _ij )

k _ij ＝Fra(i)-Fra(j)