CN114821654A - Human hand detection method fusing local and depth space-time diagram network - Google Patents
Human hand detection method fusing local and depth space-time diagram network Download PDFInfo
- Publication number
- CN114821654A CN114821654A CN202210497768.9A CN202210497768A CN114821654A CN 114821654 A CN114821654 A CN 114821654A CN 202210497768 A CN202210497768 A CN 202210497768A CN 114821654 A CN114821654 A CN 114821654A
- Authority
- CN
- China
- Prior art keywords
- graph
- image
- frame
- node
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 35
- 238000010586 diagram Methods 0.000 title claims description 16
- 238000004364 calculation method Methods 0.000 claims abstract description 25
- 230000004913 activation Effects 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 35
- 238000000034 method Methods 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 5
- 230000005764 inhibitory process Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 7
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 230000003993 interaction Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011895 specific detection Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a human hand detection method of a spatiotemporal image network fusing local and depth, which comprises the steps of firstly, acquiring a multi-scale characteristic image of a video image by using a traditional detector; then selecting a target candidate frame with high confidence coefficient from the feature map, inputting the image in the target candidate frame into the detector again to obtain the feature map, obtaining a local key information candidate frame by using an activation value of a sliding window, establishing a map relation network for the candidate frame, and respectively establishing a map network for training by fusing the depth information and the image of the previous frame of the video; and finally, adding the node attribute obtained by graph calculation and the feature information of the original feature graph, thereby achieving the effect of enhancing identification. The invention utilizes the graph relation network, the depth information and the video fore-and-aft information to enhance the image recognition, and solves the problem that the traditional detector can not utilize the incidence relation among the image depth information, the video image sequence information and the target.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to a human hand detection method of a spatiotemporal image network fusing local and depth.
Background
Multi-target video detection is a core problem of computer vision, and at present, human hand detection is increasingly applied to the fields of virtual reality, rehabilitation, remote control and the like as a key technology of human-computer interaction. The real-time video image sequence is input, the correct hand target is calculated through the detector, the detection data can be accurately applied to the field of man-machine interaction, but in the target detection, due to the reasons of video image vibration blur and the like, the hand target cannot be accurately detected through shot instantaneous images, and therefore, the problem of how to accurately detect the hand target in the video detection becomes a difficult point. The target detection refers to the detection of a target to be identified from a daily image, and is a core problem of computer vision, and many successful algorithms have been generated for the target detection problem, such as fast R-CNN, yoolox, and the like. The mainstream method in video detection utilizes the CNN method to obtain video context information, but this method only considers the feature information of the context and does not consider the context relationship information between video objects. Therefore, how to utilize the relationship between the video context information and the human hand target and the local information in the human hand target on the video image is always a difficult problem of the image detection algorithm.
Disclosure of Invention
In view of the above, the present invention provides a human hand detection method for a spatiotemporal image network fusing local and depth, which utilizes an image relationship network, depth information and video context information to enhance image recognition, and solves the problem that a conventional detector cannot utilize an association relationship between image depth information, video image precedence information and a target.
In order to achieve the purpose, the invention adopts the following technical scheme: a human hand detection method fusing a local space-time diagram network and a depth network is characterized by comprising the following steps: the method comprises the following steps:
step S1: inputting the N frame sequence images and the label containing the hand position into a traditional detector, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;
step S2: independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
step S3: inputting the image of each target candidate frame with the confidence coefficient larger than the learnable threshold V into a traditional detector, and outputting a feature map of each target;
step S4: respectively passing the feature maps of the targets through 7 groups of sliding windows with specific sizes, and calculating the activation value of each sliding windowActivation valueThe calculation method comprises the following steps:
wherein A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and width of the sliding window; selecting activation values using non-maxima suppressionThe largest first P windows are used as local candidate frames;
step S5: respectively constructing a graph relation network for each frame of image;
step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation;
step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation;
step S8: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
step S9: looping S1-S8 until network training is completed;
step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image every time, only inputting the real-time image if the first N-1 frame image does not exist, and inputting the N frame image into the network;
step S11: outputting a graph network represented by each frame of image to perform graph calculation;
step S12: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
step S13: and looping the steps S10-S12 until the real-time final prediction result of the video is output.
In a preferred embodiment: the step S5 specifically includes: and taking the target candidate frame with the confidence degree larger than the learnable threshold value V in the feature graph of each frame of image as a target graph node, taking the feature information of the candidate frame in the feature graph as the attribute of the target graph node, taking P local candidate frames in each target as graph local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the graph relation network of each frame of image.
In a preferred embodiment: the step S6 specifically includes: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij Generating a (K + L) × (K + L) adjacency matrix, wherein K is the number of target graph nodes in the image, and L is the number of local nodes in the imageAmount in which (K + L) × (K + L) represents the relationship between each graph node, and λ is set ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 For learnable parameters, ω (-) is a calculated depth map, δ (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph; attribute of each graph node h j Is updated to be h' i The updating process is as follows:
wherein alpha is ij Attention coefficients for graph nodes i and j, h j Is the attribute of j nodes around the i node, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,representing any node in the neighborhood of node i.
In a preferred embodiment: the step S7 specifically includes: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And relative depth m ij Generating an adjacency matrix of (K + L) × (K + L); in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodes:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph.
Compared with the prior art, the invention has the following beneficial effects:
1) by introducing a graph relation network, depth information and local information, the relation between human hand targets and the local information in the hand targets are reasonably used to enhance recognition, the problem that the traditional detector only recognizes inaccurately according to target features is solved, and the detection effect of the hand targets is better enhanced;
2) by introducing the time sequence and reasonably combining the relation between the front image and the rear image among the video images, the correct hand target can be more accurately found by utilizing the context relation of the videos in the detection, and the detection effect is favorably enhanced.
Drawings
Fig. 1 is a flowchart of a human hand detection method of a spatiotemporal image network fusing a part and a depth in a preferred embodiment of the present invention.
Fig. 2 is a schematic diagram of a network architecture for merging local and depth spatiotemporal patterns according to a preferred embodiment of the present invention.
FIG. 3 is a graph comparing the detection effect of the preferred embodiment of the present invention with other detection algorithms.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the human hand detection method of the spatiotemporal image network fusing local and depth of the invention is implemented according to the following steps:
step S1: inputting the N frame sequence images and the labels containing the positions of human hands into a traditional detector (such as a Yolo), and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;
step S2: independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
step S3: inputting the image of each target candidate frame with the confidence coefficient larger than a learnable threshold V (initially 0.1) into a traditional detector, and outputting a feature map of each target;
step S4: respectively passing the feature maps of the targets through 7 groups of sliding windows with specific sizes, and calculating the activation value of each sliding windowActivation valueThe calculation method comprises the following steps:
wherein, A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and the width of the sliding window; selecting activation values using non-maxima suppressionThe largest first P windows are used as local candidate frames;
step S5: respectively constructing a graph relation network for each frame of image, specifically: taking a target candidate frame with the confidence degree larger than a learnable threshold value V in the feature map of each frame of image as a target map node, taking the feature information of the candidate frame in the feature map as the attribute of the target map node, taking P local candidate frames in each target as map local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the map relational network of each frame of image;
step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation, specifically: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated (K is the number of nodes of the target graph in the image and L is the number of local nodes in the image). In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and λ is set ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 The parameters are learnable parameters (the initial values are 0.5), omega (-) is a calculated depth map, delta (-) is a depth difference of a region where I and j are located, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph. Attribute of each graph node h j Is updated to be h' i The updating process is as follows:
wherein alpha is ij Attention coefficients for graph nodes i and j, h j Is the attribute of j nodes around the i node, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,represents any node in the neighborhood of node i;
step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation; the method specifically comprises the following steps: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And a relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated. In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodes:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 The parameters are learnable parameters (the initial values are all 0.5), and Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph;
step S8: and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
step S9: looping S1-S8 until network training is completed;
step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image (only inputting the real-time image if the first N-1 frame image does not exist) each time, and inputting the N frame images into the network;
step S11: outputting a graph network represented by each frame of image to perform graph calculation;
step S12: and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
step S13: and circulating the steps S10-S12 until the real-time final prediction result of the video is output.
The following is a specific embodiment of the present invention.
The application of the algorithm provided by the invention to the target detection specifically comprises the following steps:
1. inputting the 3-frame sequence image and a label containing the hand position into a traditional detector YooloX, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the 3-frame image;
2. independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
3. inputting the image of each target candidate box with the confidence coefficient larger than a learnable threshold value V (initially 0.1) into a traditional detector, and outputting a feature map of each target;
4. the feature maps of the respective objects are respectively passed through 5 sets of sliding windows of a specific size (respectively {3 × 3,3 × 5,6 × 6,8 × 8,7 × 10}), and the activation value of each sliding window is calculatedActivation valueThe calculation method comprises the following steps:
wherein A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and width of the sliding window; selecting activation values using non-maxima suppressionThe largest first 5 windows are used as local candidate frames;
5. respectively constructing a graph relation network for each frame of image, which specifically comprises the following steps: taking a target candidate frame with the confidence degree larger than a learnable threshold value V in the feature map of each frame of image as a target map node, taking feature information of the candidate frame in the feature map as an attribute of the target map node, taking 5 local candidate frames in each target as map local nodes, averaging feature values of each channel in the local candidate frames as the attribute of the local nodes, and constructing node information of a map relational network of each frame of image;
6. constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation; the method specifically comprises the following steps: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated (K is the number of nodes of the target graph in the image and L is the number of local nodes in the image). In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and λ is set ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 Parameters which can be learned (initial values are 0.5), omega (-) is a calculated depth map, delta (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image (fixed to 1280 multiplied by 720); let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image from the maximum first 7 points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph. Attribute of each graph node h j Is updated to be h' i The updating process is as follows:
wherein alpha is ij Attention coefficients for graph nodes i and j, h j Is section iThe properties of the j nodes around the point, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,represents any node in the neighborhood of node i;
7. constructing an adjacency matrix from the graph relation network of the 3 frames of images and carrying out graph calculation; the method specifically comprises the following steps: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated. In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodes:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 The parameters are learnable parameters (the initial values are all 0.5), and Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information at the maximum first 8 points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph;
8. and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
9. circulating S1-S8 until the network training is completed;
10. inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first 2 frames of images (only inputting the real-time image if the first 2 frames of images are not available) each time, and inputting 3 frames of images into the network;
11. outputting a graph network represented by each frame of image to perform graph calculation;
12. and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
13. and looping the steps S10-S12 until the real-time final prediction result of the video is output.
In order to verify the effectiveness of the method, the method integrates the popular detection algorithm YooloX for comparison. Video image data set the test image size was 1280 x 720 using hand images as proposed by Bambach et al in EgoHands: A Dataset for Hands in compact Egocentric Interactions. Fig. 3 shows the specific detection effect of the detection algorithm in detecting an image, and it can be found through comparison that the method of the present invention can better detect a better detection effect by using the local information of the hand target and the time series information.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.
Claims (4)
1. A human hand detection method fusing a local space-time diagram network and a depth network is characterized by comprising the following steps: the method comprises the following steps:
step S1: inputting the N frame sequence images and the label containing the hand position into a traditional detector, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;
step S2: independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
step S3: inputting the image of each target candidate frame with the confidence coefficient larger than the learnable threshold V into a traditional detector, and outputting a feature map of each target;
step S4: respectively passing the feature maps of the targets through 7 groups of sliding windows with specific sizes, and calculating the activation value of each sliding windowActivation valueThe calculation method comprises the following steps:
wherein A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and width of the sliding window; selecting activation values using non-maxima suppressionThe largest first P windows are used as local candidate frames;
step S5: respectively constructing a graph relation network for each frame of image;
step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation;
step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation;
step S8: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
step S9: looping S1-S8 until network training is completed;
step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image every time, only inputting the real-time image if the first N-1 frame image does not exist, and inputting the N frame image into the network;
step S11: outputting a graph network represented by each frame of image to perform graph calculation;
step S12: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
step S13: and looping the steps S10-S12 until the real-time final prediction result of the video is output.
2. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S5 specifically includes: and taking the target candidate frame with the confidence degree larger than the learnable threshold value V in the feature graph of each frame of image as a target graph node, taking the feature information of the candidate frame in the feature graph as the attribute of the target graph node, taking P local candidate frames in each target as graph local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the graph relation network of each frame of image.
3. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S6 specifically includes: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij Generating an adjacency matrix of (K + L) × (K + L), where K is the number of target graph nodes in the image and L is the number of local nodes in the image, and in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, setting λ ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 For learnable parameters, ω (-) is a calculated depth map, δ (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph; attribute of each graph node h j Is updated to be h' i The updating process is as follows:
4. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S7 specifically includes: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And relative depth m ij Generating an adjacency matrix of (K + L) × (K + L); in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodesComprises the following steps:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210497768.9A CN114821654A (en) | 2022-05-09 | 2022-05-09 | Human hand detection method fusing local and depth space-time diagram network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210497768.9A CN114821654A (en) | 2022-05-09 | 2022-05-09 | Human hand detection method fusing local and depth space-time diagram network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114821654A true CN114821654A (en) | 2022-07-29 |
Family
ID=82513427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210497768.9A Pending CN114821654A (en) | 2022-05-09 | 2022-05-09 | Human hand detection method fusing local and depth space-time diagram network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114821654A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109977812A (en) * | 2019-03-12 | 2019-07-05 | 南京邮电大学 | A kind of Vehicular video object detection method based on deep learning |
CN111832393A (en) * | 2020-05-29 | 2020-10-27 | 东南大学 | Video target detection method and device based on deep learning |
CN111931787A (en) * | 2020-07-22 | 2020-11-13 | 杭州电子科技大学 | RGBD significance detection method based on feature polymerization |
CN112115783A (en) * | 2020-08-12 | 2020-12-22 | 中国科学院大学 | Human face characteristic point detection method, device and equipment based on deep knowledge migration |
-
2022
- 2022-05-09 CN CN202210497768.9A patent/CN114821654A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109977812A (en) * | 2019-03-12 | 2019-07-05 | 南京邮电大学 | A kind of Vehicular video object detection method based on deep learning |
WO2020181685A1 (en) * | 2019-03-12 | 2020-09-17 | 南京邮电大学 | Vehicle-mounted video target detection method based on deep learning |
CN111832393A (en) * | 2020-05-29 | 2020-10-27 | 东南大学 | Video target detection method and device based on deep learning |
CN111931787A (en) * | 2020-07-22 | 2020-11-13 | 杭州电子科技大学 | RGBD significance detection method based on feature polymerization |
CN112115783A (en) * | 2020-08-12 | 2020-12-22 | 中国科学院大学 | Human face characteristic point detection method, device and equipment based on deep knowledge migration |
Non-Patent Citations (2)
Title |
---|
刘伟华: "基于深度图像的运动人手检测与指尖点跟踪算法", 计算机应用, vol. 34, no. 5, 10 May 2014 (2014-05-10), pages 1442 - 1448 * |
阮航;王立春;: "基于特征图的车辆检测和分类", 计算机技术与发展, no. 11, 29 June 2018 (2018-06-29), pages 45 - 49 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107529650B (en) | Closed loop detection method and device and computer equipment | |
CN110097568B (en) | Video object detection and segmentation method based on space-time dual-branch network | |
CN112149459B (en) | Video saliency object detection model and system based on cross attention mechanism | |
CN112418095A (en) | Facial expression recognition method and system combined with attention mechanism | |
CN110717411A (en) | Pedestrian re-identification method based on deep layer feature fusion | |
CN112183501B (en) | Depth counterfeit image detection method and device | |
WO2019136591A1 (en) | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network | |
CN110853074B (en) | Video target detection network system for enhancing targets by utilizing optical flow | |
CN111401293B (en) | Gesture recognition method based on Head lightweight Mask scanning R-CNN | |
CN107833239B (en) | Optimization matching target tracking method based on weighting model constraint | |
CN112434608B (en) | Human behavior identification method and system based on double-current combined network | |
CN110378195B (en) | Multi-target tracking method based on histogram cache method | |
CN111401207B (en) | Human body action recognition method based on MARS depth feature extraction and enhancement | |
CN110909741A (en) | Vehicle re-identification method based on background segmentation | |
CN108022254A (en) | A kind of space-time contextual target tracking based on sign point auxiliary | |
CN107609571A (en) | A kind of adaptive target tracking method based on LARK features | |
CN111027542A (en) | Target detection method improved based on fast RCNN algorithm | |
CN111145221A (en) | Target tracking algorithm based on multi-layer depth feature extraction | |
CN114821654A (en) | Human hand detection method fusing local and depth space-time diagram network | |
CN116188555A (en) | Monocular indoor depth estimation algorithm based on depth network and motion information | |
CN112487927B (en) | Method and system for realizing indoor scene recognition based on object associated attention | |
CN112070048B (en) | Vehicle attribute identification method based on RDSNet | |
CN112084371B (en) | Movie multi-label classification method and device, electronic equipment and storage medium | |
CN114067359A (en) | Pedestrian detection method integrating human body key points and attention features of visible parts | |
CN110532960B (en) | Target-assisted action recognition method based on graph neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |