CN114821654A - Human hand detection method fusing local and depth space-time diagram network - Google Patents

Human hand detection method fusing local and depth space-time diagram network Download PDF

Info

Publication number
CN114821654A
CN114821654A CN202210497768.9A CN202210497768A CN114821654A CN 114821654 A CN114821654 A CN 114821654A CN 202210497768 A CN202210497768 A CN 202210497768A CN 114821654 A CN114821654 A CN 114821654A
Authority
CN
China
Prior art keywords
graph
image
frame
node
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210497768.9A
Other languages
Chinese (zh)
Inventor
陈飞
孙骞
刘莞玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202210497768.9A priority Critical patent/CN114821654A/en
Publication of CN114821654A publication Critical patent/CN114821654A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human hand detection method of a spatiotemporal image network fusing local and depth, which comprises the steps of firstly, acquiring a multi-scale characteristic image of a video image by using a traditional detector; then selecting a target candidate frame with high confidence coefficient from the feature map, inputting the image in the target candidate frame into the detector again to obtain the feature map, obtaining a local key information candidate frame by using an activation value of a sliding window, establishing a map relation network for the candidate frame, and respectively establishing a map network for training by fusing the depth information and the image of the previous frame of the video; and finally, adding the node attribute obtained by graph calculation and the feature information of the original feature graph, thereby achieving the effect of enhancing identification. The invention utilizes the graph relation network, the depth information and the video fore-and-aft information to enhance the image recognition, and solves the problem that the traditional detector can not utilize the incidence relation among the image depth information, the video image sequence information and the target.

Description

Human hand detection method of space-time diagram network fusing local part and depth
Technical Field
The invention relates to the technical field of image processing, in particular to a human hand detection method of a spatiotemporal image network fusing local and depth.
Background
Multi-target video detection is a core problem of computer vision, and at present, human hand detection is increasingly applied to the fields of virtual reality, rehabilitation, remote control and the like as a key technology of human-computer interaction. The real-time video image sequence is input, the correct hand target is calculated through the detector, the detection data can be accurately applied to the field of man-machine interaction, but in the target detection, due to the reasons of video image vibration blur and the like, the hand target cannot be accurately detected through shot instantaneous images, and therefore, the problem of how to accurately detect the hand target in the video detection becomes a difficult point. The target detection refers to the detection of a target to be identified from a daily image, and is a core problem of computer vision, and many successful algorithms have been generated for the target detection problem, such as fast R-CNN, yoolox, and the like. The mainstream method in video detection utilizes the CNN method to obtain video context information, but this method only considers the feature information of the context and does not consider the context relationship information between video objects. Therefore, how to utilize the relationship between the video context information and the human hand target and the local information in the human hand target on the video image is always a difficult problem of the image detection algorithm.
Disclosure of Invention
In view of the above, the present invention provides a human hand detection method for a spatiotemporal image network fusing local and depth, which utilizes an image relationship network, depth information and video context information to enhance image recognition, and solves the problem that a conventional detector cannot utilize an association relationship between image depth information, video image precedence information and a target.
In order to achieve the purpose, the invention adopts the following technical scheme: a human hand detection method fusing a local space-time diagram network and a depth network is characterized by comprising the following steps: the method comprises the following steps:
step S1: inputting the N frame sequence images and the label containing the hand position into a traditional detector, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;
step S2: independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
step S3: inputting the image of each target candidate frame with the confidence coefficient larger than the learnable threshold V into a traditional detector, and outputting a feature map of each target;
step S4: respectively passing the feature maps of the targets through 7 groups of sliding windows with specific sizes, and calculating the activation value of each sliding window
Figure BDA0003633486260000021
Activation value
Figure BDA0003633486260000022
The calculation method comprises the following steps:
Figure BDA0003633486260000023
wherein A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and width of the sliding window; selecting activation values using non-maxima suppression
Figure BDA0003633486260000024
The largest first P windows are used as local candidate frames;
step S5: respectively constructing a graph relation network for each frame of image;
step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation;
step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation;
step S8: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
step S9: looping S1-S8 until network training is completed;
step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image every time, only inputting the real-time image if the first N-1 frame image does not exist, and inputting the N frame image into the network;
step S11: outputting a graph network represented by each frame of image to perform graph calculation;
step S12: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
step S13: and looping the steps S10-S12 until the real-time final prediction result of the video is output.
In a preferred embodiment: the step S5 specifically includes: and taking the target candidate frame with the confidence degree larger than the learnable threshold value V in the feature graph of each frame of image as a target graph node, taking the feature information of the candidate frame in the feature graph as the attribute of the target graph node, taking P local candidate frames in each target as graph local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the graph relation network of each frame of image.
In a preferred embodiment: the step S6 specifically includes: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij Generating a (K + L) × (K + L) adjacency matrix, wherein K is the number of target graph nodes in the image, and L is the number of local nodes in the imageAmount in which (K + L) × (K + L) represents the relationship between each graph node, and λ is set ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
Figure BDA0003633486260000041
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 For learnable parameters, ω (-) is a calculated depth map, δ (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph; attribute of each graph node h j Is updated to be h' i The updating process is as follows:
Figure BDA0003633486260000042
wherein alpha is ij Attention coefficients for graph nodes i and j, h j Is the attribute of j nodes around the i node, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,
Figure BDA0003633486260000043
representing any node in the neighborhood of node i.
In a preferred embodiment: the step S7 specifically includes: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And relative depth m ij Generating an adjacency matrix of (K + L) × (K + L); in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodes:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph.
Compared with the prior art, the invention has the following beneficial effects:
1) by introducing a graph relation network, depth information and local information, the relation between human hand targets and the local information in the hand targets are reasonably used to enhance recognition, the problem that the traditional detector only recognizes inaccurately according to target features is solved, and the detection effect of the hand targets is better enhanced;
2) by introducing the time sequence and reasonably combining the relation between the front image and the rear image among the video images, the correct hand target can be more accurately found by utilizing the context relation of the videos in the detection, and the detection effect is favorably enhanced.
Drawings
Fig. 1 is a flowchart of a human hand detection method of a spatiotemporal image network fusing a part and a depth in a preferred embodiment of the present invention.
Fig. 2 is a schematic diagram of a network architecture for merging local and depth spatiotemporal patterns according to a preferred embodiment of the present invention.
FIG. 3 is a graph comparing the detection effect of the preferred embodiment of the present invention with other detection algorithms.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the human hand detection method of the spatiotemporal image network fusing local and depth of the invention is implemented according to the following steps:
step S1: inputting the N frame sequence images and the labels containing the positions of human hands into a traditional detector (such as a Yolo), and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;
step S2: independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
step S3: inputting the image of each target candidate frame with the confidence coefficient larger than a learnable threshold V (initially 0.1) into a traditional detector, and outputting a feature map of each target;
step S4: respectively passing the feature maps of the targets through 7 groups of sliding windows with specific sizes, and calculating the activation value of each sliding window
Figure BDA0003633486260000061
Activation value
Figure BDA0003633486260000062
The calculation method comprises the following steps:
Figure BDA0003633486260000063
wherein, A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and the width of the sliding window; selecting activation values using non-maxima suppression
Figure BDA0003633486260000064
The largest first P windows are used as local candidate frames;
step S5: respectively constructing a graph relation network for each frame of image, specifically: taking a target candidate frame with the confidence degree larger than a learnable threshold value V in the feature map of each frame of image as a target map node, taking the feature information of the candidate frame in the feature map as the attribute of the target map node, taking P local candidate frames in each target as map local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the map relational network of each frame of image;
step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation, specifically: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated (K is the number of nodes of the target graph in the image and L is the number of local nodes in the image). In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and λ is set ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
Figure BDA0003633486260000071
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 The parameters are learnable parameters (the initial values are 0.5), omega (-) is a calculated depth map, delta (-) is a depth difference of a region where I and j are located, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph. Attribute of each graph node h j Is updated to be h' i The updating process is as follows:
Figure BDA0003633486260000072
wherein alpha is ij Attention coefficients for graph nodes i and j, h j Is the attribute of j nodes around the i node, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,
Figure BDA0003633486260000073
represents any node in the neighborhood of node i;
step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation; the method specifically comprises the following steps: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And a relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated. In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodes:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 The parameters are learnable parameters (the initial values are all 0.5), and Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph;
step S8: and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
step S9: looping S1-S8 until network training is completed;
step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image (only inputting the real-time image if the first N-1 frame image does not exist) each time, and inputting the N frame images into the network;
step S11: outputting a graph network represented by each frame of image to perform graph calculation;
step S12: and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
step S13: and circulating the steps S10-S12 until the real-time final prediction result of the video is output.
The following is a specific embodiment of the present invention.
The application of the algorithm provided by the invention to the target detection specifically comprises the following steps:
1. inputting the 3-frame sequence image and a label containing the hand position into a traditional detector YooloX, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the 3-frame image;
2. independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
3. inputting the image of each target candidate box with the confidence coefficient larger than a learnable threshold value V (initially 0.1) into a traditional detector, and outputting a feature map of each target;
4. the feature maps of the respective objects are respectively passed through 5 sets of sliding windows of a specific size (respectively {3 × 3,3 × 5,6 × 6,8 × 8,7 × 10}), and the activation value of each sliding window is calculated
Figure BDA0003633486260000091
Activation value
Figure BDA0003633486260000092
The calculation method comprises the following steps:
Figure BDA0003633486260000093
wherein A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and width of the sliding window; selecting activation values using non-maxima suppression
Figure BDA0003633486260000094
The largest first 5 windows are used as local candidate frames;
5. respectively constructing a graph relation network for each frame of image, which specifically comprises the following steps: taking a target candidate frame with the confidence degree larger than a learnable threshold value V in the feature map of each frame of image as a target map node, taking feature information of the candidate frame in the feature map as an attribute of the target map node, taking 5 local candidate frames in each target as map local nodes, averaging feature values of each channel in the local candidate frames as the attribute of the local nodes, and constructing node information of a map relational network of each frame of image;
6. constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation; the method specifically comprises the following steps: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated (K is the number of nodes of the target graph in the image and L is the number of local nodes in the image). In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and λ is set ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
Figure BDA0003633486260000101
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 Parameters which can be learned (initial values are 0.5), omega (-) is a calculated depth map, delta (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image (fixed to 1280 multiplied by 720); let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image from the maximum first 7 points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph. Attribute of each graph node h j Is updated to be h' i The updating process is as follows:
Figure BDA0003633486260000111
wherein alpha is ij Attention coefficients for graph nodes i and j, h j Is section iThe properties of the j nodes around the point, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,
Figure BDA0003633486260000112
represents any node in the neighborhood of node i;
7. constructing an adjacency matrix from the graph relation network of the 3 frames of images and carrying out graph calculation; the method specifically comprises the following steps: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And relative depth m ij An adjacency matrix of (K + L) × (K + L) is generated. In the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodes:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 The parameters are learnable parameters (the initial values are all 0.5), and Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information at the maximum first 8 points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph;
8. and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
9. circulating S1-S8 until the network training is completed;
10. inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first 2 frames of images (only inputting the real-time image if the first 2 frames of images are not available) each time, and inputting 3 frames of images into the network;
11. outputting a graph network represented by each frame of image to perform graph calculation;
12. and adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain the enhanced characteristic graph. Taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
13. and looping the steps S10-S12 until the real-time final prediction result of the video is output.
In order to verify the effectiveness of the method, the method integrates the popular detection algorithm YooloX for comparison. Video image data set the test image size was 1280 x 720 using hand images as proposed by Bambach et al in EgoHands: A Dataset for Hands in compact Egocentric Interactions. Fig. 3 shows the specific detection effect of the detection algorithm in detecting an image, and it can be found through comparison that the method of the present invention can better detect a better detection effect by using the local information of the hand target and the time series information.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims (4)

1. A human hand detection method fusing a local space-time diagram network and a depth network is characterized by comprising the following steps: the method comprises the following steps:
step S1: inputting the N frame sequence images and the label containing the hand position into a traditional detector, and acquiring a feature map of a first layer of a detection result output by the detector, namely acquiring the feature map of the N frame images;
step S2: independently performing Decoupled Head operation on the feature map of each frame of image, specifically: reducing the number of channels to 256 by using 1 × 1 convolutional layer, and obtaining confidence information and position information of each target candidate frame through 2 3 × 3 convolutional layers and 1 × 1 convolutional layer;
step S3: inputting the image of each target candidate frame with the confidence coefficient larger than the learnable threshold V into a traditional detector, and outputting a feature map of each target;
step S4: respectively passing the feature maps of the targets through 7 groups of sliding windows with specific sizes, and calculating the activation value of each sliding window
Figure FDA0003633486250000011
Activation value
Figure FDA0003633486250000012
The calculation method comprises the following steps:
Figure FDA0003633486250000013
wherein A (-) is the characteristic value of all channels of the point to be aggregated, h and w are the length and width of the sliding window; selecting activation values using non-maxima suppression
Figure FDA0003633486250000014
The largest first P windows are used as local candidate frames;
step S5: respectively constructing a graph relation network for each frame of image;
step S6: constructing respective adjacency matrixes for the graph relation network of each frame of image and carrying out graph calculation;
step S7: constructing an adjacency matrix from the graph relation network of the N frames of images and carrying out graph calculation;
step S8: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; updating the enhanced characteristic diagram through the loss calculation of the traditional detector;
step S9: looping S1-S8 until network training is completed;
step S10: inputting a video into a detector of a fused space-time diagram relationship network, inputting a real-time image and a first N-1 frame image every time, only inputting the real-time image if the first N-1 frame image does not exist, and inputting the N frame image into the network;
step S11: outputting a graph network represented by each frame of image to perform graph calculation;
step S12: adding the calculated attribute of the graph node with the characteristic of the node on the characteristic graph to obtain an enhanced characteristic graph; taking a feature map of a real-time image, performing Decoupled Head processing operation, outputting information of each candidate frame, and outputting a real-time prediction result through non-maximum inhibition;
step S13: and looping the steps S10-S12 until the real-time final prediction result of the video is output.
2. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S5 specifically includes: and taking the target candidate frame with the confidence degree larger than the learnable threshold value V in the feature graph of each frame of image as a target graph node, taking the feature information of the candidate frame in the feature graph as the attribute of the target graph node, taking P local candidate frames in each target as graph local nodes, taking the feature information of the local candidate frames as the attribute of the local nodes, and constructing the node information of the graph relation network of each frame of image.
3. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S6 specifically includes: converting the image into a depth map so as to obtain depth map information of the image; and combined with the pixel distance d between the nodes ij And relative depth m ij Generating an adjacency matrix of (K + L) × (K + L), where K is the number of target graph nodes in the image and L is the number of local nodes in the image, and in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, setting λ ij Representing the relationship between the areas represented by the i and j nodes:
λ ij =(w 1 d ij +w 2 m ij )
m ij =δ(ω(I),i,j)
Figure FDA0003633486250000031
wherein x is i 、y i And x j 、y j Coordinates of center points, w, representing two candidate frames, respectively 1 And w 2 For learnable parameters, ω (-) is a calculated depth map, δ (-) is a depth difference of a region where I and j are calculated, I is an input original image, and W and H are the width and height of the input image; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row λ is selected ij Constructing a sparse adjacent matrix of each frame of image by using the largest front F points; and using Graph Attention network Graph Attention Networks to describe the influence and propagation between each adjacent node in the Graph; attribute of each graph node h j Is updated to be h' i The updating process is as follows:
Figure FDA0003633486250000032
wherein alpha is ij Attention coefficients for graph nodes i and j, h j Is the attribute of j nodes around the i node, W' is a learnable parameter, σ (-) is expressed as the process of graph convolution calculation,
Figure FDA0003633486250000033
representing any node in the neighborhood of node i.
4. The human hand detection method fusing the local and deep space-time diagram network according to claim 1, characterized in that: the step S7 specifically includes: converting each frame of image into a depth map and combining the pixel distance d between each node ij And a hierarchical distance k ij And relative depth m ij Generating an adjacency matrix of (K + L) × (K + L); in the adjacency matrix, (K + L) × (K + L) represents the relationship between each graph node, and γ is set ij Representing the relationship between the areas represented by the i and j nodesComprises the following steps:
γ ij =(w 3 d ij +w 4 k ij +w 5 m ij )
k ij =Fra(i)-Fra(j)
wherein, w 3 、w 4 And w 5 Fra (-) is the frame number of the image where the computing node i or j is located; let λ be between local nodes of different targets ij Defining the value as ∞ and performing thinning operation on the adjacent matrix: starting from the first row, each row γ is selected ij Constructing a sparse adjacent matrix with hierarchical information from the largest first F points; and uses Graph Attention Networks to describe the impact and propagation between each adjacent node in the Graph.
CN202210497768.9A 2022-05-09 2022-05-09 Human hand detection method fusing local and depth space-time diagram network Pending CN114821654A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210497768.9A CN114821654A (en) 2022-05-09 2022-05-09 Human hand detection method fusing local and depth space-time diagram network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210497768.9A CN114821654A (en) 2022-05-09 2022-05-09 Human hand detection method fusing local and depth space-time diagram network

Publications (1)

Publication Number Publication Date
CN114821654A true CN114821654A (en) 2022-07-29

Family

ID=82513427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210497768.9A Pending CN114821654A (en) 2022-05-09 2022-05-09 Human hand detection method fusing local and depth space-time diagram network

Country Status (1)

Country Link
CN (1) CN114821654A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977812A (en) * 2019-03-12 2019-07-05 南京邮电大学 A kind of Vehicular video object detection method based on deep learning
CN111832393A (en) * 2020-05-29 2020-10-27 东南大学 Video target detection method and device based on deep learning
CN111931787A (en) * 2020-07-22 2020-11-13 杭州电子科技大学 RGBD significance detection method based on feature polymerization
CN112115783A (en) * 2020-08-12 2020-12-22 中国科学院大学 Human face characteristic point detection method, device and equipment based on deep knowledge migration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977812A (en) * 2019-03-12 2019-07-05 南京邮电大学 A kind of Vehicular video object detection method based on deep learning
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN111832393A (en) * 2020-05-29 2020-10-27 东南大学 Video target detection method and device based on deep learning
CN111931787A (en) * 2020-07-22 2020-11-13 杭州电子科技大学 RGBD significance detection method based on feature polymerization
CN112115783A (en) * 2020-08-12 2020-12-22 中国科学院大学 Human face characteristic point detection method, device and equipment based on deep knowledge migration

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘伟华: "基于深度图像的运动人手检测与指尖点跟踪算法", 计算机应用, vol. 34, no. 5, 10 May 2014 (2014-05-10), pages 1442 - 1448 *
阮航;王立春;: "基于特征图的车辆检测和分类", 计算机技术与发展, no. 11, 29 June 2018 (2018-06-29), pages 45 - 49 *

Similar Documents

Publication Publication Date Title
CN107529650B (en) Closed loop detection method and device and computer equipment
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
CN112149459B (en) Video saliency object detection model and system based on cross attention mechanism
CN112418095A (en) Facial expression recognition method and system combined with attention mechanism
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
CN112183501B (en) Depth counterfeit image detection method and device
WO2019136591A1 (en) Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN110853074B (en) Video target detection network system for enhancing targets by utilizing optical flow
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
CN107833239B (en) Optimization matching target tracking method based on weighting model constraint
CN112434608B (en) Human behavior identification method and system based on double-current combined network
CN110378195B (en) Multi-target tracking method based on histogram cache method
CN111401207B (en) Human body action recognition method based on MARS depth feature extraction and enhancement
CN110909741A (en) Vehicle re-identification method based on background segmentation
CN108022254A (en) A kind of space-time contextual target tracking based on sign point auxiliary
CN107609571A (en) A kind of adaptive target tracking method based on LARK features
CN111027542A (en) Target detection method improved based on fast RCNN algorithm
CN111145221A (en) Target tracking algorithm based on multi-layer depth feature extraction
CN114821654A (en) Human hand detection method fusing local and depth space-time diagram network
CN116188555A (en) Monocular indoor depth estimation algorithm based on depth network and motion information
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention
CN112070048B (en) Vehicle attribute identification method based on RDSNet
CN112084371B (en) Movie multi-label classification method and device, electronic equipment and storage medium
CN114067359A (en) Pedestrian detection method integrating human body key points and attention features of visible parts
CN110532960B (en) Target-assisted action recognition method based on graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination