CN110110665B - Detection method for hand area in driving environment - Google Patents

Detection method for hand area in driving environment Download PDF

Info

Publication number
CN110110665B
CN110110665B CN201910378179.7A CN201910378179A CN110110665B CN 110110665 B CN110110665 B CN 110110665B CN 201910378179 A CN201910378179 A CN 201910378179A CN 110110665 B CN110110665 B CN 110110665B
Authority
CN
China
Prior art keywords
hand
convolution
driving environment
training
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910378179.7A
Other languages
Chinese (zh)
Other versions
CN110110665A (en
Inventor
林相波
史明明
李一博
戴佐俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Chuangyuan Microsoft Co ltd
Dalian University of Technology
Original Assignee
Beijing Chuangyuan Microsoft Co ltd
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Chuangyuan Microsoft Co ltd, Dalian University of Technology filed Critical Beijing Chuangyuan Microsoft Co ltd
Priority to CN201910378179.7A priority Critical patent/CN110110665B/en
Publication of CN110110665A publication Critical patent/CN110110665A/en
Application granted granted Critical
Publication of CN110110665B publication Critical patent/CN110110665B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting a hand area in a driving environment, which comprises the following steps: step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a cab is shot by camera equipment arranged at different positions of the cab, the data set is divided into a training image set and a testing image set, then the data set is subjected to data expansion, and then a new hand surrounding frame is generated; step 2) constructing a hand detection convolutional neural network structure, and completing feature extraction and fusion by adopting a multi-scale framework and utilizing feature information on different scales; step 3) performing end-to-end training by adopting an ADAM (adaptive dynamic adaptive analysis) optimization algorithm, randomly sampling from a training image set, and stopping training when a loss function L is stable; step 4), suppressing the candidate frame used for eliminating redundancy by adopting a non-maximum value to obtain an optimal hand surrounding frame; step 5), publishing a detection result; the hand region detection method is convenient to realize the detection of the human hand region and suitable for the hand region marking in the cab environment.

Description

Detection method for hand area in driving environment
Technical Field
The invention belongs to the field of target detection of computer vision, and particularly relates to a method for detecting a hand area in a driving environment.
Background
Human hand detection, classification and tracking have been studied for many years and can be applied in many fields, such as virtual reality, man-machine interaction environment, driver behavior monitoring and the like. Because the hand region in the natural image is interfered by a plurality of factors, such as illumination change, shading, hand shape change, visual angle change, low hand resolution and the like, so far, the detection of the hand region in the natural image does not reach the accuracy of human identification, and many application occasions have to rely on an artificial detection mode with low efficiency. Therefore, it is very important to research the accurate detection method of the human hand region in the natural environment. The aim of the method is to detect a hand region from a static image in a vehicle cab environment, and a novel method based on a deep learning technology is researched, so that technical means such as driver behavior detection can be provided.
The use of skin color information in hand detection is an effective strategy for achieving good results in many ways. A two-stage approach is proposed as in document [1] [ a.misttal, a.zisserman, and p.h.s.torr.hand detection using multiple products in British Machine Vision Conference,2011] that uses three complementary detectors of context, skin color, and sliding window shape to give hand region candidate boxes, and then gives confidence probabilities for each candidate box through a classifier. A disadvantage of this type of method is that when detecting hand regions in natural images, the detection performance is greatly affected by skin color variations due to complex lighting conditions. Methods that employ multimodal information may also yield better results in certain applications. For example, documents [2] [ E.Ohn-Bar, S.Martin, A.Tawari, and M.M.Trivedi.head, eye, and hand patterns for driver activity recognition. in ICPR, patterns 660-. However, this method does not provide high detection accuracy for the hand region because of the limitations of the selected HOG features. The shape-sensitive structured forest algorithm is adopted in the document [3] [ X.Zhu, X.Jua, and K.Wong, "Pixel-level detection with shape structured requirements," in Processing of auxiliary Conference on Computer Vision.Springer Press,2014, pp.64-78 ] to detect the hand region Pixel by Pixel, although the method has a good effect on hand detection at a first viewing angle, but the method of scanning the whole image Pixel by Pixel is too time-consuming. The hand region [4] [ L.Karlinsky, M.Dinerstein, D.Harari, and S.Ullman, "The chain model for detecting parts by The person context," in Proceedings of Computer Vision and Pattern recognition. IEEE Press,2010, pp.25-32 ] is another hand region detection scheme, and The hand region is determined by dividing The human body into different parts, but when occlusion occurs, it is difficult to detect The hand. With the rapid development of deep learning technology, the target detection based on the convolutional neural network makes great progress. Such as convolutional neural network series (RCNN, Fast-RCNN, R-FCN), YOLO series object detection networks, etc., which are named based on candidate regions, although they achieve good results in detecting objects such as cats, dogs, pedestrians, automobiles, sofas, etc., when the regions occupied by the objects in the images are relatively small (e.g., human hands) or are occluded, the original structure detection accuracy using these networks is not high, and it is necessary to design a more efficient structure. Document [5] [ Lu Ding, Yong Wang, et al, Multi-scale representations for robust detection and classification, arXiv:1804.08220v1[ cs.CV ],2018] proposes a multi-scale R-FCN network structure, which comprises 5 convolution layers, gives hand region candidate frames from different scales, extracts feature maps of different layers from the hand region candidate frames for fusion, and further obtains a detected hand region enclosure frame. The document [6] [ T.hoang Ngan Le Kha Gia Quach Chenche Zhu, et al.Robust Hand Detection and Classification in Vehicles and in the Wild, CVPRW 2018, pp:39-46] also takes an R-FCN network structure as a basic frame, fuses the characteristics of different layers in a multi-scale manner, and screens a player part area in a candidate frame. A combined network of Hand region Detection and Hand Rotation direction Detection is designed in documents [7] [ Xiaoming Deng, Ye Yuan, Yinda Zhang, et al ], Joint Hand Detection and Rotation Estimation by Using CNN, arXiv:1612.02742v1[ cs.CV ],2016 ], and final Hand region Detection is completed through feature sharing.
Disclosure of Invention
The invention aims to: the method for detecting the hand area in the driving environment is used as a new hand detection network structure, a skin color model does not need to be established, an additional feature extractor is not needed, the network model is trained through an RGB (red, green and blue) data set in the driving cab environment, the detection of the hand area of the human is realized, and the method is suitable for labeling the hand area in the driving cab environment.
The technical scheme of the invention is as follows: a detection method for a hand area in a driving environment specifically comprises the following steps:
step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a cab is shot by camera equipment arranged at different positions of the cab, the data set is divided into a training image set and a testing image set, then the data set is subjected to data expansion, and then a new hand surrounding frame is generated;
step 2) constructing a hand detection convolutional neural network structure, and completing feature extraction and fusion by adopting a multi-scale framework and utilizing feature information on different scales;
step 3) performing end-to-end training by adopting an ADAM (adaptive dynamic adaptive analysis) optimization algorithm, randomly sampling from a training image set, and stopping training when a loss function L is stable;
the loss function L is formulated as follows:
L=Lc+Lr (1)
wherein L iscProbability for evaluating whether pixels inside and outside a hand bounding box are correctly classified, LrThe method is used for evaluating whether the vertex position of the hand bounding box is correctly regressed;
Lc=-αp*(1-p)γlogp-(1-α)(1-p*)pγlog(1-p) (2)
where p denotes the true pixel classification result, p denotes the probability that the network estimated pixel is inside the hand bounding box, α is the positive and negative sample balance factor,
Figure GDA0002821075550000031
gamma is taken according to experience;
Figure GDA0002821075550000032
wherein C isiAnd
Figure GDA0002821075550000033
respectively representing a regression result and a true value of the hand surrounding frame coordinates;
Figure GDA0002821075550000034
step 4), suppressing the candidate frame used for eliminating redundancy by adopting a non-maximum value to obtain an optimal hand surrounding frame;
and 5) publishing the detection result.
As a preferred technical solution, the training image set in step 1) is according to 9: the 1 proportion is randomly divided into a training subset and a verification subset.
As a preferred technical solution, the data expansion method for the data set in step 1) includes horizontal flipping, vertical flipping, random angle rotation, translation, gaussian blurring and sharpening, and the training data after expansion is increased to at least 22000 images.
As a preferred technical solution, the data expansion in step 1) includes the following rules:
expansion rule 1: the brightness enhancement range is 1.2-1.5 times, the scaling is 0.7-1.5 times, 40 pixels are translated in the x direction, and 60 pixels are translated in the y direction;
and (3) expanding the rule 2: randomly cutting 0-16 pixels of edge distance, and horizontally turning over according to 50% probability;
and (3) expansion rule: vertically turning 100%, and adding Gaussian blur treatment with the mean value of 0 and the variance of 3;
and (4) expansion rule: randomly rotating, rotating the upper limit of the angle by 45 degrees, adding Gaussian white noise, and randomly sharpening according to 50% probability, wherein the noise level is 20%.
As a preferable technical solution, the method for generating the new hand bounding box in step 1) is as follows: taking four frames of the original hand surrounding frame as reference, retracting the frame to a specified length d of 0.2lmin,lminFor the shortest bounding box length, the intra-frame portion is labeled 1 and the out-frame portion is labeled 0.
As a preferred technical scheme, the feature extraction and fusion in step 2) comprises three convolution modules and an up-sampling feature fusion process, and specifically comprises the following steps:
the image size of an input layer is 256 multiplied by 256, a first convolution module ConvB _1 comprises two convolution layers and a maximum pooling layer, and convolution kernels are 3 multiplied by 3, and 64 channels; the second convolution module ConvB _2 comprises two convolution layers and one maximum pooling layer, and the convolution kernel has 3 × 3, 128 channels; the third convolution module ConvB _3 comprises three convolution layers and a maximum pooling layer, and the convolution kernel has 3 × 3 and 256 channels; the nuclear sizes of the pooling layers are all 2 multiplied by 2, and the step length is 2;
up-sampling the feature map output by the third convolution module ConvB _3, doubling the size, randomly removing 20% of the number of channels from the feature map output by the second convolution module ConvB _2 by using a Dropout mechanism, and cascading the two modules; the fused feature map FusF _1 is subjected to normalization processing and then is sent into a 1 × 1 and 3 × 3 cascade convolution group ConvC _1, and 128 channels are formed in total; the output of the convolution is sent to an output layer after passing through a 3 multiplied by 3 convolution layer with 32 convolution kernels; the output layer comprises two branches, and the branch 1 predicts the probability that each pixel point is located in the target area through single-channel 1 × 1 convolution; branch 2 predicts the coordinate values of the vertices of the target bounding box by 4-channel 1 × 1 convolution.
As a preferable technical scheme, the detection result in the step 5) comprises the following objective quantitative evaluation indexes: average accuracy AP, average recall rate AR, comprehensive evaluation index F1-score and detection speed FPS;
assuming that TP represents that a real target is estimated, FP represents that the estimated target is not the real target, and FN represents that the real target is not estimated, the method comprises the following steps
Figure GDA0002821075550000041
Figure GDA0002821075550000051
Figure GDA0002821075550000052
The FPS employs a frame rate description.
The invention has the advantages that:
1. the method for detecting the hand region in the driving environment has the advantages of high accuracy, better applicability, low computational complexity, less operation time, simple training process, high efficiency and 42fps (measured time).
2. The invention adopts a deep convolution neural network structure to establish a hand detection model, can extract more comprehensive characteristics related to human hands, and has better robustness on shielding, uneven illumination, scale change, shape change and the like.
Drawings
The invention is further described with reference to the following figures and examples:
fig. 1 is a schematic diagram of detection results for different illumination, different hand shapes, different sizes of hands, and different numbers of hands.
Detailed Description
Example (b): because the hand area has larger size change in different images, the characteristic graphs of different depths are considered to respectively express the characteristics of the hands with different sizes, wherein deeper features are adopted to focus on the hand area with larger depth, and shallower features are adopted to focus on the hand area with smaller depth, in order to reduce the calculation expense, the invention adopts the idea of a U-shaped convolution neural network structure to gradually merge the characteristic graphs, and the method specifically comprises the following steps:
step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a driving cab is shot by camera equipment arranged at different positions of the driving cab, and is used for researching the performance of a hand region detection method under the conditions of disordered background, complex lighting conditions and frequent shielding, dividing the data set into a training image set and a testing image set, then performing data expansion on the data set, and then generating a new hand surrounding frame;
the data set comprises 5500 training images and 5500 testing images, and the image size is uniformly adjusted to 256 multiplied by 256 during training and testing; training image sets were as follows 9: the 1-scale random division is performed on a training subset and a verification subset, wherein the training subset comprises 4950 images, the verification subset comprises 550 images, and the test image set comprises 5500 images. The camera view includes: moving camera, fixed at left front camera driver, fixed at right front camera driver, fixed at back, fixed at right side of driver, fixed at top, worn on driver head, etc.
The deep neural network requires massive data training to obtain a better model. Therefore, the data set needs to be expanded on the basis of the original data. The data expansion method for the data set comprises horizontal overturning, vertical overturning, random angle rotation, translation, Gaussian blur and sharpening, and the training data is increased to at least 22000 images after expansion.
Data augmentation contains the following rules:
expansion rule 1: the brightness enhancement range is 1.2-1.5 times, the scaling is 0.7-1.5 times, 40 pixels are translated in the x direction, and 60 pixels are translated in the y direction;
and (3) expanding the rule 2: randomly cutting 0-16 pixels of edge distance, and horizontally turning over according to 50% probability;
and (3) expansion rule: vertically turning 100%, and adding Gaussian blur treatment with the mean value of 0 and the variance of 3;
and (4) expansion rule: randomly rotating, rotating the upper limit of the angle by 45 degrees, adding Gaussian white noise, and randomly sharpening according to 50% probability, wherein the noise level is 20%.
The hand bounding box given by the original dataset is in the form of vertex coordinates of the bounding box. The information used by the network output part of the patent is probability information of pixel points falling in the bounding box, so that the original hand bounding box needs to be processed to generate a new hand bounding box form. The new hand bounding box generation method is as follows: taking four frames of the original hand surrounding frame as reference, retracting the frame to a specified length d of 0.2lmin,lminFor the shortest bounding box length, the intra-frame portion is labeled 1 and the out-frame portion is labeled 0.
Step 2) constructing a hand detection convolutional neural network structure, and completing feature extraction and fusion by adopting a multi-scale framework and utilizing feature information on different scales;
the feature extraction and fusion comprises three convolution modules and an up-sampling feature fusion process, and specifically comprises the following steps:
the image size of an input layer is 256 multiplied by 256, a first convolution module ConvB _1 comprises two convolution layers and a maximum pooling layer, and convolution kernels are 3 multiplied by 3, and 64 channels; the second convolution module ConvB _2 comprises two convolution layers and one maximum pooling layer, and the convolution kernel has 3 × 3, 128 channels; the third convolution module ConvB _3 comprises three convolution layers and a maximum pooling layer, and the convolution kernel has 3 × 3 and 256 channels; the nuclear sizes of the pooling layers are all 2 multiplied by 2, and the step length is 2;
up-sampling the feature map output by the third convolution module ConvB _3, doubling the size, randomly removing 20% of the number of channels from the feature map output by the second convolution module ConvB _2 by using a Dropout mechanism, and cascading the two modules; the fused feature map FusF _1 is subjected to normalization processing and then is sent into a 1 × 1 and 3 × 3 cascade convolution group ConvC _1, and 128 channels are formed in total; the output of the convolution is sent to an output layer after passing through a 3 multiplied by 3 convolution layer with 32 convolution kernels; the output layer comprises two branches, and the branch 1 predicts the probability that each pixel point is located in the target area through single-channel 1 × 1 convolution; branch 2 predicts the coordinate values of the vertices of the target bounding box by 4-channel 1 × 1 convolution.
Step 3) performing end-to-end training by adopting an ADAM (adaptive dynamic adaptive analysis) optimization algorithm, randomly sampling from a training image set, and stopping training when a loss function L is stable;
the loss function L is formulated as follows:
L=Lc+Lr (1)
wherein L iscProbability for evaluating whether pixels inside and outside a hand bounding box are correctly classified, LrThe method is used for evaluating whether the vertex position of the hand bounding box is correctly regressed;
Lc=-αp*(1-p)γlogp-(1-α)(1-p*)pγlog(1-p) (2)
where p denotes the true pixel classification result, p denotes the probability that the network estimated pixel is inside the hand bounding box, α is the positive and negative sample balance factor,
Figure GDA0002821075550000071
gamma is taken according to experience; better experimental results can be obtained by setting gamma to 2 in the experiment;
Figure GDA0002821075550000072
wherein C isiAnd
Figure GDA0002821075550000073
respectively representing a regression result and a true value of the hand surrounding frame coordinates;
Figure GDA0002821075550000074
and 4) in the target detection process, a large number of mutually overlapped candidate frames are generated at the same target position, and each candidate frame has different confidence degrees. Adopting non-maximum value to suppress the candidate frame for eliminating redundancy to obtain the optimal hand surrounding frame;
step 5), publishing a detection result; the detection result comprises the following objective quantitative evaluation indexes: average accuracy AP, average recall rate AR, comprehensive evaluation index F1-score and detection speed FPS;
assuming that TP represents that a real target is estimated, FP represents that the estimated target is not the real target, and FN represents that the real target is not estimated, the method comprises the following steps
Figure GDA0002821075550000081
Figure GDA0002821075550000082
Figure GDA0002821075550000083
The FPS employs a frame rate description.
The performance of the hand region in the RGB static image under the environment of the cab is evaluated by means of subjective visual inspection and objective quantitative indexes, fig. 1 shows the hand inspection results of a few typical examples, and it can be seen that the method has good inspection effects on different illumination, different hand shapes, different sizes of hands and different numbers of hands.
The results of the quantitative evaluation of the method on the test set are shown in table 1, and the method performance is compared with the best result of competition on the VIVA data set.
TABLE 1 quantitative evaluation index for hand region detection in test set
Method AP(%) AR(%) F FPS
This patent 98.3 86.7 92.2 42
Background Art document [6] 94.8 74.7 - 4.65
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (7)

1. A method for detecting a hand area in a driving environment is characterized by comprising the following steps:
step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a cab is shot through camera equipment arranged at different positions of the cab, the data set is divided into a training image set and a testing image set, then the data set is subjected to data expansion, and then a new hand surrounding frame is generated
Step 2) constructing a hand detection convolutional neural network structure, and completing feature extraction and fusion by adopting a multi-scale framework and utilizing feature information on different scales;
step 3) performing end-to-end training by adopting an ADAM (adaptive dynamic adaptive analysis) optimization algorithm, randomly sampling from a training image set, and stopping training when a loss function L is stable;
the loss function L is formulated as follows:
L=Lc+Lr (1)
wherein L iscProbability for evaluating whether pixels inside and outside a hand bounding box are correctly classified, LrThe method is used for evaluating whether the vertex position of the hand bounding box is correctly regressed;
Lc=-αp*(1-p)γlogp-(1-α)(1-p*)pγlog(1-p) (2)
where p denotes the true pixel classification result and p denotes the probability that the network estimated pixel is inside the hand bounding boxAnd alpha is a positive and negative sample balance factor,
Figure FDA0002821075540000011
gamma is taken according to experience;
Figure FDA0002821075540000012
wherein C isiAnd
Figure FDA0002821075540000013
respectively representing a regression result and a true value of the hand surrounding frame coordinates;
Figure FDA0002821075540000014
step 4), suppressing the candidate frame used for eliminating redundancy by adopting a non-maximum value to obtain an optimal hand surrounding frame;
and 5) publishing the detection result.
2. The method for detecting the hand region in the driving environment according to claim 1, wherein the training image set in step 1) is obtained according to the following formula 9: the 1 proportion is randomly divided into a training subset and a verification subset.
3. The method for detecting the hand region in the driving environment of claim 1, wherein the data expansion method for the data set in step 1) comprises horizontal flipping, vertical flipping, random angle rotation, translation, gaussian blurring and sharpening, and the expanded training data is added to at least 22000 images.
4. The method for detecting the hand region in the driving environment of claim 1, wherein the data expansion in step 1) comprises the following rules:
expansion rule 1: the brightness enhancement range is 1.2-1.5 times, the scaling is 0.7-1.5 times, 40 pixels are translated in the x direction, and 60 pixels are translated in the y direction;
and (3) expanding the rule 2: randomly cutting 0-16 pixels of edge distance, and horizontally turning over according to 50% probability;
and (3) expansion rule: vertically turning 100%, and adding Gaussian blur treatment with the mean value of 0 and the variance of 3;
and (4) expansion rule: randomly rotating, rotating the upper limit of the angle by 45 degrees, adding Gaussian white noise, and randomly sharpening according to 50% probability, wherein the noise level is 20%.
5. The method for detecting a hand region in a driving environment according to claim 1, wherein the new hand bounding box in step 1) is generated as follows: taking four frames of the original hand surrounding frame as reference, retracting the frame to a specified length d of 0.2lmin,lminFor the shortest bounding box length, the intra-frame portion is labeled 1 and the out-frame portion is labeled 0.
6. The method for detecting the hand region in the driving environment according to claim 1, wherein the feature extraction and fusion in step 2) includes three convolution modules and an upsampling feature fusion process, and specifically includes the following steps:
the image size of an input layer is 256 multiplied by 256, a first convolution module ConvB _1 comprises two convolution layers and a maximum pooling layer, and convolution kernels are 3 multiplied by 3, and 64 channels; the second convolution module ConvB _2 comprises two convolution layers and one maximum pooling layer, and the convolution kernel has 3 × 3, 128 channels; the third convolution module ConvB _3 comprises three convolution layers and a maximum pooling layer, and the convolution kernel has 3 × 3 and 256 channels; the nuclear sizes of the pooling layers are all 2 multiplied by 2, and the step length is 2;
up-sampling the feature map output by the third convolution module ConvB _3, doubling the size, randomly removing 20% of the number of channels from the feature map output by the second convolution module ConvB _2 by using a Dropout mechanism, and cascading the two modules; the fused feature map FusF _1 is subjected to normalization processing and then is sent into a 1 × 1 and 3 × 3 cascade convolution group ConvC _1, and 128 channels are formed in total; the output of the convolution is sent to an output layer after passing through a 3 multiplied by 3 convolution layer with 32 convolution kernels; the output layer comprises two branches, and the branch 1 predicts the probability that each pixel point is located in the target area through single-channel 1 × 1 convolution; branch 2 predicts the coordinate values of the vertices of the target bounding box by 4-channel 1 × 1 convolution.
7. The method for detecting the hand region in the driving environment according to claim 1, wherein the detection result in the step 5) includes the following objective quantitative evaluation indexes: average accuracy AP, average recall rate AR, comprehensive evaluation index F1-score and detection speed FPS;
assuming that TP represents that a real target is estimated, FP represents that the estimated target is not the real target, and FN represents that the real target is not estimated, the method comprises the following steps
Figure FDA0002821075540000031
Figure FDA0002821075540000032
Figure FDA0002821075540000033
The FPS employs a frame rate description.
CN201910378179.7A 2019-05-08 2019-05-08 Detection method for hand area in driving environment Active CN110110665B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910378179.7A CN110110665B (en) 2019-05-08 2019-05-08 Detection method for hand area in driving environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910378179.7A CN110110665B (en) 2019-05-08 2019-05-08 Detection method for hand area in driving environment

Publications (2)

Publication Number Publication Date
CN110110665A CN110110665A (en) 2019-08-09
CN110110665B true CN110110665B (en) 2021-05-04

Family

ID=67488704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910378179.7A Active CN110110665B (en) 2019-05-08 2019-05-08 Detection method for hand area in driving environment

Country Status (1)

Country Link
CN (1) CN110110665B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364805B (en) * 2020-11-21 2023-04-18 西安交通大学 Rotary palm image detection method
CN112686888A (en) * 2021-01-27 2021-04-20 上海电气集团股份有限公司 Method, system, equipment and medium for detecting cracks of concrete sleeper

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129673A (en) * 2011-04-19 2011-07-20 大连理工大学 Color digital image enhancing and denoising method under random illumination
CN109086779A (en) * 2018-07-28 2018-12-25 天津大学 A kind of attention target identification method based on convolutional neural networks
CN109635750A (en) * 2018-12-14 2019-04-16 广西师范大学 A kind of compound convolutional neural networks images of gestures recognition methods under complex background
CN109711288A (en) * 2018-12-13 2019-05-03 西安电子科技大学 Remote sensing ship detecting method based on feature pyramid and distance restraint FCN

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10996372B2 (en) * 2017-08-25 2021-05-04 Exxonmobil Upstream Research Company Geophysical inversion with convolutional neural networks
CN108875732B (en) * 2018-01-11 2022-07-12 北京旷视科技有限公司 Model training and instance segmentation method, device and system and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129673A (en) * 2011-04-19 2011-07-20 大连理工大学 Color digital image enhancing and denoising method under random illumination
CN109086779A (en) * 2018-07-28 2018-12-25 天津大学 A kind of attention target identification method based on convolutional neural networks
CN109711288A (en) * 2018-12-13 2019-05-03 西安电子科技大学 Remote sensing ship detecting method based on feature pyramid and distance restraint FCN
CN109635750A (en) * 2018-12-14 2019-04-16 广西师范大学 A kind of compound convolutional neural networks images of gestures recognition methods under complex background

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HBE: Hand Branch Ensemble Network for Real-time 3D Hand Pose Estimation;Yidan Zhou 等;《ECCV 2018》;20181231;第1-16页 *
自适应增强卷积神经网络图像识别;刘万军 等;《中国图象图形学报》;20171231;第1723-1736页 *

Also Published As

Publication number Publication date
CN110110665A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
Ali et al. Structural crack detection using deep convolutional neural networks
CN109886986B (en) Dermatoscope image segmentation method based on multi-branch convolutional neural network
CN108416266B (en) Method for rapidly identifying video behaviors by extracting moving object through optical flow
CN109903331B (en) Convolutional neural network target detection method based on RGB-D camera
CN108062525B (en) Deep learning hand detection method based on hand region prediction
CN107358258B (en) SAR image target classification based on NSCT double CNN channels and selective attention mechanism
CN110032925B (en) Gesture image segmentation and recognition method based on improved capsule network and algorithm
CN107909005A (en) Personage's gesture recognition method under monitoring scene based on deep learning
CN106157303A (en) A kind of method based on machine vision to Surface testing
CN113160062B (en) Infrared image target detection method, device, equipment and storage medium
CN113408584B (en) RGB-D multi-modal feature fusion 3D target detection method
EP4174792A1 (en) Method for scene understanding and semantic analysis of objects
CN109753996B (en) Hyperspectral image classification method based on three-dimensional lightweight depth network
CN107808376A (en) A kind of detection method of raising one's hand based on deep learning
CN110490924B (en) Light field image feature point detection method based on multi-scale Harris
CN103903275A (en) Method for improving image segmentation effects by using wavelet fusion algorithm
CN110110665B (en) Detection method for hand area in driving environment
Li et al. Research on a product quality monitoring method based on multi scale PP-YOLO
Wang et al. Segmentation of corn leaf disease based on fully convolution neural network
CN104376312B (en) Face identification method based on bag of words compressed sensing feature extraction
Peng et al. Litchi detection in the field using an improved YOLOv3 model
Li et al. The research on traffic sign recognition based on deep learning
CN111832508B (en) DIE _ GA-based low-illumination target detection method
CN105930789A (en) Human body behavior recognition based on logarithmic Euclidean space BOW (bag of words) model
Nie et al. Analysis on DeepLabV3+ performance for automatic steel defects detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant