CN110110665B

CN110110665B - Detection method for hand area in driving environment

Info

Publication number: CN110110665B
Application number: CN201910378179.7A
Authority: CN
Inventors: 林相波; 史明明; 李一博; 戴佐俊
Original assignee: Beijing Chuangyuan Microsoft Co ltd; Dalian University of Technology
Current assignee: Beijing Chuangyuan Microsoft Co ltd; Dalian University of Technology
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2021-05-04
Anticipated expiration: 2039-05-08
Also published as: CN110110665A

Abstract

The invention discloses a method for detecting a hand area in a driving environment, which comprises the following steps: step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a cab is shot by camera equipment arranged at different positions of the cab, the data set is divided into a training image set and a testing image set, then the data set is subjected to data expansion, and then a new hand surrounding frame is generated; step 2) constructing a hand detection convolutional neural network structure, and completing feature extraction and fusion by adopting a multi-scale framework and utilizing feature information on different scales; step 3) performing end-to-end training by adopting an ADAM (adaptive dynamic adaptive analysis) optimization algorithm, randomly sampling from a training image set, and stopping training when a loss function L is stable; step 4), suppressing the candidate frame used for eliminating redundancy by adopting a non-maximum value to obtain an optimal hand surrounding frame; step 5), publishing a detection result; the hand region detection method is convenient to realize the detection of the human hand region and suitable for the hand region marking in the cab environment.

Description

Detection method for hand area in driving environment

Technical Field

The invention belongs to the field of target detection of computer vision, and particularly relates to a method for detecting a hand area in a driving environment.

Background

Human hand detection, classification and tracking have been studied for many years and can be applied in many fields, such as virtual reality, man-machine interaction environment, driver behavior monitoring and the like. Because the hand region in the natural image is interfered by a plurality of factors, such as illumination change, shading, hand shape change, visual angle change, low hand resolution and the like, so far, the detection of the hand region in the natural image does not reach the accuracy of human identification, and many application occasions have to rely on an artificial detection mode with low efficiency. Therefore, it is very important to research the accurate detection method of the human hand region in the natural environment. The aim of the method is to detect a hand region from a static image in a vehicle cab environment, and a novel method based on a deep learning technology is researched, so that technical means such as driver behavior detection can be provided.

The use of skin color information in hand detection is an effective strategy for achieving good results in many ways. A two-stage approach is proposed as in document [1] [ a.misttal, a.zisserman, and p.h.s.torr.hand detection using multiple products in British Machine Vision Conference,2011] that uses three complementary detectors of context, skin color, and sliding window shape to give hand region candidate boxes, and then gives confidence probabilities for each candidate box through a classifier. A disadvantage of this type of method is that when detecting hand regions in natural images, the detection performance is greatly affected by skin color variations due to complex lighting conditions. Methods that employ multimodal information may also yield better results in certain applications. For example, documents [2] [ E.Ohn-Bar, S.Martin, A.Tawari, and M.M.Trivedi.head, eye, and hand patterns for driver activity recognition. in ICPR, patterns 660-. However, this method does not provide high detection accuracy for the hand region because of the limitations of the selected HOG features. The shape-sensitive structured forest algorithm is adopted in the document [3] [ X.Zhu, X.Jua, and K.Wong, "Pixel-level detection with shape structured requirements," in Processing of auxiliary Conference on Computer Vision.Springer Press,2014, pp.64-78 ] to detect the hand region Pixel by Pixel, although the method has a good effect on hand detection at a first viewing angle, but the method of scanning the whole image Pixel by Pixel is too time-consuming. The hand region [4] [ L.Karlinsky, M.Dinerstein, D.Harari, and S.Ullman, "The chain model for detecting parts by The person context," in Proceedings of Computer Vision and Pattern recognition. IEEE Press,2010, pp.25-32 ] is another hand region detection scheme, and The hand region is determined by dividing The human body into different parts, but when occlusion occurs, it is difficult to detect The hand. With the rapid development of deep learning technology, the target detection based on the convolutional neural network makes great progress. Such as convolutional neural network series (RCNN, Fast-RCNN, R-FCN), YOLO series object detection networks, etc., which are named based on candidate regions, although they achieve good results in detecting objects such as cats, dogs, pedestrians, automobiles, sofas, etc., when the regions occupied by the objects in the images are relatively small (e.g., human hands) or are occluded, the original structure detection accuracy using these networks is not high, and it is necessary to design a more efficient structure. Document [5] [ Lu Ding, Yong Wang, et al, Multi-scale representations for robust detection and classification, arXiv:1804.08220v1[ cs.CV ],2018] proposes a multi-scale R-FCN network structure, which comprises 5 convolution layers, gives hand region candidate frames from different scales, extracts feature maps of different layers from the hand region candidate frames for fusion, and further obtains a detected hand region enclosure frame. The document [6] [ T.hoang Ngan Le Kha Gia Quach Chenche Zhu, et al.Robust Hand Detection and Classification in Vehicles and in the Wild, CVPRW 2018, pp:39-46] also takes an R-FCN network structure as a basic frame, fuses the characteristics of different layers in a multi-scale manner, and screens a player part area in a candidate frame. A combined network of Hand region Detection and Hand Rotation direction Detection is designed in documents [7] [ Xiaoming Deng, Ye Yuan, Yinda Zhang, et al ], Joint Hand Detection and Rotation Estimation by Using CNN, arXiv:1612.02742v1[ cs.CV ],2016 ], and final Hand region Detection is completed through feature sharing.

Disclosure of Invention

The invention aims to: the method for detecting the hand area in the driving environment is used as a new hand detection network structure, a skin color model does not need to be established, an additional feature extractor is not needed, the network model is trained through an RGB (red, green and blue) data set in the driving cab environment, the detection of the hand area of the human is realized, and the method is suitable for labeling the hand area in the driving cab environment.

The technical scheme of the invention is as follows: a detection method for a hand area in a driving environment specifically comprises the following steps:

step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a cab is shot by camera equipment arranged at different positions of the cab, the data set is divided into a training image set and a testing image set, then the data set is subjected to data expansion, and then a new hand surrounding frame is generated;

step 2) constructing a hand detection convolutional neural network structure, and completing feature extraction and fusion by adopting a multi-scale framework and utilizing feature information on different scales;

step 3) performing end-to-end training by adopting an ADAM (adaptive dynamic adaptive analysis) optimization algorithm, randomly sampling from a training image set, and stopping training when a loss function L is stable;

the loss function L is formulated as follows:

L＝L_c+L_r (1)

wherein L is_cProbability for evaluating whether pixels inside and outside a hand bounding box are correctly classified, L_rThe method is used for evaluating whether the vertex position of the hand bounding box is correctly regressed;

L_c＝-αp*(1-p)^γlogp-(1-α)(1-p*)p^γlog(1-p) (2)

where p denotes the true pixel classification result, p denotes the probability that the network estimated pixel is inside the hand bounding box, α is the positive and negative sample balance factor,

gamma is taken according to experience;

wherein C is_iAnd

respectively representing a regression result and a true value of the hand surrounding frame coordinates;

step 4), suppressing the candidate frame used for eliminating redundancy by adopting a non-maximum value to obtain an optimal hand surrounding frame;

and 5) publishing the detection result.

As a preferred technical solution, the training image set in step 1) is according to 9: the 1 proportion is randomly divided into a training subset and a verification subset.

As a preferred technical solution, the data expansion method for the data set in step 1) includes horizontal flipping, vertical flipping, random angle rotation, translation, gaussian blurring and sharpening, and the training data after expansion is increased to at least 22000 images.

As a preferred technical solution, the data expansion in step 1) includes the following rules:

expansion rule 1: the brightness enhancement range is 1.2-1.5 times, the scaling is 0.7-1.5 times, 40 pixels are translated in the x direction, and 60 pixels are translated in the y direction;

and (3) expanding the rule 2: randomly cutting 0-16 pixels of edge distance, and horizontally turning over according to 50% probability;

and (3) expansion rule: vertically turning 100%, and adding Gaussian blur treatment with the mean value of 0 and the variance of 3;

and (4) expansion rule: randomly rotating, rotating the upper limit of the angle by 45 degrees, adding Gaussian white noise, and randomly sharpening according to 50% probability, wherein the noise level is 20%.

As a preferable technical solution, the method for generating the new hand bounding box in step 1) is as follows: taking four frames of the original hand surrounding frame as reference, retracting the frame to a specified length d of 0.2l_min，l_minFor the shortest bounding box length, the intra-frame portion is labeled 1 and the out-frame portion is labeled 0.

As a preferred technical scheme, the feature extraction and fusion in step 2) comprises three convolution modules and an up-sampling feature fusion process, and specifically comprises the following steps:

the image size of an input layer is 256 multiplied by 256, a first convolution module ConvB _1 comprises two convolution layers and a maximum pooling layer, and convolution kernels are 3 multiplied by 3, and 64 channels; the second convolution module ConvB _2 comprises two convolution layers and one maximum pooling layer, and the convolution kernel has 3 × 3, 128 channels; the third convolution module ConvB _3 comprises three convolution layers and a maximum pooling layer, and the convolution kernel has 3 × 3 and 256 channels; the nuclear sizes of the pooling layers are all 2 multiplied by 2, and the step length is 2;

up-sampling the feature map output by the third convolution module ConvB _3, doubling the size, randomly removing 20% of the number of channels from the feature map output by the second convolution module ConvB _2 by using a Dropout mechanism, and cascading the two modules; the fused feature map FusF _1 is subjected to normalization processing and then is sent into a 1 × 1 and 3 × 3 cascade convolution group ConvC _1, and 128 channels are formed in total; the output of the convolution is sent to an output layer after passing through a 3 multiplied by 3 convolution layer with 32 convolution kernels; the output layer comprises two branches, and the branch 1 predicts the probability that each pixel point is located in the target area through single-channel 1 × 1 convolution; branch 2 predicts the coordinate values of the vertices of the target bounding box by 4-channel 1 × 1 convolution.

As a preferable technical scheme, the detection result in the step 5) comprises the following objective quantitative evaluation indexes: average accuracy AP, average recall rate AR, comprehensive evaluation index F1-score and detection speed FPS;

assuming that TP represents that a real target is estimated, FP represents that the estimated target is not the real target, and FN represents that the real target is not estimated, the method comprises the following steps

The FPS employs a frame rate description.

The invention has the advantages that:

1. the method for detecting the hand region in the driving environment has the advantages of high accuracy, better applicability, low computational complexity, less operation time, simple training process, high efficiency and 42fps (measured time).

2. The invention adopts a deep convolution neural network structure to establish a hand detection model, can extract more comprehensive characteristics related to human hands, and has better robustness on shielding, uneven illumination, scale change, shape change and the like.

Drawings

The invention is further described with reference to the following figures and examples:

fig. 1 is a schematic diagram of detection results for different illumination, different hand shapes, different sizes of hands, and different numbers of hands.

Detailed Description

Example (b): because the hand area has larger size change in different images, the characteristic graphs of different depths are considered to respectively express the characteristics of the hands with different sizes, wherein deeper features are adopted to focus on the hand area with larger depth, and shallower features are adopted to focus on the hand area with smaller depth, in order to reduce the calculation expense, the invention adopts the idea of a U-shaped convolution neural network structure to gradually merge the characteristic graphs, and the method specifically comprises the following steps:

step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a driving cab is shot by camera equipment arranged at different positions of the driving cab, and is used for researching the performance of a hand region detection method under the conditions of disordered background, complex lighting conditions and frequent shielding, dividing the data set into a training image set and a testing image set, then performing data expansion on the data set, and then generating a new hand surrounding frame;

the data set comprises 5500 training images and 5500 testing images, and the image size is uniformly adjusted to 256 multiplied by 256 during training and testing; training image sets were as follows 9: the 1-scale random division is performed on a training subset and a verification subset, wherein the training subset comprises 4950 images, the verification subset comprises 550 images, and the test image set comprises 5500 images. The camera view includes: moving camera, fixed at left front camera driver, fixed at right front camera driver, fixed at back, fixed at right side of driver, fixed at top, worn on driver head, etc.

The deep neural network requires massive data training to obtain a better model. Therefore, the data set needs to be expanded on the basis of the original data. The data expansion method for the data set comprises horizontal overturning, vertical overturning, random angle rotation, translation, Gaussian blur and sharpening, and the training data is increased to at least 22000 images after expansion.

Data augmentation contains the following rules:

The hand bounding box given by the original dataset is in the form of vertex coordinates of the bounding box. The information used by the network output part of the patent is probability information of pixel points falling in the bounding box, so that the original hand bounding box needs to be processed to generate a new hand bounding box form. The new hand bounding box generation method is as follows: taking four frames of the original hand surrounding frame as reference, retracting the frame to a specified length d of 0.2l_min，l_minFor the shortest bounding box length, the intra-frame portion is labeled 1 and the out-frame portion is labeled 0.

the feature extraction and fusion comprises three convolution modules and an up-sampling feature fusion process, and specifically comprises the following steps:

the loss function L is formulated as follows:

L＝L_c+L_r (1)

L_c＝-αp*(1-p)^γlogp-(1-α)(1-p*)p^γlog(1-p) (2)

gamma is taken according to experience; better experimental results can be obtained by setting gamma to 2 in the experiment;

wherein C is_iAnd

and 4) in the target detection process, a large number of mutually overlapped candidate frames are generated at the same target position, and each candidate frame has different confidence degrees. Adopting non-maximum value to suppress the candidate frame for eliminating redundancy to obtain the optimal hand surrounding frame;

step 5), publishing a detection result; the detection result comprises the following objective quantitative evaluation indexes: average accuracy AP, average recall rate AR, comprehensive evaluation index F1-score and detection speed FPS;

The FPS employs a frame rate description.

The performance of the hand region in the RGB static image under the environment of the cab is evaluated by means of subjective visual inspection and objective quantitative indexes, fig. 1 shows the hand inspection results of a few typical examples, and it can be seen that the method has good inspection effects on different illumination, different hand shapes, different sizes of hands and different numbers of hands.

The results of the quantitative evaluation of the method on the test set are shown in table 1, and the method performance is compared with the best result of competition on the VIVA data set.

TABLE 1 quantitative evaluation index for hand region detection in test set

Method	AP(％)	AR(％)	F	FPS
					This patent	98.3	86.7	92.2	42
Background Art document [6]	94.8	74.7	-	4.65

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for detecting a hand area in a driving environment is characterized by comprising the following steps:

step 1) preparing a data set, wherein the data set is obtained in a real driving environment under the condition that the scene in a cab is shot through camera equipment arranged at different positions of the cab, the data set is divided into a training image set and a testing image set, then the data set is subjected to data expansion, and then a new hand surrounding frame is generated

the loss function L is formulated as follows:

L＝L_c+L_r (1)

L_c＝-αp*(1-p)^γlogp-(1-α)(1-p*)p^γlog(1-p) (2)

where p denotes the true pixel classification result and p denotes the probability that the network estimated pixel is inside the hand bounding boxAnd alpha is a positive and negative sample balance factor,

gamma is taken according to experience;

wherein C is_iAnd

and 5) publishing the detection result.

2. The method for detecting the hand region in the driving environment according to claim 1, wherein the training image set in step 1) is obtained according to the following formula 9: the 1 proportion is randomly divided into a training subset and a verification subset.

3. The method for detecting the hand region in the driving environment of claim 1, wherein the data expansion method for the data set in step 1) comprises horizontal flipping, vertical flipping, random angle rotation, translation, gaussian blurring and sharpening, and the expanded training data is added to at least 22000 images.

4. The method for detecting the hand region in the driving environment of claim 1, wherein the data expansion in step 1) comprises the following rules:

5. The method for detecting a hand region in a driving environment according to claim 1, wherein the new hand bounding box in step 1) is generated as follows: taking four frames of the original hand surrounding frame as reference, retracting the frame to a specified length d of 0.2l_min，l_minFor the shortest bounding box length, the intra-frame portion is labeled 1 and the out-frame portion is labeled 0.

6. The method for detecting the hand region in the driving environment according to claim 1, wherein the feature extraction and fusion in step 2) includes three convolution modules and an upsampling feature fusion process, and specifically includes the following steps:

7. The method for detecting the hand region in the driving environment according to claim 1, wherein the detection result in the step 5) includes the following objective quantitative evaluation indexes: average accuracy AP, average recall rate AR, comprehensive evaluation index F1-score and detection speed FPS;

The FPS employs a frame rate description.