CN107368182A

CN107368182A - Gestures detection network training, gestures detection, gestural control method and device

Info

Publication number: CN107368182A
Application number: CN201610696340.1A
Authority: CN
Inventors: 李全全; 闫俊杰; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2016-08-19
Filing date: 2016-08-19
Publication date: 2017-11-21
Anticipated expiration: 2036-08-19
Also published as: CN107368182B

Abstract

The embodiment of the present application discloses gestures detection network training, gestures detection, gestural control method and device, is related to technical field of image processing, and above-mentioned gestures detection network training includes：First convolutional neural networks are trained according to the sample image containing human hand markup information, obtain information of forecasting of first convolutional neural networks for the human hand candidate region of the sample image；Layer parameter will be extracted for the second feature of the second convolutional neural networks of detection gesture, replace with the fisrt feature extraction layer parameter of first convolutional neural networks after training；The second convolution neural network parameter is trained according to the information of forecasting of the human hand candidate region and the sample image, and keeps the second feature extract layer parameter constant in the training process.The scheme provided using the embodiment of the present application, reduces the requirement to user in man-machine interaction, improves Consumer's Experience.

Description

Gestures detection network training, gestures detection, gestural control method and device

Technical field

The application is related to technical field of image processing, more particularly to gestures detection network training, gestures detection, gesture control Method and device.

Background technology

With the fast development of electronic technology, the application scenarios that human-computer interaction technology is related to are more and more.Man-machine interaction skill First have to detect the gesture of people in art, then further man-machine interactive operation could be carried out according to the gesture of the people detected.

In the prior art, gestures detection is mainly based upon sensor progress, so needs user to wear or hold phase Gestures detection could be realized by closing equipment.However, wearing relevant device regardless of user still holds relevant device, user's palm is required to The associative operation technological know-how of the equipment is held, higher, poor user experience is required to user.

The content of the invention

The embodiment of the present application discloses a kind of gestures detection and control program.

The embodiment of the present application discloses a kind of gestures detection network training method, including：

First convolutional neural networks are trained according to the sample image containing human hand markup information, obtain the first convolution god Information of forecasting through network for the human hand candidate region of the sample image；

Layer parameter will be extracted for the second feature of the second convolutional neural networks of detection gesture, replace with the institute after training State the fisrt feature extraction layer parameter of the first convolutional neural networks；

Second convolutional neural networks are trained according to the information of forecasting of the human hand candidate region and the sample image Parameter, and the second feature extract layer parameter constant is kept in the training process.

Alternatively, the second convolution god is trained according to the prediction result of the human hand candidate region and the sample image Through network parameter, including：

Correct the information of forecasting of the human hand candidate region；

Second convolution is trained according to the information of forecasting of the revised human hand candidate region and the sample image Neural network parameter.

Alternatively, the human hand markup information includes the markup information in human hand region.

Alternatively, the human hand markup information includes the markup information of gesture.

Alternatively, first convolutional neural networks include：First input layer, fisrt feature extract layer and the first classification are defeated Go out layer, the first classification output layer is used to predict whether multiple candidate regions of the sample image to be human hand candidate region.

Alternatively, second convolutional neural networks include：Second input layer, second feature extract layer, the second classification are defeated Go out layer, the second classification output layer is used for the gestures detection result for exporting the sample image.

Alternatively, the gestures detection result includes at least one of prearranged gesture type：Wave, scissors hand, clench fist, Hold in the palm hand, perpendicular thumb, pistol hand, OK hands, peach heart hand, opening, closure.

Alternatively, the gestures detection result also includes：Non-predetermined gesture-type.

Alternatively, the information of forecasting of the amendment human hand candidate region, including：

Multiple supplement negative sample images and the information of forecasting of the human hand candidate region are inputted into the 3rd convolutional neural networks To be classified, to filter the negative sample in the human hand candidate region, the pre- of the revised human hand candidate region is obtained Measurement information.

Alternatively, human hand candidate region quantity and the supplement negative sample figure in the information of forecasting of the human hand candidate region The difference of the quantity of picture falls into predetermined permissible range.

Alternatively, human hand candidate region quantity and the supplement negative sample figure in the information of forecasting of the human hand candidate region The difference of the quantity of picture is equal.

Alternatively, first convolutional neural networks are RPN, and/or, second convolutional neural networks are FRCNN.

Alternatively, the 3rd convolutional neural networks are FRCNN.

To reach above-mentioned purpose, the embodiment of the present application discloses a kind of gesture detecting method, including：

Neutral net detection image is accumulated using Volume Four, obtains fisrt feature information and the human hand candidate region of described image Information of forecasting, described image includes the image in still image or video；

Using the information of forecasting of the fisrt feature information and the human hand candidate region as the 5th convolutional neural networks Second feature information, and the gesture using the 5th convolutional neural networks according to second feature information progress described image Detection, obtains the gestures detection result of described image；Wherein, the fourth feature extraction layer parameter of the Volume Four product neutral net It is identical with the fifth feature extraction layer parameter of the 5th convolutional neural networks.

Alternatively, the Volume Four product neutral net includes：4th input layer, fourth feature extract layer and the 4th classification are defeated Go out layer, whether multiple candidate regions that the 4th classification output layer is used to detect described image division are human hand candidate region.

Alternatively, the 5th convolutional neural networks include：5th input layer, fifth feature extract layer, the 5th classification are defeated Go out layer, the 5th classification output layer is used for the gestures detection result for exporting described image.

To reach above-mentioned purpose, the embodiment of the present application discloses a kind of gestural control method, including：

Using gestures detection network detection image obtained by the training of above-mentioned gestures detection network training method, or, use Above-mentioned gesture detecting method detection image, obtains the gestures detection result of described image, described image include still image or Image in video；

Corresponding control operation is triggered according at least to the gestures detection result of described image.

Alternatively, the gestures detection result according at least to described image triggers corresponding control operation and included：

Record continuously to detect the image in video in a period and obtain the number of same gesture testing result；

When the number of record meets predetermined condition, corresponding control operation is triggered according to the gestures detection result.

Alternatively, corresponding control operation is triggered according to the gestures detection result, including：

It is determined that control instruction corresponding with the gestures detection result；

Corresponding operation is triggered according to the control instruction.

To reach above-mentioned purpose, the embodiment of the present application discloses a kind of gestures detection network training device, including：

First training module, for training the first convolutional neural networks according to the sample image containing human hand markup information, Obtain information of forecasting of first convolutional neural networks for the human hand candidate region of the sample image；

Parameter replacement module, for will join for the second feature extract layer of the second convolutional neural networks of detection gesture Number, replace with the fisrt feature extraction layer parameter of first convolutional neural networks after training；

Second training module, for described in the information of forecasting according to the human hand candidate region and sample image training Second convolution neural network parameter, and the second feature extract layer parameter constant is kept in the training process.

Alternatively, second training module, including：

Submodule is corrected, for correcting the information of forecasting of the human hand candidate region；

Submodule is trained, is instructed for the information of forecasting according to the revised human hand candidate region and the sample image Practice the second convolution neural network parameter, and keep the second feature extract layer parameter constant in the training process.

Alternatively, the amendment submodule, specifically for supplementing negative sample images and the human hand candidate region by multiple Information of forecasting input the 3rd convolutional neural networks to be classified, to filter the negative sample in the human hand candidate region, obtain To the information of forecasting of the revised human hand candidate region.

Alternatively, the 3rd convolutional neural networks are FRCNN.

To reach above-mentioned purpose, the embodiment of the present application discloses a kind of hand gesture detecting device, including：

First obtains module, for accumulating neutral net detection image using Volume Four, obtains the fisrt feature of described image Information and the information of forecasting of human hand candidate region, described image include the image in still image or video；

Detection module, for using the information of forecasting of the fisrt feature information and the human hand candidate region as volume five The second feature information of product neutral net, and institute is carried out according to the second feature information using the 5th convolutional neural networks The gestures detection of image is stated, obtains the gestures detection result of described image；Wherein, the 4th of the Volume Four product neutral net the is special Sign extraction layer parameter is identical with the fifth feature extraction layer parameter of the 5th convolutional neural networks.

To reach above-mentioned purpose, the embodiment of the present application discloses a kind of gesture control device, including：

Second obtains module, for gestures detection network detection obtained by above-mentioned gestures detection network training device training Image, or, using above-mentioned hand gesture detecting device detection image, obtain the gestures detection result of described image, described image Including the image in still image or video；

Trigger module, corresponding control operation is triggered for the gestures detection result according at least to described image.

Alternatively, the trigger module, including：

Record sub module, for recording, to the image in video, continuously detection obtains same gesture detection knot in a period The number of fruit；

Submodule is triggered, for when the number of record meets predetermined condition, phase to be triggered according to the gestures detection result The control operation answered.

Alternatively, the triggering submodule, including：

Determining unit, for when the number of record meets predetermined condition, it is determined that corresponding with the gestures detection result Control instruction；

Trigger element, for triggering corresponding operation according to the control instruction.

To reach above-mentioned purpose, the embodiment of the present application discloses a kind of application program, and the application program is used to run The above-mentioned gestures detection network training methods of Shi Zhihang or above-mentioned gesture detecting method or above-mentioned gestural control method.

To reach above-mentioned purpose, the embodiment of the present application discloses a kind of electronic equipment, including：Housing, processor, storage Device, circuit board and power circuit, wherein, the circuit board is placed in the interior volume that the housing surrounds, the processor and The memory is arranged on the circuit board；The power circuit, for each circuit or the device power supply for terminal；It is described Memory is used to store executable program code；The executable program generation that the processor is stored by reading in the memory Code runs program corresponding with executable program code, for perform above-mentioned gestures detection network training method or on The gesture detecting method or above-mentioned gestural control method stated.

As seen from the above, the first convolution god is trained according to the sample image containing human hand markup information in the embodiment of the present application Through network, information of forecasting of first convolutional neural networks for the human hand candidate region of sample image is obtained；It will be used to detect hand The second feature extraction layer parameter of second convolutional neural networks of gesture, replaces with first of the first convolutional neural networks after training Feature extraction layer parameter；Second convolution neural network parameter is trained according to the information of forecasting of human hand candidate region and sample image, And second feature extract layer parameter constant is kept in the training process.It can be seen that the scheme provided using the embodiment of the present application can be with Training obtains gestures detection network, and the convolutional neural networks obtained by above-mentioned training can carry out gestures detection, without user Wear or hold any equipment, the operating technology knowledge of relevant device is also just grasped without user, therefore reduce man-machine friendship Requirement in mutually to user, improves Consumer's Experience.

Brief description of the drawings

, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of application, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of schematic flow sheet for gestures detection network training method that the embodiment of the present application provides；

Fig. 2 a are a kind of schematic diagram for gesture that the embodiment of the present application provides；

Fig. 2 b are the schematic diagram for another gesture that the embodiment of the present application provides；

Fig. 3 is a kind of structural representation for gestures detection network training system that the embodiment of the present application provides；

Fig. 4 is the schematic flow sheet for another gestures detection network training method that the embodiment of the present application provides；

Fig. 5 is a kind of schematic flow sheet for gesture detecting method that the embodiment of the present application provides；

Fig. 6 is a kind of schematic flow sheet for gestural control method that the embodiment of the present application provides；

Fig. 7 is a kind of structural representation for gestures detection network training device that the embodiment of the present application provides；

Fig. 8 is the structural representation for another gestures detection network training device that the embodiment of the present application provides；

Fig. 9 is a kind of structural representation for hand gesture detecting device that the embodiment of the present application provides；

Figure 10 is a kind of structural representation for gesture control device that the embodiment of the present application provides；

Figure 11 is the structural representation for a kind of electronic equipment that the embodiment of the present application provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Site preparation describes, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of the application protection.

A kind of schematic flow sheet for gestures detection network training method that Fig. 1 provides for the embodiment of the present application, this method bag Include：

S101：First convolutional neural networks are trained according to the sample image containing human hand markup information, obtain the first convolution Information of forecasting of the neutral net for the human hand candidate region of sample image.

Wherein, above-mentioned sample image can be the image of rgb format or the image of other forms, for example, YUV Etc., the application is defined not to this.

Those skilled in the art are carried out it is understood that its bigger data volume of the resolution ratio of image is also bigger During gestures detection, required computing resource is more, and detection speed is slower, in consideration of it, in a kind of specific implementation side of the application In formula, above-mentioned sample image can be the image for meeting default resolution condition.For example, above-mentioned default resolution condition can be with It is：The longest edge of image is no more than 640 pixels, and most short side is no more than 480 pixels etc..

In addition, above-mentioned sample image can be obtained by image capture device, but due to image in practical application The hardware parameter of collecting device is different, sets difference etc., and acquired image may be unsatisfactory for above-mentioned default resolution condition, , can also be in a kind of optional implementation of the application to be met the target image of above-mentioned default resolution condition After image capture device collects image, processing is zoomed in and out to the image collected, obtains sample image.

Specifically, above-mentioned human hand markup information can include the markup information in human hand region.

Optionally, above-mentioned human hand markup information can also include the markup information of gesture.

The application is only illustrated exemplified by above-mentioned, and the specific of above-mentioned human hand markup information is not limited in practical application Appearance form.

In a kind of implementation of the application, above-mentioned first convolutional neural networks can include：First input layer, first Feature extraction layer and the first classification output layer, wherein, above-mentioned first classification output layer is used for multiple candidates of forecast sample image Whether region is human hand candidate region.

It should be noted that stroke that each layer that above-mentioned first convolutional neural networks are included only functionally is carried out Point, specifically, above-mentioned fisrt feature extract layer can be made up of convolutional layer, or can be by convolutional layer and nonlinear transformation layer group Into, or be made up of convolutional layer, nonlinear transformation layer and pond layer；The output result of above-mentioned first classification output layer is appreciated that To be the result of two classification, it can be realized by convolutional layer, but be not limited in being realized by convolutional layer.

When training the first convolutional neural networks, it can first train to obtain the first input layer parameter, fisrt feature extract layer ginseng Number and the first classification export layer parameter, then according to the parameter obtained, build the first convolutional neural networks.

Specifically, the first convolutional neural networks are trained using sample image, it can be understood as：Using sample image to first The initial model of convolutional neural networks is trained, and obtains the first final convolutional neural networks.Wherein, using sample image pair When the initial model of first convolutional neural networks is trained, gradient descent method and back-propagation algorithm can be used to be instructed Practice.

The initial model of above-mentioned first convolutional neural networks can be according to the convolutional layer number of plies, each convolution manually set Number of neuron etc. factor determines in layer, and the above-mentioned convolutional layer number of plies, neuron number etc. can be needed according to actual Ask determination.

When being demarcated to sample image, to ensure that calibration result is accurate, artificial demarcation mode can be used.On in addition, It can be the minimum rectangular area that whole hand can be covered in image to state human hand region in sample image, and above-mentioned human hand gesture can be with It is to open gesture, closure gesture etc..Specifically, referring to Fig. 2 a and Fig. 2 b, there is provided the schematic diagram of two kinds of gestures, ash in two figures Color rectangle frame region is human hand region, and for Fig. 2 a to open gesture, Fig. 2 b are closure gesture.

In addition, to cause the first convolutional neural networks that training obtains more accurate, can be selected when selecting sample image Select it is a variety of in the case of sample image, can include in these sample images：Positive sample image comprising human hand and not comprising people The negative sample image of hand.

S102：Layer parameter will be extracted for the second feature of the second convolutional neural networks of detection gesture, replace with training The fisrt feature extraction layer parameter of the first convolutional neural networks afterwards.

In a kind of implementation of the application, the second convolutional neural networks can include：Second input layer, second feature Extract layer, the second classification output layer, wherein, above-mentioned second classification output layer is used for the gestures detection result for exporting sample image.

It should be noted that second feature extract layer is similar with above-mentioned fisrt feature extract layer, repeat no more here.It is above-mentioned The output result of second classification output layer can be understood as polytypic result, and it can be realized by full articulamentum , but be not limited in being realized by full articulamentum.

It is noted that directly the fisrt feature extract layer of the first convolutional neural networks after training is joined in this step Number is used as above-mentioned second feature extracting parameter, eliminates the training to the second feature extract layer of nervus opticus network, that is, Joint training is carried out to the first convolutional neural networks and the second convolutional neural networks, both have shared feature extraction layer, with showing There is being respectively trained in technology to compare, the training speed to convolutional neural networks can be greatly improved.

Specifically, above-mentioned gestures detection result can include at least one of prearranged gesture type：Wave, scissors hand, Clench fist, hold in the palm hand, perpendicular thumb, pistol hand, OK hands, peach heart hand, opening, closure.

In addition, above-mentioned gestures detection result can also include：Non-predetermined gesture-type, wherein, above-mentioned non-predetermined gesture class Type can be understood as：The situation of gesture-type or expression " without gesture " in addition to above-mentioned prearranged gesture type, so The gesture classification degree of accuracy of the second convolutional neural networks can further be improved.

It should be noted that the application only illustrates by taking above-mentioned gesture-type as an example, above-mentioned prearranged gesture in practice Type is not limited in above-mentioned several.

In a kind of implementation of the application, above-mentioned first convolutional neural networks are RPN (Region Proposal Network), and/or, above-mentioned second convolutional neural networks are FRCNN (Fast RCNN).

In addition, above-mentioned first convolutional neural networks can also be other two classification or CNN (convolutional Neural nets of more classifying Network), it can also be Multi-Box Network or YOLO etc.；Above-mentioned second convolutional neural networks can also be other more points Class CNN, Recurrent Neural Network etc. are can also be, the application is defined not to this.

S103：Second convolution neural network parameter is trained according to the information of forecasting of human hand candidate region and sample image, and Second feature extract layer parameter constant is kept in the training process.

, can be again when training the second convolutional neural networks according to the information of forecasting and sample image of human hand candidate region Human hand gesture in sample image to be demarcated, that is, the gesture for demarcating human hand is open configuration, closure state etc., and based on this Secondary calibration result and above-mentioned information of forecasting are trained to the initial model of the second convolutional neural networks, obtain final volume Two Product neutral net.

Specifically, referring to Fig. 3, there is provided a kind of structural representation of gestures detection network training system, using the system When carrying out gestures detection network training, input information of the sample image as the first convolutional neural networks, the first input layer is inputted, After fisrt feature extract layer carries out feature extraction, the first classification output layer is classified, and the human hand for obtaining sample image is waited The first convolution neural network parameter of information of forecasting and output of favored area；Then sample image and above-mentioned information of forecasting are as The input information of one convolutional neural networks, input to the second input layer, after second feature extract layer carries out feature extraction, the Two classification output layers classified and obtain the second convolution neural network parameter, so far training complete the first convolutional neural networks and Second convolutional neural networks.Wherein, fisrt feature extraction layer parameter is identical with second feature extraction layer parameter.

As seen from the above, the first convolution nerve net is trained according to the sample image containing human hand markup information in the present embodiment Network, obtain information of forecasting of first convolutional neural networks for the human hand candidate region of sample image；By for detection gesture The second feature extraction layer parameter of second convolutional neural networks, replace with the fisrt feature of the first convolutional neural networks after training Extract layer parameter；According to the information of forecasting of human hand candidate region and sample image the second convolution neural network parameter of training, and Second feature extract layer parameter constant is kept in training process.It can be seen that the scheme provided using the present embodiment can train to obtain Gestures detection network, the convolutional neural networks obtained by above-mentioned training can carry out gestures detection, without user wear or Hold any equipment, also just without user grasp relevant device operating technology knowledge, therefore reduce in man-machine interaction to The requirement at family, improves Consumer's Experience.

In a kind of specific implementation of the application, referring to Fig. 4, there is provided another gestures detection network training method Schematic flow sheet, compared with previous embodiment, in the present embodiment, according to the prediction result and sample image of human hand candidate region The second convolution neural network parameter is trained, and keeps second feature extract layer parameter constant (S103) in the training process, including：

S103A：Correct the information of forecasting of human hand candidate region.

Specifically, during the information of forecasting of amendment human hand candidate region, can be by multiple supplement negative sample images and above-mentioned people The information of forecasting of hand candidate region inputs the 3rd convolutional neural networks to be classified, to filter in above-mentioned human hand candidate region Negative sample, obtain the information of forecasting of revised human hand candidate region.

It should be noted that input of the above-mentioned supplement negative sample image as just the 3rd convolutional neural networks, without making For the input of above-mentioned first convolutional neural networks and above-mentioned second convolutional neural networks, in addition, above-mentioned supplement negative sample image can To be the blank image of no hand or the region comprising similar hand (not being hand) but not be demarcated as including the figure of hand Picture.

Specifically, it can be FRCNN that above-mentioned 3rd convolutional neural networks, which are, certainly, above-mentioned 3rd convolutional neural networks are also Can be other two classification or more classification CNN.

Optionally, human hand candidate region quantity and supplement negative sample image in the information of forecasting of above-mentioned human hand candidate region The difference of quantity falls into predetermined permissible range.When above-mentioned difference falls into predetermined permissible range, it is believed that human hand candidate region Human hand candidate region quantity and the quantity of supplement negative sample image are equal or close in information of forecasting, so above-mentioned to make a reservation for allow The general value of scope is smaller, and specific value can determine according to actual conditions.

Preferably, human hand candidate region quantity and the quantity of supplement negative sample image in the information of forecasting of human hand candidate region Difference it is equal, it is clear that in this case, the positive sample rate of the human hand candidate region drawn by third nerve network substantially carries It is high.

In addition, the information of forecasting of amendment human hand candidate region can also be carried out by way of mark person's manual correction, this Application is defined not to this.

S103B：Second convolution nerve net is trained according to the information of forecasting of revised human hand candidate region and sample image Network parameter.

Because the default survey result of the first convolutional neural networks there may be larger error, that is, it is refreshing with the first convolution Accuracy rate is poor when the information of forecasting obtained through network trains the second convolutional neural networks, and is obtained with the first convolutional neural networks Information of forecasting compare, the information of forecasting accuracy rate of revised human hand candidate region wants high more, so, use is revised The second convolutional neural networks that the information of forecasting and sample image of human hand candidate region train the second convolutional neural networks to obtain are accurate True rate is higher.

As seen from the above, in the scheme that the present embodiment provides, the information of forecasting of human hand candidate region is corrected, then basis is repaiied The information of forecasting and sample image of human hand candidate region after just train the second convolution neural network parameter, improve training and obtain The second convolutional neural networks accuracy rate.

Fig. 5 is a kind of schematic flow sheet for gesture detecting method that the embodiment of the present application provides, and this method includes：

S501：Neutral net detection image is accumulated using Volume Four, obtains the fisrt feature information and human hand candidate regions of image The information of forecasting in domain, image include the image in still image or video.

Specifically, above-mentioned Volume Four product neutral net can include：4th input layer, fourth feature extract layer and the 4th point Class output layer, whether multiple candidate regions that the 4th classification output layer is used to detect described image division are human hand candidate regions Domain.

It should be noted that stroke that each layer that above-mentioned Volume Four product neutral net is included only functionally is carried out Point, specifically, above-mentioned fourth feature extract layer can be made up of convolutional layer, or can be by convolutional layer and nonlinear transformation layer group Into, or be made up of convolutional layer, nonlinear transformation layer and pond layer；The output result of above-mentioned 4th classification output layer is appreciated that To be the result of two classification, it can be realized by convolutional layer, but be not limited in being realized by convolutional layer.

Above-mentioned Volume Four product neutral net is RPN, can also be other two classification or the CNN that more classifies, can also be Multi-Box Network or YOLO etc.

In addition, above-mentioned Volume Four product neutral net can be identical with foregoing first convolutional neural networks, no longer carry out here It is described in detail.

Specifically, above-mentioned Volume Four product neutral net can include：Input layer, output layer and multiple convolutional layers.Using upper Multiple convolutional layers of Volume Four product neutral net are stated when handling image, feature extraction is carried out equivalent to image.4th When convolutional neural networks obtain the candidate human hand region in image, image is obtained by input layer, then extracted by convolutional layer The feature of image, and the extracted feature of combination determines candidate's human hand region in image, it is then that result is defeated by output layer Go out.

Simply, above-mentioned Volume Four product neutral net can be understood as：Two classification processing, i.e. area are done to the region in image Separate whether the region in image is human hand region, it is, candidate's human hand region is found in the picture, then to candidate's human hand Region carries out two classification.

S502：Second using the information of forecasting of fisrt feature information and human hand candidate region as the 5th convolutional neural networks Characteristic information, and the gestures detection using the 5th convolutional neural networks according to second feature information progress described image, obtain figure The gestures detection result of picture；Wherein, the fourth feature extraction layer parameter and the 5th convolutional neural networks of Volume Four product neutral net Fifth feature extraction layer parameter it is identical.

Specifically, above-mentioned 5th convolutional neural networks can include：5th input layer, fifth feature extract layer, the 5th point Class output layer, the 5th classification output layer are used for the gestures detection result for exporting described image.

It should be noted that fifth feature extract layer is similar with above-mentioned fourth feature extract layer, repeat no more here.It is above-mentioned The output result of 5th classification output layer can be understood as polytypic result, and it can be realized by full articulamentum , but be not limited in being realized by full articulamentum.

Above-mentioned 5th convolutional neural networks are FRCNN, can also be other more classification CNN (convolutional neural networks), may be used also To be Recurrent Neural Network etc..

In addition, above-mentioned 5th convolutional neural networks can be identical with foregoing second convolutional neural networks, no longer carry out here It is described in detail.

Wherein, the 5th convolutional neural networks can include：Input layer, output layer, multiple convolutional layers and multiple full articulamentums. Convolutional layer is mainly used in carrying out feature extraction, and equivalent to grader, the feature extracted to the 5th convolutional layer is carried out full articulamentum Classification.When 5th convolutional neural networks obtain the gestures detection result being directed in image, candidate's human hand area is obtained by input layer Domain, then extracts the feature in above-mentioned candidate's human hand region by convolutional layer, and full articulamentum enters according to the feature in candidate's human hand region Row classification handle, determine whether include human hand in image, and comprising human hand in the case of, the gesture of human hand, finally will classify As a result exported by output layer.

Simply, above-mentioned 5th convolutional neural networks are mainly used in solving more classification problems, that is, distinguish the class in human hand region Type, such as open, close, be not human hand.

In addition, above-mentioned gestures detection result can also include：Non-predetermined gesture-type, wherein, above-mentioned non-predetermined gesture class Type can be understood as：Gesture-type in addition to above-mentioned prearranged gesture type, it so can further improve the second convolution god The gesture classification degree of accuracy through network.

Specifically, it can include for the gestures detection result of image：Without human hand, gesture, closure gesture etc. are opened, this Application is defined not to this., can be with Probability Forms table in the case where above-mentioned gestures detection result shows to include human hand Hand gesture of leting others have a look at is to open gesture, closure gesture, when the probability for opening gesture is high, it is believed that opening gesture is included in image Human hand, when close gesture probability it is high when, it is believed that in image comprising closure gesture human hand.

Certainly, in a kind of optional implementation of the application, the output result of the 5th convolutional neural networks model can be with Including：Candidate's human hand region does not include the probability of human hand, candidate's human hand region includes the probability for the human hand for opening gesture, candidate Hand region includes probability of human hand of closure gesture etc..

It should be noted that " multiple " mentioned in present specification can be understood as：At least two.

Below by taking RPN convolutional neural networks and Fast RCNN convolutional neural networks as an example, to the above-mentioned gesture based on image Detection method illustrates.

Target image is handled using RPN convolutional neural networks, obtain target image in fisrt feature information and The information of forecasting of human hand candidate region, the positional information of human hand candidate region is also just obtained accordingly, then by above-mentioned position Information inputs Fast RCNN convolutional neural networks, last layer meeting of Fast RCNN convolutional neural networks as measurement information to be checked Output characteristic value, whether the region according to corresponding to this feature value judges above-mentioned positional information is human hand region, if human hand, tool Body gesture is to open gesture or closure gesture.

Specifically, one of last layer of Fast RCNN convolutional neural networks output for 2 full articulamentum and Softmax layers form a riffle, and the output result of the riffle can be thought as the probability of hand and the probability of non-hand, by two Situation corresponding to maximum probability is as judged result in probability.

As seen from the above, in above-mentioned each embodiment, neutral net detection image is accumulated using Volume Four, obtains the of image One characteristic information and the information of forecasting of human hand candidate region, using the information of forecasting of fisrt feature information and human hand candidate region as The second feature information of 5th convolutional neural networks, and using the 5th convolutional neural networks according to being carried out second feature information The gestures detection of image, obtain the gestures detection result of image.It can be seen that the scheme provided using above-mentioned each embodiment detects hand During gesture, worn without user or hold any equipment, the operating technology knowledge of relevant device is also just grasped without user, therefore The requirement to user in man-machine interaction is reduced, improves Consumer's Experience.

Fig. 6 is a kind of schematic flow sheet for gestural control method that the embodiment of the present application provides, and this method includes：

S601：Using gestures detection network detection image obtained by the training of above-mentioned gestures detection network training method, or, Using above-mentioned gesture detecting method detection image, the gestures detection result of described image is obtained, described image includes still image Or the image in video.

S602：Corresponding control operation is triggered according at least to the gestures detection result of image.

Specifically, when triggering corresponding control operation according to the gestures detection result of image, it can be understood as：Do not needing In the case of finding control instruction, control instruction is directly triggered, i.e., corresponding relation be present between gesture and control operation；

Furthermore it is also possible to it is interpreted as：Corresponding relation between gesture and control operation be present, first found according to the corresponding relation Control instruction corresponding to gesture, then triggers control operation by way of sending control instruction.

It should be noted that the application is only illustrated exemplified by above-mentioned, the application is not formed and limited.

In a kind of implementation of the application, corresponding control operation is triggered according at least to the gestures detection result of image When, it can first record in a period and the number of same gesture testing result is obtained to the continuous detection of image in video；Then When the number of record meets predetermined condition, corresponding control operation is triggered according to gestures detection result.

Above-mentioned predetermined condition can detect that the number for opening gesture is more than default value, certain time in certain time Inside detect that the number of closure gesture is more than predetermined threshold value etc..

Specifically, when triggering corresponding control operation according to gestures detection result, first determine corresponding with gestures detection result Control instruction, corresponding operation is triggered according to above-mentioned control instruction.

Specifically, the mapping relations between control instruction and gesture can be pre-established, then pre-established according to this Mapping relations determine control instruction corresponding to gestures detection result.

Above-mentioned mapping relations can be：

Open gesture and correspond to open system；

Gesture of waving correspondingly closes system；

Scissors hand gesture corresponds to shearing manipulation；

Gesture of clenching fist corresponds to paste operation；

Perpendicular thumb gesture correspondingly thumbs up operation；

Hand grappling gesture correspondingly chooses operation etc..

The application is only illustrated exemplified by above-mentioned, the concrete form of above-mentioned mapping relations is not entered in practical application Row limits.

After human hand gesture information is detected, when further carrying out man-machine interaction according to the gesture information of human hand, it is contemplated that The factors such as detection algorithm precision, the testing result that image is typically gathered using image capture device in a period of time are entered as foundation Row man-machine interaction, for example, 2 seconds, 3 seconds etc..

Above-mentioned control method is introduced with an instantiation below.

It is assumed that the trigger action of man-machine interaction is：Detect and in one's hands continuously open more than 2 seconds and continuously closure of more than 2 seconds.Often Second can detect the gestures of 20 two field pictures, then corresponding trigger action is exactly to be consecutively detected to open gesture and continuous more than 40 frames Detect more than 40 frames closure gesture.

Detection is carried out to target image and is likely to be obtained three kinds of results：Not comprising human hand region, open gesture or closure hand Gesture.

If detecting and human hand region not being detected in target image, reset and open gesture counter and closure gesture Counter, and continue to detect next two field picture.

Human hand region is included in target image if detected, determines whether human hand gesture, if opening gesture, then Gesture counter increase by 1 is opened, and resets closure gesture counter, and starts to carry out gestures detection to next two field picture；If It is closure gesture, then determines whether to open whether gesture counter has added up to be more than 40, if not being more than 40, reset and open Gesture counter and closure gesture counter, continue to detect next two field picture；If being more than 40, gesture is closed Counter increase by 1, and whether judge to close gesture counter more than 40, if it exceeds 40, then man-machine interaction is triggered, if not less than 40, continue to detect next two field picture.

As seen from the above, when being controlled using above-mentioned each embodiment, lead to and use above-mentioned gestures detection network training side Gestures detection network detection image obtained by method training, or, using above-mentioned gesture detecting method detection image, obtain the figure The gestures detection result of picture, corresponding control operation is triggered according at least to the gestures detection result of image.It can be seen that using above-mentioned each When the scheme that individual embodiment provides carries out operational control, worn without user or hold any equipment, also just slapped without user The operational control of relevant device is held, therefore reduces the requirement in man-machine interaction to user, improves Consumer's Experience.

Corresponding with above-mentioned gestures detection network training method, the embodiment of the present application additionally provides a kind of gestures detection network Trainer.

A kind of structural representation for gestures detection network training device that Fig. 7 provides for the embodiment of the present application, the device bag Include：

First training module 701, for training the first convolution nerve net according to the sample image containing human hand markup information Network, obtain information of forecasting of first convolutional neural networks for the human hand candidate region of the sample image；

Parameter replacement module 702, for by the second feature extract layer for the second convolutional neural networks of detection gesture Parameter, replace with the fisrt feature extraction layer parameter of first convolutional neural networks after training；

Second training module 703, trained for the information of forecasting according to the human hand candidate region and the sample image The second convolution neural network parameter, and the second feature extract layer parameter constant is kept in the training process.

Specifically, the human hand markup information can include the markup information in human hand region.

Specifically, the human hand markup information can include the markup information of gesture.

Specifically, first convolutional neural networks can include：First input layer, fisrt feature extract layer and first point Class output layer, the first classification output layer are used to predict whether multiple candidate regions of the sample image to be human hand candidate regions Domain.

Specifically, second convolutional neural networks can include：Second input layer, second feature extract layer, second point Class output layer, the second classification output layer are used for the gestures detection result for exporting the sample image.

Specifically, the gestures detection result can include at least one of prearranged gesture type：Wave, scissors hand, Clench fist, hold in the palm hand, perpendicular thumb, pistol hand, OK hands, peach heart hand, opening, closure.

Specifically, the gestures detection result can also include：Non-predetermined gesture-type.

Specifically, first convolutional neural networks can be RPN, and/or, second convolutional neural networks can be FRCNN。

In a kind of specific implementation of the application, referring to Fig. 8, there is provided another gestures detection network training device Structural representation, compared with previous embodiment, in the present embodiment, second training module 703, including：

Submodule 703A is corrected, for correcting the information of forecasting of the human hand candidate region；

Submodule 703B is trained, for the information of forecasting according to the revised human hand candidate region and the sample graph As training the second convolution neural network parameter, and the second feature extract layer parameter constant is kept in the training process.

Specifically, the amendment submodule 703A, specifically for supplementing negative sample images and the human hand candidate by multiple The information of forecasting in region inputs the 3rd convolutional neural networks to be classified, to filter the negative sample in the human hand candidate region This, obtains the information of forecasting of the revised human hand candidate region.

Specifically, human hand candidate region quantity and the supplement negative sample figure in the information of forecasting of the human hand candidate region The difference of the quantity of picture falls into predetermined permissible range.

Specifically, human hand candidate region quantity and the supplement negative sample figure in the information of forecasting of the human hand candidate region The difference of the quantity of picture is equal.

Specifically, the 3rd convolutional neural networks can be FRCNN.

Corresponding with above-mentioned gesture detecting method, the embodiment of the present application additionally provides a kind of hand gesture detecting device.

Fig. 9 is a kind of structural representation for hand gesture detecting device that the embodiment of the present application provides, and said apparatus includes：

First obtains module 901, for using Volume Four product neutral net detection image, obtain described image first to be special Reference ceases and the information of forecasting of human hand candidate region, described image include the image in still image or video；

Detection module 902, for using the information of forecasting of the fisrt feature information and the human hand candidate region as The second feature information of five convolutional neural networks, and entered using the 5th convolutional neural networks according to the second feature information The gestures detection of row described image, obtain the gestures detection result of described image；Wherein, the of Volume Four product neutral net Four feature extraction layer parameters are identical with the fifth feature extraction layer parameter of the 5th convolutional neural networks.

Specifically, the Volume Four product neutral net can include：4th input layer, fourth feature extract layer and the 4th point Class output layer, whether multiple candidate regions that the 4th classification output layer is used to detect described image division are human hand candidate regions Domain.

Specifically, the 5th convolutional neural networks can include：5th input layer, fifth feature extract layer, the 5th point Class output layer, the 5th classification output layer are used for the gestures detection result for exporting described image.

Corresponding with above-mentioned gestural control method, the embodiment of the present application additionally provides a kind of gesture control device.

Figure 10 is a kind of structural representation for gesture control device that the embodiment of the present application provides, and the device includes：

Second obtains module 1001, for hand obtained by any described device training using the embodiment of the present application offer Gesture detects network detection image, or, any described device detection image provided using the embodiment of the present application, obtain described The gestures detection result of image, described image include the image in still image or video；

Trigger module 1002, corresponding control operation is triggered for the gestures detection result according at least to described image.

Specifically, the trigger module 1002 can include：

Specifically, the triggering submodule can include：

The embodiment of the present application additionally provides a kind of application program, and the application program is used to operationally perform foregoing hand Gesture detects network training method or foregoing gesture detecting method or foregoing gestural control method.

Wherein, above-mentioned gestures detection network training method, including：

Above-mentioned gesture detecting method, including：

Above-mentioned gestural control method, including：

Using gestures detection network detection image obtained by above-mentioned gestures detection network training method training, or, adopt With above-mentioned gesture detecting method detection image, the gestures detection result of described image is obtained, described image includes still image Or the image in video；

Here only to above-mentioned foregoing gestures detection network training method, foregoing gesture detecting method, foregoing hand Gesture control method is briefly described, and concrete condition can be found in foregoing individual embodiments, repeat no more here.

Figure 11 is the structural representation for a kind of electronic equipment that the embodiment of the present application provides, including：Housing 1101, processor 1102nd, memory 1103, circuit board 1104 and power circuit 1105, wherein, circuit board 1104 is placed in what housing 1101 surrounded Interior volume, processor 1102 and memory 1103 are arranged on circuit board 1104；Power circuit 1105, for for electronic equipment Each circuit or device power supply；Memory 1103 is used to store executable program code；Processor 1102 is stored by reading The executable program code stored in device 1103 runs program corresponding with executable program code, foregoing for performing Gestures detection network training method or foregoing gesture detecting method or foregoing gestural control method.

Above-mentioned gesture detecting method, including：

Above-mentioned gestural control method, including：

Above-mentioned electronic equipment exists in a variety of forms, includes but is not limited to：

(1) mobile communication equipment：The characteristics of this kind equipment is that possess mobile communication function, and to provide speech, data Communicate as main target.This Terminal Type includes：Smart mobile phone (such as iPhone), multimedia handset, feature mobile phone, and it is low Hold mobile phone etc..

(2) super mobile personal computer equipment：This kind equipment belongs to the category of personal computer, there is calculating and processing work( Can, typically also possess mobile Internet access characteristic.This Terminal Type includes：PDA, MID and UMPC equipment etc., such as iPad.

(3) portable entertainment device：This kind equipment can show and play content of multimedia.The kind equipment includes：Audio, Video player (such as iPod), handheld device, e-book, and intelligent toy and portable car-mounted navigation equipment.

(4) server：The equipment for providing the service of calculating, the composition of server are total including processor, hard disk, internal memory, system Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, therefore in processing energy Power, stability, reliability, security, scalability, manageability etc. require higher.

(5) other electronic installations with data interaction function.

The embodiment of the present application provides a kind of storage medium, and for storing executable code, the executable code is used to hold The foregoing gestures detection network training method of row or foregoing gesture detecting method or foregoing gestural control method.

Above-mentioned gesture detecting method, including：

Above-mentioned gestural control method, including：

For device, application program, electronic equipment and storage medium embodiment, because it is substantially similar to method reality Example is applied, so description is fairly simple, the relevent part can refer to the partial explaination of embodiments of method.

It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or equipment including the key element.

Can one of ordinary skill in the art will appreciate that realizing that all or part of step in above method embodiment is To instruct the hardware of correlation to complete by program, described program can be stored in computer read/write memory medium, The storage medium designated herein obtained, such as：ROM/RAM, magnetic disc, CD etc..

The preferred embodiment of the application is the foregoing is only, is not intended to limit the protection domain of the application.It is all Any modification, equivalent substitution and improvements made within spirit herein and principle etc., it is all contained in the protection domain of the application It is interior.

Claims

A kind of 1. gestures detection network training method, it is characterised in that including：

First convolutional neural networks are trained according to the sample image containing human hand markup information, obtain the first convolution nerve net Information of forecasting of the network for the human hand candidate region of the sample image；

Layer parameter will be extracted for the second feature of the second convolutional neural networks of detection gesture, and replace with described the after training The fisrt feature extraction layer parameter of one convolutional neural networks；

The second convolution neural network parameter is trained according to the information of forecasting of the human hand candidate region and the sample image, And the second feature extract layer parameter constant is kept in the training process.
2. according to the method for claim 1, it is characterised in that according to the prediction result of the human hand candidate region and described Sample image trains the second convolution neural network parameter, including：

Correct the information of forecasting of the human hand candidate region；

Second convolutional Neural is trained according to the information of forecasting of the revised human hand candidate region and the sample image Network parameter.
3. method according to claim 1 or 2, it is characterised in that the human hand markup information includes the mark in human hand region Note information.
4. according to the method for claim 3, it is characterised in that the human hand markup information includes the markup information of gesture.
A kind of 5. gesture detecting method, it is characterised in that including：

Using Volume Four accumulate neutral net detection image, obtain described image fisrt feature information and human hand candidate region it is pre- Measurement information, described image include the image in still image or video；

Using the information of forecasting of the fisrt feature information and the human hand candidate region as the second of the 5th convolutional neural networks Characteristic information, and examined using the 5th convolutional neural networks according to the gesture of second feature information progress described image Survey, obtain the gestures detection result of described image；Wherein, the fourth feature extraction layer parameter of Volume Four product neutral net and The fifth feature extraction layer parameter of 5th convolutional neural networks is identical.
A kind of 6. gestural control method, it is characterised in that including：

Using gestures detection network detection image obtained by the method training as described in claim 1-4 is any, or, using such as Method detection image described in claim 5, obtains the gestures detection result of described image, described image include still image or Image in video；

Corresponding control operation is triggered according at least to the gestures detection result of described image.
A kind of 7. gestures detection network training device, it is characterised in that including：

First training module, for training the first convolutional neural networks according to the sample image containing human hand markup information, obtain Information of forecasting of first convolutional neural networks for the human hand candidate region of the sample image；

Parameter replacement module, for layer parameter will to be extracted for the second feature of the second convolutional neural networks of detection gesture, replace The fisrt feature extraction layer parameter for first convolutional neural networks being changed to after training；

Second training module, for the information of forecasting according to the human hand candidate region and sample image training described second Convolutional neural networks parameter, and the second feature extract layer parameter constant is kept in the training process.
A kind of 8. hand gesture detecting device, it is characterised in that including：

First obtains module, for accumulating neutral net detection image using Volume Four, obtains the fisrt feature information of described image With the information of forecasting of human hand candidate region, described image includes the image in still image or video；

Detection module, for using the information of forecasting of the fisrt feature information and the human hand candidate region as the 5th convolution god Second feature information through network, and the figure is carried out according to the second feature information using the 5th convolutional neural networks The gestures detection of picture, obtain the gestures detection result of described image；Wherein, the fourth feature of the Volume Four product neutral net carries Take layer parameter identical with the fifth feature extraction layer parameter of the 5th convolutional neural networks.
A kind of 9. gesture control device, it is characterised in that including：

Second obtains module, for using gestures detection network detection image obtained by device as claimed in claim 7 training, Or using device detection image as claimed in claim 8, obtain the gestures detection result of described image, described image bag Include the image in still image or video；

Trigger module, corresponding control operation is triggered for the gestures detection result according at least to described image.
10. a kind of electronic equipment, it is characterised in that including：Housing, processor, memory, circuit board and power circuit, wherein, The circuit board is placed in the interior volume that the housing surrounds, and the processor and the memory are arranged on the circuit board On；The power circuit, for each circuit or the device power supply for terminal；The memory is used to store executable program generation Code；The processor is run and executable program code pair by reading the executable program code stored in the memory The program answered, gestures detection network training method or claim 5 any one of 1-4 are required for perform claim Any one of gesture detecting method or the gestural control method any one of claim 6.