CN116189277A

CN116189277A - Training method and device, gesture recognition method, electronic equipment and storage medium

Info

Publication number: CN116189277A
Application number: CN202211536167.0A
Authority: CN
Inventors: 刘松
Original assignee: Zeku Technology Shanghai Corp Ltd
Current assignee: Zeku Technology Shanghai Corp Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-05-30

Abstract

Provided are a training method and device, a gesture recognition method, an electronic device and a storage medium. The training method comprises the following steps: enhancing the first gesture detection network to obtain a second gesture detection network; training the first gesture detection network and the second gesture detection network by using a training data set formed by gesture pictures to obtain a first loss corresponding to the first gesture detection network and a second loss corresponding to the second gesture detection network; and updating parameters of the first gesture detection network and/or the second gesture detection network according to the first loss and the second loss.

Description

Training method and device, gesture recognition method, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of gesture recognition, and more particularly, to a training method and apparatus, a gesture recognition method, an electronic device, and a storage medium.

Background

The lightweight gesture recognition network can be applied to electronic devices such as mobile phones and the like to recognize gestures of users. The problems of training accuracy and training efficiency of lightweight gesture recognition networks are always a concern.

Disclosure of Invention

The embodiment of the application provides a training method and device, a gesture recognition method, electronic equipment and a storage medium. Various aspects related to the present application are described below.

In a first aspect, a training method is provided, including: enhancing the first gesture detection network to obtain a second gesture detection network; training the first gesture detection network and the second gesture detection network by using a training data set formed by gesture pictures to obtain a first loss corresponding to the first gesture detection network and a second loss corresponding to the second gesture detection network; and updating parameters of the first gesture detection network and/or the second gesture detection network according to the first loss and the second loss.

With reference to the first aspect, in some implementations, the second gesture detection network shares model parameters of the first gesture detection network with the first gesture detection network.

With reference to the first aspect, in some implementations, the second gesture detection network is one of a plurality of second gesture detection networks obtained by enhancing the first gesture detection network, and the method further includes: and selecting one second gesture detection network from the plurality of second gesture detection networks to participate in each training process of the first gesture detection network.

With reference to the first aspect, in some implementations, the enhancing the first gesture detection network includes: the number of channels and/or neural network layers of the first gesture detection network is increased.

With reference to the first aspect, in some implementations, the training data set includes one or more of the following types of gesture pictures: gesture pictures of different scenes; gesture pictures of different illuminations; gesture pictures with different distances from a shooting lens; and hand gesture pictures with and without gloves.

In a second aspect, a gesture recognition method is provided, including: acquiring a gesture to be recognized; identifying the gesture with a first gesture detection network; wherein the first gesture detection network is trained based on the method as described in the first aspect or any implementation of the first aspect.

In a third aspect, there is provided a training device comprising: the enhancement module is used for enhancing the first gesture detection network to obtain a second gesture detection network; the training module is used for training the first gesture detection network and the second gesture detection network by using a training data set formed by gesture pictures to obtain a first loss corresponding to the first gesture detection network and a second loss corresponding to the second gesture detection network; and the updating module is used for updating the parameters of the first gesture detection network according to the first loss and the second loss.

In a fourth aspect, there is provided a training device comprising: a memory for storing codes; a processor for executing code stored in the memory, causing the training device to perform the method as described in the first aspect or any implementation of the first aspect.

In a fifth aspect, there is provided an electronic device comprising: a memory for storing codes; a processor configured to perform the gesture recognition method according to the second aspect.

In a sixth aspect, a computer readable storage medium is provided, on which code is stored, the code being for performing a method according to any one of the possible implementations of the first or second aspect.

In a seventh aspect, a computer program product is provided, comprising code for performing the method according to any one of the possible implementations of the first or second aspect.

By introducing a second gesture detection network (enhanced network) and supervising the training process of the first gesture detection network (lightweight network) based on the second gesture detection network, the accuracy of the first gesture detection network can be improved. In addition, since the first gesture detection network and the second gesture detection network train simultaneously, training overhead can be reduced.

Drawings

In order to more clearly describe the technical solutions in the embodiments or the background of the present application, the following description will describe the drawings that are required to be used in the embodiments or the background of the present application.

FIG. 1 is a flow chart of a training method of a lightweight gesture detection network.

FIG. 2 is a flow chart of another training method of a lightweight gesture detection network.

Fig. 3 is a schematic flow chart of a training method according to an embodiment of the present application.

Fig. 4 is an exemplary diagram of a network enhancement provided in an embodiment of the present application.

Fig. 5 is an example diagram of gesture pictures taken in an embodiment of the present application.

Fig. 6 is a flowchart of a gesture recognition method according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of a training device according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a training device according to another embodiment of the present application.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

With the development of neural network technology, gesture recognition technology based on gesture recognition networks is increasingly applied. For example, gesture recognition networks are deployed inside some electronic devices (such as mobile phones) so that the electronic devices have gesture recognition functions. For example, based on the gesture recognition network, a user can control the mobile phone to realize functions of hanging a phone, short video praise, page turning and the like through gestures.

Many electronic devices are limited in their own resource configuration, often requiring deployment of a lightweight gesture recognition network. Therefore, how to train a lightweight gesture recognition network becomes a hot problem for research.

Related art 1 proposes a gesture detection method based on knowledge distillation and attention mechanisms. Firstly, acquiring a gesture picture to construct a training data set, and constructing a teacher network and a light student network based on an attention mechanism; then training a large teacher network by using the constructed data set; then, expanding the data set by means of manual marking, automatically generating a random data set and predicting unlabeled data by a teacher network; then, the student network is distilled and trained through the expanded data set and the teacher network; and finally, obtaining a trained lightweight gesture detection network, and predicting. The implementation flow of the related art 1 can be seen in fig. 1.

Related art 2 proposes a gesture detection method based on a data enhancement and lightweight backbone network. The implementation flow of the related art 2 is shown in fig. 2, and is that firstly, a lightweight gesture detection network is constructed based on SquezeNet as a backbone; acquiring a gesture picture and a background picture, and then carrying out data enhancement to obtain a training data set; and then training the lightweight gesture detection network based on the training data set to obtain the gesture detection network for prediction.

The design objectives of both related art 1 and 2 are to achieve a lightweight, deployable gesture detection network. The related art 2 directly adopts a lightweight SqueezeNet as a backbone network and performs network training by utilizing data enhancement. Data enhancement is inherently demonstrated by a large number of networks, which can improve model accuracy, but often does not work on lightweight networks. Because the learning characterization capability of the lightweight network is weaker than that of the large model, the data enhancement gives more and more complex data, and the characterization capability of the lightweight network is insufficient to learn the effective information in the data. The related art 2 has a disadvantage in that accuracy is insufficient.

The related art 1 introduces additional supervision by adopting a knowledge distillation method, so that the problem of insufficient precision of the related art 2 can be slightly alleviated, but the related art 1 needs to train a large teacher network first, and then uses the teacher network to assist in training the student network. This has the disadvantage that two exercises are required and that large model exercises are more time consuming. Although the related art 1 solves the accuracy problem of the related art 2 to some extent, training is time-consuming and costly.

In view of the foregoing, an embodiment of the present application proposes a training method, and the training method is described in detail below with reference to fig. 3.

Referring to fig. 3, in step S310, the first gesture detection network is enhanced to obtain a second gesture detection network.

The first gesture detection network may be a lightweight gesture detection network. For example, the first gesture detection network may be a gesture detection network that can be deployed on a handheld terminal device such as a cell phone. This second gesture detection network may also be referred to as an enhanced network. By enhanced, it is understood that the model parameters of the second gesture detection network are more compared to the model parameters of the first gesture detection network. In some embodiments hereafter the first gesture detection network is referred to as the base network and the second gesture detection network is referred to as the enhancement network.

The enhancement mode of the first gesture detection network is not particularly limited, and the first gesture detection network may be enhanced in the width direction, or may be enhanced in the depth direction, or may be a combination of the two enhancement modes.

Enhancing the first gesture detection network from the depth direction may include: the number of neural network layers of the first gesture detection network is increased. For example, the first gesture detection network includes N neural network layers, and M neural network layers may be added to the first gesture detection network, so as to obtain a second gesture recognition network including n+m neural network layers.

Enhancing the first gesture detection network from the width direction may include: the number of channels of the first gesture detection network is increased. For example, referring to fig. 4, the first gesture detection network may include one or more convolution layers, which may be widened by increasing the number of convolution channels. The second gesture recognition network resulting from the enhancement of the first gesture detection network from the width direction has less overhead in training time than the enhancement of the first gesture detection network from the depth direction.

In some embodiments, the second gesture detection network is one of a plurality of second gesture detection networks that are enhanced by the first gesture detection network. That is, after the first gesture detection network is enhanced in step S310, one second gesture detection network may be obtained, or a plurality of second gesture detection networks may be obtained.

For example, the largest enhanced network may be constructed first. Let the original convolution operator width, i.e. the number of output channels, be w. An enhancement factor r may be set to enhance, so that the width of the enhanced convolution operator is w×r. In order to simplify the operation, the enhancement factor r may not be set layer by layer, and the enhancement factor r may be shared by the entire network.

After the maximum enhancement network is constructed, other enhancement networks may be obtained by sampling the output channels from the maximum enhancement network. For example, another hyper-parameter s may be set, which is linearly sliced. For example, where the width of the base network (i.e., the first gesture detection network mentioned above) is w, the maximum enhancement network width is w×r, r=3, s=2 can be set, and then the possible enhancement network widths are [ w,2×w,3×w ]. Different layers can also have different segmentation factors s, but in order to simplify the operation, the enhancement factors are not set layer by layer, and the whole network shares the segmentation factors s.

In some embodiments, the second gesture detection network and the first gesture detection network may share model parameters of the first gesture detection network. The second gesture detection network shares model parameters of the first gesture detection network, which can be understood to be the second gesture detection network comprising the first gesture detection network. Alternatively, the first gesture detection network may be obtained by sampling the second gesture detection network. If a plurality of second gesture detection networks are constructed in step S310, the plurality of second gesture detection networks may share model parameters with each other. For example, after the largest enhanced network is constructed, other gesture detection networks may be obtained by sampling the largest enhanced network. Because the model parameters are shared between the first gesture detection network and the second gesture detection network, the total storage amount of the model parameters is determined by the model parameters of the maximum enhancement network, and compared with the mode of independently storing the model parameters of each gesture detection network, the method can reduce the computing resource and the memory expenditure and the training time. If the mode of sharing model parameters is adopted, after training is completed, a final lightweight gesture detection network can be obtained by sampling the enhancement network.

Prior to performing step S320 of fig. 1, a training dataset formed by gesture pictures may be acquired. The training dataset may contain one or more of the following types of gesture pictures: gesture pictures of different scenes; gesture pictures of different illuminations; gesture pictures with different distances from the shooting lens. Fig. 5 gives an example of gesture pictures, and fig. 5 contains various gesture pictures of different scenes and different distances. In some embodiments, to address complex gesture recognition scenarios, the training data set may also include gloved and ungrooved gesture pictures. After the gesture picture is acquired, the gesture picture can be marked, so that a final training data set is obtained.

Next, in steps S320 to S330, training the first gesture detection network and the second gesture detection network by using the training data set formed by the gesture pictures, to obtain a first loss corresponding to the first gesture detection network and a second loss corresponding to the second gesture detection network; and updating parameters of the first gesture detection network and/or the second gesture detection network according to the first loss and the second loss. That is, the first gesture detection network (base network) may be trained with the second gesture detection network (enhancement network) as an additional supervisory signal. The supervision may be implemented by taking into account the loss of the enhanced network when updating the base network.

The number of enhanced networks may include one or more. Taking a plurality of enhanced networks as an example, the loss function of the base network can be expressed by the following formula on the basis of considering the plurality of enhanced networks:

L _aug ＝L(W _t )+a ₁ *L([W _t ,W ₁ ])+a ₂ *L([W _t ,W ₂ ])+…+a _i *L([W _t ,W _i ])

wherein [ W ] _t ,W _i ]Representing an enhanced network including a base network, W _t Is the basic part, W _i Is an extension. Wherein L (W) _t ) Is basic supervision, a ₁ *L([W _t ,W ₁ ])+a ₂ *L([W _t ,W ₂ ])+…+a _i *L([W _t ,W _i ]) Is an auxiliary supervision of several enhanced networks. a, a _i Is a super-ginsengAnd the number is used for controlling the duty ratio of different enhanced network supervision signals in the whole loss function.

Where multiple enhanced networks are included, each requires an additional forward computation and back propagation process. This also increases computational resource overhead and time costs if the loss functions of all the enhanced networks are calculated in each training step. Thus, in some embodiments, only one enhanced network is randomly sampled for forward computation and back propagation as an auxiliary supervisory signal during each training step. Thus, the update of the underlying network parameters can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing an updated infrastructure network +.>

Indicating the basic network before updating, lr indicating the learning rate, the parameter may be a super parameter, and the value of the parameter may be preset to be a fixed value. a may be a super parameter to control the duty cycle of the various enhanced network supervisory signals throughout the loss function.

The embodiment of the application provides a training method of a lightweight gesture detection network (namely the first gesture detection network), wherein the lightweight gesture detection network has the advantages of high reasoning speed, small model size and easiness in deployment at a mobile phone end. In addition, the network enhancement technology provided by the embodiment of the application can introduce the enhancement network as an auxiliary supervision signal while guaranteeing the light weight of the network, so that the learning characterization capability of the light weight network is effectively improved, and the precision of the light weight network is guaranteed. In the network enhancement technology, the basic network (the first gesture detection network) and the enhancement network (the second gesture detection network) are trained together, and the training can be completed only by one time, so that the computing resource cost and the time cost are saved.

According to the invention, the basic network is enhanced in a network expansion mode, so that a plurality of enhanced networks with stronger learning characterization capability are obtained. The invention then uses the enhanced network synergy to train to update parameters of the basic network with slightly weaker learning ability as an auxiliary supervisory signal. In this way, a lightweight base model with a higher characterization capacity can be learned in a shorter training time.

Fig. 6 is a flowchart of a gesture recognition method according to an embodiment of the present application. The gesture recognition method shown in fig. 6 may be performed based on the first gesture detection network trained by the training method described above. The method of fig. 6 includes step S610 and step S620.

In step S610, a gesture to be recognized is acquired. For example, gestures to be recognized may be acquired by a camera on the handset.

In step S620, a first gesture is utilized to detect a network recognition gesture.

Method embodiments of the present application are described above in detail in connection with fig. 1-6, and apparatus embodiments of the present application are described below in detail in connection with fig. 7 and 9. It is to be understood that the description of the device embodiments corresponds to the description of the method embodiments, and that parts not described in detail can therefore be seen in the preceding method embodiments.

Fig. 7 is a schematic structural diagram of a training device according to an embodiment of the present application. The training apparatus 700 shown in fig. 7 may include an enhancement module 710, a training module 720, and an update module 730.

The enhancement module 710 may be configured to enhance the first gesture detection network to obtain the second gesture detection network.

The training module 720 may be configured to train the first gesture detection network and the second gesture detection network by using a training data set formed by the gesture pictures, so as to obtain a first loss corresponding to the first gesture detection network and a second loss corresponding to the second gesture detection network.

The update module 730 may be configured to update the parameter of the first gesture detection network based on the first loss and the second loss.

Fig. 8 is a schematic structural diagram of a training device according to another embodiment of the present application. Training apparatus 800 of fig. 8 may comprise a memory 810 and a processor 820.

Memory 810 may be used to store code.

Processor 820 may be configured to execute code stored in memory to cause training device 800 to perform the training method described in any of the embodiments above.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 900 shown in fig. 9 may be, for example, a cell phone. The electronic device may include a memory 910 and a processor 920.

Memory 910 may be used to store code.

The processor 920 may be used to perform the gesture recognition method shown in fig. 6.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present disclosure, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a digital video disc (Digital Video Disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A training method, comprising:

enhancing the first gesture detection network to obtain a second gesture detection network;

training the first gesture detection network and the second gesture detection network by using a training data set formed by gesture pictures to obtain a first loss corresponding to the first gesture detection network and a second loss corresponding to the second gesture detection network;

and updating parameters of the first gesture detection network and/or the second gesture detection network according to the first loss and the second loss.

2. The training method of claim 1 wherein the second gesture detection network shares model parameters of the first gesture detection network with the first gesture detection network.

3. The training method of claim 1 or 2, wherein the second gesture detection network is one of a plurality of second gesture detection networks obtained by enhancing the first gesture detection network, the method further comprising:

and selecting one second gesture detection network from the plurality of second gesture detection networks to participate in each training process of the first gesture detection network.

4. Training method according to claim 1 or 2, characterized in that said enhancing the first gesture detection network comprises:

the number of channels and/or neural network layers of the first gesture detection network is increased.

5. Training method according to claim 1 or 2, characterized in that the training dataset contains one or more of the following types of gesture pictures:

gesture pictures of different scenes;

gesture pictures of different illuminations;

gesture pictures with different distances from a shooting lens; and

gesture pictures with and without gloves.

6. A method of gesture recognition, comprising:

acquiring a gesture to be recognized;

identifying the gesture with a first gesture detection network; wherein the first gesture detection network is trained based on the method of any one of claims 1-5.

7. A training device, comprising:

the enhancement module is used for enhancing the first gesture detection network to obtain a second gesture detection network;

the training module is used for training the first gesture detection network and the second gesture detection network by using a training data set formed by gesture pictures to obtain a first loss corresponding to the first gesture detection network and a second loss corresponding to the second gesture detection network;

and the updating module is used for updating the parameters of the first gesture detection network according to the first loss and the second loss.

8. A training device, comprising:

a memory for storing codes;

a processor for executing code stored in the memory, causing the training device to perform the method of any one of claims 1-5.

9. An electronic device, comprising:

a memory for storing codes;

a processor configured to perform the gesture recognition method of claim 6.

10. A computer readable storage medium having stored thereon code for performing the method of any of claims 1-5 or claim 6.