CN114895275B

CN114895275B - Efficient multidimensional attention neural network-based radar micro gesture recognition method

Info

Publication number: CN114895275B
Application number: CN202210551031.0A
Authority: CN
Inventors: 张文鹏; 杨磊; 姜卫东; 张双辉; 刘永祥; 霍凯; 高勋章; 卢杰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2024-06-14
Anticipated expiration: 2042-05-20
Also published as: CN114895275A

Abstract

The application relates to a radar micro gesture recognition method based on a high-efficiency multidimensional attention neural network, computer equipment and a storage medium. The method comprises the following steps: constructing a multidimensional attention module according to the global maximum pooling layer, the global average pooling layer and the segmentation and concatenation convolution module; the multidimensional attention module comprises a spatial attention module, a channel attention module and a time attention module; constructing a high-efficiency multidimensional attention block by utilizing a multidimensional attention module and a plurality of convolution blocks, constructing a high-efficiency multidimensional attention neural network according to the high-efficiency multidimensional attention block, a preset convolution layer, a maximum pooling layer, a global average pooling layer, a full connection layer and a Softmax layer, and training the high-efficiency multidimensional attention neural network; and inputting the range-Doppler graph sequence of the detected gesture into a trained high-efficiency multidimensional attention neural network to perform gesture recognition. By adopting the method, the accuracy of radar micro gesture recognition can be improved.

Description

Efficient multidimensional attention neural network-based radar micro gesture recognition method

Technical Field

The application relates to the technical field of radar target recognition, in particular to a radar micro gesture recognition method, computer equipment and storage medium based on a high-efficiency multidimensional attention neural network.

Background

Gestures carry rich information in daily life and become a hot topic in the field of human-computer interaction. Gesture recognition has been applied in many areas, such as smart home, virtual reality, etc. The primary gesture recognition technologies at present mainly include vision-based technologies, wearable device-based technologies and radar-based technologies. Under vision-based techniques, hand motions may be captured by an optical camera or depth sensor, which in turn generates Red Green Blue (RGB) images or depth images for recognition. But this method does not work well in darker conditions and presents privacy concerns. Wearable device-based technology requires users to wear designated sensors and devices in order to gather gesture data. Wearable devices are generally expensive and the user experience is poor. Compared with the two, the radar sensor has the characteristics of non-contact, no influence of illumination, no relation to user privacy and the like. Therefore, radar-based gesture recognition methods are of interest to many expert scholars and are widely used. Inching mainly refers to movements of relatively small amplitude, such as rotation, vibration, acceleration, etc., of a target and its components during movement. Thus, the jog feature of the object can be utilized to classify different objects or to distinguish between different actions. Since the magnitude of the gesture motion is small compared with the subject, it can be regarded as one of the kinds of micro motion recognition. The existing radar-based micro gesture recognition method mainly comprises the steps of generating a feature map by processing radar echo data, extracting features and recognizing the feature map by designing a two-dimensional convolutional neural network, and mainly utilizing information such as distance, frequency and the like of gesture actions.

However, the radar echo data of the dynamic gesture simultaneously contains distance, speed and time information, the two-dimensional convolution neural network cannot fully extract effective information in the radar echo data due to the limitation of the structure of the two-dimensional convolution neural network, the problem of insufficient action information of the gesture is solved effectively by using the three-dimensional convolution neural network, but the three-dimensional convolution neural network is still in an initial stage in the radar target recognition field at present, the gesture echo data is insufficiently utilized and has large parameter quantity under a complex scene, characteristics cannot be effectively extracted, and the radar micro gesture recognition accuracy is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a radar jog gesture recognition method, a computer device, and a storage medium based on an efficient multidimensional attention neural network, which can improve accuracy of radar jog gesture recognition.

A method of radar jog gesture recognition based on a high efficiency multidimensional attention neural network, the method comprising:

Acquiring radar echo data; the radar echo data comprises a plurality of gestures to be detected;

Performing two-dimensional Fourier transform and filtering processing on the radar echo data to obtain a range-Doppler graph sequence of the detected gesture; dividing the Doppler sequence into data according to a preset proportion to obtain a training set and a testing set;

constructing a multidimensional attention module according to the global maximum pooling layer, the global average pooling layer and the segmentation and concatenation convolution module; the multidimensional attention module comprises a spatial attention module, a channel attention module and a time attention module;

Constructing a high-efficiency multidimensional attention block by utilizing a multidimensional attention module and a plurality of convolution blocks, and constructing a high-efficiency multidimensional attention neural network according to the high-efficiency multidimensional attention block, a preset convolution layer, a maximum pooling layer, a global average pooling layer, a full connection layer and a Softmax layer;

Training the high-efficiency multidimensional attention neural network by using the training set and the testing set to obtain a trained high-efficiency multidimensional attention neural network;

And inputting the range-Doppler graph sequence of the detected gesture into a trained high-efficiency multidimensional attention neural network to perform gesture recognition.

In one embodiment, the radar echo data includes an intra-pulse fast time and an intra-pulse slow time; performing two-dimensional Fourier transform and filtering processing on radar echo data to obtain a range-Doppler graph sequence of the detected gesture, wherein the range-Doppler graph sequence comprises the following steps:

Performing two-dimensional Fourier transform on the intra-pulse fast time and the intra-pulse slow time of the radar echo data to obtain a function of the distance and the speed of a target scattering point;

And filtering zero frequency components in the function of the distance and the speed of the target scattering point by using mean value filtering to obtain a Doppler graph sequence of the detected gesture.

In one embodiment, performing a two-dimensional fourier transform on the fast and slow intra-pulse times of the radar echo data to obtain a function of distance and speed of the target scattering point, includes:

Performing two-dimensional Fourier transform on the fast and slow intra-pulse times of the radar echo data to obtain a function of the distance and speed of the target scattering point as

Wherein N represents FFT conversion of slow time dimension performed every N pulse repetition periods T _p, A _l represents intensity of the first scattering point, R _l is distance from the first scattering point to radar, v _l is speed of the first scattering point, gamma is modulation frequency of a transmitted signal, L represents total number of target scattering points, and f _i,f_d represents fast timeAnd a representation of the slow time t _m in the frequency domain after fourier transform, corresponding to the distance and velocity, respectively.

In one embodiment, constructing a multidimensional attention module from a global max-pooling layer, a global average pooling layer, and a split-splice convolution module includes:

Constructing a channel attention module according to the global maximum pooling layer and the global average pooling layer;

constructing a space attention module according to the segmentation and concatenation convolution module;

the temporal attention module is built from the global averaging pooling layer.

In one embodiment, a high-efficiency multidimensional attention neural network includes an input layer, an intermediate layer, and an output layer; constructing a high-efficiency multidimensional attention neural network according to a high-efficiency multidimensional attention block, a preset convolution layer, a maximum pooling layer, a global average pooling layer, a full connection layer and a Softmax layer, comprising:

Constructing an input layer of the efficient multidimensional attention neural network according to the convolution layer and the maximum pooling layer;

Constructing an intermediate layer of the high-efficiency multi-dimensional attention neural network by utilizing a plurality of high-efficiency multi-dimensional attention blocks comprising different multi-dimensional attention modules;

And constructing an output layer of the efficient multidimensional attention neural network according to the global average pooling layer, the full connection layer and the Softmax layer.

In one embodiment, inputting a range-doppler plot sequence of a detected gesture into a trained high-efficiency multi-dimensional attention neural network for gesture recognition includes:

inputting the distance Doppler graph sequence of the detected gesture into a trained high-efficiency multidimensional attention neural network, preprocessing the distance Doppler graph sequence by an input layer, and outputting the distance Doppler graph sequence to an intermediate layer for feature extraction to obtain a multidimensional feature graph;

And after the multi-dimensional feature map is convolved, carrying out three-dimensional global average pooling on the convolved multi-dimensional feature map according to a global average pooling layer at an output layer, and carrying out gesture classification on the three-dimensional global average pooled multi-dimensional feature map according to a Softmax layer to obtain a recognition result.

In one embodiment, inputting a range-doppler graph sequence of a detected gesture into a trained high-efficiency multidimensional attention neural network, preprocessing the range-doppler graph sequence by an input layer, and outputting the preprocessed range-doppler graph sequence to an intermediate layer for feature extraction to obtain a multidimensional feature graph, wherein the method comprises the following steps:

Inputting the distance Doppler graph sequence of the gesture to be detected into a trained high-efficiency multidimensional attention neural network, preprocessing the distance Doppler graph sequence by an input layer, outputting the distance Doppler graph sequence to an intermediate layer, and extracting features of the distance Doppler graph sequence by using a spatial attention module to obtain a multiscale fusion feature graph;

the channel attention module is utilized to calculate the weight of the distance Doppler graph sequence, so as to obtain the channel weight corresponding to the feature graph;

performing feature extraction on the range-Doppler graph sequence by using a time attention module to obtain a time feature graph;

And multiplying the multi-scale feature map and the channel weight corresponding to the feature map point by point, and then adding the channel weight and the time feature map to obtain the multi-dimensional feature map.

In one embodiment, feature extraction is performed on a range-doppler plot sequence by using a spatial attention module to obtain a multi-scale fusion feature plot, including:

Feature extraction is carried out on the range-Doppler image sequence by utilizing the spatial attention module, and the multi-scale fusion feature image is obtained as

F_s＝Conv(1×1×1,N→C)(F_{s_all})

Where ,F_{s_all}＝Cat([F_s1,F_s2,…,F_sN]),F_si＝Conv(3×k_i×k_i,C'→1)(F_i)i＝1,2…N,F_i denotes the range-doppler plot sequence, k _i denotes the i-th split-splice convolution kernel size,Characteristic diagrams representing different scales are shown, N represents the total number of split joint convolution modules, C' represents the number of channels, and C represents the number of input channels.

In one embodiment, the calculating the weights of the range-doppler plot sequence by using the channel attention module to obtain the channel weights corresponding to the feature plot includes:

carrying out global average pooling and global maximum pooling on the distance Doppler image sequence along the time dimension and the space dimension to obtain an average pooling characteristic image and a maximum pooling characteristic image;

splicing the channel dimensions of the average pooling feature map and the maximum pooling feature map to obtain a pooling feature map;

And fusing and exciting the spliced pooling feature images by using two full-connection layers to obtain channel weights corresponding to the feature images.

In one embodiment, the feature extraction of the range-doppler plot sequence by using the time-attention module, to obtain a time feature plot, includes:

feature extraction is carried out on the range-Doppler graph sequence by utilizing the time attention module, and the obtained time feature graph is as follows

F_t＝σ(g_tW_t1)W_t2

Wherein,F represents a range-Doppler plot sequence, H and W represent the height and width, respectively, of the range-Doppler plot sequence,/>Σ represents GeLU operations.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the radar micro gesture recognition method based on the high-efficiency multidimensional attention neural network, the detection echo data of the radar is processed into the distance Doppler graph sequence through two-dimensional Fourier transform to serve as network input, then the input distance Doppler graph sequence is effectively extracted by the multidimensional attention neural network based on joint space, channels and time, the multiscale spatial characteristics of the characteristic graph are effectively extracted by utilizing the multiscale convolution kernel, the channel attention weight is generated by applying a compression-excitation mechanism on the channel dimension, the channel interactivity is further built by utilizing a Softmax layer, the global time clue is obtained by modeling the global frame number by providing the time self-attention module, the radar echo data is effectively and fully utilized, the problems that the existing network is insufficient in utilization of gesture echo data under a complex scene and large in parameter quantity are solved, and the method has important engineering application value.

Drawings

FIG. 1 is a flow diagram of a method for radar jog gesture recognition based on an efficient multidimensional attention neural network in one embodiment;

FIG. 2 is a flow diagram of a sequence of radar gesture target range-Doppler plots in one embodiment;

FIG. 3 is a schematic diagram of a multi-dimensional attention Module (MDA) architecture in one embodiment;

FIG. 4 is a schematic diagram of a split-splice convolution module (SCC) in another embodiment;

FIG. 5 is a schematic diagram of an efficient multidimensional attention module (EMDA) architecture in one embodiment;

FIG. 6 is a block diagram of a high-efficiency multi-dimensional attention residual network (EMDANet) in one embodiment;

FIG. 7 is a Doppler plot corresponding to six gestures in one embodiment;

FIG. 8 illustrates the amount of parameters and recognition accuracy corresponding to 2D-ResNet-50, C3D, P3D-60,3D-ResNet-50 and EMDANet-50 for one embodiment using D _{All_train} and D _{All_test} as training and test sets;

FIG. 9 illustrates recognition rates corresponding to 2D-ResNet-50, C3D, P3D-60,3D-ResNet-50 and EMDANet-50 for one embodiment using D _{All_n_train} and D _{All_n_test} as training and test sets;

Fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, a radar jog gesture recognition method based on a high-efficiency multidimensional attention neural network is provided, comprising the steps of:

102, acquiring radar echo data; the radar echo data comprises a plurality of gestures to be detected; performing two-dimensional Fourier transform and filtering processing on the radar echo data to obtain a range-Doppler graph sequence of the detected gesture; and carrying out data division on the Doppler sequence according to a preset proportion to obtain a training set and a testing set.

And carrying out two-dimensional Fourier transform and filtering processing on the radar echo data to obtain a range Doppler graph sequence of the detected gesture, which is favorable for extracting gesture features in the radar echo data, so as to identify the gesture features and improve the accuracy and efficiency of gesture identification. As shown in fig. 7, (a) - (f) are doppler graphs corresponding to the forward direction, upward hand waving, downward hand waving, leftward hand waving, rightward hand waving and two-hand intersection of the radar gesture respectively, and the two-dimensional fourier transform and filtering processing are performed on the radar echo data, and then the data dividing process is as follows:

s1.1, radar echo data are obtained;

s1.1.1 assume that the wideband radar transmits a chirp signal as The expression is:

Wherein the method comprises the steps of For fast time in pulse, record the time of wave propagation, t _m＝mT_p (m=0, 1, 2.) is slow time between pulses, t is full time,/>F _c、T_p, gamma and rect (·) are the center frequency, repetition period, tuning frequency and envelope of the transmitted signal, respectively. The signal is transmitted at time t _m and is a chirp signal in time t _m,t_m+T_p.

S1.1.2 because radar chirp duration is typically on the order of microseconds, a target radar echo model can be built using a "stop-go" model. Let r _l(t_m) denote the radial distance between the first scattering center and the radar at the m-th pulse start time, the radar echo signal of the spatial target is:

Wherein sigma _l(t_m),τ_l(t_m)＝2r_l(t_m)/c is the echo amplitude and echo delay corresponding to the first scattering center, respectively.

S1.1.3 demodulating the echo signal, and then echo the radar for a fast timeAnd slow time t _m to perform a two-dimensional fourier FFT to obtain a function of the target scattering point distance and velocity:

Wherein N represents FFT conversion of slow time dimension every N pulse repetition periods T _p, A _l represents intensity of the first scattering point, R _l is distance from the first scattering point to radar, v _l is speed of the first scattering point, gamma is modulation frequency of a transmitted signal respectively, L represents total number of target scattering points, and f _i,f_d represents fast time respectively And a representation of the slow time t _m in the frequency domain after fourier transform, corresponding to the distance and velocity, respectively.

S1.1.4 to the obtained S _if(f_i,f_d) filtering the zero frequency component by means of mean filtering to obtain a doppler image F _rd of the detected gesture, and performing the above processing on each frame of the echo to obtain a doppler image sequence F _rds of the detected gesture.

S1.2 preprocessing Doppler image sequence data

And scaling each frame of Doppler images into 256 multiplied by 256 according to a dynamic gesture Doppler image sequence F _rds obtained by radar echo processing, and randomly selecting 16 frames of Doppler images from the Doppler image sequence F _rds to be used as the input of the neural network.

S1.3, dividing a training set and a testing set for Doppler image sequence data:

S1.3.1 assuming that p dynamic gestures need to be identified, n people are detected, m scenes are provided, each person performs q times on each gesture in each scene, and the total detected gesture action times are measured as D _All =n×m×p.

S1.3.2 follow 7 for all detected gesture actions: the ratio of 3 is randomly divided into training set D _{All_train} and test set D _{All_test}.

S1.3.3 the detected gesture acts according to the number of people according to 7: the ratio of 3 is randomly divided into training set D _{All_n_train} and test set D _{All_n_test}.

Step 104, constructing a multidimensional attention module according to the global maximum pooling layer, the global average pooling layer and the segmentation and concatenation convolution module; the multidimensional attention module includes a spatial attention module, a channel attention module, and a temporal attention module.

By arranging the spatial attention module and the time attention module, the omnidirectional feature extraction is carried out in the spatial dimension and the time dimension, the channel attention module is utilized to apply a compression-excitation mechanism in the channel dimension to generate the feature map weight, and radar echo data is effectively and fully utilized.

And step 106, constructing a high-efficiency multidimensional attention block by utilizing the multidimensional attention module and a plurality of convolution blocks, and constructing a high-efficiency multidimensional attention neural network according to the high-efficiency multidimensional attention block, a preset convolution layer, a maximum pooling layer, a global average pooling layer, a full connection layer and a Softmax layer.

By introducing a residual structure, namely two convolution blocks, the high-efficiency multidimensional attention module can be obtained from the multidimensional attention module, the optimization is easy, the problems of gradient elimination and gradient explosion existing in a deep network are relieved, the problem of network degradation is solved, and the training speed of the high-efficiency multidimensional attention neural network is further improved, and the gesture recognition accuracy is further improved. The degradation problem means that the training speed of the network is slow, the conditions of gradient disappearance and gradient explosion occur, the optimal solution is not obtained, and the training speed is slow and the recognition rate is low in gesture recognition.

And step 108, training the high-efficiency multidimensional attention neural network by using the training set and the testing set to obtain the trained high-efficiency multidimensional attention neural network.

The method comprises the steps of training a high-efficiency multidimensional attention neural network by using a training set and a testing set, setting initialization parameters of the network, setting the number of small batches to be 16, considering computing resources and network model convergence, setting a network loss function to be a cross entropy loss function (CrossEntropyLoss), selecting a random gradient descent method (storage GRADIENT DESCENT, SGD) by a network optimization algorithm, setting a learning rate to be 0.00002, setting a momentum factor to be 0.9, setting weight attenuation to be 0.0005, dynamically adjusting the learning rate, carrying out 10 times per iteration, changing the learning rate to be 0.75 times of the original, taking the training set D _{All_train} as input, setting the iteration times to be 51, and inputting the testing set D _{All_test} for testing after each iteration to obtain the gesture motion recognition rate of a whole person. And taking the training set D _{All_n_train} as input, setting the iteration number to be 51, and inputting the testing set D _{All_n_test} for testing after each iteration to obtain the recognition rate of the gesture actions of the non-participators. The trained high-efficiency multidimensional attention neural network can be used for carrying out radar micro gesture recognition.

Step 110, inputting the range-doppler graph sequence of the detected gesture into a trained high-efficiency multidimensional attention neural network for gesture recognition.

In the radar micro gesture recognition method based on the high-efficiency multidimensional attention neural network, the detection echo data of the radar is processed into the distance Doppler graph sequence through two-dimensional Fourier transform to serve as network input, then the multidimensional attention neural network based on the attention of joint space, channels and time is designed to effectively extract the input distance Doppler graph sequence, the multiscale space characteristics of the feature graph are effectively extracted by utilizing the multiscale convolution kernel, the compression-excitation mechanism is applied to the channel dimension to generate the channel attention weight, the Softmax layer is further utilized to establish the channel interactivity, the time self-attention module is provided to model the global frame number in the time dimension to obtain global time clues, the radar echo data is effectively and fully utilized, the problems that the existing network is insufficient in utilization of gesture echo data in complex scenes and the parameter quantity is large are solved, and the method has important engineering application value.

the temporal attention module is built from the global averaging pooling layer.

In a specific embodiment, as shown in fig. 2, a multi-dimensional attention (MDA) module structure is shown, where GMP refers to a global max pooling layer, GAP refers to a global average pooling layer, SCC refers to a split-splice convolution module, and fig. 3 is a split-splice convolution module (SCC) structure is shown, where the SCC includes a plurality of multi-scale convolution kernels.

In a specific embodiment, as shown in fig. 5, EMDA represents an efficient multidimensional attention block, and the specific steps of constructing the efficient multidimensional attention network (EMDANet) are as follows:

S4.1 construction of input layer

S4.1.1 the input range-doppler plot sequence (3, 16, 256 and 256 represent the number of channels, frame number, height and width, respectively, of the feature plot sequence) is converted to 64, the number of channels is (1, 2), the output size is 64×16×128×128, by passing through a convolution layer of convolution kernel size (3, 7).

S4.1.2 the output size is 64 x 16 x 64 through the largest pooling layer of size (2, 2), step size (1, 2).

S4.2, wherein the middle layer is composed of four EMDA blocks with different structures, and the number of the four EMDA blocks containing the EMDA blocks is 3,4,6,3 in sequence.

S4.2.1 construction of EMDA1 block: the number of output channels of the first part of the 1 x1 convolutional layer in the EMDA block is set to 64, the number of output channels of the second part of the MDA block is set to 64, the number of output channels of the third partial 1 x1 convolutional layer is set to 256, the number of output channels of the identity mapping section is set to 256, and the output size is 256×16×64×64. The above procedure was repeated 2 times.

S4.2.2 construction of EMDA2 Block

S4.2.2.1 the number of output channels of the first part of the 1 x1 convolutional layer in the EMDA block is set to 128, the number of output channels of the second part of the MDA block is set to 128, downsampling is performed in steps (2, 2), the number of output channels of the third partial 1 x1 convolutional layer is set to 512, the number of output channels of the identity mapping section was set to 512, and the output sizes were 512×8×32×32.

S4.2.2.2 the number of output channels of the first part of the 1 x1 convolutional layer in the EMDA block is set to 128, the number of output channels of the second part of the MDA block is set to 128, the number of output channels of the third partial 1 x1 convolutional layer is set to 512, the number of output channels of the identity mapping section was set to 512, and the output sizes were 512×8×32×32. The above procedure was repeated 2 times.

S4.2.3 construction of EMDA3 blocks

S4.2.3.1 the number of output channels of the first part of the 1 x 1 convolutional layer in the EMDA block is set to 256, the number of output channels of the second part of the MDA block is set to 256, downsampling is performed in steps (2, 2), the number of output channels of the third partial 1 x 1 convolutional layer is set to 1024, the number of output channels of the identity mapping section is set to 1024, and the output size is 1024×4×16×16.

S4.2.3.2 the number of output channels of the first part of the 1x 1 convolutional layer in the EMDA block is set to 256, the number of output channels of the second part of the MDA module is set to 256, the number of output channels of the third partial 1x 1 convolutional layer is set to 1024, the number of output channels of the identity mapping section is set to 1024, and the output size is 1024×4×16×16. The above procedure was repeated 4 times.

S4.2.4 construction of EMDA4 Block

S4.2.4.1 the number of output channels of the first part of the 1 x1 convolutional layer in the EMDA block is set to 512, the number of output channels of the second part of the MDA block is set to 512, downsampling is performed in steps (2, 2), the number of output channels of the third partial 1 x1 convolutional layer is set to 2048, the number of output channels of the identity mapping section was set to 2048, and the output sizes were 2048×2×8×8.

S4.2.4.2 the number of output channels of the first part of the 1 x 1 convolutional layer in the EMDA block is set to 512, the number of output channels of the second part of the MDA block is set to 512, the number of output channels of the third partial 1 x 1 convolutional layer is set to 2048, the number of output channels of the identity mapping section was set to 2048, and the output sizes were 2048×2×8×8.

S4.3 output layer

S4.3.1 the result of S4.2 was pooled in a three-dimensional global average over time and space dimensions, with an output size of 2048 x 1.

S4.3.2 a fully-connected layer with the number of neurons being p is constructed, and a Softmax layer with the number of neurons being p is connected to the fully-connected layer for classifying p dynamic gestures.

In a specific embodiment, as shown in fig. 4, two convolution blocks are added in the multidimensional attention module to construct an efficient multidimensional attention block, when feature extraction is performed, a feature map is input into two branches to be convolved, one branch is a 1 x 1 convolution layer, a multidimensional attention module (MDA module), a 1 x 1 convolutional layer, another branch connected to the output of the first branch by a shortcut channel, the convolved multidimensional feature map is formed by adding the outputs of two branches, so as to solve the degradation problem, and the main steps are as follows:

For the input feature map sequence F, it is sequentially passed through a1 x1 convolutional layer, an MDA module, a1 x1 convolutional layer, a sequence F' of feature maps of equal size to F is obtained.

The input feature map sequences F and F' are subjected to identity mapping through quick connection and then added according to elements, so that the degradation problem is solved, and finally, the convolved multidimensional feature map is output as

output'＝F+F'

As shown in fig. 8, all data measured were measured according to the number of people according to 7:3 are divided into training sets and testing sets, and the network quantity and the recognition rate are corresponding to 2D-ResNet-50, C3D, P3D-60,3D-ResNet-50 and EMDANet-50. Compared with the two-dimensional convolutional neural network, the three-dimensional convolutional neural network is added with time information, and the recognition rate is obviously increased. And because the attention mechanism is added in the space, the channel and the time dimension, compared with 3D-ResNet-50, the parameter quantity of EMDANet is reduced by 16.5%, the recognition rate reaches 95.2%, the required parameter quantity is less, the recognition accuracy is higher, and the radar echo data is fully utilized. Fig. 9 shows the measured data for all the data according to the number of people according to 7:3 into training set and testing set, 2D-ResNet-50, C3D, P3D-60,3D-ResNet-50 and EMDANet-50, and the recognition rate of EMDANet-50 is higher than that of other networks, and is higher than that of other networks by 90% at 35 th iteration.

F_s＝Conv(1×1×1,N→C)(F_{s_all})

In a specific embodiment, the specific process of feature extraction of the range-doppler plot sequence using the spatial attention module is as follows:

s2.1.1 for an input range-doppler plot sequence Wherein H, W, C, T represent its height, width, number of input channels, and number of frames, respectively. F is equally divided into N parts along the channel dimension, denoted as [ F ₁,F₂,…,F_N ], each part having/>The ith group of feature map sequences are denoted/>i＝1,2…N。

S2.1.2 extracting multi-scale space features by applying a multi-scale convolution kernel, and setting the number of output channels of each group of feature map sequences to be 1 to reduce the number of parameters, wherein the multi-scale feature map is as follows:

F_si＝Conv(3×k_i×k_i,C'→1)(F_i)i＝1,2…N

wherein the k _i is a variable which, Representing the ith convolution kernel size and feature maps of different scales, respectively.

S2.1.3 splicing the obtained multi-scale feature images:

F_{s_all}＝Cat([F_s1,F_s2,…,F_sN])

Wherein the method comprises the steps of

S2.1.3 different feature map sequences are fused by a1 x1 convolutional layer, and simultaneously setting the number of output channels as C:

F_s＝Conv(1×1×1,N→C)(F_{s_all})

Wherein the method comprises the steps of />

In a specific embodiment, the specific process of calculating the weights of the channels corresponding to the feature map by using the channel attention module to calculate the weights of the range-doppler map sequence is as follows:

s2.2.1 for the input feature map sequence F, carrying out global average pooling on the input feature map sequence along the time and space dimensions to obtain

S2.2.2 for the input feature map sequence F, carrying out global maximum pooling on the input feature map sequence along the time and space dimensions to obtain

S2.2.3 splicing F _gc and F _mc along the channel dimension to obtain

F_c＝Cat[(F_gc,F_mc),C]

S2.2.4 fusion and excitation are carried out on the spliced features by utilizing two full-connection layers, so that the weight of the feature map sequence channel dimension is obtained.

w_c＝σ(W₂δ(W₁(F_c)))

Where delta represents the ReLU operation and,And/>Representing the fully connected layer, σ represents the Sigmoid function,/>Is the attention weight.

S2.2.4 pair w _c are duplicated transformed into in time and space dimensionsMatching the size of F.

F_t＝σ(g_tW_t1)W_t2

Wherein,F represents a range-Doppler plot sequence, H and W represent the height and width, respectively, of the range-Doppler plot sequence,/>And/>Representing the full connection layer, σ represents GeLU operations, from which the/>, can be knownAnd/>Representing fully connected layers of different weights.

In a specific embodiment, the time attention module captures a cross-time relationship between frame numbers by using a multi-layer perceptron (multi-layer perceptron, MLP), and further obtains a global time cue, and a specific process of extracting features of the range-doppler image sequence by using the time attention module is as follows:

S2.3.1 for the input feature map sequence F, global averaging pooling is spatially applied.

Wherein the method comprises the steps of

S2.3.2 applies two full connection layers (W _t1 and W _t2) with the same dimension to carry out cross-frame hybrid sharing on g _t, and obtains the relation of global time in the time dimension.

F_t＝σ(g_tW_t1)W_t2

Wherein the method comprises the steps ofAnd/>Representing the fully connected layer, σ represents GeLU operations.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for radar jog gesture recognition based on an efficient multidimensional attention neural network. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method of the above embodiments when the computer program is executed.

In one embodiment, a computer storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of the above embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (Doppler RAM), direct memory bus dynamic RAM (D Doppler RAM), and memory bus dynamic RAM (Doppler RAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A radar jog gesture recognition method based on a high-efficiency multidimensional attention neural network, the method comprising:

Constructing a high-efficiency multidimensional attention block by utilizing the multidimensional attention module and a plurality of convolution blocks, and constructing a high-efficiency multidimensional attention neural network according to the high-efficiency multidimensional attention block, a preset convolution layer, a maximum pooling layer, a global average pooling layer, a full connection layer and a Softmax layer;

and inputting the range-Doppler graph sequence of the detected gesture into the trained high-efficiency multidimensional attention neural network for gesture recognition.

2. The method of claim 1, wherein the radar echo data includes an intra-pulse fast time and an intra-pulse slow time; performing two-dimensional Fourier transform and filtering processing on the radar echo data to obtain a range-Doppler graph sequence of the detected gesture, wherein the range-Doppler graph sequence comprises the following steps:

performing two-dimensional Fourier transform on the fast intra-pulse time and the slow intra-pulse time of the radar echo data to obtain a function of the distance and the speed of a target scattering point;

3. The method of claim 1, wherein performing a two-dimensional fourier transform on the intra-pulse fast time and intra-pulse slow time of the radar echo data results in a function of distance and velocity of a target scattering point, comprising:

performing two-dimensional Fourier transform on the fast and slow intra-pulse times of the radar echo data to obtain a function of the distance and the speed of a target scattering point as follows

4. A method according to any one of claims 1 to 3, wherein constructing a multi-dimensional attention module from the global max pooling layer, the global average pooling layer and the split-splice convolution module comprises:

the temporal attention module is built from the global averaging pooling layer.

5. The method of claim 4, wherein the high-efficiency multi-dimensional attention neural network comprises an input layer, an intermediate layer, and an output layer; constructing a high-efficiency multidimensional attention neural network according to the high-efficiency multidimensional attention block, a preset convolution layer, a maximum pooling layer, a global average pooling layer, a full connection layer and a Softmax layer, and comprising:

Constructing an input layer of the high-efficiency multidimensional attention neural network according to the convolution layer and the maximum pooling layer;

and constructing an output layer of the high-efficiency multidimensional attention neural network according to the global average pooling layer, the full connection layer and the Softmax layer.

6. The method of claim 5, wherein inputting the range-doppler plot sequence of the detected gesture into the trained high-efficiency multi-dimensional attention neural network for gesture recognition comprises:

Inputting the distance Doppler graph sequence of the detected gesture into the trained high-efficiency multidimensional attention neural network, preprocessing the distance Doppler graph sequence by an input layer, and outputting the distance Doppler graph sequence to an intermediate layer for feature extraction to obtain a multidimensional feature graph;

7. The method of claim 4, wherein inputting the range-doppler plot sequence of the detected gesture into the trained high-efficiency multi-dimensional attention neural network, preprocessing the input layer, and outputting the preprocessed input layer to an intermediate layer for feature extraction, and obtaining a multi-dimensional feature plot comprises:

Inputting the distance Doppler graph sequence of the detected gesture into the trained high-efficiency multi-dimensional attention neural network, preprocessing the distance Doppler graph sequence by an input layer, outputting the distance Doppler graph sequence to an intermediate layer, and extracting features of the distance Doppler graph sequence by using a spatial attention module to obtain a multi-scale fusion feature graph;

performing weight calculation on the distance Doppler graph sequence by using the channel attention module to obtain channel weights corresponding to the feature graphs;

Performing feature extraction on the range-Doppler graph sequence by using the time attention module to obtain a time feature graph;

and multiplying the multi-scale feature map and the channel weight corresponding to the feature map point by point, and then adding the multiplied channel weight and the time feature map to obtain the multi-dimensional feature map.

8. The method of claim 4, wherein the feature extraction of the range-doppler plot sequence using a spatial attention module results in a multi-scale fusion feature plot comprising:

Feature extraction is carried out on the distance Doppler image sequence by utilizing a spatial attention module, and a multi-scale fusion feature image is obtained as

F_s＝Conv(1×1×1,N→C)(F_{s_all})

9. The method of claim 4, wherein weighting the range-doppler plot sequence with the channel attention module to obtain channel weights for a feature plot comprises:

10. The method of claim 4, wherein performing feature extraction on the range-doppler plot sequence with the time-attention module to obtain a time-feature plot comprises:

extracting features of the range-Doppler graph sequence by using the time attention module to obtain a time feature graph as

F_t＝σ(g_tW_t1)W_t2

Wherein,F represents a range-Doppler plot sequence, H and W represent the height and width, respectively, of the range-Doppler plot sequence,/>And/>Representing the fully connected layer, σ represents GeLU operations.