CN110929637B

CN110929637B - Image recognition method and device, electronic equipment and storage medium

Info

Publication number: CN110929637B
Application number: CN201911139594.3A
Authority: CN
Inventors: 谷宇章; 杨洪业; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2023-05-16
Anticipated expiration: 2039-11-20
Also published as: CN110929637A

Abstract

The application relates to an image recognition method, an image recognition device, electronic equipment and a storage medium, wherein a human skeleton image sequence is obtained; determining a corresponding relative coordinate set in a skeleton node set of each frame of human skeleton image; determining a relative coordinate tensor based on the relative coordinate set, the number of skeleton joints and the number of frames of images in the human skeleton image sequence; determining a plurality of sets of inter-frame differential values; determining a time difference tensor based on the plurality of sets of inter-frame difference values, the number of skeleton nodes, and the number of frames of images in the human skeleton image sequence; determining an input tensor based on the relative coordinate tensor and the time difference tensor; and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence. According to the method and the device, the input tensor of the motion recognition model based on the graph rolling network is constructed by utilizing the human skeleton joint point information, and the motion recognition is performed, so that the accuracy of human motion recognition can be improved.

Description

Image recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to an image recognition method, an image recognition device, an electronic device, and a storage medium.

Background

Understanding human behavior is one of the most important tasks in computer vision, as it can facilitate a wide range of applications such as human-machine interaction, robotics, and game control. The skeleton, which consists of three-dimensional joint positions, provides a good representation for describing human behavior.

In recent years, with the rapid development of three-dimensional data acquisition devices such as microsoft kinect, the acquisition of skeleton data becomes easier. In addition, the skeleton itself is a high-level feature of the human body, with invariance to the appearance or representation, which eliminates the difficulty of representing and understanding the different action categories. Most importantly, the skeleton is robust to noise and is computationally and storage efficient. Therefore, skeleton-based motion recognition has received increasing attention in recent years.

In many cases, joint coordinate vectors are directly input to a recurrent neural network (Recurrent Neural Network, RNN), or a skeleton sequence is encoded into a pseudo image, and a convolutional neural network (Convolutional Neural Networks, CNN) is used to model the time-space dynamics.

However, these approaches rarely explore the inherent dependencies between joints. To capture such dependencies, the framework data should be fully understood. In terms of data structure, a skeleton is a special graph whose vertices are joints and edges are skeletons. Therefore, by using the graph rolling network (Graph Convolutional Network, GCN) to mine the structural information of the human body, better performance can be obtained than that of a non-graph rolling network, and the accuracy of human action recognition can be improved.

Disclosure of Invention

The embodiment of the application provides an image recognition method, an image recognition device, electronic equipment and a storage medium, which can improve the accuracy of human action recognition.

In one aspect, an embodiment of the present application provides an image recognition method, including:

acquiring a human skeleton image sequence; the human skeleton image sequence comprises continuous multi-frame human skeleton images; skeleton joint points of each frame of human skeleton image are consistent;

determining a corresponding relative coordinate set in a skeleton node set of each frame of human skeleton image; the relative coordinates in the relative coordinate set are in one-to-one correspondence with the skeleton joint points in the skeleton joint point set;

determining a relative coordinate tensor based on the relative coordinate set, the number of skeleton joints and the number of frames of images in the human skeleton image sequence;

determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence;

determining a time difference tensor based on the plurality of sets of inter-frame difference values, the number of skeleton nodes, and the number of frames of images in the human skeleton image sequence;

determining an input tensor based on the relative coordinate tensor and the time difference tensor;

and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence.

In another aspect, an embodiment of the present application provides an image recognition apparatus, including:

the first acquisition module is used for acquiring a human skeleton image sequence; the human skeleton image sequence comprises continuous multi-frame human skeleton images; skeleton joint points of each frame of human skeleton image are consistent;

the first determining module is used for determining a corresponding relative coordinate set in a skeleton node set of each frame of human skeleton image; the relative coordinates in the relative coordinate set are in one-to-one correspondence with the skeleton joint points in the skeleton joint point set;

the second determining module is used for determining a relative coordinate tensor based on the relative coordinate set, the number of skeleton joints and the number of frames of images in the human skeleton image sequence;

the third determining module is used for determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence;

a fourth determining module, configured to determine a time difference tensor based on a plurality of inter-frame difference value sets, the number of skeleton nodes, and the number of frames of images in the human skeleton image sequence;

a fifth determining module for determining an input tensor based on the relative coordinate tensor and the time difference tensor;

and the action recognition module is used for carrying out action recognition on the input tensor based on the trained action recognition model to obtain an action category corresponding to the human skeleton image sequence.

In another aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor and a memory, and at least one instruction, at least one program, a code set, or an instruction set is stored in the memory, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the image recognition method described above.

In another aspect, an embodiment of the present application provides a computer readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored, where at least one instruction, at least one program, a code set, or an instruction set is loaded and executed by a processor to implement the image recognition method described above.

The image recognition method, the device, the electronic equipment and the storage medium provided by the embodiment of the application have the following beneficial effects:

acquiring a human skeleton image sequence; the human skeleton image sequence comprises continuous multi-frame human skeleton images; skeleton joint points of each frame of human skeleton image are consistent; determining a corresponding relative coordinate set in a skeleton node set of each frame of human skeleton image; the relative coordinates in the relative coordinate set are in one-to-one correspondence with the skeleton joint points in the skeleton joint point set; determining a relative coordinate tensor based on the relative coordinate set, the number of skeleton joints and the number of frames of images in the human skeleton image sequence; determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence; determining a time difference tensor based on the plurality of sets of inter-frame difference values, the number of skeleton nodes, and the number of frames of images in the human skeleton image sequence; determining an input tensor based on the relative coordinate tensor and the time difference tensor; and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence. According to the method and the device, the input tensor of the motion recognition model based on the graph rolling network is constructed by utilizing the human skeleton joint point information, and the motion recognition is performed, so that the accuracy of human motion recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of an image recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a human skeleton data set according to an embodiment of the present application

FIG. 4 is a schematic view of a human skeleton provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a relative coordinate tensor according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an input tensor according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an action recognition model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a convolutional layer according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a trained adjacency matrix provided by an embodiment of the present application;

FIG. 10 is a flow chart of a spatiotemporal attention extraction operation provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of an image recognition device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application, which includes a data processing module 101 and an action recognition model 102, and after a human skeleton image sequence is obtained by an integral body formed by the data processing module 101 and the action recognition model 102, action categories corresponding to the human skeleton image sequence are sequentially output through the data processing module 101 and the action recognition model 102.

The human skeleton image sequence is input into a data processing module 101; wherein the human skeleton image sequence comprises continuous multi-frame human skeleton images; the skeleton joint points of each frame of human skeleton image are consistent. The data processing module 101 determines a corresponding relative coordinate set in a skeleton node set of each frame of human skeleton image; the relative coordinates in the relative coordinate set are in one-to-one correspondence with the skeleton joint points in the skeleton joint point set; the data processing module 101 determines a relative coordinate tensor based on the relative coordinate set, the number of skeletal joints, and the number of frames of images in the sequence of human skeletal images. The data processing module 101 determines a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence, and determines a time difference tensor based on the plurality of inter-frame difference value sets, the number of skeleton nodes and the number of frames of images in the human skeleton image sequence; the data processing module 101 uses the relative coordinate tensor and the time difference tensor in series as input tensor, inputs the input tensor into the trained motion recognition model 102, and the motion recognition model 102 performs motion recognition on the input tensor to obtain a motion category corresponding to the human skeleton image sequence.

Optionally, in another application scenario, the data processing module 101 may also be used as a part of the motion recognition model 102, the human skeleton image sequence is used as an input of the motion recognition model 102, and the motion category corresponding to the human skeleton image sequence is output through the motion recognition model 102.

Alternatively, the data processing module 101 and the action recognition model 102 may be provided in the same device, such as a mobile terminal, a computer terminal, a server, or similar computing means; alternatively, the data processing module 101 and the action recognition model 102 may be provided in a plurality of devices, which are in one system; alternatively, the data processing module 101 and the action recognition model 102 may be provided on one platform. Therefore, the execution subject of the embodiments of the present application may be a mobile terminal, a computer terminal, a server, or a similar computing device; may be a system or a platform.

In the following, a specific embodiment of an image recognition method according to the present application is described, and fig. 2 is a schematic flow chart of an image recognition method according to the embodiment of the present application, and the present specification provides method operation steps according to an embodiment or a flowchart, but may include more or fewer operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). As shown in fig. 2, the method may include:

S201: acquiring a human skeleton image sequence; the human skeleton image sequence comprises continuous multi-frame human skeleton images; the skeleton joint points of each frame of human skeleton image are consistent.

In this embodiment of the present application, the human skeleton image sequence may be acquired by a depth sensor (such as microsoft Kinect), and the data acquired by the depth sensor further includes three-dimensional coordinate information of skeleton joints in each frame of human skeleton image.

At present, a large number of open source human skeleton data sets are used for experimental verification. For example, NTU rgb+d datasets are captured simultaneously by 3 microsoft Kinect cameras, involving 60 motion categories, containing more than 50,000 motion samples, as well as video, depth image sequences, and three-dimensional skeletal data for each motion sample. Referring to fig. 3, fig. 3 is a schematic diagram of a human skeleton data set according to an embodiment of the present application, and fig. 3 (a) is three-dimensional skeleton data of NTU rgb+d data set, including three-dimensional coordinate information of 25 joint points. Three-dimensional coordinate information is obtained by bone tracking techniques in Kinect cameras, which build coordinates of individual joints of the human body by processing depth data, which can determine individual parts of the human body, such as which parts are hands, head, and body, and also where they are in space. Similarly, in addition to the NTU rgb+d dataset, there is an HDM05 dataset, as shown in fig. 3 (b), which contains three-dimensional skeleton data including three-dimensional coordinate information of 31 joint points.

S203: determining a corresponding relative coordinate set in a skeleton node set of each frame of human skeleton image; the relative coordinates in the relative coordinate set are in one-to-one correspondence with the skeleton nodes in the skeleton node set.

In this embodiment of the present application, the number of skeleton node sets of each frame of human skeleton image may be determined according to a specific algorithm, where the number of skeleton node sets of each frame of human skeleton image acquired based on NTU rgb+d data sets is 25, and the data sets further include three-dimensional coordinate information sets corresponding to the 25 joint points.

An alternative implementation way for determining the corresponding relative coordinate set in the skeleton node set of each frame of human skeleton image is that a root node is determined from the skeleton node set; determining the relative coordinates of each skeleton node in a skeleton node set in each frame of human skeleton image based on the root node in the skeleton node set, and obtaining a relative coordinate set.

The above will be described by way of a specific example, referring to FIG. 4, FIG. 4 is provided in accordance with an embodiment of the present applicationFor ease of illustration, it is assumed here that the number of joints determined by an algorithm is 5. And acquiring human skeleton images of 10 continuous frames through a depth sensor, and simultaneously acquiring three-dimensional coordinate information sets of 5 skeleton joints in each frame of human skeleton image. For example, three-dimensional coordinate information of 5 skeleton joints in the 1 st frame is respectively: head joint point A ₁ (90,90,90) hand node B ₁ (100,80,60) hand joint C ₁ (80,100,60) leg joint D ₁ (100,80,0) leg joint E ₁ (80,100,0), for example, the three-dimensional coordinate information of the 5 skeleton joints in the 2 nd frame is respectively: a is that ₂ (90,90,92)，B ₂ (100,80,62)，C ₂ (80,100,62)，D ₂ (100,80,10)，E ₂ (80,100,10), for example, three-dimensional coordinate information of 5 skeleton nodes in the 10 th frame is respectively: a is that ₁₀ (90,90,110)，B ₁₀ (100,80,80)，C ₁₀ (80,100,80)，D ₁₀ (100,80,50)，E ₁₀ (80,100,50); determining a root node as a head node A from the 5 skeleton nodes A, B, C, D and E, and determining the relative coordinates of the 5 skeleton nodes in each frame relative to the head node A to obtain a relative coordinate set; the head joint point A in each frame of human skeleton image is the origin (0, 0). For example, the relative coordinate set of frame 1 includes A' ₁ (0,0,0)，B’ ₁ (10，-10,-30)，C’ ₁ (-10,10,-30)，D’ ₁ (10,-10,-90)，E’ ₁ (-10, -90), the set of relative coordinates of frame 2 comprises A' ₂ (0,0,0)，B’ ₂ (10，-10,-30)，C’ ₂ (-10,10,-30)，D’ ₂ (10,-10,-82)，E’ ₂ (-10, -82), the set of relative coordinates of frame 10 comprises A' ₁₀ (0,0,0)，B’ ₁₀ (10，-10,-30)，C’ ₁₀ (-10,10,-30)，D’ ₁₀ (10,-10,-70)，E’ ₁₀ (-10,10,-70)。

S205: the relative coordinate tensor is determined based on the relative coordinate set, the number of skeleton joints, and the number of frames of the images in the sequence of human skeleton images.

In this embodiment of the present application, the human skeleton data is converted into tensors c×t×v, please refer to fig. 5, fig. 5 is a schematic structural diagram of a relative coordinate tensor provided in this embodiment of the present application, where C represents the number of channels (in this application, the human skeleton node is represented by three-dimensional coordinate information, i.e. the number of channels is 3), and three channels x, y, z respectively represent the relative coordinate set of the skeleton node set of each frame of image in the human skeleton image on the channel; t represents a frame number sequence of the human skeleton image; v represents the joint point sequence of the human skeleton.

S207: and determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence.

In the embodiment of the application, after a three-dimensional coordinate information set corresponding to a skeleton node set is converted into a relative coordinate set according to a root joint in each frame, a plurality of inter-frame differential value sets are determined channel by channel (x-channel, y-channel and z-channel) according to a plurality of relative coordinate sets corresponding to a human skeleton image sequence, and the inter-frame differential value is the relative displacement of a certain skeleton node between two adjacent frame images.

The description is continued based on the above examples. For example, the inter-frame difference value of the joint point D between the 2 nd frame image and the 1 st frame image is calculated channel by channel, the inter-frame difference value of the joint point D on the x channel is 0, the inter-frame difference value of the joint point D on the y channel is 0, and the inter-frame difference value of the joint point D on the Z channel is 8, which indicates that the leg action of the human body only changes in the Z axis direction; for example, the inter-frame difference value between the 2 nd frame image and the 1 st frame image of the node B is calculated channel by channel, and the inter-frame difference values on the x channel, the y channel and the z channel are all 0, which indicates that the hand motion of the human body is unchanged in any direction.

S209: a time difference tensor is determined based on the plurality of sets of inter-frame difference values, the number of skeleton nodes, and the number of frames of images in the sequence of human skeleton images.

S211: an input tensor is determined based on the relative coordinate tensor and the time difference tensor.

In this embodiment of the present application, in addition to the above-mentioned construction of the relative coordinate tensor for obtaining the characteristics of the human motion in the spatial domain, a time difference tensor is constructed based on the above-mentioned inter-frame difference value set for extracting the characteristics of the human motion in the time domain, and the time difference tensor and the relative coordinate tensor are connected in series as the input tensor c×t×v. Referring to fig. 6, fig. 6 is a schematic structural diagram of an input tensor according to an embodiment of the present application, where three channels x, y, z respectively represent a set of relative coordinates of a skeleton node set of each frame of image in a human skeleton image on the channel; the other three channels Deltax, deltayand Deltaz respectively represent the inter-frame difference value sets of the relative coordinate sets corresponding to the skeleton node sets in each frame of image on the basis of the corresponding x-channel, y-channel and z-channel of the relative coordinate sets corresponding to the skeleton node sets in the previous frame of image.

S213: and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence.

In an embodiment of the present application, the action recognition model may be a network model improved based on a graph roll-up network model, and the action recognition model may include: input layer, 1 batch normalization layer (Batch Normalization, BN), 10 convolution layers, 1 global average pooling layer (Global Average Pooling, GAP), 1 full Connected layer (FC), and output layer. Wherein each convolution layer comprises 1 pseudo-graph convolution module, 1 spatiotemporal attention extraction module and 1 temporal convolution module. The acquired human skeleton image sequence is input to an input layer of the motion recognition model, the input tensor is determined by the input layer and then transmitted to a batch normalization layer, and the purpose of the batch normalization layer is to normalize the input of the motion recognition model through the batch normalization layer, so that the disappearance and explosion of gradients can be avoided, and the training speed is improved. And then, extracting characteristic tensors from the input tensors processed by the batch normalization layers through 10 convolution layers in sequence, and feeding the characteristic tensors output by the final convolution layers into a global average pooling layer, wherein the purpose of the global average pooling layer is to reduce characteristic dimensions. And then feeding tensors output by the global averaging pooling layer into the full-connection layer to obtain classification scores of human skeleton image sequences, and finally completing human action classification and identification through a Softmax classification module of the output layer.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an action recognition model provided in the embodiment of the present application, which is sequentially an input layer, a batch normalization layer, a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer, a seventh convolution layer, an eighth convolution layer, a ninth convolution layer, a tenth convolution layer, a global average pooling layer, a full connection layer, and an output layer. In one specific example:

the function of the input layer may be the function of data processing, and steps S201 to S211 are performed to acquire a human skeleton image sequence and determine an input tensor.

The function of the batch normalization layer is to normalize the data, and this technique is common knowledge of those skilled in the art, and will not be described here again.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a convolution layer provided in the embodiment of the present application, and the structure of 10 convolution layers in the embodiment of the present application may refer to the structure, including 1 artifact convolution module, 1 spatio-temporal attention extraction module, and 1 temporal convolution module. Since the input tensor is 6 channels, the pseudo-graph convolution module of the first convolution layer has 6 input channels and also has 64 channels for output. The pseudo-graph convolution modules of the second, third, and fourth convolution layers have 64 input channels and 64 output channels, respectively. The pseudo-graph convolution module of the fifth convolution layer has 64 input channels and 128 output channels. The pseudo-graph convolution modules of the sixth and seventh convolution layers have 128 input channels and 128 output channels, respectively. The pseudo-graph convolution module of the eighth convolution layer has 128 input channels and 256 output channels. The ninth convolution layer and the tenth convolution layer have 256 input channels and 256 output channels, respectively. Wherein the step size of the fifth convolution layer and the eighth convolution layer may be set to 2.

The output tensor of the tenth convolution layer of the preamble is fed into a global pooling layer, which acquires 256-dimensional feature vectors from the human skeleton image sequence.

After the global pooling layer, the output tensor is fed into the full-connection layer to obtain action classification scoring corresponding to the human skeleton image sequence, and then human action classification recognition is completed through a Softmax classification module of the output layer.

In the embodiment of the application, each convolution layer comprises 1 pseudo-graph convolution module, 1 space-time attention extraction module and 1 time convolution module. The pseudo-graph rolling modules of the 10 convolution layers are all used for acquiring trained adjacent matrixes, and performing pseudo-graph rolling operation based on the product of tensors output by the previous convolution layer and the adjacent matrixes to obtain spatial characteristic tensors; then, the space-time attention extraction module performs space-time attention extraction operation based on the space feature tensor to obtain a space-time calibration feature tensor; the spatio-temporal calibration feature tensor comprises a plurality of feature planes of different weights; and finally, performing time convolution operation on the time space calibration characteristic tensor by the time convolution module to obtain an output tensor of the convolution layer.

In an alternative implementation of performing a pseudo-graph convolution operation based on the product of the tensor output by the previous layer of convolution layer and the adjacency matrix to obtain the spatial feature tensor, the spatial feature tensor may be determined according to formula (1):

Wherein f _out Representing a spatial feature tensor; w (W) _i Representing the weight; f (f) _in Representing tensors output by a previous layer of convolution layer;

representing a trained adjacency matrix; n represents the number of adjacency matrices in each layer; i represents the i-th adjacency matrix in each pseudo-graph convolution module.

In this embodiment of the present application, the trained adjacency matrices corresponding to the pseudo-graph convolution modules in each convolution layer are different from each other. Referring to fig. 9, fig. 9 is a schematic diagram of a trained adjacency matrix according to an embodiment of the present application. The first row in fig. 9 shows the original adjacency matrices based on NTU-rgb+d Cross-Subject training in the prior art, where these adjacency matrices only represent the connection relationships between the joints in the human skeleton joints that have a physical direct connection relationship, and are trained and kept fixed. In fig. 9, the second row and the third row are 10 adjacency matrices obtained based on NTU-rgb+d Cross-Subject training in the embodiment of the present application, the first column of the second row may be a matrix in the pseudo-graph convolution module of the first layer convolution layer, the last column of the third row may be a matrix in the pseudo-graph convolution module of the tenth layer convolution layer, and the adjacency matrices provided in the embodiment of the present application are obtained by learning a physical direct connection relationship and a non-physical direct connection relationship between the articular points, and training to obtain 10 adjacency matrices different from each other, and respectively act on 10 convolution layers, so that multi-level semantic information can be extracted, and flexibility of an enhanced model can be extracted. Here, being a learnable adjacency matrix, independent of the predefined graph and normalized adjacency matrix in the prior art, is called a pseudo-graph convolution module.

In the embodiment of the application, the positions and the motion states of the skeleton joints in the three-dimensional directions are considered to have different contributions to the classification of actions, and certain frames containing salient features play an important role in distinguishing action types. Therefore, in the convolution layer provided by the embodiment of the application, the spatio-temporal attention extraction module performs the spatio-temporal attention extraction operation to obtain the spatio-temporal calibration feature tensor.

An alternative embodiment for performing a spatio-temporal attention extraction operation by a spatio-temporal attention extraction module to obtain a spatio-temporal calibration feature tensor is described below. Referring to fig. 10, fig. 10 is a schematic flow chart of a spatio-temporal attention extraction operation according to an embodiment of the present application. Firstly, extracting information channel by global average pooling, secondly reducing the number of channels through a full connection layer and a ReLU nonlinear operation layer, and recovering the number of channels through a full connection layer and the ReLU nonlinear operation layer, wherein the spatial characteristics can be calibrated in this way; to recalibrate the temporal feature, the channel axis and the time axis are first swapped to obtain tensors t×c×v, then the same operations as described above are applied, and after recalibration of the temporal feature, the feature tensors are changed back to the original shape. The Hadamard product is used for mixing spatial features and temporal features, after the mixing tensor is changed into V multiplied by T multiplied by C, 1 multiplied by 1 convolution is adopted to extract the space-time attention tensor, and the original input tensor is multiplied by the space-time attention tensor to obtain the space-time calibration characteristic tensor.

Alternatively, the time alignment operation may perform time alignment on the features obtained after the time convolution operation of the previous layer of convolution layer; the time alignment operation may also increase the weight of some important frames in the layer of convolution before the time convolution and then perform the time convolution operation, thus facilitating high quality feature extraction.

In the embodiment of the application, all experiments of the training process of the motion recognition model can be performed based on the PyTorch deep learning framework. The learning rate, momentum and weight decay can be set to 0.1, 0.9 and 0.0001, respectively, using random gradient descent (SGD) with Nesterov momentum for optimization. Dropout with a probability of 0.2 is used to mitigate overfitting during training.

All elements in (1) are initialized to 1. Cross entropy is chosen as a loss function of the counter-propagating gradient.

The methods provided by embodiments of the present application are compared to several other skeleton-based motion recognition methods based on the NTU-rgb+d dataset and the HDM05 dataset, respectively.

For the NTU-rgb+d dataset, there are a maximum of two people in each sample of the dataset, and the maximum number of frames per sample is 300. For samples less than 300 frames, the samples are repeated until 300 frames are reached. The batch size was set to 32. The learning rate was set to 0.1 and divided by 10 at the 20 th and 40 th epochs. The training process ends at 60 th epoch. The method proposed in the present application is trained based on two common benchmarks: cross-Subject and Cross-View. The accuracy of the top-1 classification recognition by the other 16 methods and the method of the application (PGCN-TCA) was calculated during the test phase, respectively. Table 1 shows the results of a comparison in which the method provided by the examples of the present application (PGCN-TCA) in the Cross-Subject benchmark was 88.0% accurate for top-1 classification, next to 2s-AGCN, but superior to most of the existing methods; the accuracy of the method (PGCN-TCA) provided by the embodiment of the application in the Cross-View benchmark for top-1 classification is 93.6%, which is inferior to 2s-AGCN, but superior to most of the existing methods, and the accuracy is higher.

Table 1: comparison of accuracy of top-1 classification of Cross-Subject and Cross-View on NTU-RGB+D dataset

NO.	Methods	Cross-Subject(％)	Cross-View(％)
				1	Lie Group	50.1	52.8
2	H-RNN	59.1	64.0
				3	Deep LSTM	60.7	67.3
4	ST-LSTM+TS	69.2	77.7
				5	Temporal Conv	74.3	83.1
6	Visualize CNN	76.0	82.6
				7	Visualize CNN	79.6	84.8
8	ST-GCN	81.5	88.3
				9	MANs	82.7	93.2
10	DPRL	83.5	89.8
				11	SR-TSL	84.8	92.4
12	HCN	86.5	91.1
				13	PB-GCN	87.5	93.2
14	RA-GCN	85.9	93.5
				15	AS-GCN	86.8	94.2
16	2s-AGCN	88.5	95.1
				17	PGCN-TCA	88.0	93.6

For the HDM05 dataset, the maximum number of frames in each sample is 901. For samples less than 901 frames, repeat until it reaches 901 frames. The batch size was set to 16. The learning rate was also set to 0.1 and divided by 10 at 100 th epoch. The training process ends at 120 th epoch. 10 evaluations were performed by way of random evaluation, each of which randomly selected half of the sequences in the dataset for training and the remaining sequences for testing. In each evaluation, the accuracy of the top-1 classification was calculated for the other 7 methods and the method provided in the examples of the present application (PGCN-TCA), respectively. Table 2 shows the results of a comparison, wherein the method provided in the examples of the present application is inferior to PB-GCN, and is superior to most of the prior methods.

Table 2: top-1 classification accuracy comparison on HDM05 dataset

/>

According to the experimental results, the image recognition method provided by the embodiment of the application has higher accuracy of motion classification determined based on the human skeleton image sequence.

The embodiment of the application also provides an image recognition device, and fig. 11 is a schematic structural diagram of the image recognition device provided in the embodiment of the application, as shown in fig. 11, the device includes:

A first acquisition module 1101, configured to acquire a human skeleton image sequence; the human skeleton image sequence comprises continuous multi-frame human skeleton images; skeleton joint points of each frame of human skeleton image are consistent;

a first determining module 1102, configured to determine a corresponding set of relative coordinates in a skeleton node set of each frame of human skeleton image; the relative coordinates in the relative coordinate set are in one-to-one correspondence with the skeleton joint points in the skeleton joint point set;

a second determining module 1103, configured to determine a relative coordinate tensor based on the relative coordinate set, the number of skeleton nodes, and the number of frames of the images in the human skeleton image sequence;

a third determining module 1104, configured to determine a plurality of inter-frame differential value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence;

a fourth determining module 1105, configured to determine a time difference tensor based on a plurality of sets of inter-frame difference values, a number of skeleton nodes, and a number of frames of images in the human skeleton image sequence;

a fifth determination module 1106 for determining an input tensor based on the relative coordinate tensor and the time difference tensor;

the motion recognition module 1107 is configured to perform motion recognition on the input tensor based on the trained motion recognition model, so as to obtain a motion category corresponding to the human skeleton image sequence.

The apparatus and method embodiments in the embodiments of the present application are based on the same application concept.

The embodiment of the application provides electronic equipment, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the image identification method.

Embodiments of the present application also provide a storage medium that may be disposed in a server to store at least one instruction, at least one program, a code set, or an instruction set related to implementing an image recognition method in a method embodiment, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the image recognition method described above.

Alternatively, in this embodiment, the storage medium may be located in at least one network server among a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

As can be seen from the embodiments of the image recognition method, apparatus, electronic device or storage medium provided in the present application, a human skeleton image sequence is obtained in the present application; the human skeleton image sequence comprises continuous multi-frame human skeleton images; skeleton joint points of each frame of human skeleton image are consistent; determining a corresponding relative coordinate set in a skeleton node set of each frame of human skeleton image; the relative coordinates in the relative coordinate set are in one-to-one correspondence with the skeleton joint points in the skeleton joint point set; determining a relative coordinate tensor based on the relative coordinate set, the number of skeleton joints and the number of frames of images in the human skeleton image sequence; determining a plurality of inter-frame difference value sets according to a plurality of relative coordinate sets corresponding to the human skeleton image sequence; determining a time difference tensor based on the plurality of sets of inter-frame difference values, the number of skeleton nodes, and the number of frames of images in the human skeleton image sequence; determining an input tensor based on the relative coordinate tensor and the time difference tensor; and performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence. According to the method and the device, the input tensor of the motion recognition model based on the graph rolling network is constructed by utilizing the human skeleton joint point information, and the motion recognition is performed, so that the accuracy of human motion recognition can be improved.

It should be noted that: the foregoing sequence of the embodiments of the present application is only for describing, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. An image recognition method, comprising:

determining a relative coordinate tensor based on the relative coordinate set, the number of skeleton joints, and the number of frames of images in the human skeleton image sequence;

determining a time difference tensor based on the plurality of sets of inter-frame difference values, the number of skeleton nodes, and the number of frames of images in the sequence of human skeleton images;

Performing action recognition on the input tensor based on the trained action recognition model to obtain an action category corresponding to the human skeleton image sequence;

the motion recognition model comprises a plurality of convolution layers for extracting feature tensors, wherein each convolution layer comprises a pseudo-graph convolution module, a space-time attention extraction module and a time convolution module; the pseudo-graph rolling module is used for obtaining a trained adjacent matrix, and performing pseudo-graph rolling operation based on the product of tensor output by a previous convolution layer and the adjacent matrix to obtain a spatial characteristic tensor; the adjacency matrix is a learnable adjacency matrix, and the adjacency matrix is obtained by training a physical direct connection relation and a non-physical direct connection relation between the learning nodes; the space-time attention extraction module is used for carrying out space-time attention extraction operation based on the space feature tensor to obtain a space-time calibration feature tensor; the spatio-temporal calibration feature tensor comprises a plurality of feature planes of different weights; and the time convolution module is used for performing time convolution operation on the space-time calibration characteristic tensor to obtain the output tensor of the convolution layer.

2. The method of claim 1, wherein determining a corresponding set of relative coordinates in a set of skeletal nodes of each frame of human skeletal image comprises:

Determining a coordinate information set of a skeleton node set of each frame of human skeleton image in the human skeleton image sequence;

determining a root node from the skeleton node set;

and determining the relative coordinates of each skeleton joint point in the skeleton joint point set in each frame of human skeleton image based on the root node in the skeleton joint point set to obtain the relative coordinate set.

3. The method of claim 1, wherein the action recognition model comprises:

input layer, 1 batch normalization layer, 10 convolution layers, 1 global average pooling layer, 1 fully connected layer, and output layer.

4. A method according to claim 3, wherein said performing motion recognition on said input tensor based on a trained motion recognition model comprises:

acquiring a trained adjacency matrix;

performing pseudo-graph rolling operation based on the product of the input tensor and the adjacent matrix, and outputting a spatial feature tensor;

extracting information from the spatial feature tensor C multiplied by T multiplied by V channel by channel through global averaging pooling, reducing the number of channels through a full connection layer and a ReLU nonlinear operation layer, and recovering the number of channels through a full connection layer and the ReLU nonlinear operation layer so as to calibrate the spatial features;

Exchanging channel axes and time axes of the spatial feature tensors to obtain tensors T multiplied by C multiplied by V, extracting information from the tensors T multiplied by C multiplied by V channel by channel through global average pooling, reducing the number of channels through a full connection layer and a ReLU nonlinear operation layer, and recovering the number of channels through the full connection layer and the ReLU nonlinear operation layer so as to calibrate the time features; changing the tensor t×c×v back to the original shape c×t×v;

mixing spatial features and time features by Hadamard products, changing the mixed tensor into V x T x C, adopting 1 x 1 convolution to extract a space-time attention tensor, and multiplying the spatial feature tensor C x T x V of the original input by the space-time attention tensor to obtain a space-time calibration feature tensor;

performing time convolution operation on the space-time calibration characteristic tensor to obtain an output tensor;

wherein the trained adjacency matrix corresponding to each of the 10 convolutional layers is different from each other.

5. An image recognition apparatus, comprising:

the second determining module is used for determining a relative coordinate tensor based on the relative coordinate set, the number of the skeleton joints and the number of frames of images in the human skeleton image sequence;

a fourth determining module, configured to determine a time difference tensor based on the plurality of sets of inter-frame difference values, the number of skeleton nodes, and the number of frames of images in the human skeleton image sequence;

the motion recognition module is used for performing motion recognition on the input tensor based on the trained motion recognition model to obtain a motion category corresponding to the human skeleton image sequence;

6. The apparatus of claim 5, wherein the device comprises a plurality of sensors,

the first determining module is further configured to determine a coordinate information set of a skeleton node set of each frame of human skeleton image in the human skeleton image sequence; determining a root node from the skeleton node set; and determining the relative coordinates of each skeleton joint point in the skeleton joint point set in each frame of human skeleton image based on the root node in the skeleton joint point set to obtain the relative coordinate set.

7. The apparatus of claim 5, wherein the action recognition model comprises:

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the action recognition module is also used for acquiring a trained adjacency matrix; performing pseudo-graph rolling operation based on the product of the input tensor and the adjacent matrix, and outputting a spatial feature tensor; extracting information from the spatial feature tensor C multiplied by T multiplied by V channel by channel through global averaging pooling, reducing the number of channels through a full connection layer and a ReLU nonlinear operation layer, and recovering the number of channels through a full connection layer and the ReLU nonlinear operation layer so as to calibrate the spatial features; exchanging channel axes and time axes of the spatial feature tensors to obtain tensors T multiplied by C multiplied by V, extracting information from the tensors T multiplied by C multiplied by V channel by channel through global average pooling, reducing the number of channels through a full connection layer and a ReLU nonlinear operation layer, and recovering the number of channels through the full connection layer and the ReLU nonlinear operation layer so as to calibrate the time features; changing the tensor t×c×v back to the original shape c×t×v; mixing spatial features and time features by Hadamard products, changing the mixed tensor into V x T x C, adopting 1 x 1 convolution to extract a space-time attention tensor, and multiplying the spatial feature tensor C x T x V of the original input by the space-time attention tensor to obtain a space-time calibration feature tensor; performing time convolution operation on the space-time calibration characteristic tensor to obtain an output tensor; wherein the trained adjacency matrix corresponding to each of the 10 convolutional layers is different from each other.

9. An electronic device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the image recognition method of any one of claims 1-4.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the image recognition method of any one of claims 1 to 4.