CN117292435A

CN117292435A - Action recognition method and device and computer equipment

Info

Publication number: CN117292435A
Application number: CN202311267919.2A
Authority: CN
Inventors: 李雪; 薄拾; 刘博�; 赵瑞刚; 王凯; 曹盼
Original assignee: Xi'an Tianhe Defense Technology Co ltd
Current assignee: Xi'an Tianhe Defense Technology Co ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-26

Abstract

The application is applicable to the technical field of action recognition, and provides an action recognition method, which comprises the following steps: acquiring video data and inertial data of the same action to be identified; identifying a first action category corresponding to the video data of the action to be identified by using a first identification model; identifying a second action category corresponding to the inertial data of the action to be identified by using a second identification model; and determining a target action category of the action to be identified according to the consistency of the first action category and the second action category. The method adopts a multi-mode identification method, combines video data and inertial data, acquires information from two different angles of vision and motion perception to perform motion identification, determines a final identification result according to a comparison result, can reduce false identification possibly caused by a single mode, further improves the accuracy and reliability of the motion identification, and provides a more accurate and reliable identification result.

Description

Action recognition method and device and computer equipment

Technical Field

The application belongs to the technical field of motion recognition, and particularly relates to a motion recognition method, a motion recognition device and computer equipment.

Background

Along with the development of artificial intelligence and computer vision, human motion recognition becomes a research direction of great concern, and has wide application in human-computer interaction, motion analysis, medical assistance, virtual reality and other aspects.

At present, the action recognition method based on computer vision is mature in application and mainly depends on vision information for action recognition. However, the identification method has high requirements on environmental conditions, is easily influenced by factors such as illumination, shielding, viewing angles and the like, and reduces the identification accuracy.

Therefore, how to improve the accuracy of motion recognition is a problem to be solved.

Disclosure of Invention

The invention aims to provide a motion recognition method, a motion recognition device and computer equipment, which can meet the requirement of determining to improve the motion recognition accuracy.

In a first aspect, an embodiment of the present application provides an action recognition method, where the method includes: acquiring video data and inertial data of the same action to be identified, wherein the video data of the action to be identified comprises image frames for recording the states of the action to be identified at all moments, and the inertial data of the action to be identified is used for representing the motion parameters of the action to be identified in the action period; identifying a first action category corresponding to the video data of the action to be identified by using a first identification model; identifying a second action category corresponding to the inertial data of the action to be identified by using a second identification model; and determining a target action category of the action to be identified according to the consistency of the first action category and the second action category.

In a possible implementation manner of the first aspect, the determining the target action category of the action to be identified according to the consistency of the first action category and the second action category includes: when the first action category is consistent with the second action category, determining the first action category or the second action category as a target action category of the action to be identified; and when the first action category is inconsistent with the second action category, re-identifying the action category of the action to be identified.

In a possible implementation manner of the first aspect, the method further includes: and when the first action category is inconsistent with the second action category, updating the first recognition model and/or the second recognition model according to the manual calibration action category of the action to be recognized.

In a possible implementation manner of the first aspect, the first recognition model includes a target detection network, a gesture estimation network and an action recognition network, and the step of recognizing, by using the first recognition model, a first action category corresponding to video data of an action to be recognized includes: positioning the human body position on each frame of image in the video data of the action to be identified by utilizing a target detection network; positioning human skeleton point coordinates based on the human body position by utilizing a gesture estimation network, and generating a skeleton sequence; and predicting the action category corresponding to the skeleton sequence by utilizing the action recognition network, and determining the action category corresponding to the skeleton sequence as a first action category.

In a possible implementation manner of the first aspect, the object detection network is a YOLO Nano network, the pose estimation network is an SNHRNet network, and the motion recognition network is an ST-GCN network.

In a possible implementation manner of the first aspect, the SNHRNet network is trained by the converged attention mechanism module on the basis of the NHRNet network.

In a possible implementation manner of the first aspect, the second identifying model includes a feature extraction network and a classifier, and the identifying, by using the second identifying model, a second action category corresponding to inertial data of an action to be identified includes: extracting features and selecting features of inertial data of the motion to be identified by using a feature extraction network to obtain motion feature vectors corresponding to the inertial data of the motion to be identified; and predicting the action category corresponding to the action feature vector by using a classifier according to the pre-learned mapping relation, and determining the action category corresponding to the action feature vector as a second action category.

In a possible implementation manner of the first aspect, when the inertial data of the action to be identified includes initial inertial data acquired by using a plurality of sensors, then feature extraction and feature selection are performed on the inertial data of the action to be identified by using a feature extraction network, so as to obtain an action feature vector, where the steps include: respectively carrying out feature extraction and feature selection on the initial inertial data to obtain a plurality of groups of action features, wherein each group of action features corresponds to one sensor; and carrying out fusion processing on the multiple groups of action characteristics to obtain action characteristic vectors corresponding to the inertial data of the action to be identified.

In a second aspect, an embodiment of the present application provides an action recognition apparatus, including: the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring video data and inertial data of the same action to be recognized, the video data of the action to be recognized comprise image frames for recording the states of the action to be recognized at all moments, and the inertial data of the action to be recognized are used for representing the motion parameters of the action to be recognized during the action; the recognition module is used for recognizing a first action category corresponding to the video data of the action to be recognized by using the first recognition model and recognizing a second action category corresponding to the inertia data of the action to be recognized by using the second recognition model; and the analysis module is used for determining a target action category of the action to be identified according to the consistency of the first action category and the second action category.

In a third aspect, embodiments of the present application provide a computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, which when executed by the processor causes the computer device to implement any one of the implementations of the first and second aspects described above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, which when executed by a computer device implements any one of the implementations of the first and second aspects described above.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on a computer device, causes the computer device to perform the implementation of any of the first aspects.

Compared with the prior art, the embodiment of the application has the beneficial effects that: according to the method, a multi-mode action recognition method is adopted, information is acquired from two different angles of vision and motion perception through combination of video data and inertial data to perform action recognition, a final recognition result is determined according to comparison results of the two schemes, misrecognition possibly caused by a single mode is reduced, accuracy and reliability of the action recognition are improved, and a more accurate and reliable recognition result can be provided.

Drawings

Fig. 1 is a flowchart of a motion recognition method according to an embodiment of the present application.

Fig. 2 is a flowchart of a motion recognition method according to another embodiment of the present application.

Fig. 3 is a flowchart of an action recognition method according to another embodiment of the present application.

Fig. 4 is a flowchart of a motion recognition method according to another embodiment of the present application.

Fig. 5 is a structural device diagram of an action recognition device according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Fig. 1 shows a schematic flow chart of an action recognition method according to an embodiment of the present application.

S101, acquiring video data and inertial data of the same action to be identified.

The action to be identified refers to an action which needs to be identified in executing an action identification task, that is, an action which needs to be classified or identified by the action identification method of the application.

By way of example and not limitation, the action to be identified may be a standing, walking, running, turning around, inviting hand, etc. action of the person.

Video data, i.e. a video sequence, refers to a series of time-ordered video frames/image frames in which the state of each moment of the action to be recognized is recorded. When the image frames are arranged in time sequence, the execution process of the motion to be recognized can be captured and presented. For example, the video data may include web public video data, project accumulated data, laboratory shot data, or the like.

Inertial data refers to information about the state of motion of a human body acquired with sensors for representing motion parameters of the motion to be identified during the motion. In human motion analysis, important information such as motion trail, attitude change, acceleration change and the like of a human body can be acquired by utilizing inertial data, and the motion trail, attitude change, acceleration change and the like can be applied to scenes of motion recognition together with video data.

It should be understood that the video data and the inertial data are two types of data for identifying and analyzing the same action to be identified, the former is recorded by shooting equipment such as a camera when a human body makes the action to be identified, and the acquisition of visual characteristics is realized; the motion characteristic is acquired by utilizing the inertial sensor when the human body makes the motion to be identified.

When a certain action to be identified is identified by utilizing the scheme, video data and inertial data of the action to be identified are simultaneously acquired. It should be understood that "simultaneously" herein does not refer to the same instant in time, but rather means that the identification process and conclusion draw the need for co-participation of video data and inertial data.

In one implementation, inertial data for the motion to be identified may be acquired by inertial sensors deployed on various joints of the human body. By way of example and not limitation, inertial sensors may include accelerometer, gyroscope, magnetometer, etc. sensors.

S102, recognizing a first action category corresponding to the video data of the action to be recognized by using a first recognition model.

It should be appreciated that the first recognition model may be a deep learning model (e.g., a recurrent or convolutional neural network) or other machine learning model.

It should also be appreciated that the first recognition model is not limited to one model, but may be a combination or set of multiple models. By way of example and not limitation, the first recognition model may be formed by sequential concatenation of multiple models using a "model cascade" approach. In a first recognition model with multiple layers, each layer accepts the output of the previous layer as input, and each layer can use different feature extraction methods, model structures or learning algorithms, so that the overall model can learn and extract abstract features of data through multiple layers.

After the video data of the action to be identified is obtained, the video data is input into a first identification model. The first recognition model processes the input video data, extracts and analyzes visual features therein, performs motion recognition, and outputs a prediction result corresponding to a motion category (i.e., motion to be recognized) performed in the video data, i.e., a first motion category (or a first motion result). That is, the output of the first recognition model will be a prediction of the action category performed in the input video data, which may be one of predefined action categories, such as "walk", "lift hand", etc.

S103, identifying a second action category corresponding to the inertial data of the action to be identified by using a second identification model.

And after acquiring inertial data of the action to be identified, inputting the inertial data into a second identification model. The second recognition model processes the input inertial data, extracts and analyzes motion characteristics of the inertial data, performs motion recognition, and outputs a predicted result of a type of motion (i.e., motion to be recognized) executed corresponding to the inertial data, i.e., a second type of motion (or a second motion result). That is, the output of the second recognition model will be a prediction of the type of action performed corresponding to the input inertial data, which may be one of the predefined types of actions, such as "walk", "lift" and the like.

S104, determining a target action category of the action to be identified according to the consistency of the first action category and the second action category.

After the visual characteristics and the motion characteristics are respectively extracted from the video data and the inertial data of the same motion to be identified, the characteristics of the two different latitudes and different visual angles are identified by using two different identification models, so that two results, namely a first motion category and a second motion category, are obtained.

By comparing the consistency of the first and second action categories, it is determined to which action category the action to be identified belongs (i.e. the target action category).

According to the method, a multi-mode action recognition method is adopted, information is acquired from two different angles of vision and motion perception by combining video data and inertial data to perform action recognition, a final recognition result is determined according to comparison results of the two schemes, limitation of a single mode and misrecognition possibly caused can be reduced, accuracy and reliability of the action recognition are improved, and a more accurate and reliable recognition result is provided.

It should be noted that, in the existing motion recognition scheme, there is a single video recognition technology and a single inertial sensor recognition technology, but there are drawbacks:

for complex or nonstandard actions, the accuracy of a single video recognition technology may not meet the ideal requirement of a user, and may be affected by factors such as illumination change, visual angle change, shielding and the like, so that the accuracy is reduced.

The inertial sensor has high robustness to illumination change and visual angle change, and is suitable for motion recognition in complex environments. But single inertial sensor recognition techniques may not provide enough information to accurately recognize certain complex motions, particularly those involving multiple joints and motion patterns. In addition, the accuracy of single inertial sensor identification techniques may also suffer from certain effects due to noise, drift, and error accumulation.

Based on the characteristics and the defects of the single technology, the application creatively combines the two identification technologies, comprehensively utilizes the information provided by different sensors, realizes multi-mode data fusion, and further overcomes the limitation of the single technology. Meanwhile, in the actual application scene, factors such as illumination change, visual angle change and shielding can cause action recognition errors, so that a correction mechanism is introduced to correct the errors, and a more accurate action recognition result is obtained.

The scheme is widely applied to various fields, such as man-machine interaction, somatosensory games, motion analysis, rehabilitation training and the like. By improving the accuracy and correction capability of motion recognition, the requirements of users for accurate and reliable motion recognition in practical application are met.

In one embodiment, determining a target action category for an action to be identified based on the consistency of the first action category and the second action category may include the following steps.

S105, when the first action category and the second action category are consistent, determining the first action category or the second action category as a target action category of the action to be identified.

When the first action category and the second action category are consistent, namely, the recognition results obtained by adopting two recognition modes of different dimensions and different data are the same, namely, the first action category or the second action category is determined to be the target action category of the action to be recognized. For example, when the first action category is "turn right" obtained in step S102 and the second action category is also "turn right" obtained in step S103, it is determined that the action to be recognized is "turn right", so that the recognition of the action to be recognized is completed.

S106, when the first action category and the second action category are inconsistent, the action category of the action to be identified is identified again.

In one implementation, the video data and the inertial data of the motion to be identified may be re-input into the corresponding identification model for secondary identification.

In one implementation, the method can be switched into manual correction, a worker manually judges which action category the action to be identified belongs to, obtains a manual calibration action category, and determines the manual calibration action category as a target action category of the action to be identified.

In one example, when the first action category and the second action category do not coincide, a correction reminder may be issued, which is used to prompt the staff for manual correction.

And S107, when the first action category and the second action category are inconsistent, updating the first recognition model and/or the second recognition model according to the manual calibration action category of the action to be recognized.

And S106, after the manual calibration action category is obtained through manual analysis of a worker, correcting and updating the first recognition model and/or the second recognition model according to the manual calibration action category.

In one implementation, the second recognition model is modified when the manual calibration result is consistent with the first action category. For example, the artificial calibration result may be used as a training set to train the second recognition model. And similarly, when the manual calibration result is consistent with the second action category, correcting the first identification model. For example, the first recognition model may be trained using the artificial calibration result as a training set. And when the manual calibration result is inconsistent with the first action category and the second action category, correcting the first identification model and the second identification model. For example, the first and second recognition models may be trained using the artificial calibration results as training sets, respectively.

According to the embodiment, the manual intervention is integrated into the identification system, so that the high-efficiency automatic identification can be ensured, meanwhile, the judgment capability when the results are inconsistent is reserved, and the reliability and the practicability of the whole system are effectively improved. Meanwhile, a quick and effective feedback mechanism is provided for continuous optimization of the model.

Fig. 2 shows a schematic flow chart of an action recognition method according to another embodiment of the present application.

It should be appreciated that fig. 2 may be regarded as one specific example of steps S101-S107, which may include the following steps.

S201, inputting video data of actions to be identified into a first identification model.

S202, identifying and outputting a result 1 through the first identification model.

Result 1 is one example of a first action category.

S203, inputting inertial data of the action to be identified into the second identification model.

S204, identifying and outputting a result 2 through the second identification model.

Result 2 is one example of a second action category.

S205, when the result 1 and the result 2 are consistent, determining the target action category of the action to be identified as the result 1 or the result 2.

S206, when the result 1 and the result 2 are inconsistent, performing manual correction to obtain a result 3, and determining the target action category of the action to be identified as the result 3.

Result 3 is one example of a manual calibration action category.

S207, correcting the first recognition model and the second recognition model by using the result 3.

By this embodiment, a detailed implementation of the present application may be obtained. The utilization of multi-mode data, the use of a correction mechanism and the correction of a model can effectively improve the recognition capability of the whole recognition system, so as to obtain a more accurate recognition result.

Fig. 3 shows a schematic flow chart of an action recognition method according to another embodiment of the present application.

It should be appreciated that fig. 3 may be regarded as an example of step S102 in fig. 1.

In this embodiment, the first recognition model includes an object detection network, a pose estimation network, and a motion recognition network.

Fig. 3 includes the following steps.

S301, positioning the human body position on each frame of image in the video data of the action to be recognized by utilizing a target detection network.

The object detection network is used to identify and locate the position of the human body in the video data. In this process, the object detection network processes each frame of video data to identify the human body contained therein, e.g., a bounding box or polygon may be provided for each detected human body to identify its location.

In one implementation, the object detection network may be a YOLO Nano network.

YOLO Nano (You Only Look Once Nano) is a compact target detection model that can be used for mobile devices, built using ergonomic co-design strategies, including both principle network design and driven exploration design. The construction method can comprise the following steps: firstly, constructing a human body detection network on the basis of an original YOLO network; secondly, presetting a priori frame of human body detection, and constructing a loss function of human body detection; finally, training human body detection models with different prior frames, and selecting the human body detection model with highest precision through experimental comparison to complete the construction of the human body detection model.

In one implementation, the method may include the steps of inputting video data into a YOLO Nano network, processing the video data, and outputting the processed video data to obtain video data with a body position identifier.

(1) And (5) frame processing. I.e. the input video data is broken up into a series of image frames.

(2) And (5) detecting a target. I.e. applying the above-mentioned object detection model for each frame of image. This model identifies the human body in the image and provides a bounding box or polygon for each detected human body to identify its location.

(3) And (5) position identification. The output result of the target detection model (namely, the boundary box with the human body position information) is combined with the original frame to generate an image with the human body position mark.

(4) Video is reconstructed. I.e. all processed image frames are recombined into a video sequence in the original temporal order.

(5) And outputting a result. I.e. the final output will be a video in which each frame contains an image identifying the location of the person.

In one example, the output video maintains the frame rate of the original video and draws a bounding box or polygon of the detected body position on each frame. Thus, the detection result of the model in the video can be clearly displayed.

S302, positioning human skeleton point coordinates based on the human body position by utilizing a gesture estimation network, and generating a skeleton sequence.

In one implementation, the pose estimation network may be an SNHRNet network.

The application constructs an attitude estimation network based on SNHRNet, and provides a construction method thereof: firstly, a NHRNet (Nested Human Keypoint Detection Network) network is built on the basis of an HRNet (High-Resolution Network) network, and the built new NHRNet network structure can reduce the size of a model, but reduce some attention mechanisms of the network; and secondly, constructing an SNHRNet network on the basis of the NHRNet network, and achieving the purpose of compensating the lack of attention mechanism of the NHRNet network by fusing an SE (sequence-and-expression) module, and finally training the SNHRNet network to obtain the SNHRNet-based human body posture estimation network.

HRNet is a deep neural network model focused on processing high resolution images, employing a multi-scale information fusion strategy, allowing the network to maintain high quality features at various resolutions, which makes HRNet advantageous when processing large-size or high resolution images.

NHRNet is an extended version of HRNet, and through a nested network structure, the network can transmit and fuse information on multiple layers and multiple resolutions, so that the processing capacity of the network on multi-scale information is further enhanced.

The invention aims to solve the problem that a NHRNet network lacks some attention mechanisms, and builds a Basic SE module by fusing an SE (attention mechanism) module in a Basic residual module based on the NHRNet network, so as to add an attention mechanism for the NHRNet network, thereby building an SNHRNet network model.

Inputting the video data with the human body position identifier obtained in the step S302 into an SNHRNet network, positioning the position coordinates of human body key skeleton points (for example, 17 skeleton points) in each frame of image by utilizing the SNHRNet network, then forming a sequence (namely, a skeleton sequence) by the key skeleton point position coordinates of each frame according to time sequence, and outputting the skeleton sequence.

It should be understood that a skeleton sequence refers to a sequence of data in which the skeleton structure of a human body is continuously sampled over a period of time to obtain its motion state. The skeleton sequence comprises a series of frames, and each frame records the position coordinate information of key skeleton points of the human body, and can be used for describing and analyzing the information of the gesture, the action, the motion track and the like of the human body in the motion process.

S303, predicting action categories corresponding to the skeleton sequences by utilizing the action recognition network, and determining the action categories corresponding to the skeleton sequences as first action categories.

In one implementation, first, a skeleton space-time diagram is constructed from a skeleton sequence.

The skeleton space-time diagram is a three-dimensional (or four-dimensional) array formed by combining multi-frame data in a skeleton sequence, wherein the three-dimensional array comprises information of time, space coordinates and key skeleton points.

That is, the skeleton space-time diagram can be regarded as an extension of the skeleton sequence, which introduces a time dimension on the basis of the skeleton sequence, so that the space-time variation of the motion can be represented in a three-dimensional coordinate system. Thus, the skeleton space-time diagram may provide more comprehensive, global motion information.

In one implementation, the action recognition network may be an ST-GCN network.

ST-GCN (space-Temporal Graph Convolutional Network) is a deep learning network model for motion recognition of skeleton sequences, which applies GCN (Graph Convolutional Network) to a skeleton-based motion recognition task, implicitly learns the information features of the skeleton sequences by using graph convolution instead of manually extracting the features, so the model is simple and performs well.

The application provides a construction method of the compound: firstly, constructing a space diagram convolution by setting a sampling function and a weight function; secondly, a new partition strategy is provided, the neighbor set of the joint node can be divided into a fixed number of subsets by using the partition strategy, each subset is provided with a digital label, and the process of constructing the weight function can be simplified; finally, constructing an ST-GCN network under a new partition strategy, and obtaining a human body action recognition model through training.

The embodiment provides a scheme for performing action recognition according to video data. The target detection network can identify the position of the human body in the video frame, the gesture estimation network can further locate coordinate information of skeleton points of the human body on the position of the human body, and the action recognition network can predict action types of actions to be recognized according to the coordinate information of the skeleton points.

Fig. 4 shows a schematic flow chart of an action recognition method according to another embodiment of the present application.

It should be understood that fig. 4 may be regarded as one example of step S103 in fig. 1.

The second recognition model in this embodiment includes a feature extraction network and a classifier.

Fig. 4 includes the following steps.

S401, preprocessing inertial data.

Inertial data acquired by the inertial sensor during human body movement can comprise information such as acceleration, angular velocity and the like. Before being used for identification, the data is preprocessed, such as denoising, filtering, normalization, outlier rejection and the like, so as to reduce noise and maintain data consistency.

S402, performing feature extraction and feature selection on the preprocessed inertial data by using a feature extraction network in the second recognition model to obtain an action feature vector.

And extracting action features from the preprocessed data, and selecting more critical action features, namely action feature vectors. The motion feature vector is used as an input to the classifier.

By way of example and not limitation, the action features may be time domain features (e.g., mean, variance, time domain statistics), frequency domain features (e.g., fourier transformed spectral information), or time-frequency domain features (e.g., wavelet transform coefficients), etc.

S403, predicting the action category corresponding to the action feature vector by using a classifier in the second recognition model according to the pre-learned mapping relation, and determining the action category as a second action category.

In one implementation, the appropriate classifier is selected in advance and trained. After the motion feature vector is obtained through the feature extraction network processing, a trained classifier is input to predict which motion category the motion feature vector belongs to, and the motion category is determined to be a second motion category.

In one embodiment, when the inertial data of the action to be identified includes initial inertial data acquired with a plurality of sensors, step S402 may include the following steps.

S501, respectively carrying out feature extraction and feature selection on the initial inertial data to obtain a plurality of groups of action features.

In one implementation, multiple sensors may be tied to different joints of the human body, i.e., inertial data is collected using a multi-point wearing approach. And acquiring initial inertial data acquired by each sensor, inputting the initial inertial data into a feature recognition network of a second recognition model to perform feature extraction and feature selection to obtain a plurality of groups of action features, wherein each group of action features corresponds to one sensor.

S502, fusing the multiple groups of action features to obtain action feature vectors corresponding to the inertial data of the action to be identified.

The manner in which groups of action features are fused may include, but is not limited to, feature stitching, concatenation or weighting, and so forth.

According to the embodiment, feature levels are fused, action features corresponding to inertial data from different sensors are integrated together, and more comprehensive and rich feature representations can be obtained, so that the resolving capability of a model on complex tasks is improved, and the accuracy of recognition results is improved.

In one embodiment, a decision-level fusion may also be performed, i.e., the output results of multiple classifiers are fused, such as by voting or weighted averaging, or by some logic rules.

The foregoing description of a method for identifying actions according to the embodiments of the present application is mainly presented with reference to the accompanying drawings. It should also be understood that, although the steps in the flowcharts related to the embodiments described above are shown in order, these steps are not necessarily performed in the order shown in the figures. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages. An apparatus according to an embodiment of the present application is described below with reference to the accompanying drawings. For brevity, the description of the apparatus will be omitted appropriately, and the relevant content may be referred to the relevant description in the above method, and the description will not be repeated.

Corresponding to the method described in the above embodiments, fig. 5 shows a block diagram of the motion recognition device provided in the embodiment of the present application, and for convenience of explanation, only the portion relevant to the embodiment of the present application is shown.

Referring to fig. 5, the apparatus includes:

the acquiring module 510 is configured to acquire video data and inertial data of the same motion to be identified.

The video data of the motion to be identified comprises image frames for recording the states of the motion to be identified at each moment.

The inertial data of the action to be identified is used to represent the motion parameters of the action to be identified during the action.

The identifying module 520 is configured to identify a first action category corresponding to the video data of the action to be identified by using the first identifying model.

The recognition module 520 is further configured to recognize a second action category corresponding to the inertial data of the action to be recognized by using the second recognition model.

An analysis module 530 for determining a target action category of the action to be identified based on the consistency of the first action category and the second action category.

In one embodiment, the identification module 520 may also be used to perform steps S201-S204.

In one embodiment, the identification module 520 may also be used to perform steps S301-S303.

In one embodiment, the identification module 520 may also be used to perform steps S402-S403.

In one embodiment, the identification module 520 may also be used to perform steps S501-S502.

In one embodiment, the analysis module 530 may also be used to perform steps S105-S107.

In one embodiment, the analysis module 530 may also be used to perform steps S205-S207.

In one embodiment, the acquisition module 510 may include a visual sensor and an inertial sensor. The visual sensor can be shooting equipment such as a video camera, a mobile phone, a monitoring camera and the like, and the inertial sensor can be a nine-axis inertial sensor (a three-axis accelerometer, a three-axis gyroscope and a three-axis magnetometer). In one example, the MPU9250 inertial sensor may be utilized as an inertial measurement unit, and the ESP32 may be selected as a data transmission module according to the real-time requirements of the human body posture tracking system.

It should be understood that the above-mentioned data acquisition devices such as the visual sensor and the inertial sensor may be integrated into the acquisition module 510, or may be independent from the acquisition module 510, which is not limited in this application.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The embodiment of the application also provides a computer device, which comprises: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may implement the various method embodiments described above.

Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that may be performed in the various method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

It should be noted that, because the content of information interaction and execution process between the above units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 6, the computer device 1000 includes: at least one processor 1003 (only one shown in fig. 6), a memory 1001 and a computer program 1002 stored in the memory 1001 and executable on the processor 1003. The processor 1003, when executing the computer program 1002, implements steps S201 to S204 in the method embodiment of fig. 1 described above; alternatively, the processor 1003, when executing the computer program 1002, performs the functions of modules 510 through 530 described above in the embodiment of the apparatus 500 of fig. 5.

The processor 1003 may be a central processing unit (Central Processing Unit, CPU), the processor 1003 may also be another general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 1001 may in some embodiments be an internal storage unit of the computer device 1000, such as a hard disk or a memory of the computer device 1000. The memory 1001 may also be an external storage device of the computer device 1000 in other embodiments, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (FC) or the like, which are provided on the computer device 1000. Further, the memory 1001 may also include both an internal storage unit and an external storage device of the computer device 1000. The memory 1001 is used for storing an operating system, an application program, a Boot Loader (BL), data, other programs, and the like, for example, program codes of the computer program, and the like. The memory 1001 may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes a computer device to implement the steps in the respective method embodiments described above.

In the description above, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of motion recognition, comprising:

Acquiring video data and inertial data of the same action to be identified, wherein the video data of the action to be identified comprise image frames for recording states of the action to be identified at all moments, and the inertial data of the action to be identified are used for representing motion parameters of the action to be identified in an action period;

identifying a first action category corresponding to the video data of the action to be identified by using a first identification model;

identifying a second action category corresponding to the inertial data of the action to be identified by using a second identification model;

and determining the target action category of the action to be identified according to the consistency of the first action category and the second action category.

2. The method of claim 1, wherein the determining the target action category of the action to be identified based on the consistency of the first action category and the second action category comprises:

when the first action category and the second action category are consistent, determining the first action category or the second action category as a target action category of the action to be identified;

and re-identifying the action category of the action to be identified when the first action category and the second action category are inconsistent.

3. The method according to claim 2, wherein the method further comprises:

and when the first action category is inconsistent with the second action category, updating the first recognition model and/or the second recognition model according to the manual calibration action category of the action to be recognized.

4. A method according to any one of claims 1 to 3, wherein the first recognition model comprises a target detection network, a gesture estimation network and an action recognition network, and wherein the identifying a first action category corresponding to the video data of the action to be recognized using the first recognition model comprises:

positioning the human body position on each frame of image in the video data of the action to be identified by utilizing the target detection network;

positioning human skeleton point coordinates based on the human body position by utilizing the gesture estimation network, and generating a skeleton sequence;

and predicting the action category corresponding to the skeleton sequence by utilizing the action recognition network, and determining the action category corresponding to the skeleton sequence as the first action category.

5. The method of claim 4, wherein the object detection network is a YOLO Nano network, the pose estimation network is an SNHRNet network, and the action recognition network is an ST-GCN network.

6. The method of claim 5, wherein the SNHRNet network is trained on a NHRNet network by a converged attention mechanism module.

7. A method according to any one of claims 1 to 3, wherein the second recognition model comprises a feature extraction network and a classifier, the recognition of a second action category corresponding to inertial data of the action to be recognized using the second recognition model comprising:

performing feature extraction and feature selection on the inertial data of the motion to be identified by using the feature extraction network to obtain motion feature vectors corresponding to the inertial data of the motion to be identified;

and predicting the action category corresponding to the action feature vector by using the classifier according to a pre-learned mapping relation, and determining the action category corresponding to the action feature vector as a second action category.

8. The method of claim 7, wherein when the inertial data of the motion to be identified includes initial inertial data acquired by using a plurality of sensors, the performing feature extraction and feature selection on the inertial data of the motion to be identified by using the feature extraction network to obtain a motion feature vector, including:

Respectively carrying out feature extraction and feature selection on the initial inertial data to obtain a plurality of groups of action features, wherein each group of action features corresponds to one sensor;

and carrying out fusion processing on the multiple groups of action characteristics to obtain action characteristic vectors corresponding to the inertial data of the action to be identified.

9. An action recognition device, comprising:

the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring video data and inertial data of the same motion to be recognized, the video data of the motion to be recognized comprise image frames for recording states of the motion to be recognized at all moments, and the inertial data of the motion to be recognized are used for representing motion parameters of the motion to be recognized in the motion period;

the identification module is used for identifying a first action category corresponding to the video data of the action to be identified by using a first identification model;

the identification module is further used for identifying a second action category corresponding to the inertial data of the action to be identified by using a second identification model;

and the analysis module is used for determining the target action category of the action to be identified according to the consistency of the first action category and the second action category.

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, when executing the computer program, causing the computer device to implement the method of any one of claims 1 to 8.