CN117671787A

CN117671787A - Rehabilitation action evaluation method based on transducer

Info

Publication number: CN117671787A
Application number: CN202311655843.0A
Authority: CN
Inventors: 陈鹏; 孟维庆; 章军; 郑春厚; 夏懿; 王兵
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-07-26
Filing date: 2023-12-05
Publication date: 2024-03-08

Abstract

The application discloses a rehabilitation action evaluation method based on a transducer, which relates to the technical field of action quality evaluation and comprises the following steps in sequence: s1, designing a prediction model based on a transducer; s2, training a model by using the KIMORE data set; the invention can evaluate the motion quality more scientifically, can detect the subtle difference between similar test motions, analyze and model the coordinate information of the skeleton points by utilizing a deep learning algorithm, monitor, identify and predict the quality of human motions in real time, provide accurate motion guidance and quality improvement for users, evaluate the motions of patients by an intelligent evaluation system and form feedback in a fractional form, and the interaction technology can excite the active participation of the patients to improve the training effect.

Description

Rehabilitation action evaluation method based on transducer

Technical Field

The application relates to the technical field of action quality assessment, in particular to a rehabilitation action assessment method based on a transducer.

Background

In recent years, computer vision technology has been greatly developed. Meanwhile, clinical signs show that the computer-aided rehabilitation training technology can provide safe, reliable, targeted and adaptive rehabilitation training for patients with dyskinesia caused by stroke and spinal cord injury. Home rehabilitation is increasingly emerging to ensure that patients get adequate rehabilitation exercise training, and there is no difference in the impact of home rehabilitation on any outcome compared to hospitalization. In addition, there are fewer time and space constraints at home. For home rehabilitation assessment, using node position and rotation information collected by devices such as Kinect and Vicon to assess motion quality has become the dominant method of current motion assessment research. The motion of the patient is estimated through the intelligent estimation system, and feedback is formed in a fractional form, so that the interaction technology can stimulate the active participation of the patient, and the training effect is improved. It is therefore important to study and intelligently assess whether the patient's movements meet the criteria for rehabilitation exercise. However, such a system for scientifically evaluating the quality of an action is lacking. The difficulty in assessing motion quality is: the difference between different scores of the same test action is relatively fine, the fine action identification is a difficulty of the work, two main methods for evaluating the action quality are adopted, one is based on the evaluation of video data, and the other is based on the evaluation of bone point data.

The prior art also has the following problems:

1. the computational resource requirements are high. Based on the evaluation of video data, the disadvantage of such evaluation is obvious, and the method has high computational resource requirement and cannot realize win-win effect of prediction and reasoning speed. The following reasons are mainly: (1) large data size: video data is typically much larger than text data because they contain consecutive frames. Processing large-scale video data sets requires a significant amount of memory space and computing resources, such as GPUs. This adds complexity and cost to training and reasoning. (2) modeling of temporal and spatial relationships is difficult: video data has temporal and spatial relationships that need to be modeled. Conventional deep learning models may require preprocessing of the video, such as segmentation into short video segments or extraction of key frames, for processing. At the same time, the modeling takes into account temporal order and spatial context, which increases the complexity of the model, indirectly resulting in an increase in computational resources. (3) high-dimensional feature representation: the characteristic representation of video data is typically high-dimensional, including temporal and spatial dimensions. This results in the model requiring more parameters to process the video data, increasing the complexity and training difficulty of the model. At the same time, the high-dimensional feature representation also increases the computational cost.

2. The evaluation accuracy is low. The existing evaluation method based on bone point data has a plurality of defects, such as using an automatic data encoder in the bone data extraction process, and using the automatic encoder can help the model learn features faster, but can cause the model to be excessively fitted to training data, so that the model is difficult to learn a more general low-dimensional representation. In addition to the use of an automatic encoder, there are many disadvantages in the method of data preprocessing, including data dimension reduction and windowing, and the low-dimensional mapping of bone data will lose some key information and cannot extract finer motion feature information. The bone data is partitioned using a time window and used as input to the model. The disadvantage of this is that the size of the window affects the performance of the model and the window size needs to be adjusted according to the experimental results without scalability. They use window data as input data, and temporal features between frames of data within a window are difficult to learn by the model, which reduces the accuracy of the model.

3. The smaller data set size is also a very important reason for lower accuracy, the existing data sets which can be used for evaluating the quality of rehabilitation actions are small-scale data sets, and training results of the model on the small data sets are not generalized.

Disclosure of Invention

According to the rehabilitation action evaluation method based on the transducer, a rehabilitation action quality evaluation method based on the transducer is provided for solving the problem that the time sequence relation between long characteristic sequences is difficult to learn. Compared with a method based on video data evaluation, the bone point data is used for greatly reducing the consumption of computing resources. Aiming at the problem of smaller data set size, three data enhancement methods are provided, and the defect caused by small data set size is effectively solved.

The application provides a rehabilitation action evaluation method based on a transducer, which comprises the following steps in sequence:

s1, designing a prediction model based on a transducer;

s2, training a model by using the KIMORE data set;

s3, acquiring bone point data by using a KinectV2 depth camera;

s4, inputting a bone point sequence of rehabilitation actions to be evaluated into a trained prediction model for prediction;

s5, outputting a prediction result.

Further, the S1 specifically refers to a prediction model based on a transducer, which includes a data layer for sampling bone data, a position coding layer for extracting time sequence characteristics of a sequence of bone data frames, and a characteristic extraction layer for extracting bone coordinate information, where the data layer includes a convolution layer mainly used for sampling bone data, the position coding layer includes cos coding functions and sin coding functions mainly used for coding positions of feature vectors appearing in the whole feature sequence, the characteristic extraction layer includes a transducer coding module and a multi-layer perceptron module, where the transducer coding module includes 12 transducer coding layers, and the multi-layer perceptron module includes two full-connection layers, two batch regularization layers and one ELU activation function.

Further, the step S2 includes the following sequential steps:

s21, preprocessing data;

s22, data expansion;

s23, dividing the data set.

Further, the S21 refers to the joint direction data captured by the depth camera used in our experiments in this work, the data is captured at a fixed frame rate, the frame rate is assumed to be T, the dimension of the feature vector obtained by each frame is fixed, D can be set to be D, if the number of captured joints is J, the feature vector d=jx3, an action is denoted as a, and since the speed of an action affects the completion time of an action, that is, not all actions can be completed within 1 second, the number of frames acquired to complete an action is kxt, where l>0, so is known toWe sample the data in the KIMORE data set to 750 frames of data, each frame having a feature vector dimension of 75, i.e., the three-dimensional coordinates of 25 nodes.

Further, the S22 indicates that one frame of skeleton point data represents spatial feature information of skeleton nodes at a certain moment, similar to an image representing spatial features of an object, and some data expansion methods in the image research field can be used for skeleton point data, and meanwhile, an action sequence consisting of multiple frames, which is essentially another expression form of an action video, is a feature sequence of an action, and is inspired by research methods in the image processing field, through three data enhancement methods including feature masking, frame masking and feature noise addition.

Further, the feature masking method is to randomly select J feature points from the feature vector, where J can take different values to obtain more data, and set the value of the selected point to 0, which is better than adding longitudinal stripe noise into the image, so that the features of the active node can be reserved to a greater extent; the frame masking is used as a common data expansion method, the frame masking can effectively expand data, similar to adding transverse stripe noise in an image processing task, randomly selecting M frames and setting all the frames to 0, the purpose of the operation is to simulate the situation of frame jitter in the data acquisition process, the added feature noise is similar to the image processing task, M rows of frames are randomly selected in a feature sequence, J feature points are randomly selected in each frame and set to 0, and the obtained motion feature is similar to an image added with random noise.

Further, the step S23 is to follow the data of each type of action to 8:2, dividing the training set and the test set according to the proportion of 8: the scale of 2 is divided into training and validation sets.

Further, one of the transform's prediction model core components is a transform encoder layer for processing feature extraction and encoding of an input sequence, the following is a simplified description:

Multi-Head Self-Attention Layer (Layer): given an input sequence X, map it to a vector representation of the query (Q), key (K), value (V), then calculate an attention score for capturing the correlation between different locations, and apply attention weights to the value vector, resulting in a self-attention output, the specific operation being given by equation (1):

wherein Q, L, V respectively represent a query, a key, a value vector, d _k Representing the dimension of the vector;

feed-forward neural network layer (Feed-Forward Neural Network Layer): non-linear transformation and feature extraction of the self-attention output, the layer comprising a combination of two linear transformation and activation functions ReLU for linear transformation and non-linear activation function processing of the normalized output to increase the non-linear capability of the model, the specific operation being given by equation (2):

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ (2)

wherein x represents self-attention output, W ₁ 、W ₂ 、b ₁ 、b ₂ Is a learnable weight and bias;

the method comprises the steps that the high-level description of a transducer encoder layer is adopted, a complete encoder is constructed by stacking 12 encoder layers, each encoder layer can process all positions in an input sequence in parallel, and feature extraction and encoding are carried out through a self-attention mechanism and a feedforward neural network, so that global dependency and local modes in the input sequence are captured, for an action sequence, each frame of skeleton data collected can be regarded as a feature representation of action at a certain moment, all frames are arranged in time sequence to obtain a feature representation of the whole action, from the time sequence, strong relevance exists among the frames, the relevance is inspired by the transducer research in the text translation field, the relevance is equivalently expressed as a position relationship, the position relationship among words is a very important attribute of text, and similar to the position relationship, when skeleton data are processed, the time relationship among skeleton point data frames is notionally expressed as a position relationship, and the two are equivalent;

we use a series of frame sequences as input, the input feature vectors being given by the set S, where each S _i Representing the ith skeleton data frame, the specific operation is given by equation (3):

in order to increase the time relation between the skeleton data frames, a position coding technology is adopted to superimpose the position information of the data frames with the original data, and the specific operation is given by a formula (4):

the work adopts a frame sequence as input, each frame is directly input into a model as characteristic representation, so that the position relation among the frames can be more fully learned by the model, the characteristics basically the same as those of the pixels of the adjacent line images are different, the inter-frame conversion of skeleton data is relatively obvious, the time relation among the frames is more important, more time characteristics can be reserved by adopting the frame sequence as input,

in order to further verify the prediction effect of the trained transducer-based prediction model, the transducer-based prediction model is comprehensively evaluated by using a test set, and the average absolute error MAE is adopted as an evaluation index given by a formula (5):

wherein m represents the number of test samples, x represents the feature sequence, y represents the true score, and for 1000 rounds of training in the test process, five times of experiments are repeated, and the worst one of the five times of experiment results is taken.

The technical scheme provided by the application has at least the following technical effects or advantages:

1. the invention can evaluate the motion quality more scientifically, detect the subtle difference between the similar test motions, analyze and model the coordinate information of the bone points by using the deep learning algorithm, monitor, identify and predict the motion quality of the human body in real time, and provide accurate motion guidance and quality improvement for the user.

2. The intelligent evaluation system is adopted, so that the problem that an automatic data encoder is used in the bone data extraction process is effectively solved, the automatic encoder is used for helping the faster learning characteristic of the model, the model is excessively fitted with training data, the model is difficult to learn a more general low-dimensional representation, the intelligent evaluation system is used for evaluating the motion of a patient and feedback is formed in a fractional form, and the interactive technology can stimulate the active participation of the patient and improve the training effect.

3. Because the method for evaluating the motion quality by adopting the image is adopted, the problem that the data sets which can be used for evaluating the rehabilitation motion quality are all small-scale data sets is effectively solved, and the training result of the model on the small data sets does not have generalization.

Drawings

FIG. 1 is a flow chart of a method of the present application;

FIG. 2 is a network structure diagram of a transform-based predictive model in the present application;

FIG. 3 is a graph showing the variation of the test set loss during the experiment of the present application.

Detailed Description

The method has high requirements on computing resources, utilizes a deep learning algorithm to analyze and model skeleton point coordinate information, can monitor, identify and predict the quality of human body actions in real time, and provides accurate action guidance and quality improvement for users; for the excessive fitting of the model to the training data, the model is difficult to learn more general low-dimensional representation, the motion of the patient is estimated through the intelligent estimation system, and feedback is formed in a score form, so that the interaction technology can excite the active participation of the patient, and the training effect is improved; compared with a method for evaluating action quality through images, the method has the advantages that the training result of the model on the small data set does not have generalization, and the acquisition and analysis are integrated, so that the consumption of computing resources is greatly reduced.

In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.

Examples

Referring to fig. 1, a rehabilitation action evaluation method based on a transducer includes the following steps in sequence:

s1, designing a prediction model based on a transducer;

s2, training a model by using the KIMORE data set;

s3, acquiring bone point data by using a KinectV2 depth camera;

s5, outputting a prediction result.

The S1 specifically refers to a prediction model based on a transducer, which comprises a data layer for sampling bone data, a position coding layer for extracting time sequence characteristics of a bone data frame sequence, and a characteristic extraction layer for extracting bone coordinate information, wherein the data layer comprises a convolution layer which is mainly used for realizing the sampling of bone data, the position coding layer comprises cos coding functions and sin coding functions which are mainly used for coding positions of feature vectors in the whole characteristic sequence, the characteristic extraction layer comprises a transducer coding module and a multi-layer perceptron module, the transducer coding module comprises 12 transducer coding layers, and the multi-layer perceptron module comprises two fully-connected layers, two batch regularization layers and an ELU activation function.

S2 comprises the following sequential steps: s21, preprocessing data; s22, data expansion; s23, dividing the data set.

S21 refers to the joint direction data captured by the depth camera used in our experiments in this work, the data is captured usually at a fixed frame rate, assuming T, the dimension of the feature vector acquired per frame is fixed, which can be set to D, if the number of captured joints is J, the feature vector D=J×3, one action is denoted as A, since the speed of one action affects the completion time of the action, i.e. not all actions can be completed within 1 second, the number of frames acquired to complete one action is k×T, where k>0, so is known toThe method comprises the steps that data in a KIMORE data set are sampled to 750 frames of data in a fixed length mode, the feature vector dimension of each frame is 75, namely three-dimensional coordinates of 25 joint points, one frame of S22 skeleton point data represents the feature information of skeleton nodes in space at a certain moment, the method is similar to an image representing the spatial feature of an object, some data expansion methods in the image research field can be used for skeleton point data, meanwhile, an action sequence consisting of multiple frames is essentially another expression form of an action video and is a feature sequence of the action, the method is inspired by a research method in the image processing field, the feature masking method comprises feature masking, frame masking and feature noise adding, J feature points are randomly selected from the feature vectors, J can take different values to obtain more data, the value of the selected points is set to be 0, and compared with the image, longitudinal strip noise is added, and the feature of the movable node can be reserved to a greater extent; frame masking is a common data expansion method, and can effectively expand data, similar to adding horizontal stripe noise in image processing task, randomly selecting M frames and setting them to 0, and its purpose is to simulate the frame jitter condition in data acquisition process, adding characteristic noise similar to image processing task, randomly selecting M rows of frames in characteristic sequence, randomly selecting J characteristic points in each frame and setting them to 0, so that the obtained motion characteristics are as if adding random noiseAcoustic image, S23 follows the data of each type of action to 8:2, dividing the training set and the test set according to the proportion of 8: the scale of 2 is divided into training and validation sets.

The composition of the action categories contained in the KIMORE dataset is shown in Table 1 below:

TABLE 1

Action number	Specific actions
		Action 1	Lifting the arms
Action 2	Trunk side-tipping, double arm extension
		Action 3	Torso rotation
Action 4	The pelvis rotating on a transverse plane
		Action 5	Squatting and lifting

The collection of data in the KIMORE dataset involved mainly 78 subjects, divided into two groups, 44 healthy subjects and 34 dyskinesia. The data set is collected by one RGB-D sensor for different rehabilitation exercises. The motion training device comprises RGB, depth video and a framework, and five motion exercises are performed. It provides each exercise with the most clinically relevant features and clinical scores for each exercise, all given by the specialist.

Training by using a divided training set, inputting the training set into a prediction model based on a transducer for training, setting the batch size (batch size) to be 8, setting the training round number to be 1000, setting the initial learning rate to be 0.01, and adopting cross entropy as a loss function; weights of the transducer-based predictive model that perform optimally on the test set are saved.

S3 specifically refers to: s31, acquiring human action bone point coordinate information by using a KinectV2 depth camera: the method mainly comprises the steps of collecting by using a KinectV2 depth camera, connecting and testing the KinectV2 depth camera with a computer, adjusting the angle and the distance of a camera to be aligned with a tester in the forward direction, and storing collected skeleton point information by the computer after the tester finishes corresponding actions for 3 times under the condition that the camera starts to record normally; s32, preprocessing the acquired bone point data: the method mainly comprises the steps of trying to remove some noise frames, and then sampling data in a fixed length mode according to a method of a data preprocessing part in the step S2;

as shown in fig. 2, a transducer is a loop-avoidance model structure that relies on the attention mechanism to learn the global dependencies of inputs and outputs. It breaks the limitation that RNN models cannot compute in parallel, and the number of operations required to compute the association between two locations does not increase with increasing distance compared to CNN. The introduction of self-attention makes the model more interpretable. The transducer has solved many of the problems in image recognition and text translation and achieved good results. Therefore, we propose a transducer-based rehabilitation exercise quality assessment prediction model in the present invention, and the structure of the model is shown in fig. 2. The model consists of three main parts: the device comprises a data layer for sampling bone data, a position coding layer for extracting time sequence characteristics of a bone data frame sequence and a characteristic extraction layer for extracting bone coordinate information. In the first and third parts of the model, ELU is used as an activation function. Furthermore, for the transducer encoder module in the feature extraction layer, a gel is employed as an activation function, which also preserves the features of the data to a greater extent. And finally, the fully connected module in the feature extraction layer is used for transforming the output of the top transformer encoder module through spatial mapping, and activating by adopting a sigmoid function to obtain the final prediction score.

One of the transducer's prediction model core components is the transducer encoder layer, which processes feature extraction and encoding of the input sequence, a simplified description of which follows:

wherein Q, K, V respectively represent a query, a key, a value vector, d _k Representing the dimension of the vector;

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ (2)

the above steps are high-level description of a transducer encoder layer, the transducer model constructs a complete encoder by stacking 12 encoder layers, each encoder layer can process all positions in an input sequence in parallel, and perform feature extraction and encoding through a self-attention mechanism and a feedforward neural network, so as to capture global dependency and local modes in the input sequence, for an action sequence, each frame of skeleton data collected can be regarded as a feature representation of action at a certain moment, all frames are arranged in time sequence to obtain a feature representation of the whole action, from the time sequence, each frame has strong relevance, which is inspired by a transducer research in a text translation field, and is equivalently expressed as a position relationship, the position relationship between words is a very important attribute of text, and similar to the position relationship, when processing skeleton data, we also notionally represent the time relationship between skeleton point data frames as the position relationship, and the two are equivalent.

the frame sequence is used as input, each frame is directly input into the model to be used as characteristic representation, so that the position relation among the frames can be fully learned by the model, the characteristics basically same as those of pixels of adjacent line images are different, the inter-frame conversion of skeleton data is relatively obvious, the inter-frame time relation is more important, and more time characteristics can be reserved by using the frame sequence as input.

wherein m represents the number of test samples, x represents the feature sequence, y represents the true score, and for 1000 rounds of training in the test process, five times of experiments are repeated, and the worst one of the five times of experiment results is taken. To verify the effectiveness of the model, we also performed comparative experiments with other models already present, the results of which are shown in table 2.

TABLE 2 mean absolute deviation of motion mass fractions for KIMORE dataset

It can be seen that the predictive effect of the transducer-based predictive model is superior to the existing model, and thus the running speed of the visual model is superior to the existing model, FIG. 3 shows the variation of MAE values on the test set after each training round during the training process

In summary, the invention can evaluate the motion quality more scientifically, can detect subtle differences between similar test motions, analyzes and models the coordinate information of the bone points by using a deep learning algorithm, can monitor, identify and predict the quality of human motions in real time, and provides accurate motion guidance and quality improvement for users. The motion of the patient is estimated through the intelligent estimation system, and feedback is formed in a fractional form, so that the interaction technology can stimulate the active participation of the patient, and the training effect is improved. Compared with a method for evaluating the action quality through videos, the method integrates acquisition and analysis, and greatly reduces the consumption of computing resources.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, within the scope of the present application, should apply to the present application, and all equivalents and modifications as fall within the scope of the present application.

Claims

1. The rehabilitation action evaluation method based on the transducer is characterized by comprising the following steps in sequence:

s1, designing a prediction model based on a transducer;

s2, training a model by using the KIMORE data set;

s3, acquiring bone point data by using a KinectV2 depth camera;

s5, outputting a prediction result.

2. The rehabilitation activity assessment method according to claim 1, wherein the S1 specifically refers to a transducer-based prediction model including a data layer for sampling bone data, a position coding layer for extracting time sequence features of a sequence of bone data frames, and a feature extraction layer for extracting bone coordinate information, the data layer including a convolution layer mainly used for sampling bone data, the position coding layer including cos coding functions and sin coding functions mainly used for coding positions of feature vectors appearing in the whole feature sequence, the feature extraction layer including a transducer coding module and a multi-layer perceptron module, wherein the transducer coding module includes 12 transducer coding layers, and the multi-layer perceptron module includes two fully-connected layers, two regularized layers, and one ELU activation function.

3. The method for rehabilitation activity assessment based on transducer according to claim 1, wherein S2 comprises the following sequential steps:

s21, preprocessing data;

s22, data expansion;

s23, dividing the data set.

4. The rehabilitation activity assessment method according to claim 3, wherein S21 is joint direction data captured by a depth camera, usually at a fixed frame rate, assuming T, the dimension of the feature vector obtained by each frame is fixed, D can be set, if the number of captured joints is J, the feature vector d=j×3 is represented as a, and since the speed of one activity affects the completion time of the activity, that is, not all activities can be completed within 1 second, the number of frames acquired to complete one activity is kχt, wherein k is>0, so is known toWe sample the data in the KIMORE data set to 750 frames of data, each frame having a feature vector dimension of 75, i.e., the three-dimensional coordinates of 25 nodes.

5. A method for rehabilitation exercise assessment based on a transducer according to claim 3, wherein a frame of S22 skeleton point data represents spatial feature information of skeleton nodes at a certain moment, similar to an image representing spatial features of an object, and some data expansion methods in the image research field can be used for skeleton point data, and an exercise sequence consisting of multiple frames, which is essentially another expression form of an exercise video, is an exercise feature sequence, and is inspired by research methods in the image processing field, through three data enhancement methods including feature masking, frame masking and adding feature noise.

6. The method for evaluating rehabilitation actions based on transfomer according to claim 5, wherein the feature masking method is to randomly select J feature points from the feature vector, wherein J can take different values to obtain more data, and set the value of the selected point to 0, which can preserve the features of the movable node to a greater extent than adding longitudinal stripe noise to the image; the frame masking is used as a common data expansion method, the frame masking can effectively expand data, similar to adding transverse stripe noise in an image processing task, randomly selecting M frames and setting all the frames to 0, the purpose of the operation is to simulate the situation of frame jitter in the data acquisition process, the added feature noise is similar to the image processing task, M rows of frames are randomly selected in a feature sequence, J feature points are randomly selected in each frame and set to 0, and the obtained motion feature is similar to an image added with random noise.

7. The method for rehabilitation activity assessment based on transducer according to claim 3, wherein S23 follows the data of each type of activity to 8:2, dividing the training set and the test set according to the proportion of 8: the scale of 2 is divided into training and validation sets.

8. The method of claim 1, wherein one of the transform-based predictive model core components is a transform encoder layer for processing feature extraction and encoding of an input sequence, and the following is a simplified description:

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ (2)