CN116630851A

CN116630851A - Man-machine cooperation method, man-machine cooperation device and storage medium

Info

Publication number: CN116630851A
Application number: CN202310561856.5A
Authority: CN
Inventors: 王秀杰; 周雨熙; 李欣; 张勇; 盛明; 任鹏; 李宇轩
Original assignee: Tsinghua University; Tianjin University of Technology; Beijing Tsinghua Changgeng Hospital
Current assignee: Tsinghua University; Tianjin University of Technology; Beijing Tsinghua Changgeng Hospital
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-08-22

Abstract

A man-machine cooperation method, a man-machine cooperation device and a storage medium. The method includes obtaining a training set and a testing set; extracting video frame characteristics of a training set and a testing set by using a preprocessing model, and storing the video frame characteristics as a tensor array; based on the tensor array, an action positioning model is obtained, and pseudo labels of actions of people in videos of a training set and a testing set are obtained through the action positioning model; inputting the pseudo tag of the motion into a first recognition and prediction model, and inputting the tensor array into a second recognition and prediction model to obtain a recognition result and a prediction result of the motion of the person in the video; and judging the action intention of the person in the video by combining the action recognition result and the prediction result with a fusion strategy algorithm so as to assist the intelligent health system to make man-machine interaction support. The method can be used for predicting the action intention of a person, providing early warning guidance for chronic patients such as arthritis and the like, providing auxiliary support for rehabilitation exercise and promoting the development of intelligent medical treatment.

Description

Man-machine cooperation method, man-machine cooperation device and storage medium

Technical Field

The disclosure relates to a human-computer collaboration method, a human-computer collaboration device and a storage medium for intelligent exercise rehabilitation based on video intention understanding.

Background

Video understanding aims to automatically identify and parse content in videos through intelligent analysis technology. With the widespread popularization of video monitoring, the speed of video transmission and the increase of video storage space in recent years, a large amount of video data is accumulated in medical scenes, and thus, a tool for effectively managing, analyzing and processing video is required. Video understanding algorithms have complied with the needs of this age. Therefore, attention has been paid to the development of the liquid crystal display device in recent years.

For patients suffering from knee osteoarthritis, the conservative treatment effect is very poor, while osteoarthritis is a chronic disease which develops year by year and has high disability rate, and the most important progress in slowing down the progress is to improve the strength of four muscles around the knee through correct and personalized rehabilitation exercise, protect the knee joint and relieve pain, but if the exercise is improper, the knee joint is injured to a greater extent, the disease progress is accelerated, and even the operation risk is increased. For rehabilitation therapy of such chronic diseases, if a special person is required to accompany the rehabilitation therapy, the cost is high.

Disclosure of Invention

At least one embodiment of the present disclosure provides a human-machine collaboration method for intelligent exercise rehabilitation based on video intent understanding, comprising: obtaining a training set and a testing set, wherein video is received from a video source as the training set and the testing set; extracting video frame characteristics of the training set and the testing set by using a preprocessing model, and storing the video frame characteristics as a tensor array; based on the tensor array, obtaining an action positioning model, and obtaining pseudo labels of actions of people in videos of the training set and the testing set through the action positioning model; inputting the pseudo tag of the motion obtained by the motion positioning model into a first recognition and prediction model, and inputting a tensor array obtained by using the preprocessing model into a second recognition and prediction model to obtain a recognition result and a prediction result of the motion of the person in the video; and judging the action intention of the person in the video by combining the action recognition result and the prediction result with a fusion strategy algorithm so as to assist the intelligent health system to make man-machine interaction support.

For example, in the man-machine collaboration method provided in at least one embodiment of the present disclosure, the first recognition and prediction model is a recognition and prediction model with coarse granularity at a semantic level, and the second recognition and prediction model is a recognition and prediction model with fine granularity at a feature level.

For example, in the man-machine collaboration method provided in at least one embodiment of the present disclosure, the video is an RGB video, an RGB-D video, or a gray scale video, the resolution of the video is greater than or equal to 256×256, the format of the video is avi, wmv, mpeg, mp4, and the size of the video does not exceed 3G.

For example, in a human-machine collaborative method provided by at least one embodiment of the present disclosure, extracting video frame features of the training set and the test set using a preprocessing model includes: calling opencv library functions, and splitting videos of the training set and the testing set into video frame sets; dividing the video frame set into N non-overlapping frame fragments, and extracting RGB features and optical flow features of the N frame fragments by applying the preprocessing model, wherein N is an integer greater than 0; connecting the RGB and optical flow features along a channel dimension; the RGB and optical-flow features are fed to a temporal convolution layer and activated using a ReLU activation function.

For example, in a human-computer collaboration method provided in at least one embodiment of the present disclosure, obtaining, by the action positioning model, pseudo tags of actions of a person in videos of the training set and the test set includes: determining an embedded feature X and applying a fully connected layer (FC) predicted time Class Activation Sequence (CAS); determining an attention module and outputting an attention weight for each time step of the video through the attention module, wherein two weight values of each time step are normalized through softmax operation to obtain a foreground attention weight and a background attention weight respectively; combining the time-Class Activation Sequence (CAS) with the attention weight to obtain an attention-weighted time-class activation sequence; and generating a video level classification score according to a multi-instance learning (MIL) formula and through a top-k mean strategy, and obtaining the action probability of the video level.

For example, in the man-machine collaboration method provided in at least one embodiment of the present disclosure, fine-granularity convolution of the fine-granularity recognition and prediction model models at the video frame level, and the prediction result of the video frame feature level is obtained by using the local space-time feature of the frame level through the fine-granularity recognition and prediction network of the feature level:

wherein C represents the number of classes, c+1 represents the total number of classes including background classes, featureNet represents the fine-grained recognition and prediction network of the feature level, F represents the tensor array, and the local spatiotemporal features of the frame level include the RGB features and the optical flow features;

wherein the coarse-grained convolution of the coarse-grained identification and prediction model models at the action semantic level, semantic-level prediction results are obtained by utilizing semantic features of the action level through the coarse-grained identification and prediction network,

wherein, pseudoLabelNet (c) _i ) Identifying and predicting network representing said coarse granularity, c _i Representing semantic features of an action level, wherein the semantic features of the action level comprise the pseudo tag obtained through action positioning.Representing the prediction result based on the pseudo tag information obtained by the pseudolabelenet () network.

For example, in the human-computer collaboration method provided in at least one embodiment of the present disclosure, the fusion policy algorithm includes: early fusion, mid fusion and late fusion; wherein in the early fusion, the RGB features and the optical flow features are connected, and the connected RGB features and optical flow features are sent to the FeatureNet () to obtain:

wherein ,,F^rgb ，F ^flow Represents an RGB feature tensor array and an optical flow feature tensor array, concat (F) ^rgb ，F ^flow ) Representing the connection of said RGB and said optical flow features, T representing the number of segments of a sequence of video frames, D representing the total feature dimension,representing a prediction result based on the RGB features and the optical flow features, which is obtained by extracting FeatureNet () of the feature level.

In the middle-term fusion, the RGB features, the optical flow features, and the pseudo tag are fused using an attention mechanism before the full connection layer (FC) of the FeatureNet () is input to obtain the following formula:

wherein Attention () represents the Attention mechanism, featureNet _withoutfc (F ^rgb ) Indicating that the FeatureNet () does not use the last fully connected layer (FC) and />Respectively representing RGB features and optical flow features respectively represented by FeatureNet _withoutfc () The resulting predictive probabilities for each class.Representing first calculate +.> and />Is fed into the fully connected layer (FC);

in the late fusion, after obtaining a prediction probability, the RGB features, the optical flow features, and the pseudo tag are fused using the attention mechanism to obtain the following formula:

wherein ,representing the prediction results based on two feature levels obtained by the attention mechanismAnd the prediction result of the pseudo tag information +.>Is a decision level predictor of (a).

The present disclosure also provides, in at least one embodiment, a human-computer collaboration device for intelligent exercise rehabilitation based on video intent understanding, including: a first obtaining unit configured to obtain a training set and a test set, wherein video is to be received from a video source as the training set and the test set; a second obtaining unit configured to extract video frame features of the training set and the test set using a preprocessing model, and store the video frame features as a tensor array; the third obtaining unit is configured to obtain an action positioning model based on the tensor array, and obtain pseudo labels of actions of people in videos of the training set and the testing set through the action positioning model; the recognition and prediction unit is configured to input the pseudo tag of the motion obtained through the motion positioning model into a first recognition and prediction model, and input a tensor array obtained through the preprocessing model into a second recognition and prediction model so as to obtain a recognition result and a prediction result of the motion of the person in the video; and the judging unit is configured to judge the action intention of the person in the video by utilizing the action recognition result and the prediction result and combining a fusion strategy algorithm so as to assist the intelligent health system to make man-machine interaction support.

At least one embodiment of the present disclosure also provides a storage medium storing non-transitory computer readable instructions that, when executed by a computer, implement the human-machine collaboration method provided by any of the embodiments of the present disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following brief description of the drawings of the embodiments will make it apparent that the drawings in the following description relate only to some embodiments of the present invention and are not limiting of the present invention.

Fig. 1 is a schematic diagram of a human-computer collaboration method for intelligent exercise rehabilitation based on video intent understanding according to at least one embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a human-computer collaboration method for intelligent exercise rehabilitation based on video intention understanding according to at least one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a human-machine system device according to at least one embodiment of the present disclosure;

fig. 4 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

Along with monitored control system's popularization, when patient is leaning on the quiet squatting of wall, pedaling rehabilitation exercise process such as exercise bicycle, through the action of monitoring patient, when the patient appears squatting too long unable standing, from the car on the ground unable standing up etc. emergency, this disclosure can understand in time that the action intention of people makes the early warning, can avoid the secondary injury to patient's joint like this. The method and the device can predict the intention of people in the video, provide early warning guidance for chronic patients such as knee osteoarthritis and the like, provide auxiliary support for rehabilitation exercise and promote the development of intelligent medical treatment.

In the whole prediction process, the present disclosure designs three fusion strategies: early fusion, mid fusion, and late fusion. The three fusion strategies can assist in selecting branch scores with high confidence, completing intention decision, and forwarding the generated intention decision result to intelligent terminals of doctors, carers or relatives through a cloud server, so as to early warn the accident of the patient.

The man-machine cooperative method provided by the disclosure utilizes the monitoring video of the knee osteoarthritis patient, analyzes the behaviors of the patient in the video on line, judges whether the patient falls down, does not squat for a long time and other unexpected conditions, and predicts whether the patient has intention to be helped. According to the method, the predicted result is sent to the intelligent terminal of a doctor, a worker or a relative through the server, and early warning is made for the emergency possibly faced by the patient, so that the intelligent rehabilitation exercise auxiliary support is realized.

Currently, in the mainstream deep learning method based on video, the action recognition algorithm is already mature. The motion recognition algorithm involves detecting and classifying human motion from a time series (video) containing complete motion execution. In contrast, motion prediction algorithms automatically determine the motion that is to occur from a time incomplete sequence of time (video frames). Human motion recognition typically requires that an entire video be observed, from which a motion label is inferred. Are typically used in non-emergency situations, such as assessing the motor ability of a patient, etc. The action prediction is to infer a result before the action is completed, and the intention of a person can be predicted without observing the whole video or the whole action, so that early warning is made. The method is generally applied to application scenes with real-time requirements, for example, the falling scene of the knee osteoarthritis patient, can monitor videos of the patient, predict movements of the patient, understand intentions of the patient and avoid dangerous accidents.

The dataset of current motion prediction and intent understanding algorithms is typically a clipped video that processes the complete video into a motion-only, non-noisy background (e.g., a pure environment without a person) video clip according to the motion contained in the video. Such video clips are unnatural, require a lot of manpower to make manual marks, are expensive to make data sets, and cannot use the generated models directly for natural surveillance video. The motion prediction and intent processing algorithms of the present disclosure are based on video without clips, which is natural and ready to use. In particular, the method and the device replace the processing procedure of manual marking by using the action positioning algorithm to position the boundary and the action name of the action in the video, thereby saving a great deal of manual marking cost. The intent understanding algorithm of the present disclosure considers both fine granularity characteristics based on video frame level and coarse granularity characteristics based on action semantic level, and achieves similar or more excellent performance compared with similar clipping methods on the premise of saving manual annotation.

Embodiments of the present disclosure and some examples thereof are described in detail below with reference to the attached drawings.

Fig. 1 is a schematic diagram of a human-computer collaboration method for intelligent exercise rehabilitation based on video intent understanding according to at least one embodiment of the present disclosure; fig. 2 is a schematic flowchart of a human-computer collaboration method for intelligent exercise rehabilitation based on no-clip video intention understanding according to at least one embodiment of the present disclosure, as shown in fig. 1, and the method includes steps S1-S5. The following describes a man-machine cooperation method provided by the embodiment of the present disclosure with reference to fig. 1 and fig. 2.

Step S1: a training set and a test set are obtained, wherein video is to be received from a video source as the training set and the test set.

For example, the video source may include a monitoring system or other video source. As shown in fig. 2, video from a monitoring system or other video source may be received as a training set and a testing set. The video may be RGB video, RGB-D video, gray video, etc., and the resolution may be 256×256 or more, for example, and the format may be avi, wmv, mpeg, mp4, etc. The size of the video does not exceed 3G.

Step S2: and extracting video frame characteristics of the training set and the testing set by using the preprocessing model, and storing the video frame characteristics as a tensor array.

For example, the step S2 includes:

calling opencv library functions, and splitting videos of the training set and the testing set into video frame sets;

dividing the video frame set into N non-overlapping frame fragments, and extracting RGB and optical flow characteristics by applying a preprocessing model;

connecting the RGB and optical flow features along a channel dimension;

the RGB and optical-flow features are fed to a temporal convolution layer and activated using a ReLU activation function.

Specifically, referring to fig. 2, invoking the opencv library function, first splits the video into video frames:

Images＝Video2Frame(Video) (1)

wherein Video2Frame () represents an opencv library function that splits Video into Video frames, and Video represents Video of the training set and the test set. Image represents a sequence of split video frames.

Then the video frame set without clipping is divided into 16 non-overlapping frame segments, the RGB and optical flow characteristics are extracted by using a Kinetics-40 pre-trained I3D model,

wherein ,I3D_rgb() and I3D_flow () Backbone neural network representing Kinetics-40 pre-trained I3D model for extracting RGB and optical flow features, T representing the number of segments of video frame sequence, D ₁ Dimension representing RGB features, D ₂ Representing the dimensions of the optical flow features, F ^rgb and F^flow RGB and optical flow features are extracted from the I3D model representing the Kinetics-40 pre-training, respectively.

Then, the RGB and optical flow features are connected,

F′＝concat(F ^rgb ，F ^flow )∈R ^T×D ，D＝D ₁ +D ₂ ， (4)

wherein D represents the total feature dimension, concat (F ^rgb ，F ^flow ) Representing the connection of RGB and optical flow features in the T dimension, F' representing the connectedRGB and optical flow features. Feature F' is fed to the temporal convolution layer, activated using ReLU,

F＝ReLU(conv(F′))∈R ^T×D (5)

wherein ReLU () is a nonlinear activation function and conv () represents a convolution operation. F epsilon R ^T×D I.e. representing the final stored tensor array in step S2.

Step S3: and obtaining an action positioning model based on the tensor array, and obtaining pseudo labels of actions of the people in the videos of the training set and the testing set through the action positioning model.

For example, this step S3 includes:

determining an embedded feature X and applying a fully connected layer (FC) predicted time Class Activation Sequence (CAS);

determining an attention module and outputting attention weights for each time step of the video which is not clipped through the attention module, wherein the two weight values of each time step are normalized through softmax operation to obtain foreground and background attention weights respectively;

combining the CAS and the attention weights to obtain an attention weighted time-class activation sequence;

and generating a video level classification score according to a multi-instance learning (MIL) formula and through a top-k mean strategy, and obtaining the action probability of the video level.

Specifically, as shown in fig. 2, for example, the steps include: given the embedded feature X, a Fully Connected (FC) layer is applied to predict temporal Class Activation Sequences (CAS),

CAS＝FC(F)∈R ^T×(c+1) ， (6)

where C represents the number of categories for all actions and c+1 represents the inclusion of the background category at the same time. FC () represents linear mapping of F via fully connected layers.

To better distinguish between the front Jing Pianduan and background segments, an additional attention module is introduced that outputs an attention weight matrix A ε R for each time step of the unchuped video ^T×2 . Thereafter, an attention weight is generated, wherein the two weight values for each time step passThe softmax operation was normalized to obtain the foreground and background attention weight matrices, respectively. Finally, combining the time Class Activation Sequence (CAS) with the attention weight to obtain a weighted time Class Activation Sequence (CAS),

wherein c represents a class index, as well as a multiplication among elements, fg represents a foreground (foreground), bg represents a background (background), A ^fg Representing the front Jing Zhuyi force matrix, A ^bg Representing the background attention matrix. According to the Multiple Instance Learning (MILs) formula, a video level classification score is generated by a top-k means strategy. For each class c, the k maxima of the weighted time Class Activation Sequences (CAS) are taken and their average is calculated.

Wherein Top-k () represents the Top-k means strategy.

Softmax normalization is then performed on all categories to obtain the probability of action for the attention weighted video level.

Where Softmax () represents Softmax normalization.

Step S4: and inputting the pseudo tag of the motion obtained by the motion positioning model into a first recognition and prediction model, and inputting the tensor array obtained by using the preprocessing model into a second recognition and prediction model to obtain a recognition result and a prediction result of the motion of the person in the video.

For example, in some examples, the first recognition and prediction model is a semantic-level coarse-grained recognition and prediction model and the second recognition and prediction model is a feature-level fine-grained recognition and prediction model. The step S4 includes:

modeling is carried out on the video frame level by fine-granularity convolution of the fine-granularity identification and prediction model, and a prediction result of the video frame feature level is obtained by utilizing local space-time features of the frame level through a feature level fine-granularity identification and prediction network:

wherein the coarse-grained convolution of the coarse-grained recognition and prediction model models at the action semantic level, utilizes semantic features (e.g., pseudo tag information) of the action level to obtain a semantic-level prediction result through the coarse-grained recognition and prediction network,

wherein, pseudoLabelNet (c) _i ) Identifying and predicting network representing said coarse granularity, c _i Representing semantic features of an action level, wherein the semantic features of the action level comprise the pseudo tag obtained through action positioning.

Specifically, for example, RGB features and optical flow features extracted by the feature preprocessing program are used as inputs. The fine-granularity branch sends the RGB features and the optical flow features into the action prediction and time-course prediction network to execute a prediction task; coarse-grained branches send pseudo tags using dynamic action positioning algorithms into the prediction network to perform the prediction task.

Step S5: and judging the action intention of the person in the video by combining the action recognition result and the prediction result with a fusion strategy algorithm so as to assist the intelligent health system to make man-machine interaction support.

In the whole prediction process, three fusion strategies are designed: early fusion, mid fusion, and late fusion. The three fusion strategies assist in selecting branch scores with high confidence, and the intention decision is completed. The blocking is described as follows:

(1) Pseudo tag generation mechanism based on semi-supervised learning

The method adopts a semi-supervised time action positioning algorithm to automatically position actions, and uses the result of the positioning algorithm as a pseudo tag to guide training. The semi-supervision mode can greatly reduce the manual labeling work, obtain the characteristics of action semantic level and guide the prediction from the action semantic level.

Specifically, the present disclosure adds pseudo tag supervision to attention weights through a semi-supervised learning mechanism, generates a set of action segments for a video through an action dynamic boundary positioning network ActionLocNet,

wherein s_i ，e _i Is the start and end time of the ith fragment, c _i ，q _i Is the corresponding class prediction and confidence score and may also be considered the identified current action class. The pseudo labels of the segments with the classification scores greater than the classification threshold gamma are fed into a classification network for training.

(2) Coarse-fine granularity coordinated action time course prediction

The present disclosure proposes a binary convolutional neural network that enables a model to capture coarse-grained and fine-grained features simultaneously.

The fine-granularity convolution models at the video frame level, and the prediction result of the video frame feature level is obtained by using the local space-time features (RGB and flow) of the frame level through the motion feature level identification prediction network FeatureNet ().

Where C represents the number of categories, c+1 represents the total number of categories including the background category, featureNet represents a fine-grained model identification and prediction network.

Modeling is carried out on the coarse-grained convolution at a high-level action semantic level, a prediction network pseudoLabelNet () is identified through the action semantic level, semantic features (pseudo tag information generated in the step S3) of the action level are utilized to obtain a semantic level prediction result,

the above two aspects realize the modeling of the space-time dependency relationship and the semantic dependency relationship in the action level context in the vision at the same time.

(3) Uncertainty-oriented feature fusion mechanism

The present disclosure introduces an uncertainty (uncertitry) mechanism that dynamically preferentially selects the predicted results of each modality to make intent decisions. Specifically, the present disclosure contemplates three modality fusion strategies.

Early fusion: for the RGB features and the flow features, the features are fused in series before being put into the predictive network, i.e., the features are connected first and then fed into the FeatureNet (). I.e.

Mid-term fusion: for RGB features, flow features, and pseudo tag features, before predicting the fully connected layer of the network (with FeatureNet _withoutfc Representation) a fusion with attention is performed.

Wherein FeatureNet _withoutfc () Indicating that the FeatureNet () does not use the last full connection layer (FC),representing first calculate +.> and />Is fed into the fully connected layer (FC).

Late fusion: for RGB features, flow features and pseudo tag features, after the prediction probability is obtained, the attention mechanism is used for fusion, and the following features are obtained:

wherein Attention () represents the Attention mechanism.

For example, the action intention of the obtained person is sent to a doctor, a worker or an intelligent terminal of a relative in a networking state by a server through a cloud server, and possible accidents of the patient are early warned.

Fig. 3 is a schematic block diagram of a human-machine collaboration device for intelligent exercise rehabilitation based on video intent understanding according to at least one embodiment of the present disclosure. For example, in the example shown in fig. 3, the human-computer interaction device 100 includes a first acquisition unit 110, a second acquisition unit 120, a third acquisition unit 130, an identification and prediction unit 140, and a judgment unit 150. For example, these units may be implemented by hardware (e.g., circuit) modules or software modules, and the following embodiments are the same as these and will not be described in detail. For example, these elements may be implemented by a Central Processing Unit (CPU), an image processor (GPU), a Tensor Processor (TPU), a Field Programmable Gate Array (FPGA), or other form of processing unit having data processing and/or instruction execution capabilities, and corresponding computer instructions.

The first acquisition unit 110 is configured to acquire a training set and a test set. For example, video will be received from a video source as a training set and a test set. For example, the first obtaining unit 110 may implement step S1, and a specific implementation method thereof may refer to a description related to step S1, which is not described herein.

The second obtaining unit 120 is configured to extract video frame features of the training set and the test set using a preprocessing model, and store the video frame features as a tensor array. For example, the second obtaining unit 120 may implement step S2, and a specific implementation method thereof may refer to a description related to step S2, which is not described herein.

The third obtaining unit 130 is configured to obtain an action positioning model based on the tensor array, and obtain pseudo tags of actions of the person in the videos of the training set and the testing set through the action positioning model. For example, the third obtaining unit 130 may implement step S3, and a specific implementation method thereof may refer to a description related to step S3, which is not described herein.

The recognition and prediction unit 140 is configured to input the pseudo tag of the motion obtained by the motion localization model to the first recognition and prediction model, and input the tensor array to the second recognition and prediction model to obtain the recognition result and the prediction result of the motion of the person in the video. For example, the identifying and predicting unit 140 may implement step S4, and a specific implementation method thereof may refer to the related description of step S4, which is not described herein.

And the judging unit 150 is configured to judge the action intention of the person in the video by utilizing the action recognition result and the prediction result and combining a fusion strategy algorithm so as to assist the intelligent health system to make man-machine interaction support. For example, the determining unit 150 may implement step S5, and the specific implementation method thereof may refer to the related description of step S5, which is not described herein.

It should be noted that, in the embodiment of the present disclosure, the human-computer interaction device 100 may include more or fewer circuits or units, and the connection relationship between the circuits or units is not limited, and may be determined according to actual requirements. The specific configuration of each circuit is not limited, and may be constituted by an analog device, a digital chip, or other suitable means according to the circuit principle.

At least one embodiment of the present disclosure also provides a storage medium. Fig. 4 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure. For example, as shown in fig. 4, the storage medium 400 non-transitory stores computer readable instructions 401 that, when executed by a computer (including a processor), may perform the human-machine collaboration method provided by any of the embodiments of the present disclosure.

For example, the storage medium may be any combination of one or more computer-readable storage media, such as one containing computer-readable program code for obtaining a training set and a testing set, another containing computer-readable program code for extracting video frame features of the training set and the testing set using a preprocessing model and storing the video frame features as tensor arrays, and another containing computer-readable program code for obtaining a motion localization model based on the tensor arrays and obtaining pseudo tags of human motions in the videos of the training set and the testing set by the motion localization model. For example, when the program code is read by a computer, the computer may execute the program code stored in the computer storage medium, performing a human-machine collaboration method such as provided by any of the embodiments of the present disclosure.

For example, the storage medium may include a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), portable compact disc read only memory (CD-ROM), flash memory, or any combination of the foregoing, as well as other suitable storage media.

The following points need to be described:

(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.

(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the disclosure, which is defined by the appended claims.

Claims

1. A human-machine collaboration method for intelligent exercise rehabilitation based on video intent understanding, comprising:

obtaining a training set and a testing set, wherein videos received from a video source are used as the training set and the testing set;

extracting video frame characteristics of the training set and the testing set by using a preprocessing model, and storing the video frame characteristics as a tensor array;

based on the tensor array, obtaining an action positioning model, and obtaining pseudo labels of actions of people in videos of the training set and the testing set through the action positioning model;

inputting the pseudo tag of the motion obtained by the motion positioning model into a first recognition and prediction model, and inputting the tensor array into a second recognition and prediction model to obtain a recognition result and a prediction result of the motion of the person in the video;

and judging the action intention of the person in the video by combining the action recognition result and the prediction result with a fusion strategy algorithm so as to assist the intelligent health system to make man-machine interaction support.

2. The human-machine collaboration method of claim 1, wherein the first recognition and prediction model is a semantic-level coarse-grained recognition and prediction model and the second recognition and prediction model is a feature-level fine-grained recognition and prediction model.

3. The human-machine collaboration method of claim 1, wherein the video is an RGB video, an RGB-D video, or a gray scale video, the resolution of the video is greater than or equal to 256 x 256, the format of the video is avi, wmv, mpeg, mp, and the size of the video does not exceed 3G.

4. A human-machine collaboration method as claimed in claim 3 wherein extracting video frame features of the training set and the test set using a preprocessing model comprises:

dividing the video frame set into N non-overlapping frame fragments, and extracting RGB features and optical flow features of the N frame fragments by applying the preprocessing model, wherein N is an integer greater than 0;

connecting the RGB features and the optical flow features along a channel dimension;

the RGB features and the optical flow features are fed to a temporal convolution layer and activated using a ReLU activation function.

5. A human-machine collaboration method as claimed in claim 3 wherein deriving pseudo tags of human actions in the videos of the training set and the test set by the action localization model comprises:

determining an attention module and outputting an attention weight for each time step of the video through the attention module, wherein two weight values of each time step are normalized through softmax operation to obtain a foreground attention weight and a background attention weight respectively;

combining the time-class activation sequence and the attention weight to obtain an attention weight;

6. The human-machine collaboration method of claim 4, wherein the fine-grained convolution of the fine-grained recognition and prediction model models at the video frame level, and the prediction results of the video frame feature level are obtained by using the local spatio-temporal features at the frame level via a feature-level fine-grained recognition and prediction network:

the coarse-granularity convolution of the coarse-granularity recognition and prediction model is modeled at the action semantic level, and semantic-level prediction results are obtained by utilizing semantic features of an action level through a coarse-granularity recognition and prediction network:

wherein, pseudoLabelNet (c) _i ) Identifying and predicting network representing said coarse granularity, c _i Representing semantic features of the action level, wherein the semantic features of the action level comprise the pseudo tag obtained by action localization,representing the prediction result based on the pseudo tag information obtained by the coarse-grained identification and prediction network.

7. The human-machine collaboration method of claim 4, wherein the fusion policy algorithm comprises: early fusion, mid fusion and late fusion;

wherein in the early fusion, the RGB features and the optical flow features are connected, and the connected RGB features and optical flow features are sent to a fine-grained recognition and prediction network FeatureNet () of the feature level to obtain:

wherein ,F^rgb ，F ^flow Represents an RGB feature tensor array and an optical flow feature tensor array, concat (F) ^rgb ，F ^flow ) Representing the connection of the RGB features and the optical flow features, T representing the number of segments of the sequence of video frames, D representing the total feature dimension,representing a prediction result based on the RGB features and the optical flow features obtained by the fine-grained recognition and prediction network of feature levels;

in the mid-term fusion, before inputting the fine-granularity recognition of the feature level and predicting the fully connected layer (FC) of the network FeatureNet (), the RGB features, the optical flow features, and the pseudo-labels are fused using an attention mechanism to obtain the following formula:

wherein Attention () represents the Attention mechanism, featureNet _withoutfc (F ^rgb ) Indicating that the FeatureNet () does not use the last full connection layer (FC), and />Representing the RGB features and the optical flow features respectively by the FeatureNet _withoutfc () The resulting predictive probabilities for each class are then,representing first calculate +.> and />Is fed into the fully connected layer (FC);

wherein ,representing the prediction results based on two feature levels obtained through the attention mechanismAnd the prediction result of the pseudo tag information +.>Is a decision level predictor of (a).

8. A video intent understanding-based intelligent sports rehabilitation human-machine collaboration device, comprising:

a first obtaining unit configured to obtain a training set and a test set, wherein video received from a video source is taken as the training set and the test set;

a second obtaining unit configured to extract video frame features of the training set and the test set using a preprocessing model, and store the video frame features as a tensor array;

the third obtaining unit is configured to obtain an action positioning model based on the tensor array, and obtain pseudo labels of actions of people in videos of the training set and the testing set through the action positioning model;

the identification and prediction unit is configured to input the pseudo tag of the action obtained through the action positioning model into a first identification and prediction model, and input the tensor array into a second identification and prediction model so as to obtain an identification result and a prediction result of the action of the person in the video;

and the judging unit is configured to judge the action intention of the person in the video by utilizing the action recognition result and the prediction result in combination with a fusion strategy algorithm so as to assist the intelligent health system to make man-machine interaction support.

9. A storage medium storing non-transitory computer readable instructions which, when executed by a computer, implement the human-machine co-ordination method as claimed in any one of claims 1 to 7.