CN114118167B

CN114118167B - Action sequence segmentation method aiming at behavior recognition and based on self-supervision less sample learning

Info

Publication number: CN114118167B
Application number: CN202111471435.0A
Authority: CN
Inventors: 肖春静; 陈世名; 韩艳会; 康红霞; 王一凡
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-12-04
Filing date: 2021-12-04
Publication date: 2024-02-27
Anticipated expiration: 2041-12-04
Also published as: CN114118167A

Abstract

The invention belongs to the technical field of sensor action recognition, and discloses a self-supervision less sample learning-based action sequence segmentation method aiming at action recognition, which comprises the following steps: constructing a self-supervision small sample action sequence segmentation framework SFTSeg; the frame is based on a twin neural network, and takes marked samples of a large number of source sensors, marked samples of a small number of target sensors and unmarked samples of the target sensors as input data; respectively constructing a cross entropy loss function, a consistency regularization loss function and a self-supervision loss function to train the twin neural network; and then performing state label prediction and activity segmentation by using the trained SFTSeg. The invention has good activity segmentation effect under different sensors in different scenes, and can achieve good activity segmentation effect only by marking samples with few target sensors.

Description

Action sequence segmentation method aiming at behavior recognition and based on self-supervision less sample learning

Technical Field

The invention belongs to the technical field of sensor action recognition, and particularly relates to a self-supervision less sample learning-based action sequence segmentation method aiming at action recognition.

Background

Human activity recognition is considered a key aspect of many emerging internet of things applications, such as smart home and healthcare, where the effectiveness of activity segmentation is critical. The consecutively received sensor data is typically subdivided into sub-sequences (each corresponding to a single activity) prior to activity classification. And the subdivision results will be input into a classification model for behavior recognition. Accordingly, these results have a significant impact on the performance of the activity classification. Accordingly, a great deal of research has been conducted on activity segmentation, including unsupervised methods and supervised models.

For the unsupervised method in the activity segmentation task: both CPD (point of change detection) and threshold-based approaches require a threshold to distinguish between active boundaries. However, the optimal threshold value requires a user to have a lot of experience and is determined according to the actual scene. Furthermore, time shape based methods (such as FLOSS) require specific information given the problem to determine time constraint parameters, and are relatively environmentally dependent.

For the supervision method in the activity segmentation task: although the subjectivity and environmental dependence problems can be alleviated, a large number of target sensor mark samples are required to train the model, which is time-consuming and labor-consuming and is not necessarily achieved under the limitation of human environment.

Disclosure of Invention

Aiming at the problems that the existing method for activity segmentation is dependent on environment and needs a large number of marked samples to train a model, the invention provides the motion sequence segmentation method based on self-supervision less sample learning aiming at behavior recognition, which has good activity segmentation effect under different sensors of different scenes, and can achieve good activity segmentation effect only by few marked samples of a target sensor.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a motion sequence segmentation method aiming at behavior recognition and based on self-supervision less sample learning comprises the following steps:

step 1: constructing a self-supervision small sample action sequence segmentation framework SFTSeg; the frame is based on a twin neural network, and takes marked samples of a large number of source sensors, marked samples of a small number of target sensors and unmarked samples of the target sensors as input data; the marking sample of the source sensor and the marking sample of the target sensor are respectively provided with four state labels, namely a static state, a starting state, a moving state and an ending state; the samples refer to a sequence of actions derived from the sensor data;

Step 2: constructing a cross entropy loss function for a marked sample of the source sensor to train a twin neural network;

step 3: for the marked sample of the target sensor, taking the marked sample of the source sensor as disturbance, injecting the disturbance into the marked sample of the target sensor as enhancement data, and constructing a consistency regularization loss function to train the twin neural network;

step 4: constructing a positive sample pair and a negative sample pair based on unlabeled samples of the target sensor, and training a twin neural network based on the constructed self-supervision loss function so that the twin neural network can capture characteristics of unlabeled samples of the target sensor;

step 5: and (3) obtaining a trained SFTSeg through the steps 1-4, inputting a sample of the target sensor as a test sample into the trained SFTSeg, firstly predicting a state label of the test sample by the trained SFTSeg, and then carrying out activity segmentation on the test sample according to the predicted state label.

Further, the step 3 includes:

the enhancement data is constructed according to the following rules:

A. the labeled sample of the compressed source sensor as a disturbance is of the same class as the labeled sample of the target sensor;

B. Adding the compressed marked sample of the source sensor to the marked sample of the target sensor according to the warping path; the warp path is generated by a dynamic time warping algorithm.

Further, the step 4 includes:

dispersing the action sequence into an overlapped window with a fixed size w by adopting a sliding window, wherein the sliding step length is l;

two windows are considered to be positive sample pairs if they meet the following constraints: two windows are adjacent; the two windows contain the same number of change points, and the difference of the two windows does not contain any change points;

two windows are considered to be a negative-sample pair if they meet the following constraints: the two windows are spaced apart by more than a given minimum time distance; the two windows contain different numbers of change points; the change point is a time point when the action sequence behavior suddenly changes.

Further, the step 4 further includes:

for positive sample pairs, firstly calculating SEP scores of a difference set of the positive sample pairs, and then filtering the positive sample pairs according to the SEP scores;

for a negative sample pair, dividing each sample of the negative sample pair into h disjoint parts, and then calculating SEP scores of all two continuous parts to obtain the highest SEP score of each sample of the negative sample pair; then, calculating a dissimilarity score of the negative sample pair based on the highest SEP score of each sample of the negative sample pair; negative sample pairs with lower dissimilarity scores are rejected.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a self-supervision small sample action sequence segmentation framework SFTSeg for segmenting activities on action sequence data, and a twin neural network is used for realizing small sample learning and classification. The prior activity segmentation method is often based on the same sensor, the invention can realize the enhancement of the identification accuracy of the target sensor data by utilizing the source sensor data, and can realize good activity segmentation and identification effects by using few target sensor mark samples. The method realizes the technology of small sample activity segmentation, and adopts a twin neural network as a main realization method of small sample learning. Aiming at three different data, the invention designs different loss functions to enhance the training effect: constructing a cross entropy loss function to forcedly input samples to corresponding categories aiming at marked samples of a source sensor; in order to enhance the generalization capability of the target sensor data, a consistency regularization method is introduced, a marked sample of a source sensor is used as disturbance, and the disturbance is injected into the marked sample of the target sensor as enhancement data, so that model training is performed by using the marked sample of the limited target sensor; in order to alleviate a large amount of offset between the source domain and the target domain, self-supervision learning is introduced, and a positive sample pair and a negative sample pair are constructed based on unlabeled samples of the target sensor to train the twin neural network, so that the twin neural network can capture the characteristics of target data.

The invention solves the problems of environment dependence and subjectivity of designers of an unsupervised method (such as detection based on change points and threshold) in an activity segmentation task, and has good activity segmentation effect under different sensors of different scenes. The invention also solves the problem that the supervision method in the activity segmentation task needs a large amount of target sensor marking data (high cost and limited by various conditions), and realizes that good activity segmentation effect can be achieved by only needing few target sensor marking samples.

Drawings

FIG. 1 is a diagram illustrating four motion states extracted from a sequence of actions;

FIG. 2 is a flow chart of a method for segmenting an action sequence based on self-supervision and less sample learning for behavior recognition according to an embodiment of the invention;

FIG. 3 is a diagram illustrating a difference between the wrap-around and shortest paths;

FIG. 4 is a diagram of positive (negative) sample pairs extracted from an action sequence;

FIG. 5 is a diagram showing an example of detection of an activity starting point;

FIG. 6 is a graph of segmentation performance (F1-score) lines for different size marker target data.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

the activity segmentation aims at determining the start and end times of an activity, which is the first step in human activity recognition. Because it is difficult to collect a large amount of tag data from the target sensor, an unsupervised method such as a CPD-based method and a threshold-based method is widely adopted for activity segmentation. However, these methods all face experience and environmental dependent problems. Thus, we translate the activity segmentation task into a classification problem by: (1) Firstly discretizing continuous time sequence data into windows with equal size; (2) then classifying each window into four status categories: a stationary state, a start state, a motion state, and an end state; (3) Finally, the start and end points of the activity are identified from these status tags. Here, four status tags are defined as: stationary state: the window is filled with time series data without activity; start state: the window contains the start of the activity; motion state: the window is full of time series data of human body activities; end state: the window contains the end point of the activity. An example of these four states extracted from an activity is shown in fig. 1, which illustrates the first order difference in the amplitude of WiFi channel state information (Channel State Information, CSI) for one subcarrier as a function of time. The vertical dashed lines here are the actual start and end points of the activity.

Thus, the segmentation result depends to a large extent on the state inference effects. Thus, the activity segmentation problem is a method of how to design a suitable state inference model to predict state labels of discrete data from a target sensor. Because of the limited number of labeled samples, we introduce a few-sample study as a state-inference model. Assuming that it is feasible to collect multiple labeled samples of the source sensor, our goal becomes how to use three types of input data to construct a robust small sample learning model to make state inferences on the sensor data: a large number of marked samples from the source sensor, a small number of marked samples from the target sensor, and a portion of unmarked samples (i.e., a large number of marked samples from the source sensor, a small number of marked samples from the target sensor, unmarked samples from the target sensor). In practical applications, the style and characteristics of target data may be quite different, as they may be collected under different scenarios (e.g., different people, environments, and sensor devices). Therefore, a few-sample learning model should be able to solve the problem of large differences in source domain and target domain data distribution.

Specifically, the action sequence segmentation method based on self-supervision less sample learning aiming at behavior recognition comprises the following steps:

Step 1: constructing a self-supervision small sample action sequence segmentation framework SFTSeg; the framework is based on a twin neural Network (Siamese Network), and takes a marked sample of a source sensor, a marked sample of a target sensor and an unmarked sample of the target sensor as input data; the marking sample of the source sensor and the marking sample of the target sensor are respectively provided with four state labels, namely a static state, a starting state, a moving state and an ending state; the samples refer to a sequence of actions derived from the sensor data; specifically, the framework is a state inference model based on a twin neural network;

step 4: constructing a positive sample pair and a negative sample pair based on unlabeled samples of the target sensor, constructing a self-supervision loss function based on the positive sample pair and the negative sample pair, and training a twin neural network so that the twin neural network can capture the characteristics of the unlabeled samples of the target sensor, and constructing the self-supervision loss function to carry out training constraint;

Based on the above embodiment, the present invention further provides another method for dividing an action sequence based on self-supervision and less sample learning for behavior recognition, which specifically includes:

A. status inference model overview

In particular, for the problems of subjective and environmental dependence and insufficient target sensor signature data, we introduced a few-sample learning model to predict the state labels of discrete data and further use the labels for segmentation activities. However, unlike general few-sample learning, in an activity segmentation scenario, there is a large offset between the source domain and the target domain, and the same class labels exist. For this purpose we propose a self-supervised small sample motion sequence segmentation framework, SFTSeg, which is specifically a state inference model based on a twin neural network. As shown in fig. 2.

Specifically, the framework is based on a twin neural network with three data as inputs: the marked samples of the large number of source sensors, the marked samples of the small number of target sensors and the unmarked samples of the target sensors correspond to classification loss, consistency regularization loss and self-supervision loss respectively. First, since the source sensor's marked sample and the target sensor's marked sample have the same four categories, we base on the source sensor's marked sample Construct a classification (cross entropy) penalty L _cl . Second, since a model trained based on source sensor data alone may not accurately capture features of target sensor data, we are based on a limited signature sample of the target sensor ∈ ->Developing consistency regularization loss L _cr To enhance the smoothness of the model. Here, the labeled sample of the source sensor +.>The marked sample which is scaled down and injected as a disturbance to the target sensor +.>To generate enhanced data->Then sample pair->Is used to construct consistency regularization loss L _cr . Third, to enhance generalization ability to target sensor data, we designed a sample-based pair +.>Self-monitoring loss->This pair of samples is extracted from unlabeled target data using our designed auxiliary tasks for time series. We further propose an adaptive weighting method to enhance this loss.

In particular, we achieve a low sample learning through a twin neural network. A common twin neural network employs a convolutional neural network that is trained using a large amount of marker data from a source sensor to extract feature vectors and to classify fewer samples by measuring the distance between new samples and each class of samples of a target sensor. Specifically, a twin neural network consists of two networks, the parameters of which are shared. Each branch uses the same network architecture, such as a Convolutional Neural Network (CNN). In general, a twin network is trained by minimizing contrast loss based on pairs of samples. Given an input pair sample pair (x ₁ ,x ₂ ) And its pair of eigenvectors (f (x) ₁ ),f(x ₂ ) Distance between feature vectors in potential space is calculated as

d _e ＝||f(x ₁ )-f(x ₂ )|| (1)

Contrast loss function L _ct The definition is as follows:

where y is the binary label assigned to the pair, i.e. if x ₁ And x ₂ Belonging to the same class, y=0, otherwise y=1; m is the margin.

B. Enhancing low sample learning with consistency regularization

Here we have attempted to enhance the twin neural network model in two ways using the marker data. First, for labeled source data, we use cross entropy (classification) loss for model training because they have the same four categories as the target data. Second, for labeled target data, since their number is very small, we propose a line-level data enhancement method and design a consistency regularization penalty, which forces the enhancement data and the original data to have the same label distribution to enhance model smoothness.

And classifying the loss. Unlike typical low sample learning tasks, for activity segmentation, the source sensor data and the target sensor data have the same four categories: a stationary state, a start state, a motion state, and an end state. Thus, to take advantage of labeled source sensor data, we train the neural network with cross entropy loss rather than the general loss of less sample learning, which can enhance the classification capability of the network. Order D _ls ＝(x ⁱ _ls ,y ⁱ ) ^N _i＝1 A set of labeled samples that are source sensors, where y ⁱ Is x ⁱ _ls Is a state label of (a). The classifier f is a function that maps the input feature space to the label space. Marked sample D considering all source sensors _ls The cross entropy (classification) loss is as follows:

where θ is a model parameter, y ^ij Representing sample x ⁱ The j-th element, f, of the tag in one-hot form _j Is the j-th element of f.

Consistency regularization. Due to the large offset between the source and target domains, a model trained based solely on the source sensor data may not fully capture the characteristics of the target sensor data and, accordingly, may not be able to effectively predict the status tag of the target sensor data. Therefore, we introduce a consistency regularization here to enhance the generalization ability of the model with limited labeled target sensor data. In other words, we design a line-level data enhancement method for action sequence data that will generate enhancement data to build a consistency regularization penalty.

Consistency regularization aims to ensure that the classifier assigns the same class label to unlabeled samples that are injected with disturbances. Although widely used perturbation methods (such as random noise, gaussian noise, and attenuated noise) can be effectively applied to image and natural language data processing, they are not suitable for time series data due to their inherent nature. For example, the perturbation method of an image is mainly to generate a change at the pixel level, whereas time-series data needs to be linearly changed because the time-series data is a waveform that changes with time. Furthermore, the augmentation data should have a similar style to the target sensor data, which is advantageous for inferring the characteristics of the model learning the target sensor data.

To this end, we narrow down the marked sample of the source sensor as a disturbance and inject the disturbance into the marked sample of the target sensor to generate enhancement data. The raw sample (the marker sample of the target sensor) and the enhancement data are then input into the twin neural network, and the twin neural network is trained by minimizing the distance between the respective features of the raw sample and the enhancement sample.

Specifically, to generate enhancement data with a target sensor data style, we construct enhancement data according to two rules: (i) The labeled sample of the source sensor compressed as a disturbance should be of the same class as the labeled sample of the target sensor; (ii) And injecting the compressed marked sample of the source sensor into the marked sample of the target sensor according to the warping path. Here, the warp path generated using the dynamic time warping algorithm (Dynamic Time Warping, DTW) maps elements of two data sequences to minimize the distance between them. An example of a warp path and a shortest path for two action sequence samples is shown in fig. 3. Here, the black and gray solid lines represent waveforms of two time-series samples, and the gray dotted lines represent the warp path in fig. 3 (a) and the shortest path in fig. 3 (b). When a black solid line is used as a disturbance, if the disturbance is added to a gray solid line in the shortest path, the waveform will be severely distorted, as shown in fig. 3 (c). The warp path of fig. 3 (d) results in a basic shape that is maintained and a waveform that is changed to some extent.

Thus, the data is enhancedThe calculation is as follows:

wherein Aggregate (x, x') sums the two sequences according to the warp path. And h (x) is a function of the magnitude of the systolic sensor data, e.g., h (x) =γ×x. Here γ e (0, 1) may be a super parameter that adjusts the degree of shrinkage.

Thus, to penalize the original sampleAnd enhancement data->The consistency regularization loss is calculated by the following formula:

wherein f (x) refers to the eigenvector through the twin neural network, D _t Is a marker dataset from the target sensor.

C. Facilitating low sample learning by self-supervision

To further enable the inferred model to learn the characteristics of the target data, we incorporate here a self-supervised technique into the less sample learning, which uses a large amount of unlabeled target data for model training. To this end, we propose an auxiliary task for time series data to build self-supervising losses and to further enhance such losses by designing adaptive weights to adjust the importance of each training sample pair.

Self-supervising loss. To train a twin network using a self-supervision method, we need to design an auxiliary task for the twin network based on unlabeled data. Although there are some effective auxiliary tasks in the fields of computer vision and natural language processing (e.g., image rotation, morphing, and cropping), they are not applicable to continuous time series data. For example, an image rotation task widely used in computer vision aims at assigning the same label to the rotated image as the original image. However, for active segmentation, when the sequence with the start state tag is rotated 180 °, it is easily confused with the sequence in the end state. As shown in fig. 1, the data with the start state tag is rotated 180 deg. and has a shape very similar to the data with the end state.

To this end, we propose an auxiliary task for time series data that trains the twin network by constructing many positive and negative pairs of samples based on unlabeled target data. Here, a positive pair of samples means that both samples have the same status tag, while a negative pair of samples is opposite. We consider two consecutive windows of similar shape as positive pairs of samples and two separate windows of different shape as negative pairs of samples. Specifically, we use a sliding window to discrete the sequence of actions into overlapping windows of size w, with a sliding step size of l. Two windows are considered to be positive sample pairs if they meet the following constraints: (i) the two windows are adjacent; (ii) They contain the same number of change points and the difference between the two windows does not contain any change points. Accordingly, two windows are considered as negative sample pairs if they meet the following constraints: (i) The two windows are sufficiently separated from each other that they should be spaced apart by more than a given minimum time distance (e.g., 2*w); (ii) They contain a different number of change points, i.e. one window contains one change point and the other window has no change point. Here, the change point is a point in time at which the action sequence behavior suddenly changes. For time series containing motion data, the active transition may be considered a point of change. Thus, if two consecutive windows contain the same number of change points, they should have the same status label, which is considered a positive sample pair. While two windows with different numbers of change points should have different status labels, considered as negative sample pairs. We used the density ratio based method SEP (S.Aminikhanghahi, T.Wang, and d.j.cook, "Real-time change point detection with application to smart home time series data," IEEE Transactions on Knowledge and Data Engineering, vol.31, no.5, pp.1010-1023,2019) to detect the change points. The method determines the change point by comparing the probability measurement and the change score with corresponding thresholds, thereby achieving better performance.

Fig. 4 gives a positive and negative sample pair example, where the vertical dashed lines are the actual start and end points of the activity. As shown in FIG. 4, the sample pairsAre positive because they are two adjacent windows and they have only one change point, and two difference sets +.>And->Without any change points. In this example, the vertical dashed lines are the points of change, as they are the transition points of the activity. Accordingly, sample pair->Are negative pairs of samples because they are far apart and have a different number of change points.

According to these rules, a large number of positive and negative pairs can be obtained from unlabeled target data. To improve the sample quality, we further culled samples with low confidence. In particular, a larger SEP score means a greater probability of the presence of a change point. For positive pairs of samples, due to the differential set of pairs of samplesAnd->There should be no change point, so we discardSample pairs with higher SEP scores in the variance set. To this end, we first compute the SEP scores of the difference set for one sample pair and then filter the sample pairs according to their scores. For a differential set of a sample pair, the differential set is divided equally into two parts: x is x _t-1 And x _t Each length is s, and then we calculate their density ratio as follows:

Wherein f _t-1 (x) And f _t (x) The probability estimation densities corresponding to the two parts respectively. Second, the SEP change point score was constructed as follows:

in this way, a difference set can be calculatedAnd->SEP value>And->Then, to ensure the quality of these training samples and avoid overfitting, we exclude 10% of positive sample pairs:

wherein f _drop Reflecting the condition of the rejection, the method has the advantages of,is two ofThe average of the scores, ε, is a threshold determined by the rank result and culling rate of the SEP values for all positive sample pairs.

For a negative pair of samples, one sample is expected to have one point of change, while the other sample has no point of change. To meet this requirement, we filter out pairs of samples with smaller degree of variance at the change points. For this reason, we designed a dissimilarity score based on the SEP score to reject negative pairs of samples with low confidence. Specifically, each sample of the negative sample pair is divided into h disjoint parts, and then the SEP scores of all two consecutive parts are calculated using formula (7). Let theThe SEP scores for the j-th and (j+1) -th parts are represented. The highest SEP score is:

thus, a maximum SEP score can be calculated for each sample from the negative sample pair, Assume thatAnd->The largest SEP scores for the samples with 1 and 0 change points, respectively. The dissimilarity score of the pair of samples can be calculated as:

negative sample pairs with lower dissimilarity scores are removed, except byReplacement->In addition, equation (8) is still employed as a filtering method.

After reducing 10% of the low confidence positive and negative samples, the remaining sample pairs train the twin neural network by self-supervising the loss:

wherein d is _e Is a pair of input samplesAnd->Feature vector pair->And->The distance between, i.ey is the label assigned to this sample pair, i.e. if +.>And->Having the same status tag, then y=0, otherwise y=1; m is a hyper-parameter for the margin.

And (5) adaptive weighting. After screening the sample pairs, the remaining positive and negative sample pairs have a higher confidence. However, different pairs of samples may provide different cues for learning the data representation. In general, samples without any activity data correspond to those containing less chemistryClues for data representation are learned. Accordingly, samples containing activity data correspond to the more cues provided and play a more important role in model training. Fig. 4 shows an example of two sample pairs. In this figure, due to the positive sample pair Contains activity data, and the negative sample pair +.>There is no activity data, so positive samples are worth more attention when model training.

Since the amplitude range of the sensor data is much larger when activity is occurring than when no activity is occurring, a sample pair of large variation range may contain activity data, and should be emphasized in model training. We use the amplitude variance of the pairs of samples to estimate the fluctuation amplitude. Thus, the fluctuation amplitude of the sample pair can be described by the following formula:

wherein the method comprises the steps ofAnd->Sample pair->And->Amplitude variance of (c). Then V is set _pair As weights, the importance of the model is adjusted at the time of training. Taking this weight into account, the self-supervision loss in the formula (11) becomes:

finally, the final loss function, in combination with the classification loss in equation (3), the consistency loss in equation (5), and the weighted self-monitoring loss in equation (13), is described as follows: :

in this loss, the consistency loss is based on the marked data of the target sensor, while the self-supervised contrast loss is based on the unmarked data of the target sensor. Thus, models trained with these losses can effectively capture characteristics of the target sensor data and apply to the target sensor. The model adopts an Adam algorithm with default super parameters as an optimization method.

D Activity segmentation

After obtaining a trained state inference model, we predict the state label for a given sequence of actions from the target sensor, i.e. we compare the distance of the target sample vector to the sample vector for each class, and label the target sample as that class if the distance of the target sample from the sample for that class is minimal. The activity is then partitioned according to the inferred state labels. Specifically, the start point and the end point of the activity are detected based on the following manner. First, a sliding window is used to divide the continuous sequence of actions (sensor data stream) into overlapping windows, each window having a length w, wherein the sliding step size is 1. Second, a state inference model is used to infer the state label for each window. Finally, the start and end points of the activity are identified by observing the changes in the modes of a set of window tabs based on the state tabs of the window. The mode here is the most frequently occurring number in the collection. In other words, if the mode changes from 1 (rest state) to 2 (start state), the corresponding window is considered to be the start of activity. If the mode changes from 4 (end state) to 1 (rest state), this window is considered the end of the activity.

For a more visual representation, FIG. 5 shows a diagram of how the mode change is observedTo detect an example of an activity start point in which the length m of the window tab list for calculating the mode is set to 10. In this figure, there are 18 data points, the points from 1 to 13 being stationary data and the points from 14 to 18 being active data. These data are first divided into 11 overlapping windows, each window having a length w=8. Second, for each window, a trained state inference model is used to infer its state label. The state labels from w1 to w6 are stationary states, 1. The state label from w7 to w11 is the start state, 2. Finally, each window is traversed to detect the start and end points of the activity. When checking w10, the mode of the state label list from w1 to w10 is 1. When checking w11, the index of the current data point i is 18, the mode from w2 to w11 becomes 2, which means that the mode changes from 1 (stationary state) to 2 (start state), indicating that there is an active start point. When the frequency of occurrence of the plurality of values is the same, the mode will be set to the largest of these values. Here, the start point t _start Set to i-m/2+1. Thus, in this example, when i=18 and m=10, t _start Equal to the actual starting point 14. After segmentation of human activities, these data can be used for activity classification.

To verify the effect of the invention, the following experiments were performed:

we used four data sets and evaluated the validity of SFTSeg based on different sensor devices, users and circumstances. In addition, the contribution of the individual components and the impact of training data size on performance were also studied.

A. Experimental data and settings

Experimental data. We conducted experiments on four behavior recognition datasets from different types of sensors, such as WiFi devices, smartphones, and RFID tags.

HandGesture: the dataset comprises twelve hand motion activities performed by two subjects and captured by an inertial measurement unit. The activities comprise windowing, closing windows, drinking water, watering flowers, shearing, splitting, stirring, reading books, tennis hands-on, ball-catching and the like. And the activity is continuous.

USC-HAD: the dataset consisted of twelve human activities, each recorded in 14 categories using a 3-axis accelerometer and a 3-axis gyroscope, respectively. The activities of each category are repeated five times, including walking forward, walking left, walking right, going upstairs, going downstairs, running forward, jumping, sitting down, standing, sleeping, elevator up, and elevator down. Since these data are discontinuous, the active set is manually randomly stitched for segmentation.

RFID: the experimental dataset contained data from six people, each posing between a wall and an RFID antenna, with nine passive RFID tags placed on the wall. RFID data is a discontinuous data set in that it is connected by twelve poses of each of the six subjects and still manually stitched data for the purpose of experimental performance.

Wifi action: the dataset consisted of ten activities performed by five persons using WiFi devices and collected by Channel State Information (CSI) collection tools [ D.Halperin, W.Hu, A.Sheth, and d.well, "Tool release: linking 802.11n traces with channel state information," sigcom comp. These activities were continuous, including 1500 samples of approximately 5 fine grain activities (hand swing, hand lift, push, drawing O, and drawing X) and 5 coarse grain activities (boxing, pick up, running, squat, walk).

Since the source data and the target data may be collected under different sensor devices, personnel and environments, in the following experiments, the wifi action data is considered as source data when evaluating the performance of other data sets. In evaluating the WiFiAction performance, the HandGesture data is considered the source data. Furthermore, by default we select three marker samples from each category in the target data for model training, i.e. the following experiments are performed in a three sample learning scenario. For data from the target sensor, 80% of unlabeled data is selected for model training, and the remaining 20% is used as the test set.

And evaluating the index. Two metrics were used to evaluate the SFTSeg we proposed as well as the baseline model: (i) F1-score: f1-score is the average of Precision and Recall. A predicted segmentation point is considered true positive when it lies within a specified time window of the true boundary, and false negative when it falls outside the time windows of all true boundaries. The specified time windows for the WiFiAction and handgestme datasets were set to 0.3 and 0.5 seconds, respectively, with the other datasets set to 2 seconds, depending on the sampling rate of the sensor. (ii) RMSE: root Mean Square Error (RMSE) is calculated from the deviation of the real boundary time from the predicted boundary time. Here RMSE is normalized to between 0,1 in terms of duration of the time series.

Experimental details. The learning rate of the model was set to 0.001 and the mini-batch size of the data was 60. The shrinkage γ of the calculation h (x) in the formula (4), the margin m in the formula (11), and the window size w are set to 0.05, 1, and 120, respectively, according to experimental experience. The CNN architecture in the twinning network is the same as in our previous work [ C.Xiao, Y.Lei, Y.Ma, F.Zhou, and Z.Qin, "deep: deep-based activity segmentation framework for activity recognition using wifi," IEEE Internet of Things Journal, vol.8, no.7, pp.5669-5681,2021 ].

B. Baseline method

To demonstrate the effectiveness and superiority of SFTSeg, eight different techniques of segmentation methods were chosen as baseline methods, including threshold-based WiAG, wi-Multi, CPD-based AR1seg, SEPseg, IGTS, time shape-based FLOSS, ESPRESSO, and supervisory method deep Seg.

WiAG: a typical threshold-based gesture extraction segmentation method. The method identifies the beginning and end of a gesture by comparing the amplitude of a principal component in the data stream to a given threshold.

Wi-Multi: a novel activity segmentation algorithm in a multi-subject complex environment. The algorithm can eliminate potential false detection by calculating the maximum eigenvalues of the amplitude and calibration phase correlation matrix and can improve accuracy in noisy environments or/and scenes of multiple objects.

AR1seg: a typical change point detection method in the statistical field. This method uses a first order autoregressive process to infer the change point [ S.Chakar, E.Lebarbier, C.Lvy-Leduc, and S.Robin, "A robust approach for estimating change-points in the mean of an AR (1) process," Bernoulli, vol.23, no.2, pp.1408-1447,05 2017].

Sepreg: an innovative method for detecting a time series data change point. This algorithm has been used to effectively identify activity boundaries and to identify human daily activities [ S.Aminikhanghahi, T.Wang, and d.j.cook, "Real-time change point detection with application to smart home time series data," IEEE Transactions on Knowledge and Data Engineering, vol.31, no.5, pp.1010-1023,2019 ].

IGTS: a segmentation method based on information gain. This approach maximizes the information gain of the constituent parts by estimating action boundaries using a dynamic programming approach.

FLOSS: a shape-based segmentation method. The method segments activity data based on the fact that: patterns of similar shape should be associated with the same category and occur within close temporal proximity of each other.

Esponsso: an entropy and shape perception time series segmentation method. The method utilizes entropy and time shape characteristics of the time sequence to actively divide the multidimensional time sequence.

Deep seg: an activity segmentation method based on supervised learning. The framework employs CNNs as a state inference model to predict state labels for discrete data and then identifies active boundaries from the state labels.

C. Performance of activity segmentation

Table 1 shows the segmentation performance of the different methods over four datasets, with the best effect highlighted in bold. By analysing the performance of the method we have the following observations.

First, we propose that SFTSeg always yields better performance over four datasets than the baseline segmentation method. Specifically, SFTSeg increased by 2.45%,5.82%,8.23%,1.92% over the better baseline method DeepSeg at F1-score of HandGesture, USC-HAD, RFID and WiFiaction datasets, respectively. The results show that SFTSeg can capture the characteristics of target data through consistency regularization and self-supervision loss proposed by us, and further perform accurate activity segmentation on the target data based on several marked samples.

Second, the supervised method deep seg does not show significant advantages over the unsupervised method, especially for RFID data. The main reason is that deep seg is designed for a scenario where source data and target data have the same distribution. However, in the case of only a few labeled target samples, the competitiveness of DeepSeg may decrease. This also explains why most work uses an unsupervised way to segment activity in the absence of marker data. Whereas SFTSeg can solve the problem of limited marking of target data and achieve better performance than these supervised and unsupervised methods.

Third, for an unsupervised baseline approach, different data sets should employ different unsupervised approaches to achieve better segmentation results. For example, IGTS is superior to other unsupervised methods for RFID data, but the segmentation results are significantly worse than ESPRESSO for HandGesture data. These results provide basis for our description: unsupervised segmentation methods are often affected by environmental related problems. In contrast, our proposed SFTSeg can consistently achieve better performance across all data sets.

Table 1: comparison of segmentation Performance

D. Ablation experiments

Here, we focus on studying the contribution of the basic components we design in SFTSeg, namely consistency regularization loss, self-supervision loss and adaptive weighting. We studied the roles of the different components: (i) SFTSeg-Base is the basic Siamese network model, optimizing classification loss by the labeled source data given in equation (3). (ii) SFTSeg-Consis is a twin model of our designed consistency regularization loss, as shown in equation (5). (iii) SFTSeg-Self is a twin network model with Self-supervised loss, but without the adaptive weights in equation (11). (iv) SFTSeg-Weight is a twin network model with self-supervising losses and adaptive weights as shown in equation (13). (v) SFTSeg-Full is a model we propose to contain all components. The segmentation results for the four data sets are shown in table 2, with the best results highlighted in bold. The observations in this table are as follows: first, SFTSeg-Full achieves the best performance. At the same time, SFTSeg-Base performed the worst, which suggests that the functionality of our designed main components can greatly improve segmentation performance. Second, SFTSeg-Consis achieved better results than SFTSeg-Base when regularized in combination with consistency. This is because our designed approach enhances the limited labeled samples from the target domain, which is beneficial to the model in improving the generalization ability of the target domain. Third, SFTSeg-Self and SFTSeg-Weight are to a greater extent better than SFTSeg-Base. This result verifies that our main motivation for designing SFTSeg, namely self-supervised loss based on unlabeled target data, can enable the model to capture features of the target domain and further improve segmentation performance.

Table 2: considering performance when different components

E. Effect of target data size

SFTSeg attempts to exploit the target data to mitigate a large amount of offset between the source domain and the target domain. Therefore, we here discuss the effect of the target data size on the segmentation performance. In particular, when the number of unlabeled target data is changed, we studied the results of 1-shot, 3-shot, and 5-shot (n in n-shot refers to the number of labeled data per action category).

FIG. 6 illustrates the results of F1-score when different proportions of unlabeled target data are selected, where (a) is the result of the HandGesture data corresponding to F1-score, (b) is the result of the USC-HAD data corresponding to F1-score, (c) is the result of the RFID data corresponding to F1-score, and (d) is the result of the WiFiaction data corresponding to F1-score. The results of RMSE are not shown because it has the same trend. Fig. 6 shows that the segmentation performance of SFTSeg increases stepwise with increasing amounts of unlabeled data for the four data sets. This suggests that unlabeled data size plays an important role in the segmentation performance, and our designed self-monitoring task can effectively employ unlabeled target data to enhance model performance. In addition, the 5-shot performance is obviously better than that of 1-shot. The reason is that more target mark samples are beneficial to not only the consistency regularization of the training stage, but also the distance calculation between the test sample and the mark target sample in the test stage. Overall, the above results indicate that SFTSeg can effectively utilize both marked and unmarked target data to improve segmentation performance.

In summary, the present invention proposes a self-supervised small sample motion sequence segmentation framework SFTSeg to segment the activity on the motion sequence data. The prior action segmentation method is usually aimed at the same sensor, the segmentation accuracy of the target sensor data can be enhanced by utilizing the source sensor data, and good activity segmentation and recognition effects can be realized by using few target sensor mark samples. The twin neural network is adopted as a main framework of the learning of few samples, so that the active segmentation technology of few samples is realized. Aiming at three different input data, the invention designs different loss functions to enhance the training effect: constructing a cross entropy loss function to force input samples to fall into corresponding categories aiming at marked samples of a source sensor; in order to enhance the generalization capability of the target sensor data, a consistency regularization method is introduced, a marked sample of a source sensor is contracted and then used as disturbance, the disturbance is injected into the marked sample of the target sensor to be used as enhancement data, and the enhancement data is utilized to train a model to enhance the generalization capability of the model; in order to solve the problem of large difference between the data distribution of the source domain and the target domain, self-supervision learning is introduced, and a positive sample pair and a negative sample pair are constructed based on unlabeled samples of the target sensor to train the twin neural network, so that the twin neural network can capture the characteristics of the target data, and the inference performance is improved.

The invention solves the problems of environment dependence and subjectivity of designers of non-supervision methods (such as detection based on change points and threshold) in the activity segmentation task, and has good activity segmentation effect under different sensors in different scenes. The invention also solves the problem that the supervision method in the activity segmentation task needs a large amount of marked target data (high cost and limited by various conditions), and realizes that good activity segmentation effect can be achieved by only needing few marked target sensor samples.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. A motion sequence segmentation method based on self-supervision less sample learning aiming at behavior recognition is characterized by comprising the following steps:

step 1: constructing a self-supervision small sample action sequence segmentation framework SFTSeg; the frame is based on a twin neural network, and takes a marked sample of a source sensor, a marked sample of a target sensor and an unmarked sample of the target sensor as input data; the marking sample of the source sensor and the marking sample of the target sensor are respectively provided with four state labels, namely a static state, a starting state, a moving state and an ending state; the samples refer to a sequence of actions derived from the sensor data;

2. A method of motion sequence segmentation based on self-supervised less sample learning for behavior recognition as recited in claim 1, wherein step 3 comprises:

the enhancement data is constructed according to the following rules:

3. A method of motion sequence segmentation based on self-supervised less sample learning for behavior recognition as recited in claim 1, wherein step 4 comprises:

4. A method of motion sequence segmentation based on self-supervised less sample learning for behavior recognition as recited in claim 3, wherein step 4 further comprises: