CN114359790A

CN114359790A - Video time sequence behavior detection method based on weak supervised learning

Info

Publication number: CN114359790A
Application number: CN202111534859.7A
Authority: CN
Inventors: 闫春娟; 王静; 王传旭
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-04-15

Abstract

The invention provides a video time sequence behavior detection method based on weak supervised learning, which adopts a countermeasure thought, segments the behavior boundary of a segment level by adding a refined layer and reduces redundant information of a time sequence detection downlink behavior example; the similarity relation of GCN explicit modeling fragments is adopted, internal and external contrast loss of category fragment fusion is provided to supervise the intermediate representation of video features, the problem of context confusion is solved by increasing the feature distance between the foreground and the background and reducing the feature distance between the same categories, a behavior proposal is obtained by threshold fusion, and the purposes of structural integrity of a behavior instance and independent positioning of content are realized; by adopting complementary thought, aiming at the problem of loss of video information in the processes of feature learning and relationship reasoning, the invention provides that a global node is added into a complementary learning layer, the learned features are cascaded according to the continuity of time and similar measurement is carried out on the global node, and the integrity of the video information and the accuracy of behavior recognition are ensured.

Description

Video time sequence behavior detection method based on weak supervised learning

Technical Field

The invention relates to a video behavior positioning method, in particular to a video time sequence behavior detection method based on weak supervised learning.

Background

With the rapid increase of electronic shooting equipment and video data, for the positioning of video time sequence behaviors, a large amount of marking information is needed for training and learning, and accurate time sequence boundary marking is extremely expensive and is easy to make mistakes, so that the application of a time sequence behavior detection algorithm is limited to a great extent, and the time-on-demand situation is monitored weakly. The behavior positioning technology based on weak supervision only uses the video-level label in the training process, so that the waste of human resources and time and the labeling error can be further reduced, and the method has good flexibility.

The current weakly supervised behavior localization methods fall into two main categories: one is that weak supervision time sequence behavior positioning is used as a video identification task, a foreground and background separation attention mechanism is introduced to construct video level characteristics, and then a behavior classifier is applied to identify videos; another class considers the problem as a multiple instance learning problem (MIL), considers the whole un-clipped video as a packet of instances, divides the packet into multiple time segments, classifies each time segment, combines the segment-level predictions, and uses the MIL to obtain the final video-level classification. Although the existing method achieves certain effect, two problems existing at the present stage cannot be well solved: (1) and (4) a behavior integrity modeling problem, wherein the complete behavior is predicted to become abnormally complex under the weak supervision setting. As shown in fig. 1(a), the interval shown by Gt represents the real behavior range, and the Pred interval represents the prediction range of the model, a complete swimming behavior is predicted into a plurality of smaller intervals of behaviors, and the sections cannot be regarded as a complete whole; (2) action context confusion is the problem of how to distinguish behavior from highly relevant context only by video level tags. The video-level classifier learns the correlation between videos with the same label, as shown in fig. 1(b), which not only includes common behaviors, but also contains a closely related context background, and the model cannot separate the behaviors from the context, resulting in erroneous prediction.

Aiming at the problems, the existing solutions include random erasure, class-independent attention modeling, discriminant feature learning and the like, and the methods either overlook the segment with high discrimination and ignore the segment with low discrimination; or training supervision is provided only by using the feature similarity, and feature relation is not modeled for prediction; or redundant information exists in the behavior instance after fusion due to the problem of the segmentation strategy; the video information integrity verification process is not available, and the problem of behavior recognition error caused by behavior information loss exists in a series of characteristic learning and relation reasoning processes after video division, so that the model cannot have good detection performance.

Disclosure of Invention

The invention aims to solve the problem of how to accurately segment different behavior examples and backgrounds in a video under the condition that the behavior examples in a long video have no start-stop boundary marking, so as to realize the time sequence behavior detection of the long video.

The invention is realized by adopting the following technical scheme: a video time sequence behavior detection method based on weak supervised learning is characterized by comprising the following steps:

step A, performing space-time feature extraction on an uncut video through a double-current expansion convolutional network (I3D), and inputting the extracted features into a boundary regression layer to perform fine segmentation of segment-level boundaries;

inputting the extracted features into the boundary regression layer includes: stacking three same time convolution blocks, wherein each time convolution block has 2048 convolution kernels, one BN layer and one RELU layer along with time convolution filtering, and finally adding one time convolution block to output a boundary regression value for fine segmentation;

step B, taking the segmentation characteristics as nodes of a Graph Convolution Network (GCN) to carry out relationship reasoning, designing internal-External contrast (IEC) loss of category segment fusion, monitoring intermediate representation of video characteristics, increasing characteristic distances between a foreground and a background and between different categories, and fusing thresholds to obtain a behavior proposal;

and step C, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the mAP of the set category confidence coefficient as a performance evaluation index.

Further, when the relationship reasoning is carried out in the step B, the segment characteristics are input into the graph convolution network for video segment characteristic relationship learning to obtain segment characteristic output

W is a matrix of weights that is,

is the affinity matrix of G normalized by softmax, X^cRepresenting a collection of video segments, G being essentially X^cAccording to a weighted combination of each segment feature in the set according to similarity. In order to calculate G, designing a delta function to combine similarity and dissimilarity of appearance and motion characteristics among segments into a weight learning process, integrating information in a neighborhood of nodes of a graph convolution network to generate self unique characteristics, so that nodes with delta more similar have higher weight, wherein delta (X) is wx + b, w and b represent learning weight and bias terms, delta (X) represents a characteristic learning function, and X represents X^cThe fragment characteristics of (1).

Further, the characteristics of the temporally continuous segments output by the GCN are cascaded and subjected to similarity measurement with a global node (basic behavior information of the whole video is extracted and reserved by I3D), whether behavior information is lost or not is judged by setting a threshold, if the threshold is smaller than the threshold, the behavior information is lost too much due to the fact that the segments divided by the video are too large or too small, the behavior information is fed back to perform re-segmentation training learning, and otherwise, the result is output through a classifier.

Further, in the step B, IEC loss supervision characteristics are expressed as follows:

where T is the total number of video segments, T is the index representing the segment, F_t ^cRepresents the spatio-temporal characteristics of the t time period,

is the confidence of the category i at time t, j, k are two video segments, f_i ^jIn order to be a foreground feature,

for the purpose of the background feature(s),

the cosine distance is represented, the similar offset is represented by 0.5, and the offset effect of 0.5 is best obtained by an experimental interval value taking method and a value refining method on two sides. By adding non-linearity, the generalization capability of the function is improved.

Further, in the step B, by the formula CAS ═ MLP (X)^c,θ_cas) And carrying out threshold fusion.

Further, L was added to G_GSLoss to guarantee edge sparsity of G:

further, the similarity measure adopts cosine similarity, which is defined as follows:

compared with the prior art, the invention has the advantages and positive effects that:

the invention introduces the countermeasure idea, adds the boundary regression layer to segment-level behavior boundary segmentation, and reduces the redundant information of the sequential detection downlink behavior example; adopting GCN to take the divided segments as graph nodes, and explicitly modeling the similarity relation of the segments; intermediate representations of IEC loss, surveillance video features are proposed. And increasing the characteristic distance between the foreground and the background, reducing the characteristic distance between similar categories and increasing the characteristic distance between different categories, and then performing threshold fusion to obtain a behavior proposal to ensure the integrity and the independence of behavior examples.

According to the method, by means of complementary thought, aiming at the problem of loss of video behavior information in the processes of feature learning and relationship reasoning, the invention provides that a global node is added into a complementary learning layer, the learned features are cascaded according to time continuity, and similarity measurement is carried out on the global node, so that the integrity of the video information and the accuracy of behavior recognition are ensured.

Drawings

FIG. 1 is an illustration of a prior art weakly supervised behavioral localization problem;

FIG. 2 is an exemplary diagram of an uncut video;

FIG. 3 is a diagram of the overall network architecture of the video timing detection method based on the weak supervised learning according to the present invention;

FIG. 4 is a schematic diagram of the complementary countermeasure concept of the present invention;

FIG. 5 shows the canchor regression process of the present invention;

FIG. 6 is a schematic diagram of the GCN structure of the present invention;

FIG. 7 is a diagram showing the effect of the present invention.

Detailed Description

The general idea of the invention is as follows:

in the absence of fine-grained temporal boundary annotation for un-clipped video, it becomes very difficult to detect complete and accurate behavior instances. Therefore, the extracted space-time characteristics are input into the boundary regression layer to finely divide the segment-level behavior boundary; therefore, redundant information of the behavior instance is reduced, and independence on the content of the behavior instance is guaranteed; the method is used as a node of a Graph Convolution Network (GCN), local correlation between the node and a neighbor node is deduced, a unique feature of the node is generated, IEC loss supervision video feature representation (an un-clipped video not only comprises a behavior example, but also comprises a highly similar context (the background and an undesired behavior are collectively referred to as the context), and the foreground and the background are represented sums of intermediate features in a video time period.

Considering that the lengths of behavior examples in an unclipped video are different, and the lengths of the behavior examples are different in a short time, namely a few seconds and a long time, namely an hour, in the feature learning process, behavior information is lost due to a series of convolution and pooling operations, and in turn, behavior learning errors occur, so that the detection result is influenced. Aiming at ensuring that the whole video information is not lost, the invention provides that a global node is added into a complementary learning layer, a series of characteristics after characteristic learning and relational reasoning are cascaded in a time dimension, similarity measurement is carried out on the characteristics and the global node, and if the characteristics are within a certain threshold range, the whole information is not lost, so that the integrity of the video information and the accuracy of behavior recognition are ensured.

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The general framework of the technical scheme of the invention is shown in figure 3. Firstly, I3D is used for space-time feature extraction, input to a boundary regression layer, as shown in figure 3(a), three same time volume blocks are firstly provided, the size of an inner core is 3, stride is 1, padding is 1, each time volume block is provided with a BN (layer) and a RELU (layer), and with time convolution filtering and regression iterative training, behavior boundaries of prediction segment levels are finely segmented, and independence on behavior example content is enhanced; then using the segmented segment characteristics as the nodes of the GCN to carry out relationship reasoning, as shown in figure 3(b), and obtaining a behavior proposal through threshold fusion; and finally, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the mAP of the set category confidence coefficient as a performance evaluation index. The complementary countermeasure idea of the invention is shown in fig. 4, and is seen from top to bottom, mainly under the guidance of global nodes (basic information of the whole video, having a global view), the redundant information of the time sequence behavior is removed at the same time through the relation and the distinction between the learning segments of the countermeasure mechanism, so as to ensure the relative independence of the behavior proposal.

The respective sections will be described in detail next.

(1) Feature extraction network

In order to prevent the problem of poor model calculation caused by memory consumption, the method firstly divides the un-clipped video into T continuous equal-length non-overlapping segments in a uniform sampling mode, and then transmits the divided segments and the un-clipped video to an I3D network for space-time feature extraction.

The double-current expansion 3D convolutional network (I3D) takes the latest picture classification model as a basic structure, and the convolution kernel and the pooling kernel of the picture convolution classification are extended from 2D to 3D to learn spatiotemporal features seamlessly. The backbone network consists of a spatial stream that accepts RGB input and a temporal stream that accepts Flow input. For each stream, inclusion components and batch normalization were used, as shown in the fig. 5 structure. Finally, the spatial and temporal features extracted for each segment are concatenated into a feature vector of 2048 channels.

Specifically, for each video

For each segment

Respectively encoding static scene features F by using spatial stream (RGB) and temporal stream (Flow)_i ^RGB(t)∈R¹⁰²⁴And motion characteristics F_i ^flow(t)∈R¹⁰²⁴. Static scene feature F through cascading operation_i ^RGB(t) and motion characteristics F_i ^flow(t) composition into video clip-level features F_i ^c(t)＝[F_i ^RGB(t),F_i ^flow(t)]. Finally, all fragment levels are stackedFeatures to form video pretrained features F^c∈R^T ^×2048. Similarly, for each video V, extracting its video-level features, and respectively coding the static scene features F by using the space stream (RGB) and the time stream (Flow)^RGB(t)∈R¹⁰²⁴And motion characteristics F^flow(t)∈R¹⁰²⁴. Static scene feature F through cascading operation^RGB(t) and motion characteristics F^flow(t) combining into video level spatiotemporal features F^g(t)＝[F^RGB(t),F^flow(t)]∈R^1×2048Then, a Global Average Pooling (GAP) X is performed^g＝Pool(F^g(t)) preserve the complete behavioral characteristics of the video.

(2) Boundary regression layer

The spatio-temporal features extracted by I3D are input to the boundary regression layer. The adopted method is coordinate parameterization, and in order to adapt to an I3D feature extraction mode and reduce calculation consumption, the invention adopts fragment-level coordinate regression, as shown in figure 6. First, stacking three identical temporal convolution blocks, features and motion intensity at each temporal position can be viewed as a function of the temporal convolution filtering.

Each time convolution block has 2048 convolution kernels, kernel size 3, stride 1, padding 1. After each time volume block there is a BN layer and a RELU layer.

X^c＝RELU(BN(Conv(X^c,θ))) (1)

X^cRepresentative are video segment features, Conv convolution operation, BN batch processing operation, RELU non-linear activation.

And finally, adding a time volume block to output a boundary regression value for fine segmentation.

The present invention initializes M different scales of candor (clip Anchor) based on the typical duration of a given dataset behavior (e.g., THUMOS' 14 dataset with initialization scales of [1, 2, 4, 8, 16, 32 [)]The initialization scale for the activitynet1.2 dataset is [16, 32, 64, 128, 256%]). At the time position s of each segment_xIn the above, M different scales of candor are predicted. An example of a candor is shown in fig. 5.

The following process is iteratively performed: (r) in its M candor predictions, the behavior segment center position c is calculated by parameterizing the offset_x＝s_x+w_a·t_cAnd the time length w ═ w_a·exp(t_w)，t_cIndicating how to move the center position of the candor, t_wIndicating how to scale the length of the candor. If the behavior value at a time position is lower than 0.1(0.1 can indicate that the time position is most probably a non-action background class, background characteristics exist, but no behavior exists, if the value is larger, behavior can exist, the motion amplitude is small, noise interference exists, and 0.1 is most closely required), all predictions corresponding to the time position are abandoned. ③ for each remaining position, only the position with the least loss is retained, which means the most probable candor. For the reserved prediction position, the method deletes the prediction with the loss larger than a certain threshold value, and finally performs Non Maximum Suppression (NMS) on all reserved segments to obtain an accurate frame. Boundary x is obtained in the last time convolution block₁＝c_x-w/2，x₁＝c_x+ w/2 for precise segmentation of the fragments.

(3) The similarity relation between the explicit modeling behavior segments obtains a behavior proposal through threshold fusion "

The GCN is a graphical example inter-segment similarity modeling network that provides spatial topology and semantic appearance features. Local correlation between related neighbor nodes can be deduced, information is aggregated from the neighborhood, and the self characteristic independence is enhanced in the mode. The GCN takes the fragments finely divided by the candor as nodes for graph reasoning. Explicitly modeling similarity relationships between video segments, and then threshold fusing to get a "behavior proposal" ("behavior proposal" refers to independent and complete temporal behavior in un-clipped video.

GCN input X^c：

Z dimension is T × d_outIs the output of graph convolution, with W dimension of 2048 × d_outIs a weight matrix learned by back propagation,

dimension(s) of (a) is T is an affinity matrix of G after being normalized by softmax.

In order to calculate G, the invention designs a delta function to combine the similarity and dissimilarity of appearance and motion characteristics between segments into the learning process of weight, and GCN generates self unique characteristics from the aggregation information in the neighborhood. So that nodes with δ more similar have higher weights.

δ(x)＝wx+b (3)

w, b represent the learning weight and bias term, and X represents X^cSegment characteristics in the collection. The similarity measure is cosine similarity, and is defined as follows:

firstly, in the relation reasoning process, a dynamic fusion threshold value (for weak supervision, the dynamic fusion threshold value is set classification, the mean value of confidence degrees of all classes is used as a performance evaluation index, one classification confidence degree is called a class score, a plurality of classes are class activation sequences CAS) is obtained through a formula (5), then, an IEC loss supervision characteristic expression is designed, such as a formula (6), and a behavior proposal is obtained according to threshold fusion. And finally, by MIL constraint, mAP is used as a performance evaluation index.

IEC: the foreground and background are representative sums of the features in the middle of the video segment. The invention increases the characteristic distance between the foreground and the background, and solves the problem of confusion of action context; and the characteristic distance between different classes is increased, the characteristic distance between nodes in the same class is reduced, and the integrity and the independence of the behavior examples are ensured.

CAS＝MLP(X^c,θ_cas) (5)

Where T is the total number of video segments, T is the index of the segment, F_t ^cA characteristic representing a time period of t,

is the confidence of class i at time t, j, k are two video segments, f_i ^jIn order to be a foreground feature,

as a background feature.

The cosine distance is represented, the similar offset is represented by 0.5, the offset effect of 0.5 is best by the method of experimental interval value taking and value thinning on two sides, and the function nonlinearity is increased.

The IEC loss is designed for the purpose of supervising the characteristic representation, and the IEC loss and the boundary regression layer jointly form a countermeasure mechanism to ensure the integrity and independence of the behavior instance.

G is essentially a set of segment features X for each video^cEach segment feature x is weighted and combined according to the similarity relation and corresponds to a regular complete connection layer without bias items. A complementary learning layer is introduced before a graph layer is transmitted to a classification layer, and the complementary learning layer is mainly used for verifying whether behavior information is lost in a series of characteristic learning and relation reasoning processes, ensuring the integrity of video information and preventing behavior recognition errors. The process is as follows: the invention carries out cascade concat (X) on the continuous fragment characteristics of GCN output time^c) And global node X^gThe similarity measure is performed using equation (4),if the value is smaller than the threshold (the result of the experiment proves that the effect is the best when the threshold theta is 0.6), the result shows that the behavior information is lost too much due to the fact that the video is divided into too large or too small segments, and the behavior information is fed back to perform re-segmentation training learning. Otherwise, the detection result is output through the classifier. By designing the complementary learning layer, the integrity of video information can be ensured, and behavior recognition errors caused by feature loss can be prevented.

In order to optimize the model, the model obtains better performance and the identification precision is improved. The invention designs a total objective function:

L_tol＝λ₁L_MIL+λ₂L_GS+λ₃L_IOC (7)

λ₁，λ₂，λ₃learning parameter, L_tolObjective function, L_MILMultiple instance learning loss, L_GSLoss of graph sparseness, L_IOCLoss of internal and external contrast.

Loss of multi-instance learning: the method directly maps the weak surveillance video time sequence behavior detection problem into a multi-instance learning task. And dividing the prediction into a plurality of time periods, classifying each time period, merging the segment-level predictions, and obtaining the final video-level classification by using multi-instance learning.

To predict the confidence that segment j is of category i,

confidence that the real segment j is the category i, n represents the number of behavior instances in the video, n_cRepresenting the total number of categories.

GCN graph sparsity loss: in order to ensure the sparsity of the graph, the network training speed is improved. In summary, G can group together similar segment features x,and pushes away dissimilar segment features x. G with similar edge weights can be difficult to train in the network because the degree of distinctiveness of the features x is averaged. To prevent this, the present invention adds L to G_GSLoss to guarantee edge sparsity of G:

t is the total number of video segments, i, j is the intra-video segment index, G_i,jThe similarity relationship between the segment i and the segment j in the video is shown.

The idea of the internal and external contrast loss design is to supervise the video feature representation, increase the feature distance between the foreground and the background and solve the behavior context confusion problem; reducing the feature distance between similar fused segments ensures independence and completeness of behavior instances.

The method is different from other methods which only focus on the high-discrimination segment and are restricted by time proximity, and the detection performance is not good due to insufficient modeling information caused by no global view.

The present invention uses the GCN explicit modeling of similarity relationships between video segments. In summary, the GCN treats an input element as a node in a graph with weighted edges. The feature of each node is changed from X to Z (as shown in fig. 6) by several operations. However, the connection relationship between the nodes, i.e., G (affinity matrix), is shared no matter how many layers there are in between. The node edges are weighted by their similarity. In this way, relevant time segments can be pushed together, while irrelevant time segments are pushed apart in the feature space for instance clustering purposes.

Video context is a key clue to detecting behavior. Segments that are further from the behavior but contain similar semantic content may provide indicative cues for detecting the behavior. Such as background frames, the background of the motion field indicates what may happen on the motion field (e.g., "long jump") rather than elsewhere (e.g., "shopping") because the video context is adaptive.

FIG. 7 shows qualitative results of video time series behavior detection, where the true value is represented by Gt, the detection result of the present invention is represented by Pred, and the segment without GCN modeling is represented by No-F. The peripheral boxes indicate that the method of the invention can locate a wider range of behaviors, and learn a more general behavior location model, and can locate more behavior instances.

The performance of the weakly supervised time series behavior detection of the present invention was evaluated using mAP at different overlap thresholds (IOU) as metric values, expressed as mAP @ tIoU, on the THUMOS' 14 dataset to set t-IOU to [0.1, 0.2, 0.3, 0.4, 0.5] and compared to several latest methods of weak supervision, as follows for a standard-compliant evaluation protocol.

TABLE 1 test results on THUMOS' 14 test set

As shown in table 1, the method provided by the invention can obtain a good effect under the condition of weak video mark bundling, and compared with other methods, the method is improved by 1.47 percentage points on average, and the single maximum is improved by 2.28 percentage points.

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims

1. A video time sequence behavior detection method based on weak supervised learning is characterized by comprising the following steps:

step A, performing space-time feature extraction on an uncut video through a double-current expansion convolution network, inputting the extracted features into a boundary regression layer, firstly stacking three same time convolution blocks, performing convolution filtering along with time, wherein each time convolution block is provided with 2048 convolution kernels, one BN layer and a RELU layer, and finally adding one time convolution block to output a boundary regression value to perform fine segmentation of segment-level behavior boundaries;

step B, taking the segmentation segment characteristics as nodes of a graph convolution network to carry out relationship reasoning, designing internal and external comparison loss of category segment fusion, monitoring intermediate representation of video characteristics, increasing characteristic distances between a foreground and a background and between different categories, and fusing thresholds to obtain a behavior proposal;

and step C, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the category confidence coefficient mean value as a performance evaluation index.

2. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: when the relation reasoning is carried out in the step B, the segment characteristics are input into the graph convolution network to carry out the relation learning of the video segment characteristics, and the obtained segment characteristics are output as

W is a matrix of weights that is,

is the affinity matrix of G normalized by softmax, X^cRepresenting sets of video segments, G being X^cIn order to calculate G, a delta function is designed, the similarity and dissimilarity of appearance and motion characteristics among the segments are combined into the learning process of weight, and the nodes of the graph convolution network aggregate information from the neighborhood to generate own unique characteristics, so that the nodes with more similar delta are combinedHaving higher weight, δ (X) wx + b, δ (X) representing the feature learning function, w, b representing the learning weight and the bias term, and X representing X^cThe fragment characteristics of (1).

3. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein: in the step B, the characteristics of the fragments output in the graph convolution network in continuous time are cascaded, similarity measurement is carried out on the characteristics of the fragments and the global node, whether behavior information is lost or not is judged by setting a threshold, if the characteristics are smaller than the threshold, the behavior information is lost too much due to the fact that the fragments divided by the video are too large or too small, the behavior information is fed back to carry out re-segmentation training and learning, and otherwise, the result is output through a classifier.

4. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: in the step B, the internal and external contrast loss supervision characteristics of class fragment fusion are expressed as follows:

where T is the total number of clips in each video, T is the video clip index, F_t ^cRepresents the spatio-temporal characteristics of the t time period,

is the confidence of the category i at time period t, j, k are two video segments, f_i ^jIn order to be a foreground feature,

for the purpose of the background feature(s),

the cosine distance is represented and 0.5 represents the similar offset between the segments.

5. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: in step B, the formula CAS ═ MLP (X)^c,θ_cas) Threshold fusion is carried out, CAS is a class activation sequence and represents the confidence coefficient of a set class, MLP represents that video segment features are mapped to action class space by the multi-layer perception principle, and classification scores of behaviors are obtained over time, theta_casTrainable parameters, X, representing a sequence of action classes^cRepresentative is a collection of video segment features.

6. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein:

addition of L to affinity matrix G_GSLoss to guarantee edge sparsity of G:

7. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein: the similarity measure adopts cosine similarity, and is defined as follows:

X_i，X_jrepresenting the features of segments indexed i, j within a video.