CN114359790A - Video time sequence behavior detection method based on weak supervised learning - Google Patents

Video time sequence behavior detection method based on weak supervised learning Download PDF

Info

Publication number
CN114359790A
CN114359790A CN202111534859.7A CN202111534859A CN114359790A CN 114359790 A CN114359790 A CN 114359790A CN 202111534859 A CN202111534859 A CN 202111534859A CN 114359790 A CN114359790 A CN 114359790A
Authority
CN
China
Prior art keywords
video
behavior
segment
time
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111534859.7A
Other languages
Chinese (zh)
Inventor
闫春娟
王静
王传旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Science and Technology
Original Assignee
Qingdao University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Science and Technology filed Critical Qingdao University of Science and Technology
Priority to CN202111534859.7A priority Critical patent/CN114359790A/en
Publication of CN114359790A publication Critical patent/CN114359790A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a video time sequence behavior detection method based on weak supervised learning, which adopts a countermeasure thought, segments the behavior boundary of a segment level by adding a refined layer and reduces redundant information of a time sequence detection downlink behavior example; the similarity relation of GCN explicit modeling fragments is adopted, internal and external contrast loss of category fragment fusion is provided to supervise the intermediate representation of video features, the problem of context confusion is solved by increasing the feature distance between the foreground and the background and reducing the feature distance between the same categories, a behavior proposal is obtained by threshold fusion, and the purposes of structural integrity of a behavior instance and independent positioning of content are realized; by adopting complementary thought, aiming at the problem of loss of video information in the processes of feature learning and relationship reasoning, the invention provides that a global node is added into a complementary learning layer, the learned features are cascaded according to the continuity of time and similar measurement is carried out on the global node, and the integrity of the video information and the accuracy of behavior recognition are ensured.

Description

Video time sequence behavior detection method based on weak supervised learning
Technical Field
The invention relates to a video behavior positioning method, in particular to a video time sequence behavior detection method based on weak supervised learning.
Background
With the rapid increase of electronic shooting equipment and video data, for the positioning of video time sequence behaviors, a large amount of marking information is needed for training and learning, and accurate time sequence boundary marking is extremely expensive and is easy to make mistakes, so that the application of a time sequence behavior detection algorithm is limited to a great extent, and the time-on-demand situation is monitored weakly. The behavior positioning technology based on weak supervision only uses the video-level label in the training process, so that the waste of human resources and time and the labeling error can be further reduced, and the method has good flexibility.
The current weakly supervised behavior localization methods fall into two main categories: one is that weak supervision time sequence behavior positioning is used as a video identification task, a foreground and background separation attention mechanism is introduced to construct video level characteristics, and then a behavior classifier is applied to identify videos; another class considers the problem as a multiple instance learning problem (MIL), considers the whole un-clipped video as a packet of instances, divides the packet into multiple time segments, classifies each time segment, combines the segment-level predictions, and uses the MIL to obtain the final video-level classification. Although the existing method achieves certain effect, two problems existing at the present stage cannot be well solved: (1) and (4) a behavior integrity modeling problem, wherein the complete behavior is predicted to become abnormally complex under the weak supervision setting. As shown in fig. 1(a), the interval shown by Gt represents the real behavior range, and the Pred interval represents the prediction range of the model, a complete swimming behavior is predicted into a plurality of smaller intervals of behaviors, and the sections cannot be regarded as a complete whole; (2) action context confusion is the problem of how to distinguish behavior from highly relevant context only by video level tags. The video-level classifier learns the correlation between videos with the same label, as shown in fig. 1(b), which not only includes common behaviors, but also contains a closely related context background, and the model cannot separate the behaviors from the context, resulting in erroneous prediction.
Aiming at the problems, the existing solutions include random erasure, class-independent attention modeling, discriminant feature learning and the like, and the methods either overlook the segment with high discrimination and ignore the segment with low discrimination; or training supervision is provided only by using the feature similarity, and feature relation is not modeled for prediction; or redundant information exists in the behavior instance after fusion due to the problem of the segmentation strategy; the video information integrity verification process is not available, and the problem of behavior recognition error caused by behavior information loss exists in a series of characteristic learning and relation reasoning processes after video division, so that the model cannot have good detection performance.
Disclosure of Invention
The invention aims to solve the problem of how to accurately segment different behavior examples and backgrounds in a video under the condition that the behavior examples in a long video have no start-stop boundary marking, so as to realize the time sequence behavior detection of the long video.
The invention is realized by adopting the following technical scheme: a video time sequence behavior detection method based on weak supervised learning is characterized by comprising the following steps:
step A, performing space-time feature extraction on an uncut video through a double-current expansion convolutional network (I3D), and inputting the extracted features into a boundary regression layer to perform fine segmentation of segment-level boundaries;
inputting the extracted features into the boundary regression layer includes: stacking three same time convolution blocks, wherein each time convolution block has 2048 convolution kernels, one BN layer and one RELU layer along with time convolution filtering, and finally adding one time convolution block to output a boundary regression value for fine segmentation;
step B, taking the segmentation characteristics as nodes of a Graph Convolution Network (GCN) to carry out relationship reasoning, designing internal-External contrast (IEC) loss of category segment fusion, monitoring intermediate representation of video characteristics, increasing characteristic distances between a foreground and a background and between different categories, and fusing thresholds to obtain a behavior proposal;
and step C, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the mAP of the set category confidence coefficient as a performance evaluation index.
Further, when the relationship reasoning is carried out in the step B, the segment characteristics are input into the graph convolution network for video segment characteristic relationship learning to obtain segment characteristic output
Figure BDA0003412240840000021
W is a matrix of weights that is,
Figure BDA0003412240840000022
is the affinity matrix of G normalized by softmax, XcRepresenting a collection of video segments, G being essentially XcAccording to a weighted combination of each segment feature in the set according to similarity. In order to calculate G, designing a delta function to combine similarity and dissimilarity of appearance and motion characteristics among segments into a weight learning process, integrating information in a neighborhood of nodes of a graph convolution network to generate self unique characteristics, so that nodes with delta more similar have higher weight, wherein delta (X) is wx + b, w and b represent learning weight and bias terms, delta (X) represents a characteristic learning function, and X represents XcThe fragment characteristics of (1).
Further, the characteristics of the temporally continuous segments output by the GCN are cascaded and subjected to similarity measurement with a global node (basic behavior information of the whole video is extracted and reserved by I3D), whether behavior information is lost or not is judged by setting a threshold, if the threshold is smaller than the threshold, the behavior information is lost too much due to the fact that the segments divided by the video are too large or too small, the behavior information is fed back to perform re-segmentation training learning, and otherwise, the result is output through a classifier.
Further, in the step B, IEC loss supervision characteristics are expressed as follows:
Figure BDA0003412240840000023
Figure BDA0003412240840000024
where T is the total number of video segments, T is the index representing the segment, Ft cRepresents the spatio-temporal characteristics of the t time period,
Figure BDA0003412240840000025
is the confidence of the category i at time t, j, k are two video segments, fi jIn order to be a foreground feature,
Figure BDA0003412240840000026
for the purpose of the background feature(s),
Figure BDA0003412240840000031
the cosine distance is represented, the similar offset is represented by 0.5, and the offset effect of 0.5 is best obtained by an experimental interval value taking method and a value refining method on two sides. By adding non-linearity, the generalization capability of the function is improved.
Further, in the step B, by the formula CAS ═ MLP (X)ccas) And carrying out threshold fusion.
Further, L was added to GGSLoss to guarantee edge sparsity of G:
Figure BDA0003412240840000032
further, the similarity measure adopts cosine similarity, which is defined as follows:
Figure BDA0003412240840000033
compared with the prior art, the invention has the advantages and positive effects that:
the invention introduces the countermeasure idea, adds the boundary regression layer to segment-level behavior boundary segmentation, and reduces the redundant information of the sequential detection downlink behavior example; adopting GCN to take the divided segments as graph nodes, and explicitly modeling the similarity relation of the segments; intermediate representations of IEC loss, surveillance video features are proposed. And increasing the characteristic distance between the foreground and the background, reducing the characteristic distance between similar categories and increasing the characteristic distance between different categories, and then performing threshold fusion to obtain a behavior proposal to ensure the integrity and the independence of behavior examples.
According to the method, by means of complementary thought, aiming at the problem of loss of video behavior information in the processes of feature learning and relationship reasoning, the invention provides that a global node is added into a complementary learning layer, the learned features are cascaded according to time continuity, and similarity measurement is carried out on the global node, so that the integrity of the video information and the accuracy of behavior recognition are ensured.
Drawings
FIG. 1 is an illustration of a prior art weakly supervised behavioral localization problem;
FIG. 2 is an exemplary diagram of an uncut video;
FIG. 3 is a diagram of the overall network architecture of the video timing detection method based on the weak supervised learning according to the present invention;
FIG. 4 is a schematic diagram of the complementary countermeasure concept of the present invention;
FIG. 5 shows the canchor regression process of the present invention;
FIG. 6 is a schematic diagram of the GCN structure of the present invention;
FIG. 7 is a diagram showing the effect of the present invention.
Detailed Description
The general idea of the invention is as follows:
in the absence of fine-grained temporal boundary annotation for un-clipped video, it becomes very difficult to detect complete and accurate behavior instances. Therefore, the extracted space-time characteristics are input into the boundary regression layer to finely divide the segment-level behavior boundary; therefore, redundant information of the behavior instance is reduced, and independence on the content of the behavior instance is guaranteed; the method is used as a node of a Graph Convolution Network (GCN), local correlation between the node and a neighbor node is deduced, a unique feature of the node is generated, IEC loss supervision video feature representation (an un-clipped video not only comprises a behavior example, but also comprises a highly similar context (the background and an undesired behavior are collectively referred to as the context), and the foreground and the background are represented sums of intermediate features in a video time period.
Considering that the lengths of behavior examples in an unclipped video are different, and the lengths of the behavior examples are different in a short time, namely a few seconds and a long time, namely an hour, in the feature learning process, behavior information is lost due to a series of convolution and pooling operations, and in turn, behavior learning errors occur, so that the detection result is influenced. Aiming at ensuring that the whole video information is not lost, the invention provides that a global node is added into a complementary learning layer, a series of characteristics after characteristic learning and relational reasoning are cascaded in a time dimension, similarity measurement is carried out on the characteristics and the global node, and if the characteristics are within a certain threshold range, the whole information is not lost, so that the integrity of the video information and the accuracy of behavior recognition are ensured.
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The general framework of the technical scheme of the invention is shown in figure 3. Firstly, I3D is used for space-time feature extraction, input to a boundary regression layer, as shown in figure 3(a), three same time volume blocks are firstly provided, the size of an inner core is 3, stride is 1, padding is 1, each time volume block is provided with a BN (layer) and a RELU (layer), and with time convolution filtering and regression iterative training, behavior boundaries of prediction segment levels are finely segmented, and independence on behavior example content is enhanced; then using the segmented segment characteristics as the nodes of the GCN to carry out relationship reasoning, as shown in figure 3(b), and obtaining a behavior proposal through threshold fusion; and finally, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the mAP of the set category confidence coefficient as a performance evaluation index. The complementary countermeasure idea of the invention is shown in fig. 4, and is seen from top to bottom, mainly under the guidance of global nodes (basic information of the whole video, having a global view), the redundant information of the time sequence behavior is removed at the same time through the relation and the distinction between the learning segments of the countermeasure mechanism, so as to ensure the relative independence of the behavior proposal.
The respective sections will be described in detail next.
(1) Feature extraction network
In order to prevent the problem of poor model calculation caused by memory consumption, the method firstly divides the un-clipped video into T continuous equal-length non-overlapping segments in a uniform sampling mode, and then transmits the divided segments and the un-clipped video to an I3D network for space-time feature extraction.
The double-current expansion 3D convolutional network (I3D) takes the latest picture classification model as a basic structure, and the convolution kernel and the pooling kernel of the picture convolution classification are extended from 2D to 3D to learn spatiotemporal features seamlessly. The backbone network consists of a spatial stream that accepts RGB input and a temporal stream that accepts Flow input. For each stream, inclusion components and batch normalization were used, as shown in the fig. 5 structure. Finally, the spatial and temporal features extracted for each segment are concatenated into a feature vector of 2048 channels.
Specifically, for each video
Figure BDA0003412240840000051
For each segment
Figure BDA0003412240840000052
Respectively encoding static scene features F by using spatial stream (RGB) and temporal stream (Flow)i RGB(t)∈R1024And motion characteristics Fi flow(t)∈R1024. Static scene feature F through cascading operationi RGB(t) and motion characteristics Fi flow(t) composition into video clip-level features Fi c(t)=[Fi RGB(t),Fi flow(t)]. Finally, all fragment levels are stackedFeatures to form video pretrained features Fc∈RT ×2048. Similarly, for each video V, extracting its video-level features, and respectively coding the static scene features F by using the space stream (RGB) and the time stream (Flow)RGB(t)∈R1024And motion characteristics Fflow(t)∈R1024. Static scene feature F through cascading operationRGB(t) and motion characteristics Fflow(t) combining into video level spatiotemporal features Fg(t)=[FRGB(t),Fflow(t)]∈R1×2048Then, a Global Average Pooling (GAP) X is performedg=Pool(Fg(t)) preserve the complete behavioral characteristics of the video.
(2) Boundary regression layer
The spatio-temporal features extracted by I3D are input to the boundary regression layer. The adopted method is coordinate parameterization, and in order to adapt to an I3D feature extraction mode and reduce calculation consumption, the invention adopts fragment-level coordinate regression, as shown in figure 6. First, stacking three identical temporal convolution blocks, features and motion intensity at each temporal position can be viewed as a function of the temporal convolution filtering.
Each time convolution block has 2048 convolution kernels, kernel size 3, stride 1, padding 1. After each time volume block there is a BN layer and a RELU layer.
Xc=RELU(BN(Conv(Xc,θ))) (1)
XcRepresentative are video segment features, Conv convolution operation, BN batch processing operation, RELU non-linear activation.
And finally, adding a time volume block to output a boundary regression value for fine segmentation.
The present invention initializes M different scales of candor (clip Anchor) based on the typical duration of a given dataset behavior (e.g., THUMOS' 14 dataset with initialization scales of [1, 2, 4, 8, 16, 32 [)]The initialization scale for the activitynet1.2 dataset is [16, 32, 64, 128, 256%]). At the time position s of each segmentxIn the above, M different scales of candor are predicted. An example of a candor is shown in fig. 5.
The following process is iteratively performed: (r) in its M candor predictions, the behavior segment center position c is calculated by parameterizing the offsetx=sx+wa·tcAnd the time length w ═ wa·exp(tw),tcIndicating how to move the center position of the candor, twIndicating how to scale the length of the candor. If the behavior value at a time position is lower than 0.1(0.1 can indicate that the time position is most probably a non-action background class, background characteristics exist, but no behavior exists, if the value is larger, behavior can exist, the motion amplitude is small, noise interference exists, and 0.1 is most closely required), all predictions corresponding to the time position are abandoned. ③ for each remaining position, only the position with the least loss is retained, which means the most probable candor. For the reserved prediction position, the method deletes the prediction with the loss larger than a certain threshold value, and finally performs Non Maximum Suppression (NMS) on all reserved segments to obtain an accurate frame. Boundary x is obtained in the last time convolution block1=cx-w/2,x1=cx+ w/2 for precise segmentation of the fragments.
(3) The similarity relation between the explicit modeling behavior segments obtains a behavior proposal through threshold fusion "
The GCN is a graphical example inter-segment similarity modeling network that provides spatial topology and semantic appearance features. Local correlation between related neighbor nodes can be deduced, information is aggregated from the neighborhood, and the self characteristic independence is enhanced in the mode. The GCN takes the fragments finely divided by the candor as nodes for graph reasoning. Explicitly modeling similarity relationships between video segments, and then threshold fusing to get a "behavior proposal" ("behavior proposal" refers to independent and complete temporal behavior in un-clipped video.
GCN input Xc
Figure BDA0003412240840000061
Z dimension is T × doutIs the output of graph convolution, with W dimension of 2048 × doutIs a weight matrix learned by back propagation,
Figure BDA0003412240840000062
dimension(s) of (a) is T is an affinity matrix of G after being normalized by softmax.
In order to calculate G, the invention designs a delta function to combine the similarity and dissimilarity of appearance and motion characteristics between segments into the learning process of weight, and GCN generates self unique characteristics from the aggregation information in the neighborhood. So that nodes with δ more similar have higher weights.
δ(x)=wx+b (3)
w, b represent the learning weight and bias term, and X represents XcSegment characteristics in the collection. The similarity measure is cosine similarity, and is defined as follows:
Figure BDA0003412240840000063
firstly, in the relation reasoning process, a dynamic fusion threshold value (for weak supervision, the dynamic fusion threshold value is set classification, the mean value of confidence degrees of all classes is used as a performance evaluation index, one classification confidence degree is called a class score, a plurality of classes are class activation sequences CAS) is obtained through a formula (5), then, an IEC loss supervision characteristic expression is designed, such as a formula (6), and a behavior proposal is obtained according to threshold fusion. And finally, by MIL constraint, mAP is used as a performance evaluation index.
IEC: the foreground and background are representative sums of the features in the middle of the video segment. The invention increases the characteristic distance between the foreground and the background, and solves the problem of confusion of action context; and the characteristic distance between different classes is increased, the characteristic distance between nodes in the same class is reduced, and the integrity and the independence of the behavior examples are ensured.
CAS=MLP(Xccas) (5)
Figure BDA0003412240840000071
Figure BDA0003412240840000072
Where T is the total number of video segments, T is the index of the segment, Ft cA characteristic representing a time period of t,
Figure BDA0003412240840000073
is the confidence of class i at time t, j, k are two video segments, fi jIn order to be a foreground feature,
Figure BDA0003412240840000074
as a background feature.
Figure BDA0003412240840000075
The cosine distance is represented, the similar offset is represented by 0.5, the offset effect of 0.5 is best by the method of experimental interval value taking and value thinning on two sides, and the function nonlinearity is increased.
The IEC loss is designed for the purpose of supervising the characteristic representation, and the IEC loss and the boundary regression layer jointly form a countermeasure mechanism to ensure the integrity and independence of the behavior instance.
G is essentially a set of segment features X for each videocEach segment feature x is weighted and combined according to the similarity relation and corresponds to a regular complete connection layer without bias items. A complementary learning layer is introduced before a graph layer is transmitted to a classification layer, and the complementary learning layer is mainly used for verifying whether behavior information is lost in a series of characteristic learning and relation reasoning processes, ensuring the integrity of video information and preventing behavior recognition errors. The process is as follows: the invention carries out cascade concat (X) on the continuous fragment characteristics of GCN output timec) And global node XgThe similarity measure is performed using equation (4),if the value is smaller than the threshold (the result of the experiment proves that the effect is the best when the threshold theta is 0.6), the result shows that the behavior information is lost too much due to the fact that the video is divided into too large or too small segments, and the behavior information is fed back to perform re-segmentation training learning. Otherwise, the detection result is output through the classifier. By designing the complementary learning layer, the integrity of video information can be ensured, and behavior recognition errors caused by feature loss can be prevented.
In order to optimize the model, the model obtains better performance and the identification precision is improved. The invention designs a total objective function:
Ltol=λ1LMIL2LGS3LIOC (7)
λ1,λ2,λ3learning parameter, LtolObjective function, LMILMultiple instance learning loss, LGSLoss of graph sparseness, LIOCLoss of internal and external contrast.
Loss of multi-instance learning: the method directly maps the weak surveillance video time sequence behavior detection problem into a multi-instance learning task. And dividing the prediction into a plurality of time periods, classifying each time period, merging the segment-level predictions, and obtaining the final video-level classification by using multi-instance learning.
Figure BDA0003412240840000076
Figure BDA0003412240840000077
To predict the confidence that segment j is of category i,
Figure BDA0003412240840000078
confidence that the real segment j is the category i, n represents the number of behavior instances in the video, ncRepresenting the total number of categories.
GCN graph sparsity loss: in order to ensure the sparsity of the graph, the network training speed is improved. In summary, G can group together similar segment features x,and pushes away dissimilar segment features x. G with similar edge weights can be difficult to train in the network because the degree of distinctiveness of the features x is averaged. To prevent this, the present invention adds L to GGSLoss to guarantee edge sparsity of G:
Figure BDA0003412240840000081
t is the total number of video segments, i, j is the intra-video segment index, Gi,jThe similarity relationship between the segment i and the segment j in the video is shown.
The idea of the internal and external contrast loss design is to supervise the video feature representation, increase the feature distance between the foreground and the background and solve the behavior context confusion problem; reducing the feature distance between similar fused segments ensures independence and completeness of behavior instances.
The method is different from other methods which only focus on the high-discrimination segment and are restricted by time proximity, and the detection performance is not good due to insufficient modeling information caused by no global view.
The present invention uses the GCN explicit modeling of similarity relationships between video segments. In summary, the GCN treats an input element as a node in a graph with weighted edges. The feature of each node is changed from X to Z (as shown in fig. 6) by several operations. However, the connection relationship between the nodes, i.e., G (affinity matrix), is shared no matter how many layers there are in between. The node edges are weighted by their similarity. In this way, relevant time segments can be pushed together, while irrelevant time segments are pushed apart in the feature space for instance clustering purposes.
Video context is a key clue to detecting behavior. Segments that are further from the behavior but contain similar semantic content may provide indicative cues for detecting the behavior. Such as background frames, the background of the motion field indicates what may happen on the motion field (e.g., "long jump") rather than elsewhere (e.g., "shopping") because the video context is adaptive.
FIG. 7 shows qualitative results of video time series behavior detection, where the true value is represented by Gt, the detection result of the present invention is represented by Pred, and the segment without GCN modeling is represented by No-F. The peripheral boxes indicate that the method of the invention can locate a wider range of behaviors, and learn a more general behavior location model, and can locate more behavior instances.
The performance of the weakly supervised time series behavior detection of the present invention was evaluated using mAP at different overlap thresholds (IOU) as metric values, expressed as mAP @ tIoU, on the THUMOS' 14 dataset to set t-IOU to [0.1, 0.2, 0.3, 0.4, 0.5] and compared to several latest methods of weak supervision, as follows for a standard-compliant evaluation protocol.
TABLE 1 test results on THUMOS' 14 test set
Figure BDA0003412240840000082
Figure BDA0003412240840000091
As shown in table 1, the method provided by the invention can obtain a good effect under the condition of weak video mark bundling, and compared with other methods, the method is improved by 1.47 percentage points on average, and the single maximum is improved by 2.28 percentage points.
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims (7)

1. A video time sequence behavior detection method based on weak supervised learning is characterized by comprising the following steps:
step A, performing space-time feature extraction on an uncut video through a double-current expansion convolution network, inputting the extracted features into a boundary regression layer, firstly stacking three same time convolution blocks, performing convolution filtering along with time, wherein each time convolution block is provided with 2048 convolution kernels, one BN layer and a RELU layer, and finally adding one time convolution block to output a boundary regression value to perform fine segmentation of segment-level behavior boundaries;
step B, taking the segmentation segment characteristics as nodes of a graph convolution network to carry out relationship reasoning, designing internal and external comparison loss of category segment fusion, monitoring intermediate representation of video characteristics, increasing characteristic distances between a foreground and a background and between different categories, and fusing thresholds to obtain a behavior proposal;
and step C, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the category confidence coefficient mean value as a performance evaluation index.
2. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: when the relation reasoning is carried out in the step B, the segment characteristics are input into the graph convolution network to carry out the relation learning of the video segment characteristics, and the obtained segment characteristics are output as
Figure FDA0003412240830000011
W is a matrix of weights that is,
Figure FDA0003412240830000012
is the affinity matrix of G normalized by softmax, XcRepresenting sets of video segments, G being XcIn order to calculate G, a delta function is designed, the similarity and dissimilarity of appearance and motion characteristics among the segments are combined into the learning process of weight, and the nodes of the graph convolution network aggregate information from the neighborhood to generate own unique characteristics, so that the nodes with more similar delta are combinedHaving higher weight, δ (X) wx + b, δ (X) representing the feature learning function, w, b representing the learning weight and the bias term, and X representing XcThe fragment characteristics of (1).
3. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein: in the step B, the characteristics of the fragments output in the graph convolution network in continuous time are cascaded, similarity measurement is carried out on the characteristics of the fragments and the global node, whether behavior information is lost or not is judged by setting a threshold, if the characteristics are smaller than the threshold, the behavior information is lost too much due to the fact that the fragments divided by the video are too large or too small, the behavior information is fed back to carry out re-segmentation training and learning, and otherwise, the result is output through a classifier.
4. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: in the step B, the internal and external contrast loss supervision characteristics of class fragment fusion are expressed as follows:
Figure FDA0003412240830000013
Figure FDA0003412240830000014
where T is the total number of clips in each video, T is the video clip index, Ft cRepresents the spatio-temporal characteristics of the t time period,
Figure FDA0003412240830000015
is the confidence of the category i at time period t, j, k are two video segments, fi jIn order to be a foreground feature,
Figure FDA0003412240830000016
for the purpose of the background feature(s),
Figure FDA0003412240830000017
the cosine distance is represented and 0.5 represents the similar offset between the segments.
5. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: in step B, the formula CAS ═ MLP (X)ccas) Threshold fusion is carried out, CAS is a class activation sequence and represents the confidence coefficient of a set class, MLP represents that video segment features are mapped to action class space by the multi-layer perception principle, and classification scores of behaviors are obtained over time, thetacasTrainable parameters, X, representing a sequence of action classescRepresentative is a collection of video segment features.
6. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein:
addition of L to affinity matrix GGSLoss to guarantee edge sparsity of G:
Figure FDA0003412240830000021
t is the total number of video segments, i, j is the intra-video segment index, Gi,jThe similarity relationship between the segment i and the segment j in the video is shown.
7. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein: the similarity measure adopts cosine similarity, and is defined as follows:
Figure FDA0003412240830000022
Xi,Xjrepresenting the features of segments indexed i, j within a video.
CN202111534859.7A 2021-12-15 2021-12-15 Video time sequence behavior detection method based on weak supervised learning Pending CN114359790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111534859.7A CN114359790A (en) 2021-12-15 2021-12-15 Video time sequence behavior detection method based on weak supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111534859.7A CN114359790A (en) 2021-12-15 2021-12-15 Video time sequence behavior detection method based on weak supervised learning

Publications (1)

Publication Number Publication Date
CN114359790A true CN114359790A (en) 2022-04-15

Family

ID=81099261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111534859.7A Pending CN114359790A (en) 2021-12-15 2021-12-15 Video time sequence behavior detection method based on weak supervised learning

Country Status (1)

Country Link
CN (1) CN114359790A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842402A (en) * 2022-05-26 2022-08-02 重庆大学 Weakly supervised time sequence behavior positioning method based on counterstudy
CN116226443A (en) * 2023-05-11 2023-06-06 山东建筑大学 Weak supervision video clip positioning method and system based on large-scale video corpus

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842402A (en) * 2022-05-26 2022-08-02 重庆大学 Weakly supervised time sequence behavior positioning method based on counterstudy
CN114842402B (en) * 2022-05-26 2024-05-31 重庆大学 Weak supervision time sequence behavior positioning method based on countermeasure learning
CN116226443A (en) * 2023-05-11 2023-06-06 山东建筑大学 Weak supervision video clip positioning method and system based on large-scale video corpus
CN116226443B (en) * 2023-05-11 2023-07-21 山东建筑大学 Weak supervision video clip positioning method and system based on large-scale video corpus

Similar Documents

Publication Publication Date Title
Ramachandra et al. A survey of single-scene video anomaly detection
CN108805170B (en) Forming data sets for fully supervised learning
Wei et al. Boosting deep attribute learning via support vector regression for fast moving crowd counting
Wang et al. Correspondence-free activity analysis and scene modeling in multiple camera views
Zhang et al. Mining semantic context information for intelligent video surveillance of traffic scenes
US8660368B2 (en) Anomalous pattern discovery
CN101482923B (en) Human body target detection and sexuality recognition method in video monitoring
US11640714B2 (en) Video panoptic segmentation
CN111767847B (en) Pedestrian multi-target tracking method integrating target detection and association
CN114359790A (en) Video time sequence behavior detection method based on weak supervised learning
CN108133172A (en) Method, the analysis method of vehicle flowrate and the device that Moving Objects are classified in video
Wu et al. Two stage shot boundary detection via feature fusion and spatial-temporal convolutional neural networks
CN110378911B (en) Weak supervision image semantic segmentation method based on candidate region and neighborhood classifier
Luo et al. Traffic analytics with low-frame-rate videos
Bennett et al. Enhanced tracking and recognition of moving objects by reasoning about spatio-temporal continuity
Maag et al. Two video data sets for tracking and retrieval of out of distribution objects
Qin et al. Application of video scene semantic recognition technology in smart video
KR102110375B1 (en) Video watch method based on transfer of learning
Pillai et al. Transformer based self-context aware prediction for few-shot anomaly detection in videos
Tang et al. Graph-based motion prediction for abnormal action detection
Yang et al. Video anomaly detection for surveillance based on effective frame area
CN115187884A (en) High-altitude parabolic identification method and device, electronic equipment and storage medium
Ahmed et al. Localization of region of interest in surveillance scene
Veluchamy et al. Detection and localization of abnormalities in surveillance video using timerider-based neural network
Yao et al. Weakly supervised graph learning for action recognition in untrimmed video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination