CN114359790A - Video time sequence behavior detection method based on weak supervised learning - Google Patents
Video time sequence behavior detection method based on weak supervised learning Download PDFInfo
- Publication number
- CN114359790A CN114359790A CN202111534859.7A CN202111534859A CN114359790A CN 114359790 A CN114359790 A CN 114359790A CN 202111534859 A CN202111534859 A CN 202111534859A CN 114359790 A CN114359790 A CN 114359790A
- Authority
- CN
- China
- Prior art keywords
- video
- behavior
- segment
- time
- learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000006399 behavior Effects 0.000 title claims abstract description 101
- 238000001514 detection method Methods 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 35
- 230000004927 fusion Effects 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims abstract description 14
- 239000012634 fragment Substances 0.000 claims abstract description 13
- 238000005259 measurement Methods 0.000 claims abstract description 5
- 230000002123 temporal effect Effects 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 8
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 230000009471 action Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 238000011524 similarity measure Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 2
- 230000008447 perception Effects 0.000 claims 1
- 230000000295 complement effect Effects 0.000 abstract description 10
- 230000000694 effects Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000003068 static effect Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention provides a video time sequence behavior detection method based on weak supervised learning, which adopts a countermeasure thought, segments the behavior boundary of a segment level by adding a refined layer and reduces redundant information of a time sequence detection downlink behavior example; the similarity relation of GCN explicit modeling fragments is adopted, internal and external contrast loss of category fragment fusion is provided to supervise the intermediate representation of video features, the problem of context confusion is solved by increasing the feature distance between the foreground and the background and reducing the feature distance between the same categories, a behavior proposal is obtained by threshold fusion, and the purposes of structural integrity of a behavior instance and independent positioning of content are realized; by adopting complementary thought, aiming at the problem of loss of video information in the processes of feature learning and relationship reasoning, the invention provides that a global node is added into a complementary learning layer, the learned features are cascaded according to the continuity of time and similar measurement is carried out on the global node, and the integrity of the video information and the accuracy of behavior recognition are ensured.
Description
Technical Field
The invention relates to a video behavior positioning method, in particular to a video time sequence behavior detection method based on weak supervised learning.
Background
With the rapid increase of electronic shooting equipment and video data, for the positioning of video time sequence behaviors, a large amount of marking information is needed for training and learning, and accurate time sequence boundary marking is extremely expensive and is easy to make mistakes, so that the application of a time sequence behavior detection algorithm is limited to a great extent, and the time-on-demand situation is monitored weakly. The behavior positioning technology based on weak supervision only uses the video-level label in the training process, so that the waste of human resources and time and the labeling error can be further reduced, and the method has good flexibility.
The current weakly supervised behavior localization methods fall into two main categories: one is that weak supervision time sequence behavior positioning is used as a video identification task, a foreground and background separation attention mechanism is introduced to construct video level characteristics, and then a behavior classifier is applied to identify videos; another class considers the problem as a multiple instance learning problem (MIL), considers the whole un-clipped video as a packet of instances, divides the packet into multiple time segments, classifies each time segment, combines the segment-level predictions, and uses the MIL to obtain the final video-level classification. Although the existing method achieves certain effect, two problems existing at the present stage cannot be well solved: (1) and (4) a behavior integrity modeling problem, wherein the complete behavior is predicted to become abnormally complex under the weak supervision setting. As shown in fig. 1(a), the interval shown by Gt represents the real behavior range, and the Pred interval represents the prediction range of the model, a complete swimming behavior is predicted into a plurality of smaller intervals of behaviors, and the sections cannot be regarded as a complete whole; (2) action context confusion is the problem of how to distinguish behavior from highly relevant context only by video level tags. The video-level classifier learns the correlation between videos with the same label, as shown in fig. 1(b), which not only includes common behaviors, but also contains a closely related context background, and the model cannot separate the behaviors from the context, resulting in erroneous prediction.
Aiming at the problems, the existing solutions include random erasure, class-independent attention modeling, discriminant feature learning and the like, and the methods either overlook the segment with high discrimination and ignore the segment with low discrimination; or training supervision is provided only by using the feature similarity, and feature relation is not modeled for prediction; or redundant information exists in the behavior instance after fusion due to the problem of the segmentation strategy; the video information integrity verification process is not available, and the problem of behavior recognition error caused by behavior information loss exists in a series of characteristic learning and relation reasoning processes after video division, so that the model cannot have good detection performance.
Disclosure of Invention
The invention aims to solve the problem of how to accurately segment different behavior examples and backgrounds in a video under the condition that the behavior examples in a long video have no start-stop boundary marking, so as to realize the time sequence behavior detection of the long video.
The invention is realized by adopting the following technical scheme: a video time sequence behavior detection method based on weak supervised learning is characterized by comprising the following steps:
step A, performing space-time feature extraction on an uncut video through a double-current expansion convolutional network (I3D), and inputting the extracted features into a boundary regression layer to perform fine segmentation of segment-level boundaries;
inputting the extracted features into the boundary regression layer includes: stacking three same time convolution blocks, wherein each time convolution block has 2048 convolution kernels, one BN layer and one RELU layer along with time convolution filtering, and finally adding one time convolution block to output a boundary regression value for fine segmentation;
step B, taking the segmentation characteristics as nodes of a Graph Convolution Network (GCN) to carry out relationship reasoning, designing internal-External contrast (IEC) loss of category segment fusion, monitoring intermediate representation of video characteristics, increasing characteristic distances between a foreground and a background and between different categories, and fusing thresholds to obtain a behavior proposal;
and step C, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the mAP of the set category confidence coefficient as a performance evaluation index.
Further, when the relationship reasoning is carried out in the step B, the segment characteristics are input into the graph convolution network for video segment characteristic relationship learning to obtain segment characteristic outputW is a matrix of weights that is,is the affinity matrix of G normalized by softmax, XcRepresenting a collection of video segments, G being essentially XcAccording to a weighted combination of each segment feature in the set according to similarity. In order to calculate G, designing a delta function to combine similarity and dissimilarity of appearance and motion characteristics among segments into a weight learning process, integrating information in a neighborhood of nodes of a graph convolution network to generate self unique characteristics, so that nodes with delta more similar have higher weight, wherein delta (X) is wx + b, w and b represent learning weight and bias terms, delta (X) represents a characteristic learning function, and X represents XcThe fragment characteristics of (1).
Further, the characteristics of the temporally continuous segments output by the GCN are cascaded and subjected to similarity measurement with a global node (basic behavior information of the whole video is extracted and reserved by I3D), whether behavior information is lost or not is judged by setting a threshold, if the threshold is smaller than the threshold, the behavior information is lost too much due to the fact that the segments divided by the video are too large or too small, the behavior information is fed back to perform re-segmentation training learning, and otherwise, the result is output through a classifier.
Further, in the step B, IEC loss supervision characteristics are expressed as follows:
where T is the total number of video segments, T is the index representing the segment, Ft cRepresents the spatio-temporal characteristics of the t time period,is the confidence of the category i at time t, j, k are two video segments, fi jIn order to be a foreground feature,for the purpose of the background feature(s),the cosine distance is represented, the similar offset is represented by 0.5, and the offset effect of 0.5 is best obtained by an experimental interval value taking method and a value refining method on two sides. By adding non-linearity, the generalization capability of the function is improved.
Further, in the step B, by the formula CAS ═ MLP (X)c,θcas) And carrying out threshold fusion.
further, the similarity measure adopts cosine similarity, which is defined as follows:
compared with the prior art, the invention has the advantages and positive effects that:
the invention introduces the countermeasure idea, adds the boundary regression layer to segment-level behavior boundary segmentation, and reduces the redundant information of the sequential detection downlink behavior example; adopting GCN to take the divided segments as graph nodes, and explicitly modeling the similarity relation of the segments; intermediate representations of IEC loss, surveillance video features are proposed. And increasing the characteristic distance between the foreground and the background, reducing the characteristic distance between similar categories and increasing the characteristic distance between different categories, and then performing threshold fusion to obtain a behavior proposal to ensure the integrity and the independence of behavior examples.
According to the method, by means of complementary thought, aiming at the problem of loss of video behavior information in the processes of feature learning and relationship reasoning, the invention provides that a global node is added into a complementary learning layer, the learned features are cascaded according to time continuity, and similarity measurement is carried out on the global node, so that the integrity of the video information and the accuracy of behavior recognition are ensured.
Drawings
FIG. 1 is an illustration of a prior art weakly supervised behavioral localization problem;
FIG. 2 is an exemplary diagram of an uncut video;
FIG. 3 is a diagram of the overall network architecture of the video timing detection method based on the weak supervised learning according to the present invention;
FIG. 4 is a schematic diagram of the complementary countermeasure concept of the present invention;
FIG. 5 shows the canchor regression process of the present invention;
FIG. 6 is a schematic diagram of the GCN structure of the present invention;
FIG. 7 is a diagram showing the effect of the present invention.
Detailed Description
The general idea of the invention is as follows:
in the absence of fine-grained temporal boundary annotation for un-clipped video, it becomes very difficult to detect complete and accurate behavior instances. Therefore, the extracted space-time characteristics are input into the boundary regression layer to finely divide the segment-level behavior boundary; therefore, redundant information of the behavior instance is reduced, and independence on the content of the behavior instance is guaranteed; the method is used as a node of a Graph Convolution Network (GCN), local correlation between the node and a neighbor node is deduced, a unique feature of the node is generated, IEC loss supervision video feature representation (an un-clipped video not only comprises a behavior example, but also comprises a highly similar context (the background and an undesired behavior are collectively referred to as the context), and the foreground and the background are represented sums of intermediate features in a video time period.
Considering that the lengths of behavior examples in an unclipped video are different, and the lengths of the behavior examples are different in a short time, namely a few seconds and a long time, namely an hour, in the feature learning process, behavior information is lost due to a series of convolution and pooling operations, and in turn, behavior learning errors occur, so that the detection result is influenced. Aiming at ensuring that the whole video information is not lost, the invention provides that a global node is added into a complementary learning layer, a series of characteristics after characteristic learning and relational reasoning are cascaded in a time dimension, similarity measurement is carried out on the characteristics and the global node, and if the characteristics are within a certain threshold range, the whole information is not lost, so that the integrity of the video information and the accuracy of behavior recognition are ensured.
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The general framework of the technical scheme of the invention is shown in figure 3. Firstly, I3D is used for space-time feature extraction, input to a boundary regression layer, as shown in figure 3(a), three same time volume blocks are firstly provided, the size of an inner core is 3, stride is 1, padding is 1, each time volume block is provided with a BN (layer) and a RELU (layer), and with time convolution filtering and regression iterative training, behavior boundaries of prediction segment levels are finely segmented, and independence on behavior example content is enhanced; then using the segmented segment characteristics as the nodes of the GCN to carry out relationship reasoning, as shown in figure 3(b), and obtaining a behavior proposal through threshold fusion; and finally, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the mAP of the set category confidence coefficient as a performance evaluation index. The complementary countermeasure idea of the invention is shown in fig. 4, and is seen from top to bottom, mainly under the guidance of global nodes (basic information of the whole video, having a global view), the redundant information of the time sequence behavior is removed at the same time through the relation and the distinction between the learning segments of the countermeasure mechanism, so as to ensure the relative independence of the behavior proposal.
The respective sections will be described in detail next.
(1) Feature extraction network
In order to prevent the problem of poor model calculation caused by memory consumption, the method firstly divides the un-clipped video into T continuous equal-length non-overlapping segments in a uniform sampling mode, and then transmits the divided segments and the un-clipped video to an I3D network for space-time feature extraction.
The double-current expansion 3D convolutional network (I3D) takes the latest picture classification model as a basic structure, and the convolution kernel and the pooling kernel of the picture convolution classification are extended from 2D to 3D to learn spatiotemporal features seamlessly. The backbone network consists of a spatial stream that accepts RGB input and a temporal stream that accepts Flow input. For each stream, inclusion components and batch normalization were used, as shown in the fig. 5 structure. Finally, the spatial and temporal features extracted for each segment are concatenated into a feature vector of 2048 channels.
Specifically, for each videoFor each segmentRespectively encoding static scene features F by using spatial stream (RGB) and temporal stream (Flow)i RGB(t)∈R1024And motion characteristics Fi flow(t)∈R1024. Static scene feature F through cascading operationi RGB(t) and motion characteristics Fi flow(t) composition into video clip-level features Fi c(t)=[Fi RGB(t),Fi flow(t)]. Finally, all fragment levels are stackedFeatures to form video pretrained features Fc∈RT ×2048. Similarly, for each video V, extracting its video-level features, and respectively coding the static scene features F by using the space stream (RGB) and the time stream (Flow)RGB(t)∈R1024And motion characteristics Fflow(t)∈R1024. Static scene feature F through cascading operationRGB(t) and motion characteristics Fflow(t) combining into video level spatiotemporal features Fg(t)=[FRGB(t),Fflow(t)]∈R1×2048Then, a Global Average Pooling (GAP) X is performedg=Pool(Fg(t)) preserve the complete behavioral characteristics of the video.
(2) Boundary regression layer
The spatio-temporal features extracted by I3D are input to the boundary regression layer. The adopted method is coordinate parameterization, and in order to adapt to an I3D feature extraction mode and reduce calculation consumption, the invention adopts fragment-level coordinate regression, as shown in figure 6. First, stacking three identical temporal convolution blocks, features and motion intensity at each temporal position can be viewed as a function of the temporal convolution filtering.
Each time convolution block has 2048 convolution kernels, kernel size 3, stride 1, padding 1. After each time volume block there is a BN layer and a RELU layer.
Xc=RELU(BN(Conv(Xc,θ))) (1)
XcRepresentative are video segment features, Conv convolution operation, BN batch processing operation, RELU non-linear activation.
And finally, adding a time volume block to output a boundary regression value for fine segmentation.
The present invention initializes M different scales of candor (clip Anchor) based on the typical duration of a given dataset behavior (e.g., THUMOS' 14 dataset with initialization scales of [1, 2, 4, 8, 16, 32 [)]The initialization scale for the activitynet1.2 dataset is [16, 32, 64, 128, 256%]). At the time position s of each segmentxIn the above, M different scales of candor are predicted. An example of a candor is shown in fig. 5.
The following process is iteratively performed: (r) in its M candor predictions, the behavior segment center position c is calculated by parameterizing the offsetx=sx+wa·tcAnd the time length w ═ wa·exp(tw),tcIndicating how to move the center position of the candor, twIndicating how to scale the length of the candor. If the behavior value at a time position is lower than 0.1(0.1 can indicate that the time position is most probably a non-action background class, background characteristics exist, but no behavior exists, if the value is larger, behavior can exist, the motion amplitude is small, noise interference exists, and 0.1 is most closely required), all predictions corresponding to the time position are abandoned. ③ for each remaining position, only the position with the least loss is retained, which means the most probable candor. For the reserved prediction position, the method deletes the prediction with the loss larger than a certain threshold value, and finally performs Non Maximum Suppression (NMS) on all reserved segments to obtain an accurate frame. Boundary x is obtained in the last time convolution block1=cx-w/2,x1=cx+ w/2 for precise segmentation of the fragments.
(3) The similarity relation between the explicit modeling behavior segments obtains a behavior proposal through threshold fusion "
The GCN is a graphical example inter-segment similarity modeling network that provides spatial topology and semantic appearance features. Local correlation between related neighbor nodes can be deduced, information is aggregated from the neighborhood, and the self characteristic independence is enhanced in the mode. The GCN takes the fragments finely divided by the candor as nodes for graph reasoning. Explicitly modeling similarity relationships between video segments, and then threshold fusing to get a "behavior proposal" ("behavior proposal" refers to independent and complete temporal behavior in un-clipped video.
GCN input Xc:
Z dimension is T × doutIs the output of graph convolution, with W dimension of 2048 × doutIs a weight matrix learned by back propagation,dimension(s) of (a) is T is an affinity matrix of G after being normalized by softmax.
In order to calculate G, the invention designs a delta function to combine the similarity and dissimilarity of appearance and motion characteristics between segments into the learning process of weight, and GCN generates self unique characteristics from the aggregation information in the neighborhood. So that nodes with δ more similar have higher weights.
δ(x)=wx+b (3)
w, b represent the learning weight and bias term, and X represents XcSegment characteristics in the collection. The similarity measure is cosine similarity, and is defined as follows:
firstly, in the relation reasoning process, a dynamic fusion threshold value (for weak supervision, the dynamic fusion threshold value is set classification, the mean value of confidence degrees of all classes is used as a performance evaluation index, one classification confidence degree is called a class score, a plurality of classes are class activation sequences CAS) is obtained through a formula (5), then, an IEC loss supervision characteristic expression is designed, such as a formula (6), and a behavior proposal is obtained according to threshold fusion. And finally, by MIL constraint, mAP is used as a performance evaluation index.
IEC: the foreground and background are representative sums of the features in the middle of the video segment. The invention increases the characteristic distance between the foreground and the background, and solves the problem of confusion of action context; and the characteristic distance between different classes is increased, the characteristic distance between nodes in the same class is reduced, and the integrity and the independence of the behavior examples are ensured.
CAS=MLP(Xc,θcas) (5)
Where T is the total number of video segments, T is the index of the segment, Ft cA characteristic representing a time period of t,is the confidence of class i at time t, j, k are two video segments, fi jIn order to be a foreground feature,as a background feature.The cosine distance is represented, the similar offset is represented by 0.5, the offset effect of 0.5 is best by the method of experimental interval value taking and value thinning on two sides, and the function nonlinearity is increased.
The IEC loss is designed for the purpose of supervising the characteristic representation, and the IEC loss and the boundary regression layer jointly form a countermeasure mechanism to ensure the integrity and independence of the behavior instance.
G is essentially a set of segment features X for each videocEach segment feature x is weighted and combined according to the similarity relation and corresponds to a regular complete connection layer without bias items. A complementary learning layer is introduced before a graph layer is transmitted to a classification layer, and the complementary learning layer is mainly used for verifying whether behavior information is lost in a series of characteristic learning and relation reasoning processes, ensuring the integrity of video information and preventing behavior recognition errors. The process is as follows: the invention carries out cascade concat (X) on the continuous fragment characteristics of GCN output timec) And global node XgThe similarity measure is performed using equation (4),if the value is smaller than the threshold (the result of the experiment proves that the effect is the best when the threshold theta is 0.6), the result shows that the behavior information is lost too much due to the fact that the video is divided into too large or too small segments, and the behavior information is fed back to perform re-segmentation training learning. Otherwise, the detection result is output through the classifier. By designing the complementary learning layer, the integrity of video information can be ensured, and behavior recognition errors caused by feature loss can be prevented.
In order to optimize the model, the model obtains better performance and the identification precision is improved. The invention designs a total objective function:
Ltol=λ1LMIL+λ2LGS+λ3LIOC (7)
λ1,λ2,λ3learning parameter, LtolObjective function, LMILMultiple instance learning loss, LGSLoss of graph sparseness, LIOCLoss of internal and external contrast.
Loss of multi-instance learning: the method directly maps the weak surveillance video time sequence behavior detection problem into a multi-instance learning task. And dividing the prediction into a plurality of time periods, classifying each time period, merging the segment-level predictions, and obtaining the final video-level classification by using multi-instance learning.
To predict the confidence that segment j is of category i,confidence that the real segment j is the category i, n represents the number of behavior instances in the video, ncRepresenting the total number of categories.
GCN graph sparsity loss: in order to ensure the sparsity of the graph, the network training speed is improved. In summary, G can group together similar segment features x,and pushes away dissimilar segment features x. G with similar edge weights can be difficult to train in the network because the degree of distinctiveness of the features x is averaged. To prevent this, the present invention adds L to GGSLoss to guarantee edge sparsity of G:
t is the total number of video segments, i, j is the intra-video segment index, Gi,jThe similarity relationship between the segment i and the segment j in the video is shown.
The idea of the internal and external contrast loss design is to supervise the video feature representation, increase the feature distance between the foreground and the background and solve the behavior context confusion problem; reducing the feature distance between similar fused segments ensures independence and completeness of behavior instances.
The method is different from other methods which only focus on the high-discrimination segment and are restricted by time proximity, and the detection performance is not good due to insufficient modeling information caused by no global view.
The present invention uses the GCN explicit modeling of similarity relationships between video segments. In summary, the GCN treats an input element as a node in a graph with weighted edges. The feature of each node is changed from X to Z (as shown in fig. 6) by several operations. However, the connection relationship between the nodes, i.e., G (affinity matrix), is shared no matter how many layers there are in between. The node edges are weighted by their similarity. In this way, relevant time segments can be pushed together, while irrelevant time segments are pushed apart in the feature space for instance clustering purposes.
Video context is a key clue to detecting behavior. Segments that are further from the behavior but contain similar semantic content may provide indicative cues for detecting the behavior. Such as background frames, the background of the motion field indicates what may happen on the motion field (e.g., "long jump") rather than elsewhere (e.g., "shopping") because the video context is adaptive.
FIG. 7 shows qualitative results of video time series behavior detection, where the true value is represented by Gt, the detection result of the present invention is represented by Pred, and the segment without GCN modeling is represented by No-F. The peripheral boxes indicate that the method of the invention can locate a wider range of behaviors, and learn a more general behavior location model, and can locate more behavior instances.
The performance of the weakly supervised time series behavior detection of the present invention was evaluated using mAP at different overlap thresholds (IOU) as metric values, expressed as mAP @ tIoU, on the THUMOS' 14 dataset to set t-IOU to [0.1, 0.2, 0.3, 0.4, 0.5] and compared to several latest methods of weak supervision, as follows for a standard-compliant evaluation protocol.
TABLE 1 test results on THUMOS' 14 test set
As shown in table 1, the method provided by the invention can obtain a good effect under the condition of weak video mark bundling, and compared with other methods, the method is improved by 1.47 percentage points on average, and the single maximum is improved by 2.28 percentage points.
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.
Claims (7)
1. A video time sequence behavior detection method based on weak supervised learning is characterized by comprising the following steps:
step A, performing space-time feature extraction on an uncut video through a double-current expansion convolution network, inputting the extracted features into a boundary regression layer, firstly stacking three same time convolution blocks, performing convolution filtering along with time, wherein each time convolution block is provided with 2048 convolution kernels, one BN layer and a RELU layer, and finally adding one time convolution block to output a boundary regression value to perform fine segmentation of segment-level behavior boundaries;
step B, taking the segmentation segment characteristics as nodes of a graph convolution network to carry out relationship reasoning, designing internal and external comparison loss of category segment fusion, monitoring intermediate representation of video characteristics, increasing characteristic distances between a foreground and a background and between different categories, and fusing thresholds to obtain a behavior proposal;
and step C, obtaining a set category confidence coefficient through a multi-instance learning classifier, and taking the category confidence coefficient mean value as a performance evaluation index.
2. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: when the relation reasoning is carried out in the step B, the segment characteristics are input into the graph convolution network to carry out the relation learning of the video segment characteristics, and the obtained segment characteristics are output asW is a matrix of weights that is,is the affinity matrix of G normalized by softmax, XcRepresenting sets of video segments, G being XcIn order to calculate G, a delta function is designed, the similarity and dissimilarity of appearance and motion characteristics among the segments are combined into the learning process of weight, and the nodes of the graph convolution network aggregate information from the neighborhood to generate own unique characteristics, so that the nodes with more similar delta are combinedHaving higher weight, δ (X) wx + b, δ (X) representing the feature learning function, w, b representing the learning weight and the bias term, and X representing XcThe fragment characteristics of (1).
3. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein: in the step B, the characteristics of the fragments output in the graph convolution network in continuous time are cascaded, similarity measurement is carried out on the characteristics of the fragments and the global node, whether behavior information is lost or not is judged by setting a threshold, if the characteristics are smaller than the threshold, the behavior information is lost too much due to the fact that the fragments divided by the video are too large or too small, the behavior information is fed back to carry out re-segmentation training and learning, and otherwise, the result is output through a classifier.
4. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: in the step B, the internal and external contrast loss supervision characteristics of class fragment fusion are expressed as follows:
where T is the total number of clips in each video, T is the video clip index, Ft cRepresents the spatio-temporal characteristics of the t time period,is the confidence of the category i at time period t, j, k are two video segments, fi jIn order to be a foreground feature,for the purpose of the background feature(s),the cosine distance is represented and 0.5 represents the similar offset between the segments.
5. The weak supervised learning based video temporal behavior detection method according to claim 1, wherein: in step B, the formula CAS ═ MLP (X)c,θcas) Threshold fusion is carried out, CAS is a class activation sequence and represents the confidence coefficient of a set class, MLP represents that video segment features are mapped to action class space by the multi-layer perception principle, and classification scores of behaviors are obtained over time, thetacasTrainable parameters, X, representing a sequence of action classescRepresentative is a collection of video segment features.
6. The weak supervised learning based video temporal behavior detection method according to claim 2, wherein:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111534859.7A CN114359790A (en) | 2021-12-15 | 2021-12-15 | Video time sequence behavior detection method based on weak supervised learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111534859.7A CN114359790A (en) | 2021-12-15 | 2021-12-15 | Video time sequence behavior detection method based on weak supervised learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114359790A true CN114359790A (en) | 2022-04-15 |
Family
ID=81099261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111534859.7A Pending CN114359790A (en) | 2021-12-15 | 2021-12-15 | Video time sequence behavior detection method based on weak supervised learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114359790A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114842402A (en) * | 2022-05-26 | 2022-08-02 | 重庆大学 | Weakly supervised time sequence behavior positioning method based on counterstudy |
CN116226443A (en) * | 2023-05-11 | 2023-06-06 | 山东建筑大学 | Weak supervision video clip positioning method and system based on large-scale video corpus |
-
2021
- 2021-12-15 CN CN202111534859.7A patent/CN114359790A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114842402A (en) * | 2022-05-26 | 2022-08-02 | 重庆大学 | Weakly supervised time sequence behavior positioning method based on counterstudy |
CN114842402B (en) * | 2022-05-26 | 2024-05-31 | 重庆大学 | Weak supervision time sequence behavior positioning method based on countermeasure learning |
CN116226443A (en) * | 2023-05-11 | 2023-06-06 | 山东建筑大学 | Weak supervision video clip positioning method and system based on large-scale video corpus |
CN116226443B (en) * | 2023-05-11 | 2023-07-21 | 山东建筑大学 | Weak supervision video clip positioning method and system based on large-scale video corpus |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ramachandra et al. | A survey of single-scene video anomaly detection | |
CN108805170B (en) | Forming data sets for fully supervised learning | |
Wei et al. | Boosting deep attribute learning via support vector regression for fast moving crowd counting | |
Wang et al. | Correspondence-free activity analysis and scene modeling in multiple camera views | |
Zhang et al. | Mining semantic context information for intelligent video surveillance of traffic scenes | |
US8660368B2 (en) | Anomalous pattern discovery | |
CN101482923B (en) | Human body target detection and sexuality recognition method in video monitoring | |
US11640714B2 (en) | Video panoptic segmentation | |
CN111767847B (en) | Pedestrian multi-target tracking method integrating target detection and association | |
CN114359790A (en) | Video time sequence behavior detection method based on weak supervised learning | |
CN108133172A (en) | Method, the analysis method of vehicle flowrate and the device that Moving Objects are classified in video | |
Wu et al. | Two stage shot boundary detection via feature fusion and spatial-temporal convolutional neural networks | |
CN110378911B (en) | Weak supervision image semantic segmentation method based on candidate region and neighborhood classifier | |
Luo et al. | Traffic analytics with low-frame-rate videos | |
Bennett et al. | Enhanced tracking and recognition of moving objects by reasoning about spatio-temporal continuity | |
Maag et al. | Two video data sets for tracking and retrieval of out of distribution objects | |
Qin et al. | Application of video scene semantic recognition technology in smart video | |
KR102110375B1 (en) | Video watch method based on transfer of learning | |
Pillai et al. | Transformer based self-context aware prediction for few-shot anomaly detection in videos | |
Tang et al. | Graph-based motion prediction for abnormal action detection | |
Yang et al. | Video anomaly detection for surveillance based on effective frame area | |
CN115187884A (en) | High-altitude parabolic identification method and device, electronic equipment and storage medium | |
Ahmed et al. | Localization of region of interest in surveillance scene | |
Veluchamy et al. | Detection and localization of abnormalities in surveillance video using timerider-based neural network | |
Yao et al. | Weakly supervised graph learning for action recognition in untrimmed video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |