CN110633645A - Video behavior detection method based on enhanced three-stream architecture - Google Patents

Video behavior detection method based on enhanced three-stream architecture Download PDF

Info

Publication number
CN110633645A
CN110633645A CN201910764109.5A CN201910764109A CN110633645A CN 110633645 A CN110633645 A CN 110633645A CN 201910764109 A CN201910764109 A CN 201910764109A CN 110633645 A CN110633645 A CN 110633645A
Authority
CN
China
Prior art keywords
flow
behavior
detection
behavior detection
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910764109.5A
Other languages
Chinese (zh)
Inventor
王瀚漓
吴雨唐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201910764109.5A priority Critical patent/CN110633645A/en
Publication of CN110633645A publication Critical patent/CN110633645A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video behavior detection method based on an enhanced three-stream architecture, which comprises the following steps: a data generation step, namely acquiring an input video, and acquiring an optical flow diagram and a human body posture diagram; a behavior detection step, wherein an original image, a light flow graph and a human body posture graph are used as input of an enhanced appearance flow, the light flow graph is used as input of a kinetic potential flow, the human body posture graph is used as input of a posture flow, and a workflow processes corresponding input at each time step to generate a detection result; performing three-flow fusion, namely performing expanded weighted fusion on the category scores of the three workflows to obtain a fusion score, and taking a coordinate regression value of the enhanced appearance flow as a fusion regression value to obtain a group of post-regression behavior detection tubules with the category scores; and a behavior tube construction step, namely connecting the behavior detection tubules at each time step with time to construct a 3D behavior detection tube of the input video. Compared with the prior art, the method has the advantages of high classification accuracy, accurate positioning, convenience and quickness in operation and the like.

Description

Video behavior detection method based on enhanced three-stream architecture
Technical Field
The invention relates to a video behavior detection method, in particular to a video behavior detection method based on an enhanced three-stream architecture.
Background
With the increasing popularity of video understanding in recent years, video behavior detection has received more and more attention and is widely used in various video applications such as video surveillance and the like. Video behavior detection aims at classifying and locating video behavior throughout the video. Compared with other well-studied visual tasks such as image classification and the like, the development of the video behavior detection task is still limited by insufficient training data, noisy video background, large space-time search domain and the like.
The introduction of Convolutional Neural Networks (CNNs) has made a significant progress in the task of object detection. In view of task similarity between behavior detection and target detection, most leading edge behavior detection methods currently combine CNNs-based target detectors with a dual-stream architecture.
In the methods, each workflow generates a frame-level or sequence-level behavior detection result at each time step, the dual-stream results are fused, and the results at each time step are connected with each other along with time to obtain a 3D behavior detection tube for the whole video. The three-flow behavior detection architecture takes human posture visual clues as a third workflow, and helps to more clearly characterize spatial positioning of behaviors by eliminating the background and highlighting the behaviors of people. However, these tasks only process a single visual cue on a single workflow, i.e. the original image, the light flow graph or the body posture graph, and ignore the interaction between the visual cues.
Chinese patent application CN109284667A discloses a three-stream type human motion behavior spatial domain detection method for video, in which corresponding optical flows and human posture maps are obtained from an original image, an RGB Flow inputted with the original image, a Flow inputted with the optical Flow map, and a pos Flow inputted with the human posture map are formed, and the RGB Flow, the Flow, and the pos Flow are detected by detectors on the respective flows at each time step to obtain detection results, the method processes a single visual cue on a single workflow, that is, processes the original image, the optical Flow map, or the human posture map on the RGB Flow, the Flow, and the pos Flow, respectively, and generally considers that the original image contains the most abundant visual information, and the conventional method cannot fully play the role of the appearance Flow inputted with the original image in the detection of the behavior, and cannot play the role of the original image in the monitoring of the behavior, the detection performance of the detector still has a higher promotion space.
Disclosure of Invention
The present invention aims to overcome the above-mentioned drawbacks of the prior art and provide a video behavior detection method based on an enhanced three-stream architecture.
The purpose of the invention can be realized by the following technical scheme:
a video behavior detection method based on an enhanced three-stream architecture comprises the following steps:
a data generation step, namely acquiring an input video, and acquiring a corresponding light flow graph and a human body posture graph according to an original image;
a behavior detection step, namely generating an enhanced appearance flow according to an original image, a light flow graph and a human body posture graph, forming three-flow type input by taking the light flow graph as a momentum flow and taking the human body posture graph as a posture flow, and respectively detecting the enhanced appearance flow, the momentum flow and the posture flow at each time step to generate a corresponding detection result, wherein the detection result comprises a category score and an anchor cube coordinate regression value;
performing three-stream fusion, namely performing expanded weighted fusion on the category scores of the three workflows to obtain a fusion score, taking a coordinate regression value of the enhanced appearance flow as a fusion regression value, and combining the fusion score with the fusion regression value to obtain a group of post-regression behavior detection tubules with the category scores;
and a behavior tube construction step, wherein the behavior detection tubules at each time step are connected with each other along with time, and a 3D behavior detection tube for the input video is constructed.
Further, the method for acquiring the light flow graph specifically comprises the following steps:
calculating the optical flow between two adjacent frames of original images by using a Brox optical flow algorithm, wherein the horizontal direction component, the vertical direction component and the absolute value of the obtained optical flow form three dimensions of one image, and the numerical value of each dimension of each pixel is quantized to the range from 0 value to 255 value, so that the optical flow graph with the same resolution as the original image is obtained.
Further, the method for acquiring the human body posture diagram specifically comprises the following steps:
and (3) utilizing a J-HMDB behavior recognition data set to finely tune the Fast-Net network, wherein each pixel of the original image is endowed with a label through the finely tuned Fast-Net network, and each label is mapped to a corresponding preset RGB value to obtain a human body part semantic segmentation image, namely a human body posture image, with the same resolution as that of the original image.
Further, in the behavior detection step, for each time step, the enhanced appearance flow takes as input D frames of continuous original images, D × 5 continuous light flow patterns corresponding to the original images, and D human posture patterns corresponding to the original images, the momentum flow takes as input D × 5 continuous light flow patterns corresponding to the original images, and the posture flow takes as input D human posture patterns corresponding to the original images.
Further, in the behavior detection step, the enhanced appearance stream performs feature extraction on an original image, an optical flow graph and a human body posture graph corresponding to each frame of the input image sequence by using the SSD network layer, each frame performs convolution fusion on three input function feature graphs, feature graphs from the same function feature layer of each frame are merged and respectively sent to a convolution layer for classification and a convolution layer for regression, and a preset category score and an anchor cube coordinate regression value of the anchor cube are obtained.
Further, in the behavior detection step, the input image corresponding to each frame of the input image sequence is subjected to feature extraction by using the SSD network layer for the momentum flow and the attitude flow, feature maps from the same functional feature layer of each frame are merged and are respectively sent to the convolutional layer for classification and the convolutional layer for regression, and a preset category score of the anchor cube and an anchor cube coordinate regression value are obtained.
Further, when the workflows are trained, calculating a total loss of the detection network, and adjusting the weight of the detection network for each workflow by using a random gradient descent (SGD), where the total loss L of the detection network is represented as:
Figure BDA0002171355880000031
wherein L isconfRepresents a classification loss, LlocIndicating a loss of position and N indicates the total number of anchor cubes successfully matched.
Further, in the three-stream fusion step, the method for performing extended weighted fusion on the category scores of the three workflows specifically comprises:
Figure BDA0002171355880000032
wherein, S, S1、S2And S3Respectively representing a fusion category score on category l for a certain anchor cube, a category score from the enhanced appearance stream, a category score from the animation stream and a category score from the pose stream, calculated via the network parameters Θ, α1、α2、α3For the weighting coefficients assigned to the corresponding workflow, the coefficient size represents the degree of contribution of the corresponding workflow to the final detection result.
Further, the weight coefficient α1、α2、α3The ratio of (A) to (B) is 3:3: 1.
Further, in the behavior tube constructing step, the behavior detection tubules at each time step are connected with each other over time by using a real-time greedy algorithm to construct a 3D behavior detection tube for the input video, which specifically includes:
screening the first A behavior detection tubules by adopting a non-maximum inhibition method for the post-regression behavior detection tubules generated in each time step;
the behavior tube construction process is carried out in sequence according to categories, for each behavior category, the connected active behavior tubes at the moment of f-1 are sorted according to tube scores from high to low (the tube score is the average value of the category scores of the behavior detection tubules forming the behavior detection tubules), the most matched tubule is selected from the A behavior detection tubules generated at the moment of f for each active tube in sequence for connection, and the matching principle is as follows: the overlapping degree between the tubule to be connected and the last tubule member of the active tubule is higher than tau, the category score of the tubule is highest in the detection tubules which are not connected, the connected detection tubules can not be connected by other active tubules, the detection tubules which are not connected at the moment f are a new active tubule at the next moment, and the active tubules which are not connected with the detection tubules in continuous D-1 frames are considered to be connected completely;
after the small pipe connection process is finished, the detection frames from different behavior detection small pipes on the same frame of the video are averaged by using a time smoothing strategy, so that the connected active pipes are converted into smooth behavior pipes, and the 3D behavior detection pipes for the input video are obtained.
Compared with the prior art, the invention has the following advantages:
(1) the classification accuracy is high: the invention provides an enhanced appearance flow taking three forms of visual clues as input, the enhanced appearance flow can capture the motion condition of human behaviors by taking an optical flow diagram as the input of the enhanced appearance flow, correct classification is carried out on the behaviors which are difficult to be distinguished only by static original images, and then the overall classification accuracy is improved after three-flow expansion fusion;
(2) the positioning is accurate: the invention provides an enhanced appearance flow taking three forms of visual clues as input, and the enhanced appearance flow can eliminate background interference and obtain the accurate direction of a moving body by taking a human body posture graph as the input of the enhanced appearance flow, so that the overall positioning accuracy is improved after three-flow expansion fusion; meanwhile, the weight coefficients of the three workflows are designed, so that the detection precision is further improved;
(3) the operation is convenient and quick: the method is realized based on the SSD frame design, end-to-end detection is realized from input to output, the operability is high, a real-time greedy algorithm is adopted for behavior management construction, and the high efficiency of the whole behavior detection process is ensured.
Drawings
FIG. 1 is a schematic diagram of an enhanced three-stream architecture behavioral tubule detector model according to the present invention;
FIG. 2 is a schematic diagram of enhanced appearance flow operation according to the present invention;
FIG. 3 is a diagram illustrating the operation of a conventional appearance flow in the present invention;
FIG. 4 is a comparison of the detection of the enhanced appearance stream of the present invention and the conventional appearance stream over 10 categories of UCF-Sports datasets;
FIG. 5 is a comparison graph of frame-level detection performance of models with different fusion modes under UCF-Sports and J-HMDB data sets.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.
Examples
The embodiment provides a video behavior detection method based on an enhanced three-stream architecture, as shown in fig. 1, including the steps of:
1) and a data generation step, namely, giving an input video, calculating optical flow by utilizing a Brox optical flow algorithm, converting the optical flow into an optical flow graph, and generating a human body semantic segmentation graph with the same size as the input image, namely a human body posture graph by utilizing a Fast-Net network so as to obtain three forms of input.
In the data generation step, the specific process of generating the light flow graph is as follows: given an input video, calculating the optical flow between every two adjacent frames of original images by using a Brox optical flow algorithm, wherein the horizontal direction component, the vertical direction component and the absolute value of the optical flow form three dimensions of the image, and quantizing the numerical value of each dimension of each pixel into the range of [0,255] to obtain the optical flow graph with the same resolution as the original image. The optical flow graph obtained by calculating two adjacent frames is used as the result of the previous frame in the two frames, and the optical flow graph of the last frame is the same as the optical flow graph of the previous frame.
In the data generation step, the specific process of generating the human body posture graph comprises the following steps: training and fine-tuning network parameters of Fast-Net originally used for a road segmentation task by using a J-HMDB behavior recognition data set, processing each frame of original image of an input video by the fine-tuned Fast-Net network, endowing each pixel with one label (15 body part labels and 1 background label in total), mapping each label to a corresponding preset RGB value, and obtaining a human body part semantic segmentation image, namely a human body posture image, with the same resolution as that of the original image.
2) And a behavior detection step, namely taking the original image, the light flow graph and the human body posture graph as the input of the enhanced appearance flow, taking the light flow graph as the input of the kinetic potential flow, taking the human body posture graph as the input of the posture flow, forming three-flow type input, and processing the input by the enhanced appearance flow, the kinetic potential flow and the posture flow respectively at each time step to generate a corresponding detection result, wherein the detection result comprises a category score and an anchor cube coordinate regression value.
For example, as shown in fig. 2 and 3, the conventional appearance flow takes the original image as input, while the enhanced appearance flow takes the original image, the light flow diagram and the human body posture diagram as input, the momentum flow takes the light flow diagram as input, and the posture flow takes the human body posture diagram as input.
The three workflows are trained separately. In this embodiment, the training process of the three workflows is as follows:
21) for each workflow, data augmentation (including horizontal flipping, luminosity distortion, random cropping and the like) is carried out on the image sequence input each time, each image is scaled to be 300 × 300, and the enhanced appearance flow continues to obtain original images { f ] with D being 6 framesk,fk+1,…,fk+D-1And (o) 5 continuous light flow diagrams corresponding to the original imagek,ok+1,…,ok+4),(ok+1,ok+2,…,ok+5),…,(ok+D-1,ok+D,…,ok+D+4) D pieces of human body posture graph (p) corresponding to the original imagek,pk+1,…,pk+D-1Using D multiplied by 5 continuous light flow diagrams (o) of corresponding original images as input kinetic flowk,ok+1,…,ok+4),(ok+1,ok+2,…,ok+5),…,(ok+D-1,ok+D,…,ok+D+4) As input (one original image input corresponds to five continuous optical flow diagram inputs), the pose flow is D human pose diagrams (p) corresponding to the original imagek,pk+1,…,pk+D-1As input.
22) For the enhanced appearance stream, the original image f corresponding to each frame of the D frame imageiOptical flow diagram (o)i,oi+1,…,oi+4) And human body posture diagram piRespectively carrying out feature extraction through SSD (solid state disk) convolutional layers, splicing feature maps input in three forms (registration), fusing three kinds of visual cue information at each space point of the spliced feature maps by utilizing convolution kernels with the size of 1 x 1, stacking feature maps of functional layers from D frames, and respectively and simultaneously sending the feature maps into a classification convolutional layer and a regression convolutional layer, wherein the classification convolutional layer generates N +1 softmax scores (N behavior classes and 1 background class) for each preset anchor cube, and the regression convolutional layer generates D x 4 regression values (four regression values per frame) for each preset anchor cube.
23) For the action potential flow (or gesture flow), the detection principle is similar to the working mode of the traditional appearance flow shown in fig. 3, and a light flow graph (o) corresponding to each frame of the D-frame imagei,oi+1,…,oi+4) (or body posture diagram p)i) Feature extraction is performed through the SSD convolutional layers respectively, feature maps of the functional layers from the D frames are stacked and are simultaneously sent into the classification convolutional layers and the regression convolutional layers respectively, the classification convolutional layers generate N +1 softmax scores (N behavior classes and 1 background class) for each preset anchor cube, and the regression convolutional layers generate D multiplied by 4 regression values (four regression values per frame) for each preset anchor cube.
24) And calculating the training loss of each workflow, and adjusting the parameters of the detection network by using a random gradient descent (SGD) method to minimize the training loss, so that higher detection accuracy is obtained in the test stage. An anchor cube with an overlap with at least one real-valued behavior tubule above a threshold τ is generally considered a positive anchor cube. Let P denote the positive anchor cube set and N denote the negative anchor cube set, i.e. the anchor cubes remaining in addition to the P set.
Figure BDA0002171355880000061
Indicating whether the anchor cube i is successfully matched with the truth tubule j with the type l, if so, then
Figure BDA0002171355880000062
The value is 1, otherwise 0. The total loss L in each workflow training is represented by the classification loss LconfAnd positioning loss LlocThe composition, which can be expressed by the formula:
Figure BDA0002171355880000063
wherein
Figure BDA0002171355880000064
Representing the total number of anchor cubes successfully matched. Loss of classification LconfCalculated based on the softmax loss, as shown in the following formula:
Figure BDA0002171355880000071
where x represents the anchor cube, s represents the confidence score,
Figure BDA0002171355880000072
represents the score of anchor cube i on class l,
Figure BDA0002171355880000073
representing the score of anchor cube i on the background class.
Loss of positioning LlocCalculated is the offset value of the true tubule (g) relative to the anchor cubeAnd the L1 smoothing loss between the anchor cube regression value (p) calculated via the network. L islocTaking the average of the D frame losses, the offset and regression values in the localization losses are considered the width (w), height (h) and center coordinates (cx, cy) of each anchor box in the anchor cube (a):
Figure BDA0002171355880000075
Figure BDA0002171355880000077
in the behavior detection step, a trained detection network is used for detection, at each time step, the enhanced appearance flow takes D frames of continuous original images, Dx 5 continuous light flow graphs corresponding to the original images and D human posture graphs corresponding to the original images as input, the kinetic flow takes Dx 5 continuous light flow graphs corresponding to the original images as input, and the posture flow takes D human posture graphs corresponding to the original images as input. For each workflow, each image of the input sequence is scaled to a size of 300 × 300, and the corresponding input, after processing by the respective workflow, will generate N +1 softmax scores (N behavior classes and 1 background class) and D × 4 anchor cube coordinate regression values (four regression values per frame) for each preset anchor cube.
3) And a three-flow fusion step, namely performing expanded weighted fusion on the category scores of the three workflows to obtain a fusion score, taking a coordinate regression value of the enhanced appearance flow as a fusion regression value, and combining the fusion score and the fusion regression value to obtain a group of post-regression behavior detection tubules with the category scores.
The three-flow fusion step specifically comprises the following steps:
31) for each anchor cube on the behavior class l, a category score S from the enhanced appearance stream calculated via the network parameters Θ1Category score S of automatic potential flow2And a category score S from the gesture stream3And performing extended weighted fusion to obtain a fused category score S, which is shown as the following formula:
Figure BDA0002171355880000081
wherein alpha is1、α2、α3For the weighting coefficients assigned to the corresponding workflow, the value of the coefficient may be 1, 2 or 3, representing the low, medium or high degree of contribution of the corresponding workflow to the final detection result, expandedThe spread-weighted fusion method makes up the deficiency of the original weighted fusion method of the three-flow architecture, and considers the situation that the possible contribution degrees of different workflows are the same, the weight coefficient ratio among the enhanced appearance flow, the momentum flow and the attitude flow is 3 × 3 × 3 to 27 in total.
32) For each anchor cube on each behavior class, the coordinate regression value from the enhanced appearance stream is taken as the fusion regression value.
33) For each anchor cube on each behavior class, the anchor cube with invariant spatial extensibility is converted into a post-regression behavior detection tubule with a class score having a different spatial extensibility by combining the fusion score with the fusion regression value.
4) And a behavior tube construction step, namely connecting the behavior detection tubules at each time step with time by using a real-time greedy algorithm, and constructing a final 3D behavior detection tube aiming at the input video by using a time smoothing strategy.
The behavior tube construction step specifically comprises:
41) and (4) screening the first A behavior detection tubules by adopting a non-maximum suppression method (NMS) for the behavior detection tubules after regression generated at each time step.
42) The behavior tube construction process is carried out in sequence according to categories, for each behavior category, the connected active behavior tubes at the moment of f-1 are sorted according to tube scores from high to low (the tube score is the average value of the category scores of the behavior detection tubules forming the behavior detection tubules), the most matched tubule is selected from the A behavior detection tubules generated at the moment of f for each active tube in sequence for connection, and the matching principle is as follows: the overlap between the tubule to be ligated and the last tubule member of the active tubule is higher than τ and it scores the highest class among the detection tubules that have not been ligated. It should be noted that: the connected detection tubule can not be connected by other active tubes any more, the detection tubule which is not connected at the moment f is a new active tube at the next moment, and the active tube which is not connected with the detection tubule and has continuous D-1 frames is considered to be connected with the tube.
43) After the tubule connection process is finished, a time smoothing strategy (temporal smoothing) is used for averaging detection frames from different behavior detection tubules on the same frame of the video, so that the active tube after connection is converted into a smooth behavior tube. And finally obtaining a group of 3D behavior detection tubes aiming at the input video.
To verify the performance of the above method, the following experiment was designed.
Experimental validation was performed on three public data sets (UCF-Sports, J-HMDB, and UCF-101). The UCF-Sports data set comprises 10 categories which are all related to Sports behaviors, the total number of videos is 150, each video is pruned, so that each frame has behaviors, and the experiment uses a standard training/testing branch; the J-HMDB dataset contains 928 clipped videos, for a total of 21 behavior classes, which are usually divided into 3 training/testing branches, with the final result taking the average of the three branch results; videos of the UCF-101 data set are not pruned, and comprise 24 behavior tags, and 3207 videos are totally detected, so that compared with the first two data sets, the detection difficulty is higher, and the experiment uses the result of the first branch of the data set.
The evaluation indexes used in the experiment include frame-level indexes (frame-AP, frame-MABO, frame-CLASSIF) and video-level indexes (video-AP). Wherein, the frame-AP, the frame-MABO and the frame-CLASSIF respectively measure the overall performance, the positioning accuracy and the classification accuracy of the method on the frame-level result, and the video-AP evaluates the effectiveness of the method on the video-level result. Notably, the mAP value is the result of averaging the AP values under all categories. A detection result is considered to be a true-positive sample (true positive) if and only if it is correctly classified and overlaps with a true result by more than some threshold δ.
To evaluate the effectiveness of the enhanced appearance stream, FIG. 4 compares the detection of the enhanced appearance stream against the conventional appearance stream over 10 categories of the UCF-Sports dataset. As can be seen, the enhanced look stream is significantly better detected in certain behavior categories (e.g., "Skateboard" and "Walk" etc.) than the conventional look stream. This result shows that by combining the newly added optical flow input and human body gesture input, the apparent flow can learn more useful information to predict behavior classes. Specific frame-level index results on the UCF-Sports dataset are shown in Table 1, and it can be easily found that the enhanced appearance flow is superior to the traditional appearance flow in each index, and even exceeds the traditional three-flow behavior detector combined with the three-flow results in some indexes.
TABLE 1 comparison of frame level indicators on UCF-Sports dataset
Method of producing a composite material frame-mAP frame-MABO frame-CLASSIF
Traditional appearance flow 82.56 82.41 79.46
Conventional three-stream detector 87.93 85.66 82.05
Enhanced appearance flow 91.34 85.29 85.48
To evaluate the extended weighted fusion method of the three workflows, fig. 5 compares the frame-level index results of the enhanced behavior detector on UCF-Sports and J-HMDB datasets at a total of 3 × 3 × 3 ═ 27 weight coefficient ratios. The extended weighted fusion increases the "different workflows may contribute to the result to the same extent" that the conventional method misses. The results show that the change of the weight coefficient ratio significantly affects the frame-AP and frame-CLASSIF but not the frame-MABO, indicating that the weight distribution of the three workflows mainly acts on classification rather than localization. Analysis of the line graph shows that the detection effect is remarkably improved when the enhanced appearance flow and the momentum flow occupy the main positions. The final weight ratio used by the method, considering the overall performance over multiple data sets, is 3:3:1, at which the frame-level index results for the detector under the J-HMDB data set are shown in table 2.
TABLE 2 comparison of frame level indicators on the J-HMDB dataset
Method of producing a composite material frame-mAP frame-MABO frame-CLASSIF
Conventional three-stream detector 67.1 85.1 64.8
Enhanced three-stream detector 68.0 87.6 65.8
The comparison of the frame level and video level indexes of the method on UCF-Sports, J-HMDB and UCF-101 data sets with the most advanced behavior detection method at present is shown in tables 3, 4 and 5.
TABLE 3 comparison of methods on UCF-Sports dataset
Figure BDA0002171355880000101
TABLE 4 comparison of the methods on the J-HMDB dataset
Figure BDA0002171355880000102
Figure BDA0002171355880000111
TABLE 5 comparison of methods on UCF-101 dataset
Figure BDA0002171355880000112
Experiments can prove that the video behavior detection method based on the enhanced three-stream framework has high classification accuracy, accurate positioning and convenient and quick operation on three public data sets, and has stronger advantages and application prospects in the currently known video behavior detection model.
The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.

Claims (10)

1. A video behavior detection method based on an enhanced three-stream architecture is characterized by comprising the following steps:
a data generation step, namely acquiring an input video, and acquiring a corresponding light flow graph and a human body posture graph according to an original image;
a behavior detection step, namely generating an enhanced appearance flow according to an original image, a light flow graph and a human body posture graph, forming three-flow type input by taking the light flow graph as a momentum flow and taking the human body posture graph as a posture flow, and respectively detecting the enhanced appearance flow, the momentum flow and the posture flow at each time step to generate a corresponding detection result, wherein the detection result comprises a category score and an anchor cube coordinate regression value;
performing three-stream fusion, namely performing expanded weighted fusion on the category scores of the three workflows to obtain a fusion score, taking a coordinate regression value of the enhanced appearance flow as a fusion regression value, and combining the fusion score with the fusion regression value to obtain a group of post-regression behavior detection tubules with the category scores;
and a behavior tube construction step, wherein the behavior detection tubules at each time step are connected with each other along with time, and a 3D behavior detection tube for the input video is constructed.
2. The enhanced three-stream-architecture-based video behavior detection method according to claim 1, wherein the method for acquiring the light flow graph specifically comprises:
calculating the optical flow between two adjacent frames of original images by using a Brox optical flow algorithm, wherein the horizontal direction component, the vertical direction component and the absolute value of the obtained optical flow form three dimensions of one image, and the numerical value of each dimension of each pixel is quantized to the range from 0 value to 255 value, so that the optical flow graph with the same resolution as the original image is obtained.
3. The video behavior detection method based on the enhanced three-stream architecture according to claim 1, wherein the human body posture diagram is obtained by a method specifically comprising:
and (3) utilizing a J-HMDB behavior recognition data set to finely tune the Fast-Net network, wherein each pixel of the original image is endowed with a label through the finely tuned Fast-Net network, and each label is mapped to a corresponding preset RGB value to obtain a human body part semantic segmentation image, namely a human body posture image, with the same resolution as that of the original image.
4. The method according to claim 1, wherein in the behavior detection step, for each time step, the enhanced appearance stream takes as input D frames of consecutive original images, D x 5 consecutive light flow patterns corresponding to the original images, and D body posture diagrams corresponding to the original images, the animation stream takes as input D x 5 consecutive light flow patterns corresponding to the original images, and the posture stream takes as input D body posture diagrams corresponding to the original images.
5. The method according to claim 1, wherein in the behavior detection step, the enhanced appearance stream performs feature extraction on an original image, an optical flow graph and a human body posture graph corresponding to each frame of the input image sequence by using an SSD network layer, each frame performs convolution fusion on three input function feature graphs, and then feature graphs from the same function feature layer of each frame are merged and sent to a convolution layer for classification and a convolution layer for regression, so as to obtain a preset category score and an anchor cube coordinate regression value of the anchor cube.
6. The method according to claim 1, wherein in the behavior detection step, the SSD network layer is used to perform feature extraction on the input image corresponding to each frame of the input image sequence for the momentum stream and the pose stream, feature maps from the same functional feature layer of each frame are merged and sent to the convolutional layer for classification and the convolutional layer for regression, respectively, so as to obtain a preset category score and anchor cube coordinate regression value of the anchor cube.
7. The method according to claim 1, wherein when the workflows are trained, a total loss of the detection network is calculated, and for each workflow, a random gradient descent method is used to adjust the weight of the detection network, where the total loss L of the detection network is expressed as:
wherein L isconfRepresents a classification loss, LlocIndicating a loss of position and N indicates the total number of anchor cubes successfully matched.
8. The method for detecting video behavior based on the enhanced three-stream architecture according to claim 1, wherein in the three-stream fusion step, the method for performing extended weighted fusion on the category scores of the three workflows specifically comprises:
Figure FDA0002171355870000022
wherein, S, S1、S2And S3Respectively representing a fusion category score on category l for a certain anchor cube, a category score from the enhanced appearance stream, a category score from the animation stream and a category score from the pose stream, calculated via the network parameters Θ, α1、α2、α3Is a weight coefficient assigned to the corresponding workflow.
9. The enhanced three-stream architecture based video behavior detection method according to claim 8, wherein the weight coefficient α is1、α2、α3The ratio of (A) to (B) is 3:3: 1.
10. The method according to claim 1, wherein in the behavior tube constructing step, a real-time greedy algorithm is used to connect behavior detection tubules at each time step with time to construct a 3D behavior detection tube for an input video, and specifically includes:
screening the first A behavior detection tubules by adopting a non-maximum inhibition method for the post-regression behavior detection tubules generated in each time step;
the behavior tube construction process is carried out in sequence according to categories, for each behavior category, the connected active behavior tubes at the moment of f-1 are sorted according to the tube score from high to low, the most matched small tubes are selected from A behavior detection small tubes generated at the moment of f for each active tube in sequence for connection, and the matching principle is as follows: the overlapping degree between the tubule to be connected and the last tubule member of the active tubule is higher than tau, the category score of the tubule is highest in the detection tubules which are not connected, the connected detection tubules can not be connected by other active tubules, the detection tubules which are not connected at the moment f are a new active tubule at the next moment, and the active tubules which are not connected with the detection tubules in continuous D-1 frames are considered to be connected completely;
after the small pipe connection process is finished, the detection frames from different behavior detection small pipes on the same frame of the video are averaged by using a time smoothing strategy, so that the connected active pipes are converted into smooth behavior pipes, and the 3D behavior detection pipes for the input video are obtained.
CN201910764109.5A 2019-08-19 2019-08-19 Video behavior detection method based on enhanced three-stream architecture Pending CN110633645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910764109.5A CN110633645A (en) 2019-08-19 2019-08-19 Video behavior detection method based on enhanced three-stream architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910764109.5A CN110633645A (en) 2019-08-19 2019-08-19 Video behavior detection method based on enhanced three-stream architecture

Publications (1)

Publication Number Publication Date
CN110633645A true CN110633645A (en) 2019-12-31

Family

ID=68970425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910764109.5A Pending CN110633645A (en) 2019-08-19 2019-08-19 Video behavior detection method based on enhanced three-stream architecture

Country Status (1)

Country Link
CN (1) CN110633645A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112651291A (en) * 2020-10-01 2021-04-13 新加坡依图有限责任公司(私有) Video-based posture estimation method, device, medium and electronic equipment
CN112906549A (en) * 2021-02-07 2021-06-04 同济大学 Video behavior detection method based on space-time capsule network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805083A (en) * 2018-06-13 2018-11-13 中国科学技术大学 The video behavior detection method of single phase
CN109284667A (en) * 2018-07-26 2019-01-29 同济大学 A kind of three streaming human motion action space area detecting methods towards video
CN109377555A (en) * 2018-11-14 2019-02-22 江苏科技大学 Autonomous underwater robot prospect visual field three-dimensional reconstruction target's feature-extraction recognition methods
CN110096950A (en) * 2019-03-20 2019-08-06 西北大学 A kind of multiple features fusion Activity recognition method based on key frame

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805083A (en) * 2018-06-13 2018-11-13 中国科学技术大学 The video behavior detection method of single phase
CN109284667A (en) * 2018-07-26 2019-01-29 同济大学 A kind of three streaming human motion action space area detecting methods towards video
CN109377555A (en) * 2018-11-14 2019-02-22 江苏科技大学 Autonomous underwater robot prospect visual field three-dimensional reconstruction target's feature-extraction recognition methods
CN110096950A (en) * 2019-03-20 2019-08-06 西北大学 A kind of multiple features fusion Activity recognition method based on key frame

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KAREN SIMONYAN ET.AL.: ""Two-Stream Convolutional Networks for action recognition in videos"", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN111931602B (en) * 2020-07-22 2023-08-08 北方工业大学 Attention mechanism-based multi-flow segmented network human body action recognition method and system
CN112651291A (en) * 2020-10-01 2021-04-13 新加坡依图有限责任公司(私有) Video-based posture estimation method, device, medium and electronic equipment
CN112906549A (en) * 2021-02-07 2021-06-04 同济大学 Video behavior detection method based on space-time capsule network

Similar Documents

Publication Publication Date Title
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN110633645A (en) Video behavior detection method based on enhanced three-stream architecture
CN109598735A (en) Method using the target object in Markov D-chain trace and segmented image and the equipment using this method
CN111709410B (en) Behavior identification method for strong dynamic video
CN107424161B (en) Coarse-to-fine indoor scene image layout estimation method
CN111126412B (en) Image key point detection method based on characteristic pyramid network
CN110222604B (en) Target identification method and device based on shared convolutional neural network
CN111274921A (en) Method for recognizing human body behaviors by utilizing attitude mask
KR20200075114A (en) System and Method for Matching Similarity between Image and Text
CN109670405A (en) A kind of complex background pedestrian detection method based on deep learning
CN111696136B (en) Target tracking method based on coding and decoding structure
CN107944437B (en) A kind of Face detection method based on neural network and integral image
CN111339908A (en) Group behavior identification method based on multi-mode information fusion and decision optimization
CN113361542A (en) Local feature extraction method based on deep learning
CN113610905B (en) Deep learning remote sensing image registration method based on sub-image matching and application
CN109670401B (en) Action recognition method based on skeletal motion diagram
CN116630608A (en) Multi-mode target detection method for complex scene
CN110751226A (en) Crowd counting model training method and device and storage medium
CN109272577A (en) A kind of vision SLAM method based on Kinect
CN115661777A (en) Semantic-combined foggy road target detection algorithm
CN109284667B (en) Three-stream type human motion behavior space domain detection method facing video
CN109740672B (en) Multi-stream feature distance fusion system and fusion method
CN113361466B (en) Multispectral target detection method based on multi-mode cross guidance learning
Tian et al. High-speed tiny tennis ball detection based on deep convolutional neural networks
CN111738099B (en) Face automatic detection method based on video image scene understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191231