CN110633645A

CN110633645A - Video behavior detection method based on enhanced three-stream architecture

Info

Publication number: CN110633645A
Application number: CN201910764109.5A
Authority: CN
Inventors: 王瀚漓; 吴雨唐
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-12-31

Abstract

The invention relates to a video behavior detection method based on an enhanced three-stream architecture, which comprises the following steps: a data generation step, namely acquiring an input video, and acquiring an optical flow diagram and a human body posture diagram; a behavior detection step, wherein an original image, a light flow graph and a human body posture graph are used as input of an enhanced appearance flow, the light flow graph is used as input of a kinetic potential flow, the human body posture graph is used as input of a posture flow, and a workflow processes corresponding input at each time step to generate a detection result; performing three-flow fusion, namely performing expanded weighted fusion on the category scores of the three workflows to obtain a fusion score, and taking a coordinate regression value of the enhanced appearance flow as a fusion regression value to obtain a group of post-regression behavior detection tubules with the category scores; and a behavior tube construction step, namely connecting the behavior detection tubules at each time step with time to construct a 3D behavior detection tube of the input video. Compared with the prior art, the method has the advantages of high classification accuracy, accurate positioning, convenience and quickness in operation and the like.

Description

Video behavior detection method based on enhanced three-stream architecture

Technical Field

The invention relates to a video behavior detection method, in particular to a video behavior detection method based on an enhanced three-stream architecture.

Background

With the increasing popularity of video understanding in recent years, video behavior detection has received more and more attention and is widely used in various video applications such as video surveillance and the like. Video behavior detection aims at classifying and locating video behavior throughout the video. Compared with other well-studied visual tasks such as image classification and the like, the development of the video behavior detection task is still limited by insufficient training data, noisy video background, large space-time search domain and the like.

The introduction of Convolutional Neural Networks (CNNs) has made a significant progress in the task of object detection. In view of task similarity between behavior detection and target detection, most leading edge behavior detection methods currently combine CNNs-based target detectors with a dual-stream architecture.

In the methods, each workflow generates a frame-level or sequence-level behavior detection result at each time step, the dual-stream results are fused, and the results at each time step are connected with each other along with time to obtain a 3D behavior detection tube for the whole video. The three-flow behavior detection architecture takes human posture visual clues as a third workflow, and helps to more clearly characterize spatial positioning of behaviors by eliminating the background and highlighting the behaviors of people. However, these tasks only process a single visual cue on a single workflow, i.e. the original image, the light flow graph or the body posture graph, and ignore the interaction between the visual cues.

Chinese patent application CN109284667A discloses a three-stream type human motion behavior spatial domain detection method for video, in which corresponding optical flows and human posture maps are obtained from an original image, an RGB Flow inputted with the original image, a Flow inputted with the optical Flow map, and a pos Flow inputted with the human posture map are formed, and the RGB Flow, the Flow, and the pos Flow are detected by detectors on the respective flows at each time step to obtain detection results, the method processes a single visual cue on a single workflow, that is, processes the original image, the optical Flow map, or the human posture map on the RGB Flow, the Flow, and the pos Flow, respectively, and generally considers that the original image contains the most abundant visual information, and the conventional method cannot fully play the role of the appearance Flow inputted with the original image in the detection of the behavior, and cannot play the role of the original image in the monitoring of the behavior, the detection performance of the detector still has a higher promotion space.

Disclosure of Invention

The present invention aims to overcome the above-mentioned drawbacks of the prior art and provide a video behavior detection method based on an enhanced three-stream architecture.

The purpose of the invention can be realized by the following technical scheme:

a video behavior detection method based on an enhanced three-stream architecture comprises the following steps:

a data generation step, namely acquiring an input video, and acquiring a corresponding light flow graph and a human body posture graph according to an original image;

a behavior detection step, namely generating an enhanced appearance flow according to an original image, a light flow graph and a human body posture graph, forming three-flow type input by taking the light flow graph as a momentum flow and taking the human body posture graph as a posture flow, and respectively detecting the enhanced appearance flow, the momentum flow and the posture flow at each time step to generate a corresponding detection result, wherein the detection result comprises a category score and an anchor cube coordinate regression value;

performing three-stream fusion, namely performing expanded weighted fusion on the category scores of the three workflows to obtain a fusion score, taking a coordinate regression value of the enhanced appearance flow as a fusion regression value, and combining the fusion score with the fusion regression value to obtain a group of post-regression behavior detection tubules with the category scores;

and a behavior tube construction step, wherein the behavior detection tubules at each time step are connected with each other along with time, and a 3D behavior detection tube for the input video is constructed.

Further, the method for acquiring the light flow graph specifically comprises the following steps:

calculating the optical flow between two adjacent frames of original images by using a Brox optical flow algorithm, wherein the horizontal direction component, the vertical direction component and the absolute value of the obtained optical flow form three dimensions of one image, and the numerical value of each dimension of each pixel is quantized to the range from 0 value to 255 value, so that the optical flow graph with the same resolution as the original image is obtained.

Further, the method for acquiring the human body posture diagram specifically comprises the following steps:

and (3) utilizing a J-HMDB behavior recognition data set to finely tune the Fast-Net network, wherein each pixel of the original image is endowed with a label through the finely tuned Fast-Net network, and each label is mapped to a corresponding preset RGB value to obtain a human body part semantic segmentation image, namely a human body posture image, with the same resolution as that of the original image.

Further, in the behavior detection step, for each time step, the enhanced appearance flow takes as input D frames of continuous original images, D × 5 continuous light flow patterns corresponding to the original images, and D human posture patterns corresponding to the original images, the momentum flow takes as input D × 5 continuous light flow patterns corresponding to the original images, and the posture flow takes as input D human posture patterns corresponding to the original images.

Further, in the behavior detection step, the enhanced appearance stream performs feature extraction on an original image, an optical flow graph and a human body posture graph corresponding to each frame of the input image sequence by using the SSD network layer, each frame performs convolution fusion on three input function feature graphs, feature graphs from the same function feature layer of each frame are merged and respectively sent to a convolution layer for classification and a convolution layer for regression, and a preset category score and an anchor cube coordinate regression value of the anchor cube are obtained.

Further, in the behavior detection step, the input image corresponding to each frame of the input image sequence is subjected to feature extraction by using the SSD network layer for the momentum flow and the attitude flow, feature maps from the same functional feature layer of each frame are merged and are respectively sent to the convolutional layer for classification and the convolutional layer for regression, and a preset category score of the anchor cube and an anchor cube coordinate regression value are obtained.

Further, when the workflows are trained, calculating a total loss of the detection network, and adjusting the weight of the detection network for each workflow by using a random gradient descent (SGD), where the total loss L of the detection network is represented as:

wherein L is_confRepresents a classification loss, L_locIndicating a loss of position and N indicates the total number of anchor cubes successfully matched.

Further, in the three-stream fusion step, the method for performing extended weighted fusion on the category scores of the three workflows specifically comprises:

wherein, S, S₁、S₂And S₃Respectively representing a fusion category score on category l for a certain anchor cube, a category score from the enhanced appearance stream, a category score from the animation stream and a category score from the pose stream, calculated via the network parameters Θ, α₁、α₂、α₃For the weighting coefficients assigned to the corresponding workflow, the coefficient size represents the degree of contribution of the corresponding workflow to the final detection result.

Further, the weight coefficient α₁、α₂、α₃The ratio of (A) to (B) is 3:3: 1.

Further, in the behavior tube constructing step, the behavior detection tubules at each time step are connected with each other over time by using a real-time greedy algorithm to construct a 3D behavior detection tube for the input video, which specifically includes:

screening the first A behavior detection tubules by adopting a non-maximum inhibition method for the post-regression behavior detection tubules generated in each time step;

the behavior tube construction process is carried out in sequence according to categories, for each behavior category, the connected active behavior tubes at the moment of f-1 are sorted according to tube scores from high to low (the tube score is the average value of the category scores of the behavior detection tubules forming the behavior detection tubules), the most matched tubule is selected from the A behavior detection tubules generated at the moment of f for each active tube in sequence for connection, and the matching principle is as follows: the overlapping degree between the tubule to be connected and the last tubule member of the active tubule is higher than tau, the category score of the tubule is highest in the detection tubules which are not connected, the connected detection tubules can not be connected by other active tubules, the detection tubules which are not connected at the moment f are a new active tubule at the next moment, and the active tubules which are not connected with the detection tubules in continuous D-1 frames are considered to be connected completely;

after the small pipe connection process is finished, the detection frames from different behavior detection small pipes on the same frame of the video are averaged by using a time smoothing strategy, so that the connected active pipes are converted into smooth behavior pipes, and the 3D behavior detection pipes for the input video are obtained.

Compared with the prior art, the invention has the following advantages:

(1) the classification accuracy is high: the invention provides an enhanced appearance flow taking three forms of visual clues as input, the enhanced appearance flow can capture the motion condition of human behaviors by taking an optical flow diagram as the input of the enhanced appearance flow, correct classification is carried out on the behaviors which are difficult to be distinguished only by static original images, and then the overall classification accuracy is improved after three-flow expansion fusion;

(2) the positioning is accurate: the invention provides an enhanced appearance flow taking three forms of visual clues as input, and the enhanced appearance flow can eliminate background interference and obtain the accurate direction of a moving body by taking a human body posture graph as the input of the enhanced appearance flow, so that the overall positioning accuracy is improved after three-flow expansion fusion; meanwhile, the weight coefficients of the three workflows are designed, so that the detection precision is further improved;

(3) the operation is convenient and quick: the method is realized based on the SSD frame design, end-to-end detection is realized from input to output, the operability is high, a real-time greedy algorithm is adopted for behavior management construction, and the high efficiency of the whole behavior detection process is ensured.

Drawings

FIG. 1 is a schematic diagram of an enhanced three-stream architecture behavioral tubule detector model according to the present invention;

FIG. 2 is a schematic diagram of enhanced appearance flow operation according to the present invention;

FIG. 3 is a diagram illustrating the operation of a conventional appearance flow in the present invention;

FIG. 4 is a comparison of the detection of the enhanced appearance stream of the present invention and the conventional appearance stream over 10 categories of UCF-Sports datasets;

FIG. 5 is a comparison graph of frame-level detection performance of models with different fusion modes under UCF-Sports and J-HMDB data sets.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.

Examples

The embodiment provides a video behavior detection method based on an enhanced three-stream architecture, as shown in fig. 1, including the steps of:

1) and a data generation step, namely, giving an input video, calculating optical flow by utilizing a Brox optical flow algorithm, converting the optical flow into an optical flow graph, and generating a human body semantic segmentation graph with the same size as the input image, namely a human body posture graph by utilizing a Fast-Net network so as to obtain three forms of input.

In the data generation step, the specific process of generating the light flow graph is as follows: given an input video, calculating the optical flow between every two adjacent frames of original images by using a Brox optical flow algorithm, wherein the horizontal direction component, the vertical direction component and the absolute value of the optical flow form three dimensions of the image, and quantizing the numerical value of each dimension of each pixel into the range of [0,255] to obtain the optical flow graph with the same resolution as the original image. The optical flow graph obtained by calculating two adjacent frames is used as the result of the previous frame in the two frames, and the optical flow graph of the last frame is the same as the optical flow graph of the previous frame.

In the data generation step, the specific process of generating the human body posture graph comprises the following steps: training and fine-tuning network parameters of Fast-Net originally used for a road segmentation task by using a J-HMDB behavior recognition data set, processing each frame of original image of an input video by the fine-tuned Fast-Net network, endowing each pixel with one label (15 body part labels and 1 background label in total), mapping each label to a corresponding preset RGB value, and obtaining a human body part semantic segmentation image, namely a human body posture image, with the same resolution as that of the original image.

2) And a behavior detection step, namely taking the original image, the light flow graph and the human body posture graph as the input of the enhanced appearance flow, taking the light flow graph as the input of the kinetic potential flow, taking the human body posture graph as the input of the posture flow, forming three-flow type input, and processing the input by the enhanced appearance flow, the kinetic potential flow and the posture flow respectively at each time step to generate a corresponding detection result, wherein the detection result comprises a category score and an anchor cube coordinate regression value.

For example, as shown in fig. 2 and 3, the conventional appearance flow takes the original image as input, while the enhanced appearance flow takes the original image, the light flow diagram and the human body posture diagram as input, the momentum flow takes the light flow diagram as input, and the posture flow takes the human body posture diagram as input.

The three workflows are trained separately. In this embodiment, the training process of the three workflows is as follows:

21) for each workflow, data augmentation (including horizontal flipping, luminosity distortion, random cropping and the like) is carried out on the image sequence input each time, each image is scaled to be 300 × 300, and the enhanced appearance flow continues to obtain original images { f ] with D being 6 frames_k,f_k+1,…,f_k+D-1And (o) 5 continuous light flow diagrams corresponding to the original image_k,o_k+1,…,o_k+4),(o_k+1,o_k+2,…,o_k+5),…,(o_k+D-1,o_k+D,…,o_k+D+4) D pieces of human body posture graph (p) corresponding to the original image_k,p_k+1,…,p_k+D-1Using D multiplied by 5 continuous light flow diagrams (o) of corresponding original images as input kinetic flow_k,o_k+1,…,o_k+4),(o_k+1,o_k+2,…,o_k+5),…,(o_k+D-1,o_k+D,…,o_k+D+4) As input (one original image input corresponds to five continuous optical flow diagram inputs), the pose flow is D human pose diagrams (p) corresponding to the original image_k,p_k+1,…,p_k+D-1As input.

22) For the enhanced appearance stream, the original image f corresponding to each frame of the D frame image_iOptical flow diagram (o)_i,o_i+1,…,o_i+4) And human body posture diagram p_iRespectively carrying out feature extraction through SSD (solid state disk) convolutional layers, splicing feature maps input in three forms (registration), fusing three kinds of visual cue information at each space point of the spliced feature maps by utilizing convolution kernels with the size of 1 x 1, stacking feature maps of functional layers from D frames, and respectively and simultaneously sending the feature maps into a classification convolutional layer and a regression convolutional layer, wherein the classification convolutional layer generates N +1 softmax scores (N behavior classes and 1 background class) for each preset anchor cube, and the regression convolutional layer generates D x 4 regression values (four regression values per frame) for each preset anchor cube.

23) For the action potential flow (or gesture flow), the detection principle is similar to the working mode of the traditional appearance flow shown in fig. 3, and a light flow graph (o) corresponding to each frame of the D-frame image_i,o_i+1,…,o_i+4) (or body posture diagram p)_i) Feature extraction is performed through the SSD convolutional layers respectively, feature maps of the functional layers from the D frames are stacked and are simultaneously sent into the classification convolutional layers and the regression convolutional layers respectively, the classification convolutional layers generate N +1 softmax scores (N behavior classes and 1 background class) for each preset anchor cube, and the regression convolutional layers generate D multiplied by 4 regression values (four regression values per frame) for each preset anchor cube.

24) And calculating the training loss of each workflow, and adjusting the parameters of the detection network by using a random gradient descent (SGD) method to minimize the training loss, so that higher detection accuracy is obtained in the test stage. An anchor cube with an overlap with at least one real-valued behavior tubule above a threshold τ is generally considered a positive anchor cube. Let P denote the positive anchor cube set and N denote the negative anchor cube set, i.e. the anchor cubes remaining in addition to the P set.

Indicating whether the anchor cube i is successfully matched with the truth tubule j with the type l, if so, then

The value is 1, otherwise 0. The total loss L in each workflow training is represented by the classification loss L_confAnd positioning loss L_locThe composition, which can be expressed by the formula:

wherein

Representing the total number of anchor cubes successfully matched. Loss of classification L_confCalculated based on the softmax loss, as shown in the following formula:

where x represents the anchor cube, s represents the confidence score,

represents the score of anchor cube i on class l,

representing the score of anchor cube i on the background class.

Loss of positioning L_locCalculated is the offset value of the true tubule (g) relative to the anchor cubeAnd the L1 smoothing loss between the anchor cube regression value (p) calculated via the network. L is_locTaking the average of the D frame losses, the offset and regression values in the localization losses are considered the width (w), height (h) and center coordinates (cx, cy) of each anchor box in the anchor cube (a):

in the behavior detection step, a trained detection network is used for detection, at each time step, the enhanced appearance flow takes D frames of continuous original images, Dx 5 continuous light flow graphs corresponding to the original images and D human posture graphs corresponding to the original images as input, the kinetic flow takes Dx 5 continuous light flow graphs corresponding to the original images as input, and the posture flow takes D human posture graphs corresponding to the original images as input. For each workflow, each image of the input sequence is scaled to a size of 300 × 300, and the corresponding input, after processing by the respective workflow, will generate N +1 softmax scores (N behavior classes and 1 background class) and D × 4 anchor cube coordinate regression values (four regression values per frame) for each preset anchor cube.

3) And a three-flow fusion step, namely performing expanded weighted fusion on the category scores of the three workflows to obtain a fusion score, taking a coordinate regression value of the enhanced appearance flow as a fusion regression value, and combining the fusion score and the fusion regression value to obtain a group of post-regression behavior detection tubules with the category scores.

The three-flow fusion step specifically comprises the following steps:

31) for each anchor cube on the behavior class l, a category score S from the enhanced appearance stream calculated via the network parameters Θ₁Category score S of automatic potential flow₂And a category score S from the gesture stream₃And performing extended weighted fusion to obtain a fused category score S, which is shown as the following formula:

wherein alpha is₁、α₂、α₃For the weighting coefficients assigned to the corresponding workflow, the value of the coefficient may be 1, 2 or 3, representing the low, medium or high degree of contribution of the corresponding workflow to the final detection result, expandedThe spread-weighted fusion method makes up the deficiency of the original weighted fusion method of the three-flow architecture, and considers the situation that the possible contribution degrees of different workflows are the same, the weight coefficient ratio among the enhanced appearance flow, the momentum flow and the attitude flow is 3 × 3 × 3 to 27 in total.

32) For each anchor cube on each behavior class, the coordinate regression value from the enhanced appearance stream is taken as the fusion regression value.

33) For each anchor cube on each behavior class, the anchor cube with invariant spatial extensibility is converted into a post-regression behavior detection tubule with a class score having a different spatial extensibility by combining the fusion score with the fusion regression value.

4) And a behavior tube construction step, namely connecting the behavior detection tubules at each time step with time by using a real-time greedy algorithm, and constructing a final 3D behavior detection tube aiming at the input video by using a time smoothing strategy.

The behavior tube construction step specifically comprises:

41) and (4) screening the first A behavior detection tubules by adopting a non-maximum suppression method (NMS) for the behavior detection tubules after regression generated at each time step.

42) The behavior tube construction process is carried out in sequence according to categories, for each behavior category, the connected active behavior tubes at the moment of f-1 are sorted according to tube scores from high to low (the tube score is the average value of the category scores of the behavior detection tubules forming the behavior detection tubules), the most matched tubule is selected from the A behavior detection tubules generated at the moment of f for each active tube in sequence for connection, and the matching principle is as follows: the overlap between the tubule to be ligated and the last tubule member of the active tubule is higher than τ and it scores the highest class among the detection tubules that have not been ligated. It should be noted that: the connected detection tubule can not be connected by other active tubes any more, the detection tubule which is not connected at the moment f is a new active tube at the next moment, and the active tube which is not connected with the detection tubule and has continuous D-1 frames is considered to be connected with the tube.

43) After the tubule connection process is finished, a time smoothing strategy (temporal smoothing) is used for averaging detection frames from different behavior detection tubules on the same frame of the video, so that the active tube after connection is converted into a smooth behavior tube. And finally obtaining a group of 3D behavior detection tubes aiming at the input video.

To verify the performance of the above method, the following experiment was designed.

Experimental validation was performed on three public data sets (UCF-Sports, J-HMDB, and UCF-101). The UCF-Sports data set comprises 10 categories which are all related to Sports behaviors, the total number of videos is 150, each video is pruned, so that each frame has behaviors, and the experiment uses a standard training/testing branch; the J-HMDB dataset contains 928 clipped videos, for a total of 21 behavior classes, which are usually divided into 3 training/testing branches, with the final result taking the average of the three branch results; videos of the UCF-101 data set are not pruned, and comprise 24 behavior tags, and 3207 videos are totally detected, so that compared with the first two data sets, the detection difficulty is higher, and the experiment uses the result of the first branch of the data set.

The evaluation indexes used in the experiment include frame-level indexes (frame-AP, frame-MABO, frame-CLASSIF) and video-level indexes (video-AP). Wherein, the frame-AP, the frame-MABO and the frame-CLASSIF respectively measure the overall performance, the positioning accuracy and the classification accuracy of the method on the frame-level result, and the video-AP evaluates the effectiveness of the method on the video-level result. Notably, the mAP value is the result of averaging the AP values under all categories. A detection result is considered to be a true-positive sample (true positive) if and only if it is correctly classified and overlaps with a true result by more than some threshold δ.

To evaluate the effectiveness of the enhanced appearance stream, FIG. 4 compares the detection of the enhanced appearance stream against the conventional appearance stream over 10 categories of the UCF-Sports dataset. As can be seen, the enhanced look stream is significantly better detected in certain behavior categories (e.g., "Skateboard" and "Walk" etc.) than the conventional look stream. This result shows that by combining the newly added optical flow input and human body gesture input, the apparent flow can learn more useful information to predict behavior classes. Specific frame-level index results on the UCF-Sports dataset are shown in Table 1, and it can be easily found that the enhanced appearance flow is superior to the traditional appearance flow in each index, and even exceeds the traditional three-flow behavior detector combined with the three-flow results in some indexes.

TABLE 1 comparison of frame level indicators on UCF-Sports dataset

Method of producing a composite material	frame-mAP	frame-MABO	frame-CLASSIF
				Traditional appearance flow	82.56	82.41	79.46
Conventional three-stream detector	87.93	85.66	82.05
				Enhanced appearance flow	91.34	85.29	85.48

To evaluate the extended weighted fusion method of the three workflows, fig. 5 compares the frame-level index results of the enhanced behavior detector on UCF-Sports and J-HMDB datasets at a total of 3 × 3 × 3 ═ 27 weight coefficient ratios. The extended weighted fusion increases the "different workflows may contribute to the result to the same extent" that the conventional method misses. The results show that the change of the weight coefficient ratio significantly affects the frame-AP and frame-CLASSIF but not the frame-MABO, indicating that the weight distribution of the three workflows mainly acts on classification rather than localization. Analysis of the line graph shows that the detection effect is remarkably improved when the enhanced appearance flow and the momentum flow occupy the main positions. The final weight ratio used by the method, considering the overall performance over multiple data sets, is 3:3:1, at which the frame-level index results for the detector under the J-HMDB data set are shown in table 2.

TABLE 2 comparison of frame level indicators on the J-HMDB dataset

Method of producing a composite material	frame-mAP	frame-MABO	frame-CLASSIF
				Conventional three-stream detector	67.1	85.1	64.8
Enhanced three-stream detector	68.0	87.6	65.8

The comparison of the frame level and video level indexes of the method on UCF-Sports, J-HMDB and UCF-101 data sets with the most advanced behavior detection method at present is shown in tables 3, 4 and 5.

TABLE 3 comparison of methods on UCF-Sports dataset

TABLE 4 comparison of the methods on the J-HMDB dataset

TABLE 5 comparison of methods on UCF-101 dataset

Experiments can prove that the video behavior detection method based on the enhanced three-stream framework has high classification accuracy, accurate positioning and convenient and quick operation on three public data sets, and has stronger advantages and application prospects in the currently known video behavior detection model.

The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.

Claims

1. A video behavior detection method based on an enhanced three-stream architecture is characterized by comprising the following steps:

2. The enhanced three-stream-architecture-based video behavior detection method according to claim 1, wherein the method for acquiring the light flow graph specifically comprises:

3. The video behavior detection method based on the enhanced three-stream architecture according to claim 1, wherein the human body posture diagram is obtained by a method specifically comprising:

4. The method according to claim 1, wherein in the behavior detection step, for each time step, the enhanced appearance stream takes as input D frames of consecutive original images, D x 5 consecutive light flow patterns corresponding to the original images, and D body posture diagrams corresponding to the original images, the animation stream takes as input D x 5 consecutive light flow patterns corresponding to the original images, and the posture stream takes as input D body posture diagrams corresponding to the original images.

5. The method according to claim 1, wherein in the behavior detection step, the enhanced appearance stream performs feature extraction on an original image, an optical flow graph and a human body posture graph corresponding to each frame of the input image sequence by using an SSD network layer, each frame performs convolution fusion on three input function feature graphs, and then feature graphs from the same function feature layer of each frame are merged and sent to a convolution layer for classification and a convolution layer for regression, so as to obtain a preset category score and an anchor cube coordinate regression value of the anchor cube.

6. The method according to claim 1, wherein in the behavior detection step, the SSD network layer is used to perform feature extraction on the input image corresponding to each frame of the input image sequence for the momentum stream and the pose stream, feature maps from the same functional feature layer of each frame are merged and sent to the convolutional layer for classification and the convolutional layer for regression, respectively, so as to obtain a preset category score and anchor cube coordinate regression value of the anchor cube.

7. The method according to claim 1, wherein when the workflows are trained, a total loss of the detection network is calculated, and for each workflow, a random gradient descent method is used to adjust the weight of the detection network, where the total loss L of the detection network is expressed as:

8. The method for detecting video behavior based on the enhanced three-stream architecture according to claim 1, wherein in the three-stream fusion step, the method for performing extended weighted fusion on the category scores of the three workflows specifically comprises:

wherein, S, S₁、S₂And S₃Respectively representing a fusion category score on category l for a certain anchor cube, a category score from the enhanced appearance stream, a category score from the animation stream and a category score from the pose stream, calculated via the network parameters Θ, α₁、α₂、α₃Is a weight coefficient assigned to the corresponding workflow.

9. The enhanced three-stream architecture based video behavior detection method according to claim 8, wherein the weight coefficient α is₁、α₂、α₃The ratio of (A) to (B) is 3:3: 1.

10. The method according to claim 1, wherein in the behavior tube constructing step, a real-time greedy algorithm is used to connect behavior detection tubules at each time step with time to construct a 3D behavior detection tube for an input video, and specifically includes:

the behavior tube construction process is carried out in sequence according to categories, for each behavior category, the connected active behavior tubes at the moment of f-1 are sorted according to the tube score from high to low, the most matched small tubes are selected from A behavior detection small tubes generated at the moment of f for each active tube in sequence for connection, and the matching principle is as follows: the overlapping degree between the tubule to be connected and the last tubule member of the active tubule is higher than tau, the category score of the tubule is highest in the detection tubules which are not connected, the connected detection tubules can not be connected by other active tubules, the detection tubules which are not connected at the moment f are a new active tubule at the next moment, and the active tubules which are not connected with the detection tubules in continuous D-1 frames are considered to be connected completely;