CN112800934B

CN112800934B - Behavior recognition method and device for multi-class engineering vehicle

Info

Publication number: CN112800934B
Application number: CN202110098578.5A
Authority: CN
Inventors: 汪霖; 李一荻; 曹世闯; 汪照阳; 胡莎; 刘成; 陈晓璇; 姜博; 李艳艳; 周延
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2023-08-08
Anticipated expiration: 2041-01-25
Also published as: CN112800934A

Abstract

According to the behavior recognition method and device for the multi-category engineering vehicle, the video to be recognized is input into the trained target detection model, so that the trained target detection model recognizes the video to be recognized, a prediction frame containing engineering vehicle targets in the video to be recognized is output, the prediction frame where the engineering vehicle targets are located corresponds to position coordinates and categories of the engineering vehicle targets, then images in the range of the prediction frame are input into the trained behavior recognition network in the form of continuous frames, the behavior recognition network extracts key frames of the video to be recognized and recognizes behaviors of the engineering vehicle targets, the category of the behaviors of the engineering vehicle targets in the video to be recognized is obtained, and the behavior recognition network simulates time domain information through displacement of different sets of feature vectors in channel dimensions, so that the speed of the behavior recognition process is greatly improved, and different behaviors of a plurality of engineering vehicles can be recognized in real time.

Description

Behavior recognition method and device for multi-class engineering vehicle

Technical Field

The invention belongs to the technical field of video image recognition, and particularly relates to a behavior recognition method and device for a multi-class engineering vehicle.

Background

In the field of video behavior recognition, existing methods are mainly divided into two categories. The first category is a behavior recognition method based on video frame image information, such as two-stream method and three-dimensional convolution method. the two-stream method is that a light flow graph and a video frame are sent into a convolutional neural network (Convolutional Neural Networks, CNN) to be jointly trained to obtain behavior categories; the three-dimensional convolution method is to add time dimension information into a video frame sequence, and directly perform three-dimensional convolution on the sequence to obtain behavior categories. The second type of method is a skeleton-based behavior recognition method, which is firstly carried out by estimating key nodes through RGB images and then carrying out behavior prediction through a cyclic neural network (Recurrent Neural Network, RNN) or a Long Short-Term Memory (LSTM), but the method is mostly suitable for skeleton-fixed scenes such as human behavior recognition.

In the existing behavior recognition method based on video frame image information, when a section of video is input for recognition, only one object and one action type of the object can be recognized. The skeleton-based behavior recognition method can recognize a plurality of targets, but because the fixed skeleton structure is required to be encoded into vectors to be input into a network for motion classification, the recognition method is difficult to recognize when the motion of an object to be recognized has large variation.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a behavior recognition method and device for a multi-class engineering vehicle. The technical problems to be solved by the invention are realized by the following technical scheme:

in a first aspect, the behavior recognition method for a multi-class engineering vehicle provided by the invention includes:

acquiring a video to be identified;

the video to be identified comprises a plurality of frames of images, and each frame of image comprises a plurality of engineering truck targets;

inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;

the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained by obtaining a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked by a real frame, the first training set is clustered to obtain k priori frames, the priori frames are input into a preset target detection model, so that the preset target detection model determines the priori frame with the largest intersection ratio with the real frame, the offset between the prediction frame and the priori frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is iteratively trained until a first training cut-off condition is reached;

Inputting images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;

the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented by 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.

Optionally, the trained target detection model is obtained by the following steps:

Step 1: acquiring original image data;

step 2: dividing the original data into a training set, a testing set and a verification set;

step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using a real frame;

step 4: clustering the training set by using a k-means clustering algorithm to obtain k priori frame scales;

wherein each prior frame corresponds to prior frame information, the prior frame information comprising a scale of the prior frame, the scale comprising a width and a height;

step 5: performing data enhancement on each sample in the training set;

step 6: dividing each sample after enhancement into s×s lattices;

wherein each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence and c category probabilities;

step 7: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-serial ratio with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum parallel-to-serial ratio with the real frame and the confidence level of a grid where the center position of the object is located, calculating offset between a prediction frame and the prior frame, and outputting the prediction frame;

Step 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;

step 9: repeating the steps 7 to 8 until a first training cut-off condition is reached;

wherein the first training cutoff condition includes: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;

step 10: and determining the preset target detection model with the minimum loss function as a trained target detection model.

Optionally, the step 7 includes:

inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-parallel ratio with the real frame, and calculating offset before a predicted frame and the prior frame by using the following formula (1) based on the confidence level of the prior frame with the maximum parallel-to-parallel ratio with the real frame and the grid where the center position of the object is positioned, and outputting the predicted frame;

the formula (1) is:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

wherein b _x Representing the abscissa of the prediction block, b _y Representing the ordinate of the prediction block, b _w Representing the wide offset of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame, b _h Representing a high offset, p, of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame _w Representing the current prior frame width, p _h Representing the current priori frame height; c _x And c _y Representing the upper left corner coordinate, sigma (t) _x ) Sum sigma (t) _y ) Representing the distance between the central point C of the prediction frame and the upper left corner coordinate of the grid where the central point is positioned, t _w Representing the wide offset, t, of the prior frame predicted by the preset target detection model relative to the real frame _h And (3) a priori frame predicted by the preset target detection model has a high offset relative to a real frame, wherein sigma represents a Sigmoid function, and the function is used for quantifying the coordinate offset into a (0, 1) interval.

Wherein the loss function is:

loss＝lbox+lcls+lobj

wherein lbox represents the position loss of the predicted and real frames, lambda _coord And (3) representing the weight of the position loss, wherein S is the generated grid number, and B is the prior frame number of each grid setting.A judgment value indicating that the prediction frame contains the object, if the judgment value is 1, the judgment value is not 0, and x _i 、y _i Representing coordinates of a real frame, w _i 、h _i A wide-high value representing a real box, +.>Representing the coordinates of the prediction box>Representing coordinates and a wide-high value of the prediction frame; lcls represents class loss, lambda _class Weights representing class losses by cross entropy loss function +. >Calculate class loss, p _i (c) The probability that the category c predicted by the prediction frame is identical to the real category is represented as 1, and the difference is 0,/or +>Representing a probability of being predicted as category c; lobj represents confidence loss, lambda _noobj Indicating that the prediction frame does not contain the weight of the actual engineering vehicle target lambda _obj Indicating that the prediction frame contains the weight of the actual engineering vehicle target,/->If the predicted frame at the i and j positions is not provided with the engineering truck target of 1, the engineering truck target is 0 and c _i Representing confidence of prediction box, +.>Expressed as confidence in the predictions of the prediction block.

Optionally, the trained behavior recognition network is obtained by the following steps:

step 1: acquiring a second data set;

step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain behavior categories recognized by the preset behavior recognition network;

step 3: adjusting parameters of a preset behavior recognition network;

step 4: comparing the behavior type of the sample identified by the preset behavior identification network with the real behavior type of the sample for each sample, and calculating a loss function of the preset behavior identification network;

step 5: repeating the steps 2 to 4 until the preset behavior recognition network reaches a second training cut-off condition;

Wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;

step 6: and determining the preset behavior recognition network reaching the second training cut-off condition as a trained behavior recognition network.

Optionally, the preset behavior recognition network is a TSN based time sequence division network, a TSM time shift module is connected between the residual layers of the TSN network, the TSM time shift module of each layer shifts the feature dimension graph output by the residual layer of the previous layer according to the serial number of the group, and the space in the feature vector corresponding to the shifted dimension feature graph is supplemented with 0.

Optionally, the step of shifting the corresponding position of the feature dimension map output by the previous layer residual layer by the TSM time shift module of each layer according to the serial number of the group, and the step of supplementing 0 to the space in the feature vector corresponding to the shifted dimension feature map includes:

the TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;

shifting the dimension feature graphs of the first group to one bit leftwards according to the time sequence of the image, and supplementing the feature vector space corresponding to the feature dimension graphs of the shifted group with 0;

And shifting the dimension feature graphs of the second group to the right by one bit according to the time sequence of the image, and supplementing the feature vector space corresponding to the shifted feature dimension graphs of the group by 0.

Optionally, before inputting the prediction frame into the trained behavior recognition network in the form of continuous frames, the behavior recognition method further includes:

equally dividing images in a prediction frame range according to an image time sequence, randomly extracting one frame from each subframe section to serve as a key frame, and stacking all the key frames to obtain divided image data;

and inputting the image data into the trained behavior recognition network.

Optionally, the recognition result output by the trained behavior recognition model is:

OutPut＝{TSN ₁ (T ₁ ,T ₂ ,...T _k )，TSN ₂ (T ₁ ,T ₂ ,...T _k )，...，TSN _m (T ₁ ,T ₂ ,...T _k )}；

TSN(T ₁ ,T ₂ ,...T _k )＝H(G(F(T ₁ ,w),F(T ₂ ,w)...F(T _k ,w)))

wherein, (T) ₁ ,T ₂ ,...T _k ) Representing a sequence of video key frames, each key frame T _k From its corresponding video segment S _k Randomly sampling to obtain; f (T) _k W) denotes the effect of a convolutional network using w as a parameter on the frame T _k Function F returns to T _k Score relative to all categories; g is a segment consensus function representing a combination of a plurality of T' s _k The class score of (2) outputs a total class prediction value between them, H is a softmax prediction function used to predict the probability that the whole video belongs to each behavior class.

In a second aspect, the present invention provides a behavior recognition device for a multi-class engineering vehicle, including:

the method comprises the steps of obtaining a model, wherein the model is used for obtaining a video to be identified;

the detection module is used for inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;

The recognition module is used for inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;

According to the behavior recognition method of the multi-category engineering vehicle, the video to be recognized is input into the trained target detection model, so that the trained target detection model recognizes the video to be recognized, a predicted frame containing engineering vehicle targets in the video to be recognized is output, the predicted frame where the engineering vehicle targets are located corresponds to position coordinates and categories of the engineering vehicle targets, then images in the range of the predicted frame are input into the trained behavior recognition network in the form of continuous frames, the behavior recognition network extracts key frames of the video to be recognized and recognizes behaviors of the engineering vehicle targets, the class of behaviors of the engineering vehicle targets in the video to be recognized is obtained, the trained behavior recognition network obtains a second training set, second samples in the second training set are input into the preset behavior recognition network, dimension feature images output by each layer in the preset behavior recognition network are grouped according to time sequences of input images, the number difference of dimension feature images contained between each group is minimum, each group of dimension feature images is subjected to extraction of key frames and recognition of the engineering vehicle targets in the form of the continuous frames, the behavior recognition network is enabled to obtain the classes of behaviors of the engineering vehicle targets in the video to be recognized, and the behavior recognition network is enabled to be different from the preset dimension images, and the dimension images are subjected to be subjected to the iterative recognition of the second training feature images in the second training set until the second training set is different from the first training set, and the training condition is achieved, and the training of the training condition is achieved.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

FIG. 1 is a flowchart of a behavior recognition method of a multi-class engineering vehicle provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a training process for providing a target detection model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a DarkNet53 network architecture;

FIG. 4 is a schematic diagram of the calculation of the prior frame and prediction frame offsets;

FIG. 5 is a schematic diagram of a TSN architecture;

FIG. 6 is a schematic diagram of an inserted TSN architecture of a time shift module;

fig. 7 is a block diagram of a behavior recognition device of a multi-class engineering vehicle according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.

Example 1

As shown in fig. 1, the behavior recognition method of the multi-class engineering vehicle provided by the invention comprises the following steps:

s1, acquiring a video to be identified;

s2, inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;

s3, inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;

Example two

As an alternative embodiment of the present invention, the trained object detection model is obtained by:

step 1: acquiring original image data;

because the engineering vehicles comprise different types, such as an excavator, a soil and slag vehicle, a bulldozer and the like, the skeleton structure and the action modes of the engineering vehicles are different, and the engineering vehicles have various action behaviors such as bulldozing, excavating, dumping and the like, video data comprising the multi-type engineering vehicles is taken as original data. Firstly, extracting a plurality of frames from original video data as target detection data, dividing a training set, a testing set and a verification set, and marking the video frames by using a marking tool. In order to prevent overfitting, the detection accuracy is improved by adding gaussian noise before target detection and mirroring and rotating the data randomly to obtain a data enhancement effect.

Step 5: performing data enhancement on each sample in the training set;

step 6: dividing each sample after enhancement into s×s lattices;

The first threshold may be preset according to practical experience.

Wherein the loss function is:

loss＝lbox+lcls+lobj

wherein lbox represents the position loss of the predicted and real frames, lambda _coord And (3) representing the weight of the position loss, wherein S is the generated grid number, and B is the prior frame number of each grid setting.A judgment value indicating that the prediction frame contains the object, if the judgment value is 1, the judgment value is not 0, and x _i 、y _i Representing coordinates of a real frame, w _i 、h _i A wide-high value representing a real box, +.>Representing the coordinates of the prediction box>Representing coordinates and a wide-high value of the prediction frame; lcls represents class loss, lambda _class Weights representing class losses by cross entropy loss function +.>Calculate class loss, p _i (c) The probability that the category c predicted by the prediction frame is identical to the real category is represented as 1, and the difference is 0,/or +>Representing a probability of being predicted as category c; lobj represents confidence loss, lambda _noobj Indicating that the prediction frame does not contain the weight of the actual engineering vehicle target lambda _obj Indicating that the prediction frame contains the weight of the actual engineering vehicle target,/->If the predicted frame at the i and j positions is not provided with the engineering truck target of 1, the engineering truck target is 0 and c _i Representing confidence of prediction box, +.>Expressed as confidence in the predictions of the prediction block.

Referring to fig. 2, the embodiment of the present invention may use YOLO algorithm to perform calculation of the target detection portion, where the backbone network adopts a dark net53, and obtains a priori frame dimensions through clustering on the training set. The prior frames are clustered from all the true annotation frames of the training set, several shapes and sizes that occur most often in the training set. These statistically a priori experiences are added to the model in advance to help the model converge quickly.

And obtaining the prior frame scale on the training set through clustering. The prior box is the several shapes and sizes that most often occur in the training set, clustered from all the true annotation boxes in the training set. These statistically a priori experiences are added to the model in advance to help the model converge quickly.

The number of the preselected frames is set to be k, k priori frame scale values which are most suitable are obtained by using a k-means clustering algorithm, and the k scale values are normalized relative to the length and the width of the image, so that the k frames can represent the shape of a real object in the data set to the greatest extent. At the time of clustering, the evaluation criterion is a distance d (box, centroid) =1-IoU (box, centroid) between two borders. The intersection ratio (Intersection over Union, ioU) of the prior frames and the real frames is used as a standard to measure the quality of a set of pre-selected frames.

The offset of the a priori frame from the real object is predicted. Enhancing the data to be the video frame resize to 416×416, into s x s grids, where a priori boxes are set based on different scales obtained by clustering, based on which the position of the object is predicted. The prior frame information (x, y, w, h) is the coordinates of the center position of the object, the width and the height of the prior frame, and the values normalize the width and the height of the image. One confidence score (confidence score) and c class probabilities are predicted for each a priori box of each trellis over the dark 53 network. Confidence is expressed asP _r (Object) indicates whether the lattice contains a true Object center point. If the center position coordinate of a certain object falls into a certain grid, P of the grid _r (Object) is 1, indicating that the Object is detected.Representing the intersection ratio of the prediction frame and the real object.

As shown in fig. 3, the network structure of Yolo3 is that the dark and shallow feature maps are subjected to channel splicing (Concat) operation by adding upsampling to different layers by the dark and shallow feature map 53, and the dark and shallow feature map is fused at the output end, so that the feature maps with three sizes of 13×13, 26×26 and 52×52 are finally output. The deep feature map has small size and large receptive field, and is favorable for detecting large-scale objects, while the shallow feature map is favorable for detecting small-size objects.

The target detection network is trained through the above network so that the loss value of the loss function is continuously reduced until convergence, and the function is verified by using the test set data. The network structure and parameters are continuously optimized until the output is optimal. The final optimized model is the model responsible for the target detection part in the system. And inputting the video data into a model to obtain the position coordinates and the category information of each engineering vehicle.

Example III

As an alternative embodiment of the present invention, the step 7 includes:

the formula (1) is:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

wherein b _x Representing the abscissa of the prediction block, b _y Representing the ordinate of the prediction block, b _w Representing the wide offset of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame, b _h Representing a high offset, p, of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame _w Representing the current prior frame width, p _h Representing the current priori frame height; c _x And c _y Representing the upper left corner coordinate, sigma (t) _x ) Sum sigma (t) _y ) Representing the distance between the central point C of the prediction frame and the upper left corner coordinate of the grid where the central point is positioned, t _w Representing the wide offset, t, of the prior frame predicted by the preset target detection model relative to the real frame _h For the high offset of the prior frame relative to the real frame predicted by the preset target detection model, sigma represents a Sigmoid function, and the function is used for quantizing the coordinate offset to a (0, 1) interval, so that the obtained predicted frame center coordinate b _x ,b _y Limiting to the current region, ensuring that a region predicts only the center point in that regionAnd the object is beneficial to model convergence. The whole prediction process is to input a priori frame into a target detection model, and obtain t through model calculation _w 、t _h 、t _x 、t _y Is a process of (2).

Referring to fig. 4, video frames and prior frame information are input into a dark net53 network, a grid containing a center point of a real object is found first, then the one of all prior frames generated by the grid, which is the largest with a real frame IOU, is selected, the offset of the prior frame and the real frame is predicted through the network, a prediction frame is obtained through the offset values, and the model itself calculates the final output prediction frame internally.

Example IV

As an alternative embodiment of the present invention, the trained behavior recognition network is obtained by:

step 1: acquiring a second data set;

step 3: adjusting parameters of a preset behavior recognition network;

the second threshold is a preset value, and can be obtained according to industry experience.

Example five

As an optional embodiment of the present invention, the preset behavior recognition network is a TSN based time sequence division network, a TSM time shift module is connected between residual layers of the TSN network, the TSM time shift module of each layer shifts a feature dimension graph output by a previous layer of residual layer according to a serial number of a group, and a space in a feature vector corresponding to the shifted dimension feature graph is supplemented with 0.

Referring to fig. 5, behavior recognition of a network (Temporal Segment Networks, TSN) is partitioned based on timing. The video stream data passes through the target detection model, then the position information of various engineering vehicles is sequentially input into the behavior recognition network in a form of a binding box, and the TSN architecture is adopted for extracting key frames and recognizing behaviors.

Example six

As an optional embodiment of the present invention, the step of the TSM time shift module of each layer shifting the corresponding position of the feature dimension map output by the previous layer residual layer according to the serial number of the group, and the step of supplementing 0 to the space in the feature vector corresponding to the shifted dimension feature map includes:

Since behavior recognition relies on timing modeling, a TSM (Temporal Shift Module) module is added to perform timing modeling on the basis of the TSN architecture. And each time displacement module is used for dividing the batch_size×segment×channel×h×w dimension feature map generated by the network middle layer into 3 groups according to channel number average, and simulating time domain information by the left and right movements of different groups of feature vectors in the channel dimension. If the moving proportion is too large, the space feature modeling capability is weakened, the image information of the original frame is possibly damaged, if the moving proportion is too small, the time modeling capability of the model is affected, so that the 3 groups of feature images are respectively left-shifted by one bit, right-shifted by one bit and are not shifted to simulate a time domain receptive field, and the feature vectors which are empty after the movement are filled with 0. This operation moves some channels from frame to frame in the time dimension, the inter-frame information is exchanged, and the time domain information is further fused, thus making the model more efficient in behavior recognition.

The 2DConvNet in FIG. 5 employs a conventional image classification network, such as ResNet50, resNet101, BN-acceptance, etc., the network employed in the present invention is ResNet50, which is a superposition of 50 residual networks. A TSM time shift module is inserted into each residual block of the res net50 in the manner shown in fig. 6. The first layer on each residual structure branch 1 performs a time shift operation, and the rest of the structure and the calculation mode of the residual block are unchanged. Thus, original frame information on the branch 2 is reserved, and inter-frame information is exchanged on the branch 1, and each residual block fuses the two information, so that the network is more suitable for behavior recognition. And connecting 50 layers of residual blocks subjected to time displacement to serve as an infrastructure of the behavior recognition network, and finally adding a layer of full-connection layer for classification so as to recognize the behaviors of the multi-class targets.

Example seven

As an alternative embodiment of the present invention, before inputting the prediction frame into the trained behavior recognition network in the form of continuous frames, the behavior recognition method further includes:

step 1: equally dividing images in a prediction frame range according to an image time sequence, randomly extracting one frame from each subframe section to serve as a key frame, and stacking all the key frames to obtain divided image data;

Step 2: and inputting the image data into the trained behavior recognition network.

The recognition result output by the trained behavior recognition model is as follows:

TSN(T ₁ ,T ₂ ,...T _k )＝H(G(F(T ₁ ,w),F(T ₂ ,w)...F(T _k ,w)))

TSN is a behavior recognition network architecture, whose core is the segmentation of the time domain. Giving a section of video V, wherein m objects of behaviors to be detected are contained, extracting the m objects by adopting the method in the step S2, and then sequentially inputting the m objects into a TSN network in a continuous frame mode. Taking a certain engineering truck target to be tested as an example, dividing the engineering truck target into k sections { S } according to equal frame intervals ₁ ,S ₂ ,...S _k The output result of behavior recognition is therefore:

TSN(T ₁ ,T ₂ ,...T _k )＝H(G(F(T ₁ ,w),F(T ₂ ,w)...F(T _k ,w)))

OutPut＝{TSN ₁ (T ₁ ,T ₂ ,...T _k )，TSN ₂ (T ₁ ,T ₂ ,...T _k )，...，TSN _m (T ₁ ,T ₂ ,...T _k )}

wherein, (T) ₁ ,T ₂ ,...T _k ) Representing a sequence of video key frames, each key frame T _k From its corresponding video segment S _k Randomly sampling to obtain; f (T) _k W) denotes the effect of a convolutional network using w as a parameter on the frame T _k Function F returns to T _k Score relative to all categories; g is a segment consensus function representing a combination of a plurality of T' s _k Class score output of (2)The total category predicted value between the two is generally obtained by obtaining the maximum value of k predicted results; h is a softmax prediction function used to predict the probability that the entire video belongs to each behavior class.

Training is carried out through the network, the network structure and model parameters are optimized, various tested results are optimized, and finally a behavior recognition network is obtained. And inputting the targets of the engineering vehicles of each class in the video frame into the network to finally obtain the behaviors of the targets of the engineering vehicles of each class.

Example eight

As shown in fig. 7, the behavior recognition device for a multi-class engineering vehicle provided by the invention includes:

an acquisition model 71 for acquiring a video to be identified;

the detection module 72 is configured to input the video to be identified into a trained target detection model, so that the trained target detection model identifies the video to be identified, and output a prediction frame;

the recognition module 73 is configured to input the images within the prediction frame range into a trained behavior recognition network in a continuous frame manner, so that the behavior recognition network performs key frame extraction on the video to be recognized and recognition on the behavior of the engineering truck target, and obtains a category to which the behavior of the engineering truck target in the video to be recognized belongs;

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The behavior recognition method of the multi-class engineering vehicle is characterized by comprising the following steps of:

acquiring a video to be identified;

the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering vehicle target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained;

the trained target detection model is obtained through the following steps:

Step 1: acquiring original image data;

step 5: performing data enhancement on each sample in the training set;

step 6: dividing each sample after enhancement into s×s lattices;

step 10: determining a preset target detection model with the minimum loss function as a trained target detection model;

the step 7 comprises the following steps: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-parallel ratio with the real frame, and calculating offset before a predicted frame and the prior frame by using the following formula (1) based on the confidence level of the prior frame with the maximum parallel-to-parallel ratio with the real frame and the grid where the center position of the object is positioned, and outputting the predicted frame;

the formula (1) is:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

2. The behavior recognition method of claim 1, wherein the loss function is:

loss＝lbox+lcls+lobj

wherein lbox represents the position loss of the predicted and real frames, lambda _coord The weight of the position loss is represented, S is represented by the generated grid number, and B is represented by the prior frame number of each grid;a judgment value indicating that the prediction frame contains the object, if the judgment value is 1, the judgment value is not 0, and x _i 、y _i Representing coordinates of a real frame, w _i 、h _i A wide-high value representing a real box, +.>Representing the coordinates of the prediction box>Representing coordinates and a wide-high value of the prediction frame; lcls represents class loss, lambda _class Weights representing class losses by cross entropy loss function +.>Calculate class loss, p _i (c) The probability that the category c predicted by the prediction frame is identical to the real category is represented as 1, and the difference is 0,/or +>Representing a probability of being predicted as category c; lobj represents confidence loss, lambda _noobj Indicating that the prediction frame does not contain the weight of the actual engineering vehicle target lambda _obj Indicating that the prediction frame contains the weight of the actual engineering vehicle target,/->The prediction frames at the positions i and j are 1 if no engineering truck target exists, and the prediction frames are provided withThe engineering truck target is 0, c _i Representing confidence of prediction box, +.>Expressed as confidence in the predictions of the prediction block.

3. The behavior recognition method of claim 1, wherein the trained behavior recognition network is obtained by:

step 1: acquiring a second data set;

step 3: adjusting parameters of a preset behavior recognition network;

4. The behavior recognition method according to claim 3, wherein the preset behavior recognition network is a TSN based time sequence division network, a TSM time shift module is connected between residual layers of the TSN network, the TSM time shift module of each layer shifts the feature dimension graph output by the residual layer of the previous layer according to the serial number of the group, and the space in the feature vector corresponding to the shifted dimension feature graph is supplemented with 0.

5. The behavior recognition method according to claim 4, wherein the TSM time shift module of each layer shifts a feature dimension graph output by a previous layer of residual layers according to a serial number of a group, and the supplementing 0 of a space in a feature vector corresponding to the shifted dimension feature graph comprises:

6. The behavior recognition method of claim 1, wherein prior to inputting the prediction frame into the trained behavior recognition network in the form of successive frames, the behavior recognition method further comprises:

and inputting the image data into the trained behavior recognition network.

7. The behavior recognition method according to claim 6, wherein the recognition result outputted by the trained behavior recognition model is:

8. A behavior recognition device for a multi-class engineering vehicle, comprising:

the trained target detection model is obtained through the following steps:

Step 1: acquiring original image data;

step 5: performing data enhancement on each sample in the training set;

step 6: dividing each sample after enhancement into s×s lattices;

the formula (1) is:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

wherein b _x Representing the abscissa of the prediction block, b _y Representing the ordinate of the prediction block, b _w The prediction frame representing the prediction of the preset object detection model is wider than the prior frame with the maximum intersection with the real frame Offset, b _h Representing a high offset, p, of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame _w Representing the current prior frame width, p _h Representing the current priori frame height; c _x And c _y Representing the upper left corner coordinate, sigma (t) _x ) Sum sigma (t) _y ) Representing the distance between the central point C of the prediction frame and the upper left corner coordinate of the grid where the central point is positioned, t _w Representing the wide offset, t, of the prior frame predicted by the preset target detection model relative to the real frame _h And (3) a priori frame predicted by the preset target detection model has a high offset relative to a real frame, wherein sigma represents a Sigmoid function, and the function is used for quantifying the coordinate offset into a (0, 1) interval.