CN112800934B - Behavior recognition method and device for multi-class engineering vehicle - Google Patents

Behavior recognition method and device for multi-class engineering vehicle Download PDF

Info

Publication number
CN112800934B
CN112800934B CN202110098578.5A CN202110098578A CN112800934B CN 112800934 B CN112800934 B CN 112800934B CN 202110098578 A CN202110098578 A CN 202110098578A CN 112800934 B CN112800934 B CN 112800934B
Authority
CN
China
Prior art keywords
frame
detection model
behavior recognition
target detection
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110098578.5A
Other languages
Chinese (zh)
Other versions
CN112800934A (en
Inventor
汪霖
李一荻
曹世闯
汪照阳
胡莎
刘成
陈晓璇
姜博
李艳艳
周延
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NORTHWEST UNIVERSITY
Original Assignee
NORTHWEST UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NORTHWEST UNIVERSITY filed Critical NORTHWEST UNIVERSITY
Priority to CN202110098578.5A priority Critical patent/CN112800934B/en
Publication of CN112800934A publication Critical patent/CN112800934A/en
Application granted granted Critical
Publication of CN112800934B publication Critical patent/CN112800934B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

According to the behavior recognition method and device for the multi-category engineering vehicle, the video to be recognized is input into the trained target detection model, so that the trained target detection model recognizes the video to be recognized, a prediction frame containing engineering vehicle targets in the video to be recognized is output, the prediction frame where the engineering vehicle targets are located corresponds to position coordinates and categories of the engineering vehicle targets, then images in the range of the prediction frame are input into the trained behavior recognition network in the form of continuous frames, the behavior recognition network extracts key frames of the video to be recognized and recognizes behaviors of the engineering vehicle targets, the category of the behaviors of the engineering vehicle targets in the video to be recognized is obtained, and the behavior recognition network simulates time domain information through displacement of different sets of feature vectors in channel dimensions, so that the speed of the behavior recognition process is greatly improved, and different behaviors of a plurality of engineering vehicles can be recognized in real time.

Description

Behavior recognition method and device for multi-class engineering vehicle
Technical Field
The invention belongs to the technical field of video image recognition, and particularly relates to a behavior recognition method and device for a multi-class engineering vehicle.
Background
In the field of video behavior recognition, existing methods are mainly divided into two categories. The first category is a behavior recognition method based on video frame image information, such as two-stream method and three-dimensional convolution method. the two-stream method is that a light flow graph and a video frame are sent into a convolutional neural network (Convolutional Neural Networks, CNN) to be jointly trained to obtain behavior categories; the three-dimensional convolution method is to add time dimension information into a video frame sequence, and directly perform three-dimensional convolution on the sequence to obtain behavior categories. The second type of method is a skeleton-based behavior recognition method, which is firstly carried out by estimating key nodes through RGB images and then carrying out behavior prediction through a cyclic neural network (Recurrent Neural Network, RNN) or a Long Short-Term Memory (LSTM), but the method is mostly suitable for skeleton-fixed scenes such as human behavior recognition.
In the existing behavior recognition method based on video frame image information, when a section of video is input for recognition, only one object and one action type of the object can be recognized. The skeleton-based behavior recognition method can recognize a plurality of targets, but because the fixed skeleton structure is required to be encoded into vectors to be input into a network for motion classification, the recognition method is difficult to recognize when the motion of an object to be recognized has large variation.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a behavior recognition method and device for a multi-class engineering vehicle. The technical problems to be solved by the invention are realized by the following technical scheme:
in a first aspect, the behavior recognition method for a multi-class engineering vehicle provided by the invention includes:
acquiring a video to be identified;
the video to be identified comprises a plurality of frames of images, and each frame of image comprises a plurality of engineering truck targets;
inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained by obtaining a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked by a real frame, the first training set is clustered to obtain k priori frames, the priori frames are input into a preset target detection model, so that the preset target detection model determines the priori frame with the largest intersection ratio with the real frame, the offset between the prediction frame and the priori frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is iteratively trained until a first training cut-off condition is reached;
Inputting images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;
the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented by 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
Optionally, the trained target detection model is obtained by the following steps:
Step 1: acquiring original image data;
step 2: dividing the original data into a training set, a testing set and a verification set;
step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using a real frame;
step 4: clustering the training set by using a k-means clustering algorithm to obtain k priori frame scales;
wherein each prior frame corresponds to prior frame information, the prior frame information comprising a scale of the prior frame, the scale comprising a width and a height;
step 5: performing data enhancement on each sample in the training set;
step 6: dividing each sample after enhancement into s×s lattices;
wherein each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence and c category probabilities;
step 7: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-serial ratio with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum parallel-to-serial ratio with the real frame and the confidence level of a grid where the center position of the object is located, calculating offset between a prediction frame and the prior frame, and outputting the prediction frame;
Step 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;
step 9: repeating the steps 7 to 8 until a first training cut-off condition is reached;
wherein the first training cutoff condition includes: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;
step 10: and determining the preset target detection model with the minimum loss function as a trained target detection model.
Optionally, the step 7 includes:
inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-parallel ratio with the real frame, and calculating offset before a predicted frame and the prior frame by using the following formula (1) based on the confidence level of the prior frame with the maximum parallel-to-parallel ratio with the real frame and the grid where the center position of the object is positioned, and outputting the predicted frame;
the formula (1) is:
b x =σ(t x )+c x
b y =σ(t y )+c y
wherein b x Representing the abscissa of the prediction block, b y Representing the ordinate of the prediction block, b w Representing the wide offset of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame, b h Representing a high offset, p, of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame w Representing the current prior frame width, p h Representing the current priori frame height; c x And c y Representing the upper left corner coordinate, sigma (t) x ) Sum sigma (t) y ) Representing the distance between the central point C of the prediction frame and the upper left corner coordinate of the grid where the central point is positioned, t w Representing the wide offset, t, of the prior frame predicted by the preset target detection model relative to the real frame h And (3) a priori frame predicted by the preset target detection model has a high offset relative to a real frame, wherein sigma represents a Sigmoid function, and the function is used for quantifying the coordinate offset into a (0, 1) interval.
Wherein the loss function is:
loss=lbox+lcls+lobj
wherein lbox represents the position loss of the predicted and real frames, lambda coord And (3) representing the weight of the position loss, wherein S is the generated grid number, and B is the prior frame number of each grid setting.A judgment value indicating that the prediction frame contains the object, if the judgment value is 1, the judgment value is not 0, and x i 、y i Representing coordinates of a real frame, w i 、h i A wide-high value representing a real box, +.>Representing the coordinates of the prediction box>Representing coordinates and a wide-high value of the prediction frame; lcls represents class loss, lambda class Weights representing class losses by cross entropy loss function +. >Calculate class loss, p i (c) The probability that the category c predicted by the prediction frame is identical to the real category is represented as 1, and the difference is 0,/or +>Representing a probability of being predicted as category c; lobj represents confidence loss, lambda noobj Indicating that the prediction frame does not contain the weight of the actual engineering vehicle target lambda obj Indicating that the prediction frame contains the weight of the actual engineering vehicle target,/->If the predicted frame at the i and j positions is not provided with the engineering truck target of 1, the engineering truck target is 0 and c i Representing confidence of prediction box, +.>Expressed as confidence in the predictions of the prediction block.
Optionally, the trained behavior recognition network is obtained by the following steps:
step 1: acquiring a second data set;
step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain behavior categories recognized by the preset behavior recognition network;
step 3: adjusting parameters of a preset behavior recognition network;
step 4: comparing the behavior type of the sample identified by the preset behavior identification network with the real behavior type of the sample for each sample, and calculating a loss function of the preset behavior identification network;
step 5: repeating the steps 2 to 4 until the preset behavior recognition network reaches a second training cut-off condition;
Wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;
step 6: and determining the preset behavior recognition network reaching the second training cut-off condition as a trained behavior recognition network.
Optionally, the preset behavior recognition network is a TSN based time sequence division network, a TSM time shift module is connected between the residual layers of the TSN network, the TSM time shift module of each layer shifts the feature dimension graph output by the residual layer of the previous layer according to the serial number of the group, and the space in the feature vector corresponding to the shifted dimension feature graph is supplemented with 0.
Optionally, the step of shifting the corresponding position of the feature dimension map output by the previous layer residual layer by the TSM time shift module of each layer according to the serial number of the group, and the step of supplementing 0 to the space in the feature vector corresponding to the shifted dimension feature map includes:
the TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;
shifting the dimension feature graphs of the first group to one bit leftwards according to the time sequence of the image, and supplementing the feature vector space corresponding to the feature dimension graphs of the shifted group with 0;
And shifting the dimension feature graphs of the second group to the right by one bit according to the time sequence of the image, and supplementing the feature vector space corresponding to the shifted feature dimension graphs of the group by 0.
Optionally, before inputting the prediction frame into the trained behavior recognition network in the form of continuous frames, the behavior recognition method further includes:
equally dividing images in a prediction frame range according to an image time sequence, randomly extracting one frame from each subframe section to serve as a key frame, and stacking all the key frames to obtain divided image data;
and inputting the image data into the trained behavior recognition network.
Optionally, the recognition result output by the trained behavior recognition model is:
OutPut={TSN 1 (T 1 ,T 2 ,...T k ),TSN 2 (T 1 ,T 2 ,...T k ),...,TSN m (T 1 ,T 2 ,...T k )};
TSN(T 1 ,T 2 ,...T k )=H(G(F(T 1 ,w),F(T 2 ,w)...F(T k ,w)))
wherein, (T) 1 ,T 2 ,...T k ) Representing a sequence of video key frames, each key frame T k From its corresponding video segment S k Randomly sampling to obtain; f (T) k W) denotes the effect of a convolutional network using w as a parameter on the frame T k Function F returns to T k Score relative to all categories; g is a segment consensus function representing a combination of a plurality of T' s k The class score of (2) outputs a total class prediction value between them, H is a softmax prediction function used to predict the probability that the whole video belongs to each behavior class.
In a second aspect, the present invention provides a behavior recognition device for a multi-class engineering vehicle, including:
the method comprises the steps of obtaining a model, wherein the model is used for obtaining a video to be identified;
the video to be identified comprises a plurality of frames of images, and each frame of image comprises a plurality of engineering truck targets;
the detection module is used for inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained by obtaining a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked by a real frame, the first training set is clustered to obtain k priori frames, the priori frames are input into a preset target detection model, so that the preset target detection model determines the priori frame with the largest intersection ratio with the real frame, the offset between the prediction frame and the priori frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is iteratively trained until a first training cut-off condition is reached;
The recognition module is used for inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;
the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented by 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
According to the behavior recognition method of the multi-category engineering vehicle, the video to be recognized is input into the trained target detection model, so that the trained target detection model recognizes the video to be recognized, a predicted frame containing engineering vehicle targets in the video to be recognized is output, the predicted frame where the engineering vehicle targets are located corresponds to position coordinates and categories of the engineering vehicle targets, then images in the range of the predicted frame are input into the trained behavior recognition network in the form of continuous frames, the behavior recognition network extracts key frames of the video to be recognized and recognizes behaviors of the engineering vehicle targets, the class of behaviors of the engineering vehicle targets in the video to be recognized is obtained, the trained behavior recognition network obtains a second training set, second samples in the second training set are input into the preset behavior recognition network, dimension feature images output by each layer in the preset behavior recognition network are grouped according to time sequences of input images, the number difference of dimension feature images contained between each group is minimum, each group of dimension feature images is subjected to extraction of key frames and recognition of the engineering vehicle targets in the form of the continuous frames, the behavior recognition network is enabled to obtain the classes of behaviors of the engineering vehicle targets in the video to be recognized, and the behavior recognition network is enabled to be different from the preset dimension images, and the dimension images are subjected to be subjected to the iterative recognition of the second training feature images in the second training set until the second training set is different from the first training set, and the training condition is achieved, and the training of the training condition is achieved.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
FIG. 1 is a flowchart of a behavior recognition method of a multi-class engineering vehicle provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a training process for providing a target detection model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a DarkNet53 network architecture;
FIG. 4 is a schematic diagram of the calculation of the prior frame and prediction frame offsets;
FIG. 5 is a schematic diagram of a TSN architecture;
FIG. 6 is a schematic diagram of an inserted TSN architecture of a time shift module;
fig. 7 is a block diagram of a behavior recognition device of a multi-class engineering vehicle according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.
Example 1
As shown in fig. 1, the behavior recognition method of the multi-class engineering vehicle provided by the invention comprises the following steps:
s1, acquiring a video to be identified;
the video to be identified comprises a plurality of frames of images, and each frame of image comprises a plurality of engineering truck targets;
s2, inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;
The method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained by obtaining a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked by a real frame, the first training set is clustered to obtain k priori frames, the priori frames are input into a preset target detection model, so that the preset target detection model determines the priori frame with the largest intersection ratio with the real frame, the offset between the prediction frame and the priori frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is iteratively trained until a first training cut-off condition is reached;
s3, inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;
the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented by 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
According to the behavior recognition method of the multi-category engineering vehicle, the video to be recognized is input into the trained target detection model, so that the trained target detection model recognizes the video to be recognized, a predicted frame containing engineering vehicle targets in the video to be recognized is output, the predicted frame where the engineering vehicle targets are located corresponds to position coordinates and categories of the engineering vehicle targets, then images in the range of the predicted frame are input into the trained behavior recognition network in the form of continuous frames, the behavior recognition network extracts key frames of the video to be recognized and recognizes behaviors of the engineering vehicle targets, the class of behaviors of the engineering vehicle targets in the video to be recognized is obtained, the trained behavior recognition network obtains a second training set, second samples in the second training set are input into the preset behavior recognition network, dimension feature images output by each layer in the preset behavior recognition network are grouped according to time sequences of input images, the number difference of dimension feature images contained between each group is minimum, each group of dimension feature images is subjected to extraction of key frames and recognition of the engineering vehicle targets in the form of the continuous frames, the behavior recognition network is enabled to obtain the classes of behaviors of the engineering vehicle targets in the video to be recognized, and the behavior recognition network is enabled to be different from the preset dimension images, and the dimension images are subjected to be subjected to the iterative recognition of the second training feature images in the second training set until the second training set is different from the first training set, and the training condition is achieved, and the training of the training condition is achieved.
Example two
As an alternative embodiment of the present invention, the trained object detection model is obtained by:
step 1: acquiring original image data;
because the engineering vehicles comprise different types, such as an excavator, a soil and slag vehicle, a bulldozer and the like, the skeleton structure and the action modes of the engineering vehicles are different, and the engineering vehicles have various action behaviors such as bulldozing, excavating, dumping and the like, video data comprising the multi-type engineering vehicles is taken as original data. Firstly, extracting a plurality of frames from original video data as target detection data, dividing a training set, a testing set and a verification set, and marking the video frames by using a marking tool. In order to prevent overfitting, the detection accuracy is improved by adding gaussian noise before target detection and mirroring and rotating the data randomly to obtain a data enhancement effect.
Step 2: dividing the original data into a training set, a testing set and a verification set;
step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using a real frame;
step 4: clustering the training set by using a k-means clustering algorithm to obtain k priori frame scales;
wherein each prior frame corresponds to prior frame information, the prior frame information comprising a scale of the prior frame, the scale comprising a width and a height;
Step 5: performing data enhancement on each sample in the training set;
step 6: dividing each sample after enhancement into s×s lattices;
wherein each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence and c category probabilities;
step 7: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-serial ratio with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum parallel-to-serial ratio with the real frame and the confidence level of a grid where the center position of the object is located, calculating offset between a prediction frame and the prior frame, and outputting the prediction frame;
step 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;
step 9: repeating the steps 7 to 8 until a first training cut-off condition is reached;
wherein the first training cutoff condition includes: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;
The first threshold may be preset according to practical experience.
Step 10: and determining the preset target detection model with the minimum loss function as a trained target detection model.
Wherein the loss function is:
loss=lbox+lcls+lobj
wherein lbox represents the position loss of the predicted and real frames, lambda coord And (3) representing the weight of the position loss, wherein S is the generated grid number, and B is the prior frame number of each grid setting.A judgment value indicating that the prediction frame contains the object, if the judgment value is 1, the judgment value is not 0, and x i 、y i Representing coordinates of a real frame, w i 、h i A wide-high value representing a real box, +.>Representing the coordinates of the prediction box>Representing coordinates and a wide-high value of the prediction frame; lcls represents class loss, lambda class Weights representing class losses by cross entropy loss function +.>Calculate class loss, p i (c) The probability that the category c predicted by the prediction frame is identical to the real category is represented as 1, and the difference is 0,/or +>Representing a probability of being predicted as category c; lobj represents confidence loss, lambda noobj Indicating that the prediction frame does not contain the weight of the actual engineering vehicle target lambda obj Indicating that the prediction frame contains the weight of the actual engineering vehicle target,/->If the predicted frame at the i and j positions is not provided with the engineering truck target of 1, the engineering truck target is 0 and c i Representing confidence of prediction box, +.>Expressed as confidence in the predictions of the prediction block.
Referring to fig. 2, the embodiment of the present invention may use YOLO algorithm to perform calculation of the target detection portion, where the backbone network adopts a dark net53, and obtains a priori frame dimensions through clustering on the training set. The prior frames are clustered from all the true annotation frames of the training set, several shapes and sizes that occur most often in the training set. These statistically a priori experiences are added to the model in advance to help the model converge quickly.
And obtaining the prior frame scale on the training set through clustering. The prior box is the several shapes and sizes that most often occur in the training set, clustered from all the true annotation boxes in the training set. These statistically a priori experiences are added to the model in advance to help the model converge quickly.
The number of the preselected frames is set to be k, k priori frame scale values which are most suitable are obtained by using a k-means clustering algorithm, and the k scale values are normalized relative to the length and the width of the image, so that the k frames can represent the shape of a real object in the data set to the greatest extent. At the time of clustering, the evaluation criterion is a distance d (box, centroid) =1-IoU (box, centroid) between two borders. The intersection ratio (Intersection over Union, ioU) of the prior frames and the real frames is used as a standard to measure the quality of a set of pre-selected frames.
The offset of the a priori frame from the real object is predicted. Enhancing the data to be the video frame resize to 416×416, into s x s grids, where a priori boxes are set based on different scales obtained by clustering, based on which the position of the object is predicted. The prior frame information (x, y, w, h) is the coordinates of the center position of the object, the width and the height of the prior frame, and the values normalize the width and the height of the image. One confidence score (confidence score) and c class probabilities are predicted for each a priori box of each trellis over the dark 53 network. Confidence is expressed asP r (Object) indicates whether the lattice contains a true Object center point. If the center position coordinate of a certain object falls into a certain grid, P of the grid r (Object) is 1, indicating that the Object is detected.Representing the intersection ratio of the prediction frame and the real object.
As shown in fig. 3, the network structure of Yolo3 is that the dark and shallow feature maps are subjected to channel splicing (Concat) operation by adding upsampling to different layers by the dark and shallow feature map 53, and the dark and shallow feature map is fused at the output end, so that the feature maps with three sizes of 13×13, 26×26 and 52×52 are finally output. The deep feature map has small size and large receptive field, and is favorable for detecting large-scale objects, while the shallow feature map is favorable for detecting small-size objects.
The target detection network is trained through the above network so that the loss value of the loss function is continuously reduced until convergence, and the function is verified by using the test set data. The network structure and parameters are continuously optimized until the output is optimal. The final optimized model is the model responsible for the target detection part in the system. And inputting the video data into a model to obtain the position coordinates and the category information of each engineering vehicle.
Example III
As an alternative embodiment of the present invention, the step 7 includes:
inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-parallel ratio with the real frame, and calculating offset before a predicted frame and the prior frame by using the following formula (1) based on the confidence level of the prior frame with the maximum parallel-to-parallel ratio with the real frame and the grid where the center position of the object is positioned, and outputting the predicted frame;
the formula (1) is:
b x =σ(t x )+c x
b y =σ(t y )+c y
wherein b x Representing the abscissa of the prediction block, b y Representing the ordinate of the prediction block, b w Representing the wide offset of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame, b h Representing a high offset, p, of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame w Representing the current prior frame width, p h Representing the current priori frame height; c x And c y Representing the upper left corner coordinate, sigma (t) x ) Sum sigma (t) y ) Representing the distance between the central point C of the prediction frame and the upper left corner coordinate of the grid where the central point is positioned, t w Representing the wide offset, t, of the prior frame predicted by the preset target detection model relative to the real frame h For the high offset of the prior frame relative to the real frame predicted by the preset target detection model, sigma represents a Sigmoid function, and the function is used for quantizing the coordinate offset to a (0, 1) interval, so that the obtained predicted frame center coordinate b x ,b y Limiting to the current region, ensuring that a region predicts only the center point in that regionAnd the object is beneficial to model convergence. The whole prediction process is to input a priori frame into a target detection model, and obtain t through model calculation w 、t h 、t x 、t y Is a process of (2).
Referring to fig. 4, video frames and prior frame information are input into a dark net53 network, a grid containing a center point of a real object is found first, then the one of all prior frames generated by the grid, which is the largest with a real frame IOU, is selected, the offset of the prior frame and the real frame is predicted through the network, a prediction frame is obtained through the offset values, and the model itself calculates the final output prediction frame internally.
Example IV
As an alternative embodiment of the present invention, the trained behavior recognition network is obtained by:
step 1: acquiring a second data set;
step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain behavior categories recognized by the preset behavior recognition network;
step 3: adjusting parameters of a preset behavior recognition network;
step 4: comparing the behavior type of the sample identified by the preset behavior identification network with the real behavior type of the sample for each sample, and calculating a loss function of the preset behavior identification network;
step 5: repeating the steps 2 to 4 until the preset behavior recognition network reaches a second training cut-off condition;
wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;
the second threshold is a preset value, and can be obtained according to industry experience.
Step 6: and determining the preset behavior recognition network reaching the second training cut-off condition as a trained behavior recognition network.
Example five
As an optional embodiment of the present invention, the preset behavior recognition network is a TSN based time sequence division network, a TSM time shift module is connected between residual layers of the TSN network, the TSM time shift module of each layer shifts a feature dimension graph output by a previous layer of residual layer according to a serial number of a group, and a space in a feature vector corresponding to the shifted dimension feature graph is supplemented with 0.
Referring to fig. 5, behavior recognition of a network (Temporal Segment Networks, TSN) is partitioned based on timing. The video stream data passes through the target detection model, then the position information of various engineering vehicles is sequentially input into the behavior recognition network in a form of a binding box, and the TSN architecture is adopted for extracting key frames and recognizing behaviors.
Example six
As an optional embodiment of the present invention, the step of the TSM time shift module of each layer shifting the corresponding position of the feature dimension map output by the previous layer residual layer according to the serial number of the group, and the step of supplementing 0 to the space in the feature vector corresponding to the shifted dimension feature map includes:
the TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;
Shifting the dimension feature graphs of the first group to one bit leftwards according to the time sequence of the image, and supplementing the feature vector space corresponding to the feature dimension graphs of the shifted group with 0;
and shifting the dimension feature graphs of the second group to the right by one bit according to the time sequence of the image, and supplementing the feature vector space corresponding to the shifted feature dimension graphs of the group by 0.
Since behavior recognition relies on timing modeling, a TSM (Temporal Shift Module) module is added to perform timing modeling on the basis of the TSN architecture. And each time displacement module is used for dividing the batch_size×segment×channel×h×w dimension feature map generated by the network middle layer into 3 groups according to channel number average, and simulating time domain information by the left and right movements of different groups of feature vectors in the channel dimension. If the moving proportion is too large, the space feature modeling capability is weakened, the image information of the original frame is possibly damaged, if the moving proportion is too small, the time modeling capability of the model is affected, so that the 3 groups of feature images are respectively left-shifted by one bit, right-shifted by one bit and are not shifted to simulate a time domain receptive field, and the feature vectors which are empty after the movement are filled with 0. This operation moves some channels from frame to frame in the time dimension, the inter-frame information is exchanged, and the time domain information is further fused, thus making the model more efficient in behavior recognition.
The 2DConvNet in FIG. 5 employs a conventional image classification network, such as ResNet50, resNet101, BN-acceptance, etc., the network employed in the present invention is ResNet50, which is a superposition of 50 residual networks. A TSM time shift module is inserted into each residual block of the res net50 in the manner shown in fig. 6. The first layer on each residual structure branch 1 performs a time shift operation, and the rest of the structure and the calculation mode of the residual block are unchanged. Thus, original frame information on the branch 2 is reserved, and inter-frame information is exchanged on the branch 1, and each residual block fuses the two information, so that the network is more suitable for behavior recognition. And connecting 50 layers of residual blocks subjected to time displacement to serve as an infrastructure of the behavior recognition network, and finally adding a layer of full-connection layer for classification so as to recognize the behaviors of the multi-class targets.
Example seven
As an alternative embodiment of the present invention, before inputting the prediction frame into the trained behavior recognition network in the form of continuous frames, the behavior recognition method further includes:
step 1: equally dividing images in a prediction frame range according to an image time sequence, randomly extracting one frame from each subframe section to serve as a key frame, and stacking all the key frames to obtain divided image data;
Step 2: and inputting the image data into the trained behavior recognition network.
The recognition result output by the trained behavior recognition model is as follows:
OutPut={TSN 1 (T 1 ,T 2 ,...T k ),TSN 2 (T 1 ,T 2 ,...T k ),...,TSN m (T 1 ,T 2 ,...T k )};
TSN(T 1 ,T 2 ,...T k )=H(G(F(T 1 ,w),F(T 2 ,w)...F(T k ,w)))
wherein, (T) 1 ,T 2 ,...T k ) Representing a sequence of video key frames, each key frame T k From its corresponding video segment S k Randomly sampling to obtain; f (T) k W) denotes the effect of a convolutional network using w as a parameter on the frame T k Function F returns to T k Score relative to all categories; g is a segment consensus function representing a combination of a plurality of T' s k The class score of (2) outputs a total class prediction value between them, H is a softmax prediction function used to predict the probability that the whole video belongs to each behavior class.
TSN is a behavior recognition network architecture, whose core is the segmentation of the time domain. Giving a section of video V, wherein m objects of behaviors to be detected are contained, extracting the m objects by adopting the method in the step S2, and then sequentially inputting the m objects into a TSN network in a continuous frame mode. Taking a certain engineering truck target to be tested as an example, dividing the engineering truck target into k sections { S } according to equal frame intervals 1 ,S 2 ,...S k The output result of behavior recognition is therefore:
TSN(T 1 ,T 2 ,...T k )=H(G(F(T 1 ,w),F(T 2 ,w)...F(T k ,w)))
OutPut={TSN 1 (T 1 ,T 2 ,...T k ),TSN 2 (T 1 ,T 2 ,...T k ),...,TSN m (T 1 ,T 2 ,...T k )}
wherein, (T) 1 ,T 2 ,...T k ) Representing a sequence of video key frames, each key frame T k From its corresponding video segment S k Randomly sampling to obtain; f (T) k W) denotes the effect of a convolutional network using w as a parameter on the frame T k Function F returns to T k Score relative to all categories; g is a segment consensus function representing a combination of a plurality of T' s k Class score output of (2)The total category predicted value between the two is generally obtained by obtaining the maximum value of k predicted results; h is a softmax prediction function used to predict the probability that the entire video belongs to each behavior class.
Training is carried out through the network, the network structure and model parameters are optimized, various tested results are optimized, and finally a behavior recognition network is obtained. And inputting the targets of the engineering vehicles of each class in the video frame into the network to finally obtain the behaviors of the targets of the engineering vehicles of each class.
Example eight
As shown in fig. 7, the behavior recognition device for a multi-class engineering vehicle provided by the invention includes:
an acquisition model 71 for acquiring a video to be identified;
the video to be identified comprises a plurality of frames of images, and each frame of image comprises a plurality of engineering truck targets;
the detection module 72 is configured to input the video to be identified into a trained target detection model, so that the trained target detection model identifies the video to be identified, and output a prediction frame;
The method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained by obtaining a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked by a real frame, the first training set is clustered to obtain k priori frames, the priori frames are input into a preset target detection model, so that the preset target detection model determines the priori frame with the largest intersection ratio with the real frame, the offset between the prediction frame and the priori frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is iteratively trained until a first training cut-off condition is reached;
the recognition module 73 is configured to input the images within the prediction frame range into a trained behavior recognition network in a continuous frame manner, so that the behavior recognition network performs key frame extraction on the video to be recognized and recognition on the behavior of the engineering truck target, and obtains a category to which the behavior of the engineering truck target in the video to be recognized belongs;
the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented by 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (8)

1. The behavior recognition method of the multi-class engineering vehicle is characterized by comprising the following steps of:
acquiring a video to be identified;
the video to be identified comprises a plurality of frames of images, and each frame of image comprises a plurality of engineering truck targets;
inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained by obtaining a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked by a real frame, the first training set is clustered to obtain k priori frames, the priori frames are input into a preset target detection model, so that the preset target detection model determines the priori frame with the largest intersection ratio with the real frame, the offset between the prediction frame and the priori frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is iteratively trained until a first training cut-off condition is reached;
Inputting images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;
the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering vehicle target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained;
the trained target detection model is obtained through the following steps:
Step 1: acquiring original image data;
step 2: dividing the original data into a training set, a testing set and a verification set;
step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using a real frame;
step 4: clustering the training set by using a k-means clustering algorithm to obtain k priori frame scales;
wherein each prior frame corresponds to prior frame information, the prior frame information comprising a scale of the prior frame, the scale comprising a width and a height;
step 5: performing data enhancement on each sample in the training set;
step 6: dividing each sample after enhancement into s×s lattices;
wherein each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence and c category probabilities;
step 7: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-serial ratio with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum parallel-to-serial ratio with the real frame and the confidence level of a grid where the center position of the object is located, calculating offset between a prediction frame and the prior frame, and outputting the prediction frame;
Step 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;
step 9: repeating the steps 7 to 8 until a first training cut-off condition is reached;
wherein the first training cutoff condition includes: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;
step 10: determining a preset target detection model with the minimum loss function as a trained target detection model;
the step 7 comprises the following steps: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-parallel ratio with the real frame, and calculating offset before a predicted frame and the prior frame by using the following formula (1) based on the confidence level of the prior frame with the maximum parallel-to-parallel ratio with the real frame and the grid where the center position of the object is positioned, and outputting the predicted frame;
the formula (1) is:
b x =σ(t x )+c x
b y =σ(t y )+c y
wherein b x Representing the abscissa of the prediction block, b y Representing the ordinate of the prediction block, b w Representing the wide offset of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame, b h Representing a high offset, p, of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame w Representing the current prior frame width, p h Representing the current priori frame height; c x And c y Representing the upper left corner coordinate, sigma (t) x ) Sum sigma (t) y ) Representing the distance between the central point C of the prediction frame and the upper left corner coordinate of the grid where the central point is positioned, t w Representing the wide offset, t, of the prior frame predicted by the preset target detection model relative to the real frame h And (3) a priori frame predicted by the preset target detection model has a high offset relative to a real frame, wherein sigma represents a Sigmoid function, and the function is used for quantifying the coordinate offset into a (0, 1) interval.
2. The behavior recognition method of claim 1, wherein the loss function is:
loss=lbox+lcls+lobj
wherein lbox represents the position loss of the predicted and real frames, lambda coord The weight of the position loss is represented, S is represented by the generated grid number, and B is represented by the prior frame number of each grid;a judgment value indicating that the prediction frame contains the object, if the judgment value is 1, the judgment value is not 0, and x i 、y i Representing coordinates of a real frame, w i 、h i A wide-high value representing a real box, +.>Representing the coordinates of the prediction box>Representing coordinates and a wide-high value of the prediction frame; lcls represents class loss, lambda class Weights representing class losses by cross entropy loss function +.>Calculate class loss, p i (c) The probability that the category c predicted by the prediction frame is identical to the real category is represented as 1, and the difference is 0,/or +>Representing a probability of being predicted as category c; lobj represents confidence loss, lambda noobj Indicating that the prediction frame does not contain the weight of the actual engineering vehicle target lambda obj Indicating that the prediction frame contains the weight of the actual engineering vehicle target,/->The prediction frames at the positions i and j are 1 if no engineering truck target exists, and the prediction frames are provided withThe engineering truck target is 0, c i Representing confidence of prediction box, +.>Expressed as confidence in the predictions of the prediction block.
3. The behavior recognition method of claim 1, wherein the trained behavior recognition network is obtained by:
step 1: acquiring a second data set;
step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain behavior categories recognized by the preset behavior recognition network;
step 3: adjusting parameters of a preset behavior recognition network;
step 4: comparing the behavior type of the sample identified by the preset behavior identification network with the real behavior type of the sample for each sample, and calculating a loss function of the preset behavior identification network;
Step 5: repeating the steps 2 to 4 until the preset behavior recognition network reaches a second training cut-off condition;
wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;
step 6: and determining the preset behavior recognition network reaching the second training cut-off condition as a trained behavior recognition network.
4. The behavior recognition method according to claim 3, wherein the preset behavior recognition network is a TSN based time sequence division network, a TSM time shift module is connected between residual layers of the TSN network, the TSM time shift module of each layer shifts the feature dimension graph output by the residual layer of the previous layer according to the serial number of the group, and the space in the feature vector corresponding to the shifted dimension feature graph is supplemented with 0.
5. The behavior recognition method according to claim 4, wherein the TSM time shift module of each layer shifts a feature dimension graph output by a previous layer of residual layers according to a serial number of a group, and the supplementing 0 of a space in a feature vector corresponding to the shifted dimension feature graph comprises:
The TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;
shifting the dimension feature graphs of the first group to one bit leftwards according to the time sequence of the image, and supplementing the feature vector space corresponding to the feature dimension graphs of the shifted group with 0;
and shifting the dimension feature graphs of the second group to the right by one bit according to the time sequence of the image, and supplementing the feature vector space corresponding to the shifted feature dimension graphs of the group by 0.
6. The behavior recognition method of claim 1, wherein prior to inputting the prediction frame into the trained behavior recognition network in the form of successive frames, the behavior recognition method further comprises:
equally dividing images in a prediction frame range according to an image time sequence, randomly extracting one frame from each subframe section to serve as a key frame, and stacking all the key frames to obtain divided image data;
and inputting the image data into the trained behavior recognition network.
7. The behavior recognition method according to claim 6, wherein the recognition result outputted by the trained behavior recognition model is:
wherein, (T) 1 ,T 2 ,...T k ) Representing a sequence of video key frames, each key frame T k From its corresponding video segment S k Randomly sampling to obtain; f (T) k W) denotes the effect of a convolutional network using w as a parameter on the frame T k Function F returns to T k Score relative to all categories; g is a segment consensus function representing a combination of a plurality of T' s k The class score of (2) outputs a total class prediction value between them, H is a softmax prediction function used to predict the probability that the whole video belongs to each behavior class.
8. A behavior recognition device for a multi-class engineering vehicle, comprising:
the method comprises the steps of obtaining a model, wherein the model is used for obtaining a video to be identified;
the video to be identified comprises a plurality of frames of images, and each frame of image comprises a plurality of engineering truck targets;
the detection module is used for inputting the video to be identified into a trained target detection model so that the trained target detection model identifies the video to be identified and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained by obtaining a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked by a real frame, the first training set is clustered to obtain k priori frames, the priori frames are input into a preset target detection model, so that the preset target detection model determines the priori frame with the largest intersection ratio with the real frame, the offset between the prediction frame and the priori frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is iteratively trained until a first training cut-off condition is reached;
The recognition module is used for inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behaviors of the targets of the engineering vehicles, and the categories of the behaviors of the targets of the engineering vehicles in the video to be recognized are obtained;
the trained behavior recognition network is characterized in that a second training set is obtained, the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering vehicle target, the second samples are input into a preset behavior recognition network, so that dimension feature graphs output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of dimension feature graphs contained between each group is minimum, each group of dimension feature graphs is shifted according to the serial number of the group, the gaps in feature vectors corresponding to the shifted dimension feature graphs are supplemented with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained;
the trained target detection model is obtained through the following steps:
Step 1: acquiring original image data;
step 2: dividing the original data into a training set, a testing set and a verification set;
step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using a real frame;
step 4: clustering the training set by using a k-means clustering algorithm to obtain k priori frame scales;
wherein each prior frame corresponds to prior frame information, the prior frame information comprising a scale of the prior frame, the scale comprising a width and a height;
step 5: performing data enhancement on each sample in the training set;
step 6: dividing each sample after enhancement into s×s lattices;
wherein each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence and c category probabilities;
step 7: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-serial ratio with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum parallel-to-serial ratio with the real frame and the confidence level of a grid where the center position of the object is located, calculating offset between a prediction frame and the prior frame, and outputting the prediction frame;
Step 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;
step 9: repeating the steps 7 to 8 until a first training cut-off condition is reached;
wherein the first training cutoff condition includes: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;
step 10: determining a preset target detection model with the minimum loss function as a trained target detection model;
the step 7 comprises the following steps: inputting the prior frame information and the coordinates of the center position of the object into a preset target detection model, so that the preset target detection model determines a prior frame with the maximum parallel-to-parallel ratio with the real frame, and calculating offset before a predicted frame and the prior frame by using the following formula (1) based on the confidence level of the prior frame with the maximum parallel-to-parallel ratio with the real frame and the grid where the center position of the object is positioned, and outputting the predicted frame;
the formula (1) is:
b x =σ(t x )+c x
b y =σ(t y )+c y
wherein b x Representing the abscissa of the prediction block, b y Representing the ordinate of the prediction block, b w The prediction frame representing the prediction of the preset object detection model is wider than the prior frame with the maximum intersection with the real frame Offset, b h Representing a high offset, p, of a predicted frame predicted by a preset target detection model relative to a priori frame with maximum intersection ratio with a real frame w Representing the current prior frame width, p h Representing the current priori frame height; c x And c y Representing the upper left corner coordinate, sigma (t) x ) Sum sigma (t) y ) Representing the distance between the central point C of the prediction frame and the upper left corner coordinate of the grid where the central point is positioned, t w Representing the wide offset, t, of the prior frame predicted by the preset target detection model relative to the real frame h And (3) a priori frame predicted by the preset target detection model has a high offset relative to a real frame, wherein sigma represents a Sigmoid function, and the function is used for quantifying the coordinate offset into a (0, 1) interval.
CN202110098578.5A 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle Active CN112800934B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110098578.5A CN112800934B (en) 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110098578.5A CN112800934B (en) 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle

Publications (2)

Publication Number Publication Date
CN112800934A CN112800934A (en) 2021-05-14
CN112800934B true CN112800934B (en) 2023-08-08

Family

ID=75811658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110098578.5A Active CN112800934B (en) 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle

Country Status (1)

Country Link
CN (1) CN112800934B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361519B (en) * 2021-05-21 2023-07-28 北京百度网讯科技有限公司 Target processing method, training method of target processing model and device thereof
CN113255616B (en) * 2021-07-07 2021-09-21 中国人民解放军国防科技大学 Video behavior identification method based on deep learning
CN114419508A (en) * 2022-01-19 2022-04-29 北京百度网讯科技有限公司 Recognition method, training method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086792A (en) * 2018-06-26 2018-12-25 上海理工大学 Based on the fine granularity image classification method for detecting and identifying the network architecture
WO2020206861A1 (en) * 2019-04-08 2020-10-15 江西理工大学 Yolo v3-based detection method for key object at transportation junction
CN111950583A (en) * 2020-06-05 2020-11-17 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM clustering
CN112084890A (en) * 2020-08-21 2020-12-15 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM and CQFL

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086792A (en) * 2018-06-26 2018-12-25 上海理工大学 Based on the fine granularity image classification method for detecting and identifying the network architecture
WO2020206861A1 (en) * 2019-04-08 2020-10-15 江西理工大学 Yolo v3-based detection method for key object at transportation junction
CN111950583A (en) * 2020-06-05 2020-11-17 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM clustering
CN112084890A (en) * 2020-08-21 2020-12-15 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM and CQFL

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
改进YOLOv2卷积神经网络的多类型合作目标检测;王建林;付雪松;黄展超;郭永奇;王汝童;赵利强;;光学精密工程(01);全文 *

Also Published As

Publication number Publication date
CN112800934A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112800934B (en) Behavior recognition method and device for multi-class engineering vehicle
CN109118479B (en) Capsule network-based insulator defect identification and positioning device and method
CN107609525B (en) Remote sensing image target detection method for constructing convolutional neural network based on pruning strategy
CN111476181B (en) Human skeleton action recognition method
CN109492596B (en) Pedestrian detection method and system based on K-means clustering and regional recommendation network
CN110991444B (en) License plate recognition method and device for complex scene
CN111832615A (en) Sample expansion method and system based on foreground and background feature fusion
CN110889318A (en) Lane detection method and apparatus using CNN
CN116206185A (en) Lightweight small target detection method based on improved YOLOv7
CN111832484A (en) Loop detection method based on convolution perception hash algorithm
CN108171119B (en) SAR image change detection method based on residual error network
CN116342894B (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN110909657A (en) Method for identifying apparent tunnel disease image
CN111833353B (en) Hyperspectral target detection method based on image segmentation
CN111539456B (en) Target identification method and device
CN115546113A (en) Method and system for predicting parameters of tunnel face crack image and front three-dimensional structure
CN112597964A (en) Method for counting layered multi-scale crowd
CN113313031A (en) Deep learning-based lane line detection and vehicle transverse positioning method
CN114926667A (en) Image identification method based on cloud edge-end cooperation
CN114937177A (en) Automatic marking and detection model training and target recognition method and electronic equipment
CN111144462A (en) Unknown individual identification method and device for radar signals
CN112288702A (en) Road image detection method based on Internet of vehicles
CN110503631A (en) A kind of method for detecting change of remote sensing image
CN113902044B (en) Image target extraction method based on lightweight YOLOV3
CN113362286B (en) Natural resource element change detection method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant