CN109190444B

CN109190444B - Method for realizing video-based toll lane vehicle feature recognition system

Info

Publication number: CN109190444B
Application number: CN201810705071.XA
Authority: CN
Inventors: 阮雅端; 赵博睿; 陈林凯; 葛嘉琦; 陈启美
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2021-05-18
Anticipated expiration: 2038-07-02
Also published as: CN109190444A

Abstract

The invention provides a method for realizing a video-based toll lane vehicle feature recognition system, which comprises three modules: the system comprises a vehicle detection module, a vehicle tracking module and a vehicle characteristic identification module. The invention uses an SSD target detector for detection, uses a characteristic diagram histogram comparison and distance comparison method for tracking, and carries out vehicle characteristic identification on the characteristic diagram through a convolution neural network. The method can effectively identify the characteristics, can run in real time, reduces the repeated consumption of computing resources and improves the accuracy of the system.

Description

Method for realizing video-based toll lane vehicle feature recognition system

Technical Field

The invention belongs to the technical field of image processing and computer vision detection, relates to application of a target detection algorithm and a deep learning algorithm in vehicle detection, and discloses a method for realizing a toll lane vehicle feature recognition system based on a video.

Background

The construction condition of the expressway in China develops rapidly, and the expressway transportation also becomes one of the main modes of land freight transportation. The highway transportation has the advantages of high speed, stability and the like. However, the toll evasion phenomenon of the expressway toll lane is increasingly serious. Many vehicles are clearly buses, but are equipped with ETC toll devices for cars, and when passing through a toll lane, the vehicles are charged according to the charging standard of the cars. With the increasingly mature deep learning and target detection technologies, the automatic detection and feature recognition of toll lane vehicles become an important research topic of an intelligent traffic system, and in the toll lane management of an expressway, the manpower consumption can be effectively reduced, and the fee escaping phenomenon can be efficiently counteracted. However, vehicle detection and feature recognition in toll lanes have high requirements on real-time performance and accuracy of the system. If the real-time performance does not meet the requirement, the system cannot be normally used; if the accuracy does not meet the requirement, a large number of misjudgments easily occur in the system, and the normal work of the toll lane is influenced. Therefore, how to simultaneously improve the real-time performance and the accuracy of the detection and identification system is very important, and the method is also a great hot direction of the current research, and has important significance and value for an intelligent traffic system of a toll lane.

At present, most vehicle feature recognition systems model the background in the toll lane video by adopting a Gaussian mixture background subtraction algorithm (GMBSD) so as to realize vehicle detection and tracking, but the method has low accuracy and no universality when the vehicle is congested. At present, many target detection algorithms based on deep learning, such as fast R-CNN, SSD and the like, have better detection accuracy, but the target detectors have lower real-time performance and cannot be deployed on a large scale effectively and economically, and because of no subsequent vehicle tracking algorithm, the system can easily perform repeated vehicle feature identification on the same vehicle, and even if the subsequent vehicle tracking algorithm and vehicle feature identification algorithm are simply added, the real-time performance of the system is lower, and the large-scale deployment is still difficult.

Disclosure of Invention

The invention aims to solve the problems that: aiming at the requirement of toll lane vehicle feature recognition, the recognition method adopted by the existing recognition system cannot give consideration to accuracy, instantaneity and economy, and cannot meet the requirements of large-scale deployment and accurate and real-time recognition. The invention aims to improve the real-time performance of the existing vehicle characteristic identification system without losing the accuracy; aiming at a vehicle feature recognition task, a target tracking method is realized, and the repetition of vehicle feature recognition is reduced; and the characteristic diagram obtained by detection is used for identifying the characteristics of the vehicle, so that the real-time performance of the system is improved.

The technical scheme of the invention is as follows: a toll lane vehicle feature recognition method based on videos comprises the following three steps of vehicle detection, vehicle tracking and vehicle feature recognition:

step S1, vehicle detection is carried out on the video of the toll lane based on the deep learning method, the characteristic diagram of the detected vehicle is stored after being normalized and pooled, and the position and category information of each vehicle is stored at the same time:

s1.1) training a convolutional neural network for vehicle detection, and classifying the detected vehicles into 3 types, namely a bus, a truck and a car;

s1.2) detecting each frame of picture of the toll lane video by using the convolutional neural network, wherein a detection object comprises the position and the category of each vehicle, the position refers to the coordinate, the width and the height of the center point of each vehicle, and the category refers to one of 3 categories;

s1.3) the feature maps of the video images of the detected vehicles are normalized and pooled to obtain sub-feature maps, the vehicle positions and the vehicle types are saved as detection information, an ID is used as an index for each vehicle, and the saved information is expressed as:

content(id)＝{featuremap,loc,class} (1)

in the formula, featuremap represents a feature map, and is a 3x3x 256-dimensional vector; the loc (x, y, w, h) represents position information, the four items respectively represent a central point abscissa, a central point ordinate, a vehicle width and a vehicle height, and values are all between 0 and 1; class (cls1, cls2 and cls3) represents the class of the vehicle, and the three items respectively represent the total frame number of the cars, the bus and the truck which are identified as targets by the current frame;

step S2, comparing the feature map similarity and the position of the detection information of the current frame and the detection information of the previous frame, marking the vehicles with similar comparison results as the same vehicle, and realizing the vehicle tracking function:

s2.1) comparing the vehicle detection information of the previous frame with the vehicle detection information of the current frame one by one, regarding a target with the similarity and the position distance of a feature map meeting set thresholds as a same vehicle target, changing the corresponding ID of the vehicle in the current frame with the same target into the corresponding ID of the vehicle in the previous frame, updating by using the corresponding detection information of the current frame until the same vehicle target does not appear in a video frame any more, realizing target tracking, wherein the same vehicle corresponds to the same ID in a multi-frame image at the moment, the detection information is the detection information corresponding to the last detected video frame of the vehicle, if the target in the current frame does not appear in the previous frame, regarding the target as a vehicle newly appearing in the video, regarding the ID of the vehicle corresponding to the current frame as the ID of the vehicle, and performing new tracking;

s2.2) carrying out weighted average on the current category and the categories of all historical frames belonging to the same vehicle target to obtain the final category of the vehicle, wherein the category average method is represented as follows:

cls＝argmax(cls1,cls2,cls3) (3)

where argmax denotes an index value to the maximum value;

step S3, when the tracked vehicle passes through a polygonal interesting area marked in the video in advance, inputting the normalized sub-feature maps corresponding to the vehicle target into two deep learning sub-networks, respectively performing vehicle type recognition and color recognition, and storing all feature information to realize the toll lane vehicle feature recognition function:

s3.1) judging the position information of all vehicle targets of the current frame, if a certain target is in an interested area, extracting a sub-feature map corresponding to the target, wherein the mode for judging whether the target is in the interested area is as follows: sequentially traversing the vertexes of the polygon of the interested region, if the area of a sub-triangle formed by all the vertexes of the interested region and the vehicle central point is equal to that of the polygon, the point is positioned in the interested region, otherwise, the point is positioned outside the interested region, and the discriminant expression is as follows:

in the formula, Area represents the Area of a triangle, P represents the center point of the target, and R represents_iThe ith point represents the clockwise order of the polygon, n represents the number of the points of the polygon, and if any area is equal to area', the target is positioned in the polygon;

s3.2) the obtained sub-feature map is passed through two convolutional neural networks to respectively obtain color information and vehicle type classification information, and the two convolutional neural networks use the collected toll lane video as training data for identifying the vehicle color and vehicle type information;

and S3.3) storing the color and the vehicle type information corresponding to the vehicle ID, wherein if the same ID target appears in all the subsequent frames, the color and the vehicle type information cannot be stored repeatedly, and the recognition of the vehicle characteristics of the toll lane is completed.

Preferably, the deep learning method used in step S1 is specifically:

carrying out vehicle detection on a frame image of a toll lane video by using a Single Shot MultiBox Detector algorithm, inputting a color image with the size of 300x300, wherein the convolutional neural network structure specifically comprises the following steps:

(1) the characteristic diagram scales used for detection are 10x10, 5x5, 3x3 and 1x 1;

(2) convolution kernels with the sizes of 5x5, 3x3 and 1x1 are used for detection and connected in parallel, the convolution kernels with three scales are filled, the sizes of feature graphs after convolution are guaranteed to be the same, and the corresponding zero value filling (padding) scales of the three scales are respectively 2, 1 and 0;

(3) the loss function used in training is divided into a location loss and a category loss, and the loss function is expressed as:

loss＝loss_loc*0.8+loss_class*0.2 (5)

in the formula, loss_locsmoothL1() represents the positional regression loss, the loss function being smoothL1, loss_classSoftMax () represents the classification loss and the loss function is SoftMax.

Further, the method for normalizing and pooling the feature map used in step S1 specifically includes:

firstly, selecting a feature map with a scale of 38x38 as a reference feature map, mapping the vehicle size onto the reference feature map to obtain a sub-feature map, pooling the mapped sub-feature map by using a variable pooling step size and a pooling kernel, ensuring that the sizes of feature maps output after pooling are unified into 3x3, wherein the pooling step size and the pooling kernel size are uniquely determined by the size of the sub-feature map, and the determination method can be represented as:

wherein W and H are the width and length of the sub-feature map, and the horizontal step length of pooling is equal to the width of the pooling nucleus, both being s_w(ii) a The longitudinal step length of the pooling is equal to the height of the pooling nucleus, and is s_h，[]Meaning rounding down the real number.

Further, in S2.1), the comparison content for target tracking includes feature map similarity comparison and position distance comparison, where the feature map similarity comparison method is to calculate a feature histogram distance, and the smaller the distance, the higher the similarity, the distance calculation method is euclidean distance, and is expressed as:

in the formula, x₁,y₁,x₂,y₂Respectively represents the horizontal and vertical coordinates of the center point of a certain vehicle in the current frame and the horizontal and vertical coordinates of the center point of a certain vehicle in the previous frame.

The invention has the beneficial effects that:

in order to identify important features of vehicles passing through a toll lane, a target detection algorithm based on deep learning and a target tracking algorithm based on feature map comparison are combined to obtain a sub-feature map of the vehicle, and the features such as color, category and the like of the sub-feature map are identified by utilizing the deep learning algorithm;

the invention improves the vehicle detection algorithm, and in the SSD target detection algorithm, the characteristic diagram with low utilization rate is removed, thereby saving the time consumption during detection and improving the real-time performance of the system; and the loss of the training period and the scale of the convolution kernel are modified aiming at the task, so that the accuracy of the system is improved.

The method comprehensively considers the real-time performance and the accuracy of the system, cuts out the redundant part in the target detector, and improves the accuracy of vehicle detection by modifying the network structure and the loss function; meanwhile, a method combining feature map histogram comparison and position comparison is used for realizing a tracking algorithm of the detected vehicle, and the method has better robustness; finally, the system identifies the characteristics of colors, vehicle types and the like of the vehicles with the unique ID, the used input is not a picture, but a sub-characteristic diagram obtained by a detection network, the real-time performance of the system and the use efficiency of network parameters are improved, and the system has good real-time performance and effectiveness.

Drawings

FIG. 1 is a system framework diagram of the present invention.

Fig. 2 is a schematic diagram of the deep learning method used in step S1, i.e., the SSD network structure.

FIG. 3 is a schematic diagram of the structure of the SSD detection convolution kernel in step S1 according to the present invention.

FIG. 4 is a schematic diagram of the normalized pooling algorithm of the present invention.

FIG. 5 is a schematic diagram of a method for determining a relationship between a point and a polygon, where (a) is a point outside a graph and (b) is a point inside a graph.

FIG. 6 is a schematic diagram of a convolutional neural network for color and vehicle type recognition in the present invention.

FIG. 7 is a diagram illustrating the effect of the steps of the present invention, (a) is an input image; (b) is a detection effect graph; (c) is a tracking effect graph; (d) corresponding pictures for the vehicle sub-feature maps; (e) and the vehicle characteristic identification result is obtained.

Detailed Description

The invention provides a method for realizing a video-based toll lane vehicle feature recognition system, which can effectively realize vehicle detection and tracking, avoid repeated feature recognition on the same vehicle and further improve the accuracy and the real-time performance of vehicle feature recognition.

The invention is further illustrated with reference to the figures and examples.

The technical scheme of the invention is as follows: an implementation method of a video-based toll lane vehicle feature recognition system is provided. As shown in fig. 1, the method specifically includes three parts, namely vehicle detection, vehicle tracking and vehicle feature identification, and includes the following steps:

step S1: as shown in fig. 7(a), vehicle detection based on the deep learning method is performed for each frame of picture, and the feature maps of the detected vehicles are normalized and pooled and then stored, and at the same time, the position information of each vehicle is stored:

s1.1) carrying out picture interception and vehicle labeling on part of collected toll lane videos, wherein the obtained picture data are used for training a convolutional neural network, and the convolutional neural network divides detected vehicles into 3 types, namely buses, trucks and cars;

s1.2) detecting each frame of picture by using a trained convolutional neural network, wherein the detected result is the position and the category of each vehicle, the position is represented by the coordinates of a central point, the width and the height, and the category is represented by one of 3 categories as shown in FIG. 7 (b);

s1.3) the feature maps of the video images of the detected vehicles are normalized and pooled to obtain sub-feature maps, the vehicle positions and the vehicle types are saved as detection information, an ID is used as an index for each vehicle, and the saved information is expressed as: the information saved is represented as:

content(id)＝{featuremap,loc,class} (1)

in the formula, featuremap represents a feature map, and is a 3x3x 256-dimensional vector; the loc (x, y, w, h) represents position information, the four items respectively represent a central point abscissa, a central point ordinate, a vehicle width and a vehicle height, and values are all between 0 and 1; class (cls1, cls2, cls3) represents a vehicle class, and three items represent the total number of frames in which the target has been identified as a car, a bus, and a truck until now.

Step S2: sequencing the pictures of the detected vehicles according to the video, tracking the vehicles by the current frame and the previous frame, and comparing the feature map similarity and the position of the detection information of the current frame and the detection information of the previous frame, wherein the vehicles with similar comparison results are marked as the same vehicle to realize the vehicle tracking function as shown in fig. 7 (c);

the feature map similarity contrast method is to calculate the distance of the feature histogram, and the smaller the distance, the higher the similarity, the feature histogram statistical method is similar to the color histogram statistics, and the difference is that the statistical channel component is changed from the color three-channel value to the 256-channel feature value. The distance calculation method is Euclidean distance and is expressed as:

in the formula, x₁,y₁,x₂,y₂Respectively represents the horizontal and vertical coordinates of the center point of a certain vehicle in the current frame and the horizontal and vertical coordinates of the center point of a certain vehicle in the previous frame. After a comparison result is obtained, the targets with the similarity comparison result and the position comparison result meeting the set threshold value are regarded as the same vehicle target;

s2.2) carrying out weighted average on the current category and the categories of all historical frames belonging to the same target to obtain the final category of the vehicle, wherein the category average method is represented as:

cls＝argmax(cls1,cls2,cls3) (3)

in the formula, argmax represents the value of indexing the maximum value, and cls1, cls2, and cls3 represent three components in the content of class, respectively.

Step S3: when the tracked vehicle passes through a polygonal interesting area marked manually in advance, the normalized sub-feature maps (step S1.3) corresponding to the vehicle targets are input into two deep learning sub-networks, vehicle type recognition and color recognition are respectively carried out, and all feature information is stored. The method realizes the function of recognizing the vehicle characteristics of the toll lane:

and S3.1) judging the position information of all vehicle targets in the current frame, if a certain target is in the region of interest, extracting a sub-feature map corresponding to the target, wherein an original image corresponding to the sub-feature map is shown in FIG. 7(d), and the sub-feature map is considered as a feature representation of the vehicle information and is similar to a color histogram. As shown in fig. 5, the manner of determining whether the target is located in the region of interest is to calculate the area of the sub-triangle, sequentially traverse the vertices of the polygon of the region of interest, if the area of the sub-triangle formed by all the vertices of the region of interest and the vehicle center point is equal to the area of the polygon, the point is located in the region of interest, otherwise, the point is located outside the region of interest, and the discriminant expression is:

in the formula, Area represents the Area of a triangle, P represents the center point of the target, and R represents_iI-th point representing the polygon in clockwise order, and n represents the number of points of the polygon. If there is area', the target is located inside the polygon;

and S3.2) as shown in the graph (e) of FIG. 7, respectively obtaining color information and vehicle type classification information from the obtained sub-feature graph through two convolutional neural networks, wherein the two convolutional neural networks are trained, and the collected toll lane video is used as training data. The method is similar to S1.1, except that the inputs to these two convolutional neural networks are not images but feature maps, and the convolutional neural network structure is shown in fig. 6, where the colors are 8 types: black, white, red, yellow, blue, green, brown, silver: and vehicle types are in 76 types: BMW, the public, etc.;

s3.3) storing all the characteristic information, wherein each vehicle has a unique ID during storage, and the storage will not be repeated if targets with the same ID appear in all the following frames.

Further, in the above scheme, the deep learning algorithm used in step S1 is specifically:

and carrying out vehicle detection on the video frame image of the toll lane by using a Single Shot MultiBox (SSD) algorithm. The input to the SSD is a color image of size 300x 300. As shown in fig. 2, the SSD network structure is modified for the toll lane vehicle detection problem as follows:

(1) the characteristic diagram used for detection has the dimensions of 10x10, 5x5, 3x3 and 1x 1. The feature maps of 19x19 and 38x38 sizes originally used were deleted. Because the vehicle feature recognition of the toll lane only needs to detect the vehicles passing through the toll lane, and the vehicles are in the stronger visual angle of the camera, and the scale is generally larger. And the feature maps with the sizes of 19x19 and 38x38 are used for detecting small targets, so that the feature maps can be deleted in a vehicle detection task, and the real-time performance of the model is improved.

(2) As shown in fig. 3, the sizes of convolution kernels used for detection are changed to be 5x5, 3x3 and 1x1 convolution kernels which are connected in parallel, padding with different sizes is performed on the convolution kernels with three scales, the sizes of feature maps after convolution are guaranteed to be the same, and therefore feature map fusion can be performed. The corresponding padding of the three scales is 2, 1 and 0 respectively. This structure is similar to the inclusion structure. The purpose is to better extract the characteristic information of a plurality of reception fields and improve the accuracy of the network.

(3) The loss function used in training is divided into position loss and category loss, and the weighting proportion of the position loss is improved, so that the position of the detection result is more accurate. The redefined loss function may be expressed as:

loss＝loss_loc*0.8+loss_class*0.2 (5)

Since the classification target is only 3 classes, the classification task is simpler than the position regression task, and therefore, the detection accuracy is not reduced by properly reducing the proportion of class loss, and the detection accuracy is improved due to the fact that the proportion of position loss is improved.

The method for normalizing and pooling the feature map used in the step S1 specifically comprises the following steps:

firstly, a feature map with a scale of 38x38 is selected as a reference feature map. And mapping the size corresponding to the target, namely the size of the vehicle, to the reference characteristic diagram to obtain a sub characteristic diagram. The scale feature map is chosen because: (1) the semantic information of the scale characteristic graph is low, and vehicles in the same category can be effectively distinguished; (2) the maximum feature map size used for detection is 10x10, and the sub-feature map size corresponding to the target can be not smaller than 3x3 by using the feature map with the scale of 38x38 as a reference. And pooling the sub-feature maps obtained by mapping by using a variable pooling step size and a pooling kernel, and ensuring that sizes of feature maps output after pooling are unified to be 3x 3. As shown in fig. 4, the pooled step size and the pooled kernel size may be uniquely determined by the size of the sub-feature map, and the determination method may be expressed as:

wherein W and H are the width and length of the sub-feature map, and the horizontal step length of pooling is equal to the width of the pooling nucleus, both being s_w(ii) a The longitudinal step length of the pooling is equal to the height of the pooling nucleus, and is s_h。[]Meaning rounding down the real number.

Through the implementation, the vehicle identification of the lane video is realized.

Claims

1. A method for realizing a video-based toll lane vehicle feature recognition system is characterized by comprising the following three steps of vehicle detection, vehicle tracking and vehicle feature recognition:

content(id)＝{featuremap,loc,class} (1)

cls＝argmax(cls1,cls2,cls3) (3)

where argmax denotes an index value to the maximum value;

s3.1) judging the position information of all vehicle targets of the current frame, if a certain target is in an interested area, extracting a sub-feature map corresponding to the target, wherein the mode for judging whether the target is in the interested area is as follows: sequentially traversing the vertexes of the polygon of the region of interest, if the area of the sub-triangle formed by all the vertexes of the region of interest and the center point of the vehicle is equal to the area' of the polygon, the point is positioned in the region of interest, otherwise, the point is positioned outside the region of interest, and the discriminant is expressed as:

where Area represents the Area of the triangle, P tableCenter point of target, R_iThe ith point represents the clockwise order of the polygon, n represents the number of the points of the polygon, and if any area is equal to area', the target is positioned in the polygon;

2. The method for implementing a video-based toll lane vehicle feature recognition system as claimed in claim 1, wherein the deep learning method used in step S1 is specifically:

(2) the convolution kernels with the sizes of 5x5, 3x3 and 1x1 used for detection are connected in parallel, the convolution kernels with three scales are filled, the sizes of feature graphs after convolution are ensured to be the same, and the filling scales of zero values corresponding to the three scales are respectively 2, 1 and 0;

(3) the loss function used in training is divided into a position regression loss and a classification loss, and the loss function is expressed as:

loss＝loss_loc*0.8+loss_class*0.2 (5)

3. The method for implementing a video-based toll lane vehicle feature recognition system as claimed in claim 1, wherein the feature map normalization pooling method used in step S1 is specifically:

4. The method for implementing the video-based toll lane vehicle feature recognition system according to claim 1, wherein in S2.1), the comparison contents for target tracking include feature map similarity comparison and position distance comparison, the feature map similarity comparison method is to calculate a feature histogram distance, and the smaller the distance, the higher the similarity, the distance calculation method is euclidean distance, expressed as: