CN111862147A

CN111862147A - Method for tracking multiple vehicles and multiple human targets in video

Info

Publication number: CN111862147A
Application number: CN202010496840.7A
Authority: CN
Inventors: 许银翠; 范圣印; 熊敏; 单丰武; 姜筱华; 陈立伟; 朱祖伟; 弥博文; 龚朋朋
Original assignee: Jiangxi Jiangling Group New Energy Automobile Co Ltd; Beijing Yihang Yuanzhi Technology Co Ltd
Current assignee: Jiangxi Jiangling Group New Energy Automobile Co Ltd; Beijing Yihang Yuanzhi Technology Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-10-30
Anticipated expiration: 2040-06-03
Also published as: CN111862147B

Abstract

A multi-vehicle and multi-pedestrian target tracking method and device in a video are provided, wherein a target category-based multi-feature apparent modeling method is adopted, different feature extraction operators are constructed by utilizing detected target categories to complete accurate target apparent description, and accurate description of targets is realized; the fast dimensionality reduction of the incidence matrix is completed by adopting a hierarchical progressive feature extraction algorithm and a high threshold matching algorithm, so that the extraction times of the depth features are reduced, and the timeliness of calculation is ensured; the target matching is completed by using the relative orientation constraint relation between the targets to assist the apparent characteristics, so that the partial targets with similar appearances are effectively distinguished, and the mismatching rate is reduced; the recovery of the missed detection target is completed by utilizing the apparent characteristics and the motion prediction of the target, and the recovery rate of the missed detection target is greatly improved. The method has the advantages of high accuracy, strong adaptability and high detection efficiency.

Description

Method for tracking multiple vehicles and multiple human targets in video

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a device for tracking multiple vehicles and multiple pedestrian targets in a video.

Background

Video multi-target tracking is a research hotspot in the field of computer vision and is an important component of many intelligent vision systems. The main tasks of video multi-target tracking include locating target locations and maintaining target IDs. The video sequence for target tracking is obtained by projecting a 3D real world to a 2D image plane by a video acquisition device, information loss is inevitably introduced, and the quality of the video sequence is also influenced by illumination change, scene change and noise in the imaging process. Besides the problem of fuzzy appearance of the target caused by video degradation, the video multi-target tracking also faces the challenges of complex and variable background, variable posture of the target, frequent shielding among the targets, difficult distinguishing of appearance similarity and the like. Therefore, how to design a video multi-target tracking algorithm which is adaptive to the environmental complexity and meets the application diversity by fully utilizing the target appearance information and the information between the front frame and the rear frame of the video has important theoretical significance and practical application value.

The online video multi-target tracking algorithm does not depend on information of subsequent video frames, and can directly output a multi-target tracking result to the currently input video frame. The tracking algorithm expresses the association problem of the previous and next frames as the matching problem of the bipartite graph, and solves the problem by using the association algorithm such as Hungarian and the like.

In order to understand the state of the art, the present invention searches, compares and analyzes the existing patents and papers. The technical scheme with high relevance to the invention is as follows:

the technical scheme 1: a CN108932730A patent number video multi-target tracking method and system based on data association relates to a video multi-target tracking method based on data association, which is mainly completed by the following four steps: 1) calculating the similarity between each target in the current frame image and each target in the previous frame image, and establishing a similarity matrix; 2) establishing a cost matrix by taking each target in the current frame image and each target in the previous frame image as a row or a column, and initializing the elements of the cost matrix to be 0; 3) setting a similarity threshold, and finishing assignment of the cost matrix according to the comparison result of elements in the similarity matrix and the threshold; 4) and judging the condition of each target in the two frames of images according to the value of each element of the cost matrix. The method provides a video multi-target tracking method based on data association, although the appearance, disappearance, fusion and separation of the target can be simply and effectively judged according to the reality, the problem that the target is difficult to correctly match due to the fact that the similarity is too low when the target is frequently shielded is solved, and the target which is missed to be detected by a detection algorithm cannot be recovered by the method.

The technical scheme 2 is as follows: an online multi-target tracking method based on multi-feature optimal association with the patent number CN109859238A relates to an online multi-target tracking method based on multi-feature optimal association, and mainly comprises the steps of extracting apparent features of a target through a CNN network, extracting depth features of the target through a depth network, predicting motion features of the target through a Kalman filtering tracker, solving the similarity between a detection sequence set and a tracking sequence set based on the construction of a multi-feature model, constructing an association matrix through a hierarchical strategy, solving and updating the optimal association matrix, and achieving multi-target tracking. The method integrates various characteristics of the target, improves the multi-target tracking accuracy and precision under the condition of relative motion, but the tracking timeliness is difficult to guarantee once the number of the targets in the video is increased due to long time consumption of depth characteristic extraction, and the method cannot recover the target which is missed to be detected by the detection algorithm.

Technical scheme 3: the patent number CN109919981A discloses a multi-target tracking method based on kalman filter-assisted multi-feature fusion, which relates to a multi-target tracking method based on kalman filter-assisted multi-feature fusion, and the method judges the shielding situation of a target according to the coordinates of the central point of the target and the size of the target, and respectively processes the targets with different shielding degrees: 1) if the blocked part is small or no, the detector inputs the centroid coordinates of the detection frame and the preprocessed video frame into a pre-trained convolutional neural network, extracts the shallow and deep semantic information of the target, and concatenates the semantic information to form a feature matrix, and then carries out similarity estimation on the feature matrices of the two frames to obtain an optimal track; 2) and if the detected target shielding condition is serious, inputting the centroid coordinates of the detection frame into a Kalman filter, estimating the position information of the target in the next frame according to the motion state before the target, and comparing the estimated coordinate information with the actual detection result to obtain the optimal track. The method respectively processes the targets with different shielding conditions, and can solve the problem of target shielding. However, the method uses the deep neural network as the feature extractor, so that the system has high calculation consumption and low tracking efficiency. And the method directly utilizes the Hungarian algorithm to realize the matching of the targets of the front frame and the rear frame, and when the targets are similar in appearance, the problem of mismatching can be caused. In addition, as in the technical solutions 1 and 2, the method also has no strategy for dealing with target missing detection, and cannot recover the target missed detected by the detection algorithm.

The technical scheme 4 is as follows: the paper Structural constraint data association for online tracking proposes that multi-target tracking is realized by using inter-target structured position constraint auxiliary data association. The data association method of the method comprises the following 3 steps: 1) determining each possible target association combination according to the position relation of the targets of the previous and next frames; 2) for each association combination, one of the targets is sequentially selected as an anchor point, the positions of the other targets in the frame are recovered by using the other targets and the structural position constraint information of the target, the matching cost of the anchor point is calculated, and the matching costs of all the anchor points are finally fused to serve as the matching cost of the association combination; 3) and comparing the matching costs of all the association combinations, and selecting the association combination with the minimum matching cost as a final target matching result. According to the method, the position of the target is constrained and predicted by using the structured position between the targets, the problem of overall target offset caused by camera shake can be solved, but when the motion difference between the targets is large, the position of the missed detection target in the frame cannot be accurately predicted by directly using the structured topological relation of the previous frame, and the missed detection recovery rate is low. And the method only extracts the color histogram information of the target, and is difficult to cope with complicated target and background changes.

In the technical schemes 1, 2 and 3, the association between the targets is established by utilizing the apparent characteristics, the motion characteristics and the like of the targets, so that a part of multi-target tracking problems are solved, but the position association information between frames before and after the targets is ignored, and when the detection algorithm is missed, the tracking algorithm has no corresponding recovery strategy. Technical schemes 2 and 3 integrate the depth features, further improve the tracking precision, but frequently extract the depth features from the target and seriously affect the speed of the tracking algorithm. According to the technical scheme 4, the multi-target tracking is realized by utilizing the inter-target structured position constraint auxiliary data association, the problems of target overall deviation and target missing detection are solved to a certain extent, but the recovery condition of the target missing detection is not ideal when the motion difference between the targets is large; and the method only extracts the color characteristics of the target, is sensitive to illumination and has low algorithm robustness.

In automatic driving, due to the fact that application scenes are complex and changeable, the running speed of a vehicle is high, the target and the background change rapidly and inconstant, the difficulty of tracking and detecting the targets of multiple vehicles and multiple rows of people in a video is remarkably improved, and the tracking accuracy and the real-time performance of the existing method are difficult to harmonize and obtain. Therefore, a new method needs to be researched, which can ensure the tracking precision, adapt to the complicated target and background changes, and simultaneously ensure the real-time performance without increasing extra calculation overhead.

Disclosure of Invention

Aiming at the technical problems, the invention aims to design a high-accuracy and strong-adaptability multi-target tracking method to finish the efficient tracking of the pedestrian and vehicle targets in the video. Aiming at the problem that a single feature is difficult to cope with complex target and background change, the invention provides a multi-feature apparent modeling method based on a target class, and different apparent features are extracted from different classes of targets so as to improve the description capability of the features on the targets. Aiming at the problem of low timeliness of a tracking algorithm caused by time consumption of depth feature extraction, the invention provides a hierarchical progressive feature extraction algorithm, and the extraction times of depth features are reduced as far as possible on the premise of ensuring the tracking accuracy. Aiming at the problem that the apparent similarity of partial targets is difficult to distinguish, the invention provides a method for further assisting the apparent characteristics to complete the correct matching of the targets by utilizing the relative orientation relation between the targets. Aiming at the problem of missed detection caused by a detection algorithm, the invention provides a method for recovering a missed detection target by utilizing target apparent characteristics and motion prediction based on a relevant filter, and the recovery rate of the missed detection target is improved.

The invention provides a method for tracking pedestrians and vehicle targets in a video, which is characterized in that the matching of front and rear frame targets is completed by utilizing various apparent characteristics and the orientation constraint relation between the targets, a hierarchical progressive matching strategy is adopted to reduce the extraction times of the characteristics, the recovery of the undetected target is further completed based on the related filtering tracking thought, and finally, a set of method for tracking the pedestrians and the vehicle targets in the video with high accuracy and strong adaptability is formed.

To solve the above technical problem, according to an aspect of the present invention, there is provided a method for tracking multiple vehicles and multiple pedestrian targets in a video, comprising the steps of:

step 1), data acquisition: acquiring video data;

step 2), target matching: aiming at the video data, establishing an incidence matrix of a detection result and a tracking result, extracting apparent characteristics based on target categories from the detection result, and performing target matching in a layered and progressive manner;

step 3), auxiliary matching: further matching unsuccessfully matched apparent similar targets in target matching by using the orientation constraint relation between the targets;

step 4), recovering the missed detection target: recovering undetected targets of the frame by utilizing motion prediction and apparent characteristics;

step 5), outputting a tracking result: and maintaining a tracking chain, updating the orientation constraint relation between the targets, and outputting the tracking result of the current frame.

Preferably, the acquiring video data comprises acquiring video data in real time.

Preferably, the acquiring the video data includes reading the video data by a file.

Preferably, the acquiring video data comprises capturing video data with a camera mounted on the autonomous vehicle.

Preferably, the target matching further comprises:

Step 1.1), respectively establishing a vehicle target incidence matrix and a pedestrian target incidence matrix according to the target types.

Preferably, the step 1.1) of respectively establishing the vehicle target correlation matrix and the pedestrian target correlation matrix according to the target categories includes:

the current frame shown in formula (1) is used to detect the target sequence set D_i

D_i＝{d₁,d₂,…d_i…,d_M-1,d_M} (1)

For a row (or a column), tracking a target sequence set T by a previous frame shown in a formula (2)_j

T_j＝{t₁,t₂,…t_j…,t_N-1,t_N} (2)

An M × N correlation matrix is established for the columns (or rows), M, N being a natural number, each element A in the correlation matrix_ijRepresents the detection target d shown in the formula (3)_i

d_i＝{type_i,x_i,y_i,w_i,h_i}(i＝1,2,…,M) (3)

And the tracking target t shown in the formula (4)_j

Correlation result of (1) (initialization A)_ij＝1，A_ij1 stands for d_iAnd t_jAssociated, otherwise, representing unassociated), wherein type_iIs d_iClass, { x_i,y_iIs d_iCoordinates of center point of target frame, { w_i,h_iIs d_iThe width and height of the target frame, i.e., the size of the target frame, idj is t_jID of (2) { x }_j,y_jIs t_jCoordinates of center point of target frame, { w_j,h_jIs t_jThe width and the height of the target frame,

is t_jSpeed of movement of { Δ w }_j,Δh_jIs t_jWide and high variation of (a).

Preferably, the target matching further comprises:

and 1.2) obtaining the predicted position of each target in the current frame in the previous frame by using Kalman filtering motion prediction.

Preferably, the obtaining of the predicted position of each target in the previous frame in the current frame by using kalman filtering motion prediction comprises:

Obtaining each target t in the previous frame by Kalman filtering motion prediction_j(j-1, 2, …, N) at the predicted position of the current frame

Is established with

As the center of circle, R is shown in formula (5)

For a circular correlation gate of radius, coordinate { x ] of center point of target frame_i,y_iM (M is 1,2, …, M) detection targets d falling within the correlation gate₁,…,d_mAssociated to the tracking target t_j(j＝1,2, …, N), i.e. for the jth column, hold a_mjThe matrix is thinned by setting the remaining number to 0 (1, 2, …, M).

Preferably, the target matching further comprises:

step 1.3), calculating to satisfy A_ijTarget association pair A not equal to 0_ijCorresponding detection target d_iAnd tracking the target t_jTarget frame similarity F_sDegree of overlap with target frame F_iouObtaining the overall similarity F of the target frame shown in the formula (6)_box

F_box＝λ_boxF_s+(1–λ_box)F_iou， (6)

Wherein λ is_boxAs the target frame similarity F_sGlobal similarity in object box F_boxThe weight in (1);

updating each correlation pair in the correlation matrix as shown in equation (7)

T_boxIs the overall similarity threshold of the target frame; for satisfying A_ijAssociation pair A of 1_ijStatistics of all A in ith row and jth column_ijNumber of 1

If it is

Indicating the detection target d_iAnd tracking the target t_jHaving been successfully matched, using the detection target d_iTarget frame center point coordinates { x }_i,y_iAnd size w_i,h_iUpdating the tracking target t_jCalculating and saving the target t _jThe ith row and the jth column of the correlation matrix are deleted.

Preferably, the similarity F of the target frame in the step 1.3)_SThe calculation method of (2) is shown in equation (8):

preferably, the target frame overlapping degree F in the step 1.3)_iouThe calculation method of (2) is shown in equation (9):

wherein, area (d)_i) Indicating a detected object d_iThe area of the target frame of (1); area (t)_j) Representing a tracked object t_jArea of the target frame of (1).

Preferably, the target matching further comprises:

step 1.4), detecting target d in incidence matrix_i(i ═ 1,2, …, m) computing the apparent feature a₁Fusing the overall similarity F of the target frame_boxDegree of similarity to apparent_a1As the integrated similarity F shown in the formula (10)_c1

F_c1＝λ₁F_box+(1–λ₁)F_a1， (10)

Wherein λ is₁Is the overall similarity F of the target frame_boxAt the integrated similarity F_c1The weight in (1).

Preferably, each relevant pair in the correlation matrix is updated as shown in equation (11)

T_c1Is a comprehensive similarity threshold; for satisfying A_ijAssociation pair A of 1_ijStatistics of all A in ith row and jth column_ijNumber of 1

If it is

Indicating the detection target d_iAnd tracking the target t_jHaving been successfully matched, using the detection target d_iTarget frame center point coordinates { x }_i,y_iAnd size w_i,h_iUpdating the tracking target t_jCalculating and saving the target t_jThe ith row and the jth column of the incidence matrix are deleted, and the apparent characteristic a of the target is updated ₁。

Preferably, the apparent characteristic a₁And the calculation method of the similarity measurement is as follows:

the gradient feature is an important apparent feature for pedestrian and vehicle targets, and the apparent feature a of the invention₁Selected as a Histogram of Oriented Gradient (HOG) feature that describes the stronger power. Cutting an image slice in the frame image detection target frame, inputting the image slice into an HOG feature extractor to obtain the HOG feature of the detection target, performing point multiplication operation on the HOG feature of the previous frame tracking target to obtain a feature response image, representing the HOG feature similarity of the detection target and the tracking target by the maximum pixel value in the feature response image, and normalizing to obtain an apparent similarity F_a1。

Preferably, the target matching further comprises:

step 1.5), calculating the apparent feature a2 of the detection targets di (i is 1,2, …, m) in the incidence matrix, wherein m is the total number of the detection targets, and fusing the comprehensive similarity Fc1 and the apparent similarity Fa2 as the comprehensive similarity F shown in the formula (12)_c2

F_c2＝λ₂F_c1+(1–λ₂)F_a2， (12)

Wherein λ is₂To synthesize the similarity F_c1At the integrated similarity F_c2The weight in (1).

Preferably, each relevant pair in the correlation matrix is updated as shown in equation (13)

T_c2Is a comprehensive similarity threshold; for satisfying A_ijAssociation pair A of 1_ijStatistics of all A in ith row and jth column _ijNumber of 1

If it is

Indicating the detection target d_iAnd tracking the target t_jHaving been successfully matched, using the detection target d_iTarget frame center point coordinates { x }_i,y_iAnd size w_i,h_iUpdating the tracking target t_jCalculating and saving the target t_jThe ith row and the jth column of the incidence matrix are deleted, and the apparent characteristic a of the target is updated₁And a₂。

Preferably, the apparent characteristic a₂And the calculation method of the similarity measurement is as follows:

due to the inter-class difference of the two classes of targets of the pedestrian and the vehicle, the invention respectively extracts different apparent characteristics a of the vehicle target and the pedestrian target based on the target class₂Preferably, the apparent characteristic a of the present invention₂Selected as the depth feature learned by the Resnet18 deep neural network. And respectively training the difference between the same different targets of the vehicle target and the pedestrian target and the characteristic that the same target does not change along with time by utilizing the Resnet18 deep neural network. Cutting an image slice in a frame image detection target frame, inputting the image slice into a depth feature extractor to obtain a depth feature vector of a detection target, calculating the Euclidean distance between the depth feature vector and the depth feature vector of a previous frame tracking target, and normalizing to obtain an apparent similarity F_a2。

Preferably, the underlying network of the depth feature extractor is Resnet18 with the last fully connected classification level removed for extracting global features of the target.

Preferably, the pedestrian target feature is horizontally cut into two parts of the upper and lower body to obtain the local feature.

Preferably, the vehicle target feature is horizontally cut into an upper half part and a lower half part and is vertically cut into a left half part and a right half part respectively to obtain a local feature, and then the global feature and the local feature are subjected to global maximum pooling and global mean pooling, wherein the global maximum pooling is used for acquiring the saliency characteristic of the target, and the global mean pooling is used for the background information and the outline information of the image slice.

Preferably, 2048-dimensional features are obtained by adding global features subjected to global maximum pooling and global mean pooling, dimension reduction by one-dimensional convolution is performed to 256-dimensional features, 1024-dimensional features are obtained by adding local features subjected to global maximum pooling and global mean pooling, and dimension reduction by one-dimensional convolution is performed to 256-dimensional features.

Preferably, the dimensionality-reduced features are spliced to form a depth feature vector of 3 × 256 to 768 dimensions for the pedestrian target.

Preferably, the reduced-dimension features are spliced to form a depth feature vector of 5 × 256 to 1280 dimensions for the vehicle target.

Preferably, the auxiliary matching further comprises:

the apparent similar objects are further matched using inter-object orientation constraints.

Preferably, said further matching the apparently similar objects using the inter-object orientation constraint comprises:

for N tracking targets t in a video sequence_j(j ═ 1,2, …, N), each object encoding and maintaining relative bearing relationships with the other N-1 objects

RP_j＝{rp₁,…rp_j-1,rp_j+1…,rp_N}， (14)

Wherein rp_j-1Representing the target t_jAnd a target t_j-1Relative orientation of (3).

Preferably, the relevance pairs in the relevance matrix, T, are updated according to equation (15)_aAnd (4) screening out the associated pairs with lower apparent similarity as an associated threshold value.

Preferably, A in the correlation matrix_ijAnd (3) regarding the targets which are not equal to 0 as apparent similar targets, coding the relative orientations of the n apparent similar targets, calculating an orientation matching cost Crp, and taking the matching result of the minimum orientation matching cost as a final matching result.

Preferably, the calculation method of the orientation matching cost Crp is as shown in formula (17):

wherein, C_rpjIs a target t_jThe orientation matching cost with other n-1 apparently similar targets is calculated as shown in equation (18):

wherein, RP_jRepresenting a preceding frame tracking target t_jRelative orientation relation with other N-1 targets;

showing the relative orientation relation of the tracking target tj of the frame and other N-1 targets.

Preferably, for unmatched tracked targets t satisfying equation (19) in the correlation matrix _jAnd combines the motion prediction and the apparent characteristic to recover the missed detection target,

preferably, for the calculated target t_j(j ═ 1,2, …, n) at the predicted position of the current frame

And further recovering the missed detection target by combining the apparent characteristics.

Preferably, with a target t_jPredicted position of

Is the center point, as shown in equation (20)

Is wide, and is shown in formula (21)

Candidate boxes are determined to be high.

Preferably, image slices in the candidate frame of the frame image are cut, the cut image slices are input into an HOG feature extractor to obtain HOG features of the target in the candidate frame, point multiplication is carried out on the HOG features and the HOG features which are not matched with the tracking target to obtain a feature response graph, the maximum pixel value in the feature response graph represents the HOG feature similarity of the detection target and the tracking target, and the HOG feature similarity is compared with an HOG feature similarity threshold value to determine the target state.

Preferably, if the HOG feature similarity is greater than the HOG feature similarity threshold, the coordinate position of the maximum pixel value in the feature response map is mapped to the coordinate position in the original map

By the coordinate position

And size w_j+Δw_j,h_j+Δh_jUpdating the tracking target t_jSimultaneously updating the apparent feature a of the target₁And a₂。

Preferably, if the HOG feature similarity is smaller than the HOG feature similarity threshold, the tracking target t is considered to be_jIs blocked or disappears, and does not update the target t _j。

Preferably, processing the tracking chain of the unmatched tracking target and the unmatched detection target, updating the orientation constraint relation between the targets to prepare for the next frame of tracking, and outputting the target tracking result of the current frame.

Preferably, tracking target t lower than HOG feature similarity threshold value_iAnd marking the target to be confirmed, adding the target to the incidence matrix of the next frame, updating the target if the target is successfully matched in the next frame, otherwise, considering that the target is shielded, increasing the continuous unmatched times, considering that the target disappears and removing the target from the incidence matrix if the continuous unmatched times reach the continuous unmatched threshold.

Preferably, the formula (22) will be satisfied

Undetected target d_iAnd marking as a target to be initialized, adding the target to the incidence matrix of the next frame, initializing the target to be a new target if the target is successfully matched in the next frame, otherwise, considering the target as a false alarm target, and removing the target from the incidence matrix.

Preferably, the orientation constraint relation between the targets is updated, and the tracking result of the current frame is output.

In order to solve the above technical problem, according to another aspect of the present invention, there is provided an apparatus for tracking multiple vehicles and multiple human targets in a video, comprising:

a data acquisition device: for acquiring video data;

a target matching device: aiming at the video data, establishing an incidence matrix of a detection result and a tracking result, extracting apparent characteristics based on target categories from the detection result, and performing target matching in a layered and progressive manner;

Auxiliary matching device: further matching unsuccessfully matched apparent similar targets in target matching by using the orientation constraint relation between the targets;

missed detection target recovery means: recovering undetected targets of the frame by utilizing motion prediction and apparent characteristics;

a tracking result output device: and maintaining a tracking chain, updating the orientation constraint relation between the targets, and outputting the tracking result of the current frame.

Preferably, the target matching further comprises:

and respectively establishing a vehicle target incidence matrix and a pedestrian target incidence matrix according to the target types.

Preferably, the respectively establishing a vehicle target correlation matrix and a pedestrian target correlation matrix according to the target categories includes:

D_i＝{d₁,d₂,…d_i…,d_M-1,d_M} (1)

T_j＝{t₁,t₂,…t_j…,t_N-1,t_N} (2)

An M × N correlation matrix is established for the columns (or rows), M, N being a natural number, each element A in the correlation matrix _ijRepresents the detection target d shown in the formula (3)_i

d_i＝{type_i,x_i,y_i,w_i,h_i}(i＝1,2,…,M) (3)

And the tracking target t shown in the formula (4)_j

Preferably, it is characterized in that the first and second parts,

the target matching further comprises:

and obtaining the predicted position of each target in the previous frame in the current frame by using Kalman filtering motion prediction.

Is established with

As the center of circle, R is shown in formula (5)

For a circular correlation gate of radius, coordinate { x ] of center point of target frame_i,y_iM (M is 1,2, …, M) detection targets d falling within the correlation gate₁,…,d_mAssociated to the tracking target t_j(j-1, 2, …, N), i.e. for the j-th column, a is held_mjThe matrix is thinned by setting the remaining number to 0 (1, 2, …, M).

Preferably, the target matching further comprises:

calculating to satisfy A_ijTarget association pair A not equal to 0_ijCorresponding detection target d_iAnd tracking the target t_jTarget frame similarity F_sDegree of overlap with target frame F_iouObtaining the overall similarity F of the target frame shown in the formula (6)_box

F_box＝λ_boxF_s+(1–λ_box)F_iou， (6)

If it is

Indicating the detection target d_iAnd tracking the target t_jHaving been successfully matched, using the detection target d_iTarget frame center point coordinates { x }_i,y_iAnd size w_i,h_iUpdating the tracking target t_jCalculating and saving the target t_jThe ith row and the jth column of the correlation matrix are deleted.

Preferably, the target matching further comprises:

For the detected target d in the incidence matrix_i(i ═ 1,2, …, m) computing the apparent feature a₁Fusing the overall similarity F of the target frame_boxDegree of similarity to apparent_a1AsThe integrated similarity F as shown in the formula (10)_c1

F_c1＝λ₁F_box+(1–λ₁)F_a1， (10)

If it is

Indicating the detection target d_iAnd tracking the target t_jHaving been successfully matched, using the detection target d_iTarget frame center point coordinates { x }_i,y_iAnd size w_i,h_iUpdating the tracking target t_jCalculating and saving the target t_jThe ith row and the jth column of the incidence matrix are deleted, and the apparent characteristic a of the target is updated₁。

the gradient feature is an important apparent feature for pedestrian and vehicle targets, and the apparent feature a of the invention₁Selected as a Histogram of Oriented Gradient (HOG) feature that describes the stronger power. Cutting image slices in the frame image detection target frame, inputting the image slices into an HOG feature extractor to obtain HOG features of a detection target, performing point multiplication operation on the HOG features of a previous frame tracking target to obtain a feature response image, wherein the maximum pixel value in the feature response image represents the HOG features of the detection target and the tracking target The similarity is characterized and normalized to obtain the apparent similarity F_a1。

Preferably, the target matching further comprises:

calculating the apparent feature a2 of the detection targets di (i is 1,2, …, m) in the incidence matrix, wherein m is the total number of the detection targets, and fusing the comprehensive similarity Fc1 and the apparent similarity Fa2 as the comprehensive similarity F shown in the formula (12)_c2

F_c2＝λ₂F_c1+(1–λ₂)F_a2， (12)

T_c2Is a comprehensive similarity threshold; for satisfying A_ijAssociation pair A of 1_ijStatistics of all A in ith row and jth column_ijNumber of 1

If it is

because of the inter-class difference of the two classes of targets, the invention extracts different tables for the vehicle target and the pedestrian target respectively based on the target classAspect characteristics a₂Preferably, the apparent characteristic a of the present invention ₂Selected as the depth feature learned by the Resnet18 deep neural network. And respectively training the difference between the same different targets of the vehicle target and the pedestrian target and the characteristic that the same target does not change along with time by utilizing the Resnet18 deep neural network. Cutting an image slice in a frame image detection target frame, inputting the image slice into a depth feature extractor to obtain a depth feature vector of a detection target, calculating the Euclidean distance between the depth feature vector and the depth feature vector of a previous frame tracking target, and normalizing to obtain an apparent similarity F_a2。

Preferably, the auxiliary matching further comprises:

RP_j＝{rp₁,…rp_j-1,rp_j+1…,rp_N}， (14)

representing the tracking target tj of the frame with N-1 other targetsRelative orientation relationship.

Preferably, for unmatched tracked targets t satisfying equation (19) in the correlation matrix_jAnd combines the motion prediction and the apparent characteristic to recover the missed detection target,

Preferably, with a target t_jPredicted position of

Is the center point, as shown in equation (20)

Is wide, and is shown in formula (21)

Candidate boxes are determined to be high.

By the coordinate position

Preferably, if the HOG feature similarity is smaller than the HOG feature similarity threshold, the tracking target t is considered to be_jIs blocked or disappears, and does not update the target t_j。

Preferably, the formula (22) will be satisfied

The invention has the beneficial effects that:

1. The invention provides a multi-feature apparent modeling method based on a target category, different apparent description methods are provided based on the target category, changes of a target and a background are fully considered, different feature extraction operators are constructed by utilizing the target category obtained by detection to complete accurate target apparent description, accurate description of the target is realized, and complicated target and background changes can be dealt with;

2. the invention provides a hierarchical progressive data association method, which is characterized in that an association matrix is established according to a target category, a high-threshold matching algorithm is utilized to complete rapid dimensionality reduction on the association matrix, the extraction times of depth features are reduced, the time efficiency of the algorithm is improved, and the tracking speed is improved while the tracking precision is ensured;

3. the invention provides a method for completing target matching by utilizing the relative orientation constraint relation between targets to assist the apparent characteristics, and the method can be used for completing correct matching of the apparent similar targets by assisting the apparent characteristics of the current frame through coding the relative orientation of the targets of the previous frame, thereby effectively reducing the mismatching rate.

4. The invention provides a method for recovering a missed detection target based on a correlation filter by utilizing target apparent characteristics and motion prediction under a tracking frame of the correlation filter, thereby greatly improving the recovery rate of the missed detection target.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention. The above and other objects, features and advantages of the present invention will become more apparent from the detailed description of the embodiments of the present invention when taken in conjunction with the accompanying drawings.

FIG. 1 is an overall frame diagram of the present invention;

FIG. 2 is a flow chart of a hierarchical progressive data association algorithm;

FIG. 3 is a relative orientation encoding diagram;

FIG. 4 is a graph of a matched set of two apparently similar targets.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

The invention aims to provide a method for tracking pedestrian and vehicle targets in a video. Fig. 1 depicts the overall framework of the invention, comprising the following steps:

firstly, establishing an incidence matrix of a detection result and a tracking result, extracting apparent characteristics based on target categories from the detection result, and performing target matching in a layered and progressive manner;

secondly, further matching the unsuccessfully matched apparent similar targets in the first step by using the orientation constraint relation between the targets;

thirdly, recovering undetected targets of the frame by utilizing motion prediction and apparent characteristics;

and fourthly, maintaining a tracking chain, updating the orientation constraint relation between the targets and outputting the tracking result of the frame.

(1) Hierarchical progressive data association completion target matching

The data association is a process of associating the uncertainty detection result with the existing target track so as to complete the task of matching the targets of the previous frame and the next frame. Most of the existing schemes directly extract various features for each target, for each association pair, multiple feature similarities are fused to serve as matching costs, an exhaustive method is used for calculating the matching costs for each possible global association, and the global association with the lowest matching cost serves as a target matching result. It can be seen that the existing scheme makes a "hard" decision on data association, does not fully consider the possibility that the target is a false match, and requires that the matching cost is designed to be very accurate. Aiming at the technical problem, the invention provides the method for establishing the incidence matrixes respectively by using the object types, reducing the redundant object similarity calculation times, reducing the feature extraction times by using the layered progressive thought, ensuring the algorithm precision and improving the time efficiency of the algorithm. FIG. 2 is a flow chart of a hierarchical progressive data association algorithm.

(1.1) respectively establishing vehicle target association according to target typesMatrix and pedestrian target correlation matrix, and detecting target sequence set D in the frame shown in formula (1)_i

D_i＝{d₁,d₂,…d_i…,d_M-1,d_M} (1)

T_j＝{t₁,t₂,…t_j…,t_N-1,t_N} (2)

d_i＝{type_i,x_i,y_i,w_i,h_i}(i＝1,2,…,M) (3)

And the tracking target t shown in the formula (4)_j

(1.2) obtaining each target t in the previous frame by using Kalman filtering motion prediction_j(j ═ 1,2, …, N) at the predicted position in this frame

Is established with

As the center of circle, R is shown in formula (5)

For a circular correlation gate of radius, coordinate { x ] of center point of target frame_i,y_iM (M is 1,2, …, M) detection targets d falling within the correlation gate₁,…,d_mAssociated to the tracking target t_j(j-1, 2, …, N), i.e. for the j-th column, a is held _mjThe matrix is thinned by setting the remaining number to 0 (1, 2, …, M).

(1.3) calculation step (1.2) satisfying A_ijTarget association pair A not equal to 0_ijCorresponding detection target d_iAnd tracking the target t_jTarget frame similarity F_sDegree of overlap with target frame F_iouObtaining the overall similarity F of the target frame shown in the formula (6)_box

F_box＝λ_boxF_s+(1–λ_box)F_iou， (6)

T_boxIs the target box overall similarity threshold. For satisfying A_ijAssociation pair A of 1_ijStatistics of all A in ith row and jth column_ijNumber of 1

If it is

Method as described above, target frame similarity F in step (1.3)_SThe calculation method of (2) is shown in equation (8):

the method as described above, step (1.3) target frame overlap degree F_iouThe calculation method of (2) is shown in equation (9):

wherein, area (d)_i) Indicating a detected object d_iThe area of the target frame of (1); area (t)_j) Representing a tracked object t _jArea of the target frame of (1).

(1.4) to the detection target d in the incidence matrix_i(i ═ 1,2, …, m) computing the apparent feature a₁Fusing the overall similarity F of the target frame_boxDegree of similarity to apparent_a1As the integrated similarity F shown in the formula (10)_c1

F_c1＝λ₁F_box+(1–λ₁)F_a1， (10)

Updating each correlation pair in the correlation matrix as shown in equation (11)

T_c1For healdAnd (4) combining the similarity threshold values. For satisfying A_ijAssociation pair A of 1_ijStatistics of all A in ith row and jth column_ijNumber of 1

If it is

The method as described above, step (1.4) the appearance characteristic a₁And the calculation method of the similarity measurement is as follows:

the gradient feature is an important apparent feature for pedestrian and vehicle targets, and the apparent feature a of the invention₁Selected as a Histogram of Oriented Gradient (HOG) feature that describes the stronger power. Cutting an image slice in the frame image detection target frame, inputting the image slice into an HOG feature extractor to obtain the HOG feature of the detection target, performing point multiplication operation on the HOG feature of the previous frame tracking target to obtain a feature response image, representing the HOG feature similarity of the detection target and the tracking target by the maximum pixel value in the feature response image, and normalizing to obtain an apparent similarity F _a1。

(1.5) for the detection targets di (i is 1,2, …, m) in the incidence matrix, m is the total number of the detection targets, the apparent feature a2 is calculated, and the integrated similarity Fc1 and the apparent similarity Fa2 are fused to be used as the integrated similarity F shown in the formula (12)_c2

F_c2＝λ₂F_c1+(1–λ₂)F_a2， (12)

Updating each relevant pair in the correlation matrix as shown in equation (13)

T_c2Is the integrated similarity threshold. For satisfying A_ijAssociation pair A of 1_ijStatistics of all A in ith row and jth column_ijNumber of 1

If it is

The method as described above, step (1.5) the appearance characteristic a₂And the calculation method of the similarity measurement is as follows:

due to the inter-class difference of the two classes of targets of the pedestrian and the vehicle, the invention respectively extracts different apparent characteristics a of the vehicle target and the pedestrian target based on the target class₂Preferably, the apparent characteristic a of the present invention₂Selected as the depth feature learned by the Resnet18 deep neural network. And respectively training the difference between the same different targets of the vehicle target and the pedestrian target and the characteristic that the same target does not change along with time by utilizing the Resnet18 deep neural network. Cutting an image slice in a frame image detection target frame, inputting the image slice into a depth feature extractor to obtain a depth feature vector of a detection target, calculating the Euclidean distance between the depth feature vector and the depth feature vector of a previous frame tracking target, and normalizing to obtain an apparent similarity F _a2。

Specifically, the underlying network of the depth feature extractor is Resnet18 with the last fully connected classification level removed for extracting global features of the target. The upper part and the lower part of the pedestrian target are obvious in characteristics, so that the characteristic of the pedestrian target is horizontally cut into the upper part and the lower part to form local characteristics. The upper part and the lower part of the vehicle target are obvious in characteristics, and the left part and the right part are easy to be frequently shielded, so that the vehicle target characteristics are respectively horizontally cut into the upper half part and the lower half part, and are vertically cut into the left half part and the right half part to obtain local characteristics, then the global characteristics and the local characteristics are subjected to global maximum pooling and global mean pooling, the global maximum pooling is used for obtaining the significance characteristics of the target, and the global mean pooling is used for background information and contour information of the image slice. Adding the global features subjected to global maximum pooling and global mean pooling to obtain 2048-dimensional features, performing one-dimensional convolution to reduce the dimensions to 256 dimensions, adding the local features subjected to global maximum pooling and global mean pooling to obtain 1024-dimensional features, and performing one-dimensional convolution to reduce the dimensions to 256 dimensions. And splicing the dimensionality reduced features, forming a depth feature vector with 3 x 256 to 768 dimensions for a pedestrian target, and forming a depth feature vector with 5 x 256 to 1280 dimensions for a vehicle target.

(2) Further matching of apparently similar objects using inter-object orientation constraints

RP_j＝{rp₁,…rp_j-1,rp_j+1…,rp_N}， (14)

Wherein rp is_j＝{rc}，

The specific coding mode is shown in fig. 3.

(2.1) according to

Updating the relevance pairs, T, in the relevance matrix_aAnd (4) screening out the associated pairs with lower apparent similarity as an associated threshold value.

(2.2) A in the correlation matrix_ijAnd (3) regarding the targets which are not equal to 0 as apparent similar targets, coding the relative orientations of the n apparent similar targets, calculating an orientation matching cost Crp, and taking the matching result of the minimum orientation matching cost as a final matching result.

As with the method described above, the method of encoding the relative orientation in step (2.2) is shown in fig. 3.

FIG. 3(a) is a relative orientation relation code table, in which 9 target positions in FIG. 3(b) are taken as an example, the relative orientation relation code of target 1 and target 2 is {00}, and the relative orientation relation with target 3 is {00}

The relative orientation with respect to the target 4 is {01}, and the relative orientation with respect to the target 5 is { 5 }

The relative orientation with respect to the target 6 is {11}, and the relative orientation with respect to the target 7 is {11}

The relative orientation with respect to the target 8 is {10}, and the relative orientation with respect to the target 9 is {10}

Therefore, the temperature of the molten metal is controlled,

as described above, the method for calculating the azimuth matching cost Crp in step (2.2) is shown in equation (17):

wherein, C_rpjIs a target t_jThe orientation matching cost with other n-1 apparent similar targets is calculated as shown in formula (18)The following steps:

For example, when target t₁And a target t₂When apparent similar, the tabular form of the correlation matrix is shown in table 1:

TABLE 1 correlation matrix of two apparently similar objects

Previous frame target t₁And a target t₂The positional relationship of (A) is as shown in FIG. 4(a), and FIGS. 4(b) and 4(c) are detection targets d whose appearance is similar in the present frame₁And d₂And tracking target t₁And t₂Two global association results.

For FIG. 4(a), RP₁＝{rp₂}＝{11}，RP₂＝{rp ₁00, for the association of fig. 4(b),

C

_rpb0, for the association of figure 4(c),

C_rpc4, taking the global association with the minimum azimuth matching cost as the final target matching, and the matching result is d₁→t₁，d₂→t₂。

(3) Motion prediction and apparent feature combined recovery of missed detection target

For satisfaction of formula in the correlation matrix (19)

Is not matched with the tracking target t_jAnd the target is possibly missed or blocked, and the method combines motion prediction and apparent characteristics to recover the missed target. Step (1.2) has calculated the target t_j(j ═ 1,2, …, n) at the predicted position in this frame

(3.1) with target t_jPredicted position of

Is the center point, as shown in equation (20)

Is wide, and is shown in formula (21)

Candidate boxes are determined to be high.

And (3.2) cutting image slices in the candidate frame of the frame image, inputting the image slices into an HOG feature extractor to obtain HOG features of the target in the candidate frame, performing point multiplication operation on the HOG features and the HOG features which are not matched with the tracked target to obtain a feature response graph, wherein the maximum pixel value in the feature response graph represents the HOG feature similarity of the detected target and the tracked target, and comparing the maximum pixel value with the HOG feature similarity threshold to determine the target state.

In the method, if the HOG feature similarity is greater than the HOG feature similarity threshold in step (3.2), the coordinate position of the maximum pixel value in the feature response map is mapped to the coordinate position in the original image

By the coordinate position

In the method, if the HOG feature similarity is smaller than the threshold in step (3.2), the tracking target t is considered to be _jIs blocked or disappears, and does not update the target t_j。

(4) Maintaining a tracking chain, updating the orientation constraint relation between targets, and outputting a tracking result

For the tracking target which is successfully matched and recovered, the target update is completed through the steps. And further processing the tracking chain of the unmatched tracking target and the unmatched detection target, updating the orientation constraint relation between the targets to prepare for the next frame of tracking, and outputting the target tracking result of the frame.

(4.1) maintaining a tracking chain, and enabling the tracking target t lower than the threshold value in the step (3.2)_iAnd marking the target to be confirmed, adding the target to the incidence matrix of the next frame, updating the target if the target is successfully matched in the next frame, otherwise, considering that the target is shielded, increasing the continuous unmatched times, considering that the target disappears and removing the target from the incidence matrix if the continuous unmatched times reach the continuous unmatched threshold.

(4.2) maintaining the tracking chain, which will satisfy equation (22)

And (4.3) updating the orientation constraint relation between the targets and outputting the tracking result of the current frame.

Therefore, the method for tracking the multi-vehicle and multi-pedestrian targets in the video adopts a multi-feature apparent modeling method based on the target category, fully considers the target and background changes, constructs different feature extraction operators by using the detected target category to complete accurate target apparent description, improves the description capability of the features on the target, realizes accurate description on the target, and overcomes the technical problem that the single feature is difficult to deal with the complicated target and background changes; the proposed hierarchical progressive feature extraction algorithm establishes an incidence matrix according to the target category, completes the rapid dimensionality reduction of the incidence matrix by using a high-threshold matching algorithm, reduces the depth feature extraction times, improves the algorithm time efficiency, effectively solves the technical problem of low timeliness of a tracking algorithm caused by time consumption of the depth feature extraction, and improves the tracking speed while ensuring the tracking precision; the method for completing the matching of the targets by utilizing the relative orientation constraint relation between the targets and assisting the apparent features to complete the correct matching of the apparent similar targets by encoding the relative orientation of the targets of the previous frame and assisting the apparent features of the current frame to complete the correct matching of the apparent similar targets, thereby realizing the effective distinguishing of the partial targets with similar appearances and effectively reducing the mismatching rate; the recovery of the missed detection target is completed by utilizing the target apparent characteristics and the motion prediction under the tracking framework of the correlation filtering based on the correlation filter, so that the recovery rate of the missed detection target is greatly improved. The method has the advantages of high accuracy, strong adaptability and high detection efficiency, reduces the calculation complexity, reduces the occupation of the calculation resources of the system, and has strong system reliability. And robustness on illumination change, scene change, noise influence and the like in the imaging process is strong. The method can be effectively applied to the tracking of multiple vehicles and multiple pedestrian targets which have complex background change, variable target postures, frequent shielding among targets and difficult distinguishing of appearance similarity in the automatic driving process.

So far, the technical solutions of the present invention have been described with reference to the preferred embodiments shown in the drawings, but it should be understood by those skilled in the art that the above embodiments are only for clearly illustrating the present invention, and not for limiting the scope of the present invention, and it is apparent that the scope of the present invention is not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for tracking multiple vehicles and multiple pedestrian targets in a video is characterized by comprising the following steps:

step 1), data acquisition: acquiring video data;

2. The method of claim 1, wherein the capturing video data comprises capturing video data in real time.

3. The method of claim 1, wherein the obtaining video data comprises reading the video data from a file.

4. The method of claim 1 or 2, wherein the acquiring video data comprises capturing video data with a camera mounted on an autonomous vehicle.

5. The method of claim 1, wherein the tracking of multiple vehicles and multiple human targets in the video,

the target matching further comprises:

6. The method for tracking multiple vehicles and multiple human targets in video according to claim 5,

the step 1.1) of respectively establishing a vehicle target incidence matrix and a pedestrian target incidence matrix according to the target types comprises the following steps:

D_i＝{d₁,d₂,…d_i…,d_M-1,d_M} (1)

T_j＝{t₁,t₂,…t_j…,t_N-1,t_N} (2)

d_i＝{type_i,x_i,y_i,w_i,h_i}(i＝1,2,…,M) (3)

And the tracking target t shown in the formula (4)_j

Correlation result of (1) (initialization A)_ij＝1，A_ij1 stands for d_iAnd t_jAssociated, otherwise, representing unassociated), wherein type_iIs d_iClass, { x_i,y_iIs d_iCoordinates of center point of target frame, { w_i,h_iIs d_iWidth and height of the target frame, i.e. the size, id, of the target frame_jIs t_jID of (2) { x }_j,y_jIs t_jCoordinates of center point of target frame, { w_j,h_jIs t_jThe width and the height of the target frame,

7. The method for tracking multiple vehicles and multiple human targets in video according to claim 1 or 6,

it is characterized in that the preparation method is characterized in that,

the target matching further comprises:

8. The method for tracking multiple vehicles and multiple human targets in video according to claim 6 or 7,

the method for obtaining the predicted position of each target in the previous frame in the current frame by using Kalman filtering motion prediction comprises the following steps:

Obtaining each tracking target t in the previous frame by Kalman filtering motion prediction_j(j-1, 2, …, N) at the predicted position of the current frame

Is established with

As the center of circle, R is shown in formula (5)

9. The method of claim 8, wherein the tracking of multiple vehicles and multiple human targets in the video,

the target matching further comprises:

F_box＝λ_boxF_s+(1–λ_box)F_iou， (6)

If it is

Indicating the detection target d_iAnd tracking the target t _jHaving been successfully matched, using the detection target d_iTarget frame center point coordinates { x }_i,y_iAnd size w_i,h_iUpdating the tracking target t_jCalculating and saving the target t_jThe ith row and the jth column of the correlation matrix are deleted.

10. An apparatus for tracking multiple vehicles and multiple human targets in a video, comprising:

a data acquisition device: for acquiring video data;