CN115063836A

CN115063836A - Pedestrian tracking and re-identification method based on deep learning

Info

Publication number: CN115063836A
Application number: CN202210657848.6A
Authority: CN
Inventors: 王璇; 宋永超; 吕骏; 王莹洁; 徐金东; 赵金东; 阎维青; 雷明威; 李凯强
Original assignee: Yantai University
Current assignee: Yantai University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-09-16

Abstract

A pedestrian tracking and re-identification method based on deep learning comprises the following steps; step 1: carrying out pedestrian target detection on the video images frame by frame; step 2: performing feature extraction on the pedestrian detected in each frame in the step 1 by adopting a Deepsort model to generate a npy file; and step 3: adopting Fastreid to carry out pedestrian re-identification detection, and carrying out feature extraction according to a preset pedestrian picture base to generate a npy file; and 4, step 4: cosine similarity calculation is carried out on the feature extraction result of each pedestrian target and the feature extraction result of the specific pedestrian base, if the cosine similarity is larger than a threshold value gamma, the specific pedestrian target needing re-identification is judged, and tracking of the pedestrian is carried out, otherwise, target tracking is not carried out; the invention can accurately position specific pedestrians crossing time, regions and cameras, can carry out reasoning and detection through real-time videos, achieves the optimal effect through a series of improvements, finally finishes the project landing, and can be generally applied to systems such as intelligent monitoring, intelligent security and the like.

Description

Pedestrian tracking and re-identification method based on deep learning

Technical Field

The invention belongs to the technical field of intelligent monitoring and security protection, and particularly relates to a pedestrian tracking and re-identification method based on deep learning.

Background

With the development of science and technology, surveillance videos are widely applied to the fields of commerce, security, searching and the like, and play a very important role in daily life of people. Pedestrian re-recognition technology since the rise of face recognition technology has been developed as one of the main directions of computer vision. Although the face recognition technology is developed more mature, under the conditions of high-density people, low resolution of a capturing camera, angle deviation of the camera and the like, the face recognition technology cannot achieve an ideal effect, but the face recognition technology can continuously play an important role, timely positions and identifies specific pedestrians in a monitoring video, and has important significance for investigation of criminal cases, search and rescue of missing people and the like.

Until now, enterprises in the field of domestic and foreign artificial intelligence have intensive research on the pedestrian re-identification technology, and the current pedestrian re-identification technology has the following research difficulties and problems:

(1) the real-world pedestrians can have complex and changeable situations such as barrier shielding, day-to-night, clothing change and the like, and the precision of the experimental algorithm is difficult to achieve.

(2) The cross-region identification has the safety and privacy problems, the data set is difficult to obtain, and the method for obtaining the model algorithm with high robustness under the condition of sample imbalance is extremely challenging.

(3) When the camera is crossed for tracking, due to the change of the camera, light and shade, barrier shielding and camera definition can be changed accordingly, and the same target can be still identified without the limitation of a tracking range, so that the problem that the pedestrian re-identification technology needs to solve urgently is solved.

Disclosure of Invention

In order to overcome the technical problems, the invention provides a pedestrian tracking and re-identification method based on deep learning, which combines an improved YOLOv5-Lite target detection algorithm and an improved Deepsort target tracking algorithm, can accurately position specific pedestrians across time, regions and cameras, can carry out reasoning and detection through real-time video, and enables a system model to achieve the optimal effect through a series of improvement measures.

In order to achieve the purpose, the invention adopts the following technical scheme:

a pedestrian tracking and re-identification method based on deep learning comprises the following steps;

step 1: carrying out pedestrian target detection on the video image frame by adopting an improved YOLOv5-Lite model;

and 2, step: performing feature extraction on the pedestrian detected in each frame in the step 1 by adopting a Deepsort model to generate a npy file;

and step 3: adopting Fastreid to carry out pedestrian re-identification detection, and carrying out feature extraction according to a preset pedestrian picture base to generate a npy file;

and 4, step 4: performing cosine similarity calculation on the feature extraction result of each pedestrian target in the step 2 and the feature extraction result of the specific pedestrian base in the step 3, wherein the formula is shown as (1), and x ₁ And x ₂ And if the number of the vectors is two non-0 vectors and is greater than a threshold value gamma, judging the vectors to be specific pedestrian targets needing re-identification, and tracking the pedestrians by using the tracking strategy of the improved Deepsort model, otherwise, not tracking the targets.

Further, the step 1 comprises the following sub-steps:

step 1.1: inputting an improved YOLOv5-Lite model network structure into a picture of a data set, adding a BiFPN module on the basis of YOLOv5-Lite, realizing the combination of cross-scale bidirectional connection and rapid normalization by the BiFPN, inputting different feature weights, and enabling the network to learn by itself, wherein the weights are normalized to be between 0 and 1 in a Softmax-based fusion mode as shown in a formula (2):

wherein, w _i And w _j Is a learnable weight;

and step 1.2, extracting the features of the picture by using the convolutional neural network, then outputting a feature graph, dividing the picture into small blocks and generating an anchor frame, associating the labeled prediction frame with the feature graph, and finally establishing a loss function and starting end-to-end training, wherein the loss function is shown as a formula (3).

Wherein

RepresentsC represents the diagonal distance of the minimum closure area which can contain the prediction frame and the real frame at the same time. v is used to measure the consistency of the relative proportions of the two rectangular boxes, and α is a weighting factor:

further, the step 3 comprises the following sub-steps:

step 3.1: after an input picture is preprocessed, a pre-training model ResNet50 is called as a backhaul, an output feature graph represents an object through a feature vector in a polymerization mode through Gem Pooling, the feature vector obtained in the front is changed to a certain extent through a Bnneck module, and finally triple loss is defined to learn similarity in classification and discrimination in classification, so that direct discrimination between different feature vectors is more obvious, and the same feature vectors converge.

The triple loss input is a triple including an Anchor (Anchor) example, a Positive (Positive) example and a Negative (Negative) example, similarity calculation between samples is realized by optimizing the distance between the Anchor example and the Positive example to be smaller than the distance between the Anchor example and the Negative example, and a: anchor, anchor example; p: positive, a sample of the same class as a; n: negative, a sample of a different class than a; margin is a constant greater than 0:

L＝max(d(a,p)-d(a,n)+margin,0)(5)

further, the tracking strategy in step 4 includes the following sub-steps:

step 4.1: tracking the specific pedestrian target selected for re-identification by using an NSA Kalman filtering algorithm, specifically updating the appearance state of the ith track at the frame t in an exponential moving average manner

Wherein f is _i ^t Is the appearance embedding of the current match detection and α ═ 0.9 is the momentum term.

Adaptive noise is added at the same time to enhance the robustness of tracking, wherein the covariance of the adaptive noise

As shown in equation (7):

wherein R is _k Is a predetermined constant measured noise covariance, c _k Is the detection confidence score in state k and no longer uses only the appearance feature distance in the matching process, but considers both appearance and motion information;

the matching cascade is replaced by a common global linear distribution, wherein the distribution matrix C is the appearance cost A _a And cost of action A _m Weighted sum of (c):

C＝λA _a +(1-λ)A _m (8)

wherein the weighting factor λ is set to 0.98;

step 4.2: after the tracking track is predicted by a Kalman filtering algorithm, a track is predicted for the current frame, if confirmation (pedestrians or vehicles) is predicted, detection is carried out on the current frame, then the detection frame and the confirmed track frame are in cascade matching, and the tracked detection frame is updated after matching is completed;

if the track matching fails, IoU matching is carried out again, if the matching is successful, then updating is carried out again, and then the tracking process of prediction-observation-updating is repeated. IoU matching failures are divided into observation matching failures and trajectory matching failures: for observation matching failure, a method of establishing a new track is adopted, then three times of investigation is carried out, and if the target (pedestrian or vehicle) is still an actual target, the confirmation is carried out; and judging whether the track matching fails, judging whether the track matching is confirmed to be a pedestrian or a vehicle, if the track matching is not confirmed, deleting the track matching, otherwise, setting a threshold value for the track matching, if the track matching is larger than the threshold value max _ age, deleting the track matching, considering that the track matching is moved out of the observation range, and if the track matching is smaller than the threshold value, carrying out three times of investigation on the track matching again and returning to the initial stage.

The invention has the beneficial effects.

The invention realizes the real-time tracking and re-identification of the specific pedestrian target. Compared with the algorithm before improvement, the YOLOv5-Lite detection module model improves the recognition accuracy by 3% while keeping the average accuracy, and the detection accuracy can reach 92%; the Deepsort tracking module model has different amplitude improvements on various indexes for evaluating the tracking performance, and a better tracking effect is obtained; the feature extraction logic of the Fastreid re-identification module is optimized, so that the algorithm speed is improved in a crossing manner; the integral model provided by the invention can reach higher precision in real-time detection, thereby meeting the requirement of actual video monitoring and having wide application prospect.

Description of the drawings:

FIG. 1 Overall flow of pedestrian re-identification System

FIG. 2 is a modified YOLOv5-Lite model network structure.

Figure 3 tracking algorithm Deepsort improves strategy diagram.

Fig. 4 shows pictures of the pedestrian to be detected, which are named bag and red from left to right.

Fig. 5 the pedestrian bag recognizes the effect again in the area 1.

Fig. 6 shows that the pedestrian red has a heavy recognition effect in the region 1.

Fig. 7 the pedestrian bag re-identifies the effect in area 2.

Fig. 8 the pedestrian red re-recognizes the effect in the area 2.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings and the attached tables in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

step 2: adopting a DeepSort model to perform feature extraction on the pedestrians detected in each frame in the step 1 to generate npy files;

and 4, step 4: and (3) performing cosine similarity calculation on the feature extraction result of each pedestrian target in the step (2) and the feature extraction result of the specific pedestrian base in the step (3), wherein the formula is shown as (1), if the feature extraction result is greater than a threshold value gamma, the specific pedestrian target is judged to be the specific pedestrian target needing re-identification, the tracking of the pedestrian is performed by utilizing the tracking strategy of the improved Deepsort model, and otherwise, the target tracking is not performed

The step 1 comprises the following substeps:

step 1.1: the picture of the data set is input into a modified YOLOv5-Lite model network structure, as shown in fig. 2. According to the invention, a BiFPN module (a weighted bidirectional feature pyramid network) is added on the basis of the original YOLOv5-Lite, so that the feature extraction is effectively enhanced. The BiFPN realizes the combination of cross-scale bidirectional connection and quick normalization, inputs different feature weights and enables a network to learn by itself, and normalizes the weights between 0 and 1 in a mode of Softmax-based fusion as shown in a formula (2):

wherein, w _i And w _j Are learnable weights.

And 1.2, extracting the features of the picture by using the convolutional neural network, then outputting a feature graph, dividing the picture into small blocks, generating an anchor frame, associating the labeled prediction frame with the feature graph, and finally establishing a loss function and starting end-to-end training, wherein the loss function is shown as a formula (3).

Wherein

Representing the euclidean distance between the center points of the prediction box and the real box. c represents the diagonal distance of the minimum closure area that can contain both the prediction box and the real box. v is used to measure the consistency of the relative proportions of the two rectangular boxes, and α is a weighting factor:

the step 3 comprises the following substeps:

L＝max(d(a,p)-d(a,n)+margin,0) (5)

the invention improves the tracking strategy in Deepsort, and the tracking strategy in the step 4 comprises the following substeps:

step 4.1: and tracking the specific pedestrian target selected for re-identification by using an NSA Kalman filtering algorithm. In particular, the appearance state of the ith track at the frame t is updated in an exponential moving average mode

Adaptive noise is added into the algorithm at the same time to enhance the robustness of tracking. Wherein the covariance of the noise is adapted

As shown in equation (7):

wherein R is _k Is a predetermined constant measured noise covariance, c _k Is the detection confidence score in state k and no longer uses only the appearance feature distance in the matching process, but considers both appearance and motion information.

In order to solve the problem that the matching precision is limited by the additional prior constraint, common global linear distribution is adopted to replace matching cascade. Wherein the allocation matrix C is the appearance cost A _a And cost of action A _m Weighted sum of (c):

C＝λA _a +(1-λ)A _m (8)

with the weighting factor lambda set to 0.98.

Step 4.2: after the tracking track is predicted through a Kalman filtering algorithm, a track is predicted for a current frame, if confirmation (pedestrians or vehicles) is predicted, detection is carried out on the current frame, then the detection frame and the confirmed track frame are in cascade matching, and the tracked detection frame is updated after matching is completed.

Example (b):

as shown in fig. 1: firstly, intercepting a picture of a target pedestrian, then, performing feature extraction on an intercepted pedestrian base through a Fastreid feature extraction model to generate a corresponding npy file, reading in a video to be detected, detecting all pedestrians in a current video frame by using a YOLOv5-Lite target detection algorithm, then, performing feature extraction on the detected pedestrians by using a Deepsort algorithm to generate a npy file, calculating cosine similarity of the two generated npy files, judging the pedestrians as the target pedestrians if the similarity is greater than a threshold gamma, tracking the target pedestrians by using the Deepsort algorithm, and finally, displaying the whole flow through simple visualization, wherein the similarity is less than the threshold gamma and is a non-target pedestrian.

As shown in fig. 2: the Concat modules of the original network header are all replaced with BiFPN _ Concat modules.

As shown in fig. 3, the ordinary kalman filter is replaced by the NSA kalman filter algorithm, and the ordinary global linear distribution is adopted to replace the matching cascade, so as to update the track appearance in an Exponential Moving Average (EMA) manner.

Fig. 4 shows a picture of a pedestrian to be searched, which is captured in advance.

The search results of the pedestrian to be searched in the area 1 are shown in fig. 5-6.

Fig. 7-8 show the search result of the pedestrian to be searched in the area 2.

TABLE 1 comparison of indices before and after improvement of the YOLOv5-Lite Algorithm

TABLE 2 comparison of indices before and after Deepsort algorithm improvement

TABLE 3 Deepsort algorithm before and after improvement index comparison (continuation)

As shown in table 1, on the premise that the picture input sizes are 640 × 640, the overall size of the improved model has a small increase, and on the premise that the map0.5, the map0.5: 0.95 are basically leveled before and after improvement, and the Recall and the frame rate FPS have a small decrease, the accuracy of the model increases by 3% from 0.89 to 0.92, which indicates that the accuracy of the improved model has a certain increase, and compared with other algorithms, the model size and accuracy are improved a lot, and the test set is more excellent.

As shown in tables 2-3, the IDR index is increased from 21.7 to 24.9, the IDP index is increased from 71.8 to 74.7, and the IDF1 index is increased from 33.3 to 37.4, which indicates that the correctly identified recall value and the detection score are both significantly improved; rcll is improved to 31.3 from 27.2, Prcn is improved to 94.0 from 89.9, which shows that the improved Deepsort algorithm is obviously improved in precision; the FAR is reduced from 0.63 to 0.42, namely the number of misidentification of each frame is reduced after improvement; the MT is increased from 25 to 30, the ML is reduced from 339 to 307, which shows that the number of successfully tracked frames accounts for more than 80 percent of the total number of frames, the number of GT tracks accounts for less than 20 percent of the total number of frames, and the number of GT tracks accounts for less than 20 percent of the total number of frames; the number of FP as false reports is reduced to 2214 from 3352, and the number of FN as false reports is reduced to 75817 from 80411; the IDs are from 218 to 239, and the frequency of ID-switch appearance increases after model changes; the FM rises from 1121 to 1190, which shows that after the algorithm is improved, the capability of continuously tracking the target appears again after the target is shielded is improved; the MOTA is increased from 23.9 to 29.1, the MOTP is increased from 78.4 to 78.5, and the detection quality and the tracking accuracy are improved to a certain extent. Analysis and comparison show that the tracking performance and accuracy of Deepsort are greatly improved after algorithm improvement, and the Deepsort is more excellent on the same data set compared with the Deepsort improved.

The innovation points of the invention are as follows:

firstly, aiming at the branch model v5Lite-g of YOLOv5-Lite, the network header is modified, and the Concat is completely replaced by BiFPN _ Concat.

The second improvement is that the common Kalman filtering is replaced by the NSA Kalman filtering algorithm, and a self-adaptive noise covariance calculation method is introduced

The cost matrix C is the appearance cost A _a And cost of action A _m Weighted sum of (c):

C＝λA _a +(1-λ)A _m (2)

wherein the weighting factor lambda is set to 0.98, and in addition, in order to solve the problem that the matching precision is limited by additional prior constraint, common global linear distribution is adopted to replace matching cascade.

Updating appearance state of ith track at frame t in mode of exponential moving average

And thirdly, converting the fastried model file of the pth suffix into the model file of the onnx suffix.

And fourthly, changing the pedestrian detection into frame detection, namely detecting all pedestrians in the video by using a YOLOv5-Lite detection model every other frame, and meanwhile, adding a display frame rate module for a real-time video visual interface.

Claims

1. A pedestrian tracking and re-identification method based on deep learning is characterized by comprising the following steps;

step 2: performing feature extraction on the pedestrian detected in each frame in the step 1 by adopting a Deepsort model to generate a npy file;

and 4, step 4: performing cosine similarity calculation on the feature extraction result of each pedestrian target in the step 2 and the feature extraction result of the specific pedestrian base in the step 3, wherein the formula is shown as (1), and x ₁ And x ₂ For the two non-0-vectors,if the number of the pedestrian targets is larger than the threshold value gamma, the specific pedestrian targets needing to be re-identified are judged, the tracking of the pedestrians is carried out by utilizing the tracking strategy of the improved Deepsort model, and otherwise, the target tracking is not carried out

2. The deep learning-based pedestrian tracking and re-identification method according to claim 1, wherein the step 1 comprises the following sub-steps:

wherein, w _i And w _j Is a learnable weight;

step 1.2, extracting the features of the picture by using the convolutional neural network, then outputting a feature graph, dividing the picture into small blocks and generating an anchor frame, correlating the labeled prediction frame with the feature graph, and finally establishing a loss function and starting end-to-end training, wherein the loss function is shown as a formula (3);

where ρ is ² (b,b ^gt ) C represents the diagonal distance of the minimum closure area which can contain the prediction frame and the real frame at the same time. v is used to measureThe consistency of the relative proportions of the two rectangular boxes, α is the weighting factor:

3. the deep learning-based pedestrian tracking and re-identification method according to claim 1, wherein the step 3 comprises the following sub-steps:

step 3.1: after an input picture is preprocessed, calling a pre-training model ResNet50 as a backhaul, then representing an object by a feature vector in an aggregation mode through a Gem Pooling output feature map, changing the feature vector obtained in the front by a Bnneck module, and finally defining triple loss to learn similarity in classification and discrimination in the class, so that direct discrimination between different feature vectors is more obvious, and the same feature vectors are more convergent;

the Tripletloss input is a triple including an Anchor (Anchor) example, a Positive (Positive) example and a Negative (Negative) example, similarity calculation between samples is realized by optimizing the distance between the Anchor example and the Positive example to be smaller than the distance between the Anchor example and the Negative example, and a: anchor, anchor example; p: positive, a sample of the same class as a; n: negative, a sample of a different class than a; margin is a constant greater than 0:

L＝max(d(a,p)-d(a,n)+margin,0)(5)。

4. the deep learning-based pedestrian tracking and re-identification method according to claim 1, wherein the tracking strategy in the step 4 comprises the following sub-steps:

step 4.1: tracking the specific pedestrian target selected for re-identification by using the NSA Kalman filtering algorithm, specifically updating the appearance state of the ith track at the frame t in an exponential moving average manner

As shown in equation (7):

C＝λA _a +(1-λ)A _m (8)

wherein the weighting factor λ is set to 0.98;

if the track matching fails, IoU matching is carried out again, if the matching is successful, updating is carried out again, then the tracking process of predicting-observing-updating is repeated, and IoU matching failure is divided into observation matching failure and track matching failure: for observation matching failure, a method of establishing a new track is adopted, then three times of investigation is carried out, and if the target (pedestrian or vehicle) is still an actual target, the confirmation is carried out; and judging whether the track matching fails, judging whether the track matching is confirmed to be a pedestrian or a vehicle, if the track matching is not confirmed, deleting the track matching, otherwise, setting a threshold value for the track matching, if the track matching is larger than the threshold value max _ age, deleting the track matching, considering that the track matching is moved out of the observation range, and if the track matching is smaller than the threshold value, carrying out three times of investigation on the track matching again and returning to the initial stage.