CN112614159B

CN112614159B - Cross-camera multi-target tracking method for warehouse scene

Info

Publication number: CN112614159B
Application number: CN202011530890.9A
Authority: CN
Inventors: 张森镇; 周韬; 吴均峰; 陈积明; 史治国; 贺诗波
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2023-04-07
Anticipated expiration: 2040-12-22
Also published as: CN112614159A

Abstract

The invention discloses a cross-camera multi-target tracking method for a warehouse scene, which mainly comprises three parts of target detection, multi-target tracking and track mapping. The target detector realizes accurate detection under the scene that the targets are mutually shielded; the multi-target tracker can predict the position of each target in the current frame based on the target template frame; updating target states and tracking pool information by optimally matching the detection frame and the multi-target tracking frame; the space-time relation among the cameras is utilized, and cross-camera tracking of the target is realized based on the appearance characteristics of the target; and mapping the target track into a unified warehouse map through a projection mapping technology so as to visualize the tracking process. The method can realize robust cross-camera detection and tracking in a warehouse scene.

Description

Cross-camera multi-target tracking method for warehouse scene

Technical Field

The invention relates to the field of computer vision, in particular to a cross-camera multi-target tracking method for a warehouse scene.

Background

In recent years, the rapid development of the security industry is driven by the progress of computer vision and Artificial Intelligence (AI) technologies, the quantity of cameras is more and more, the AI intelligent application taking the cameras as infrastructure rapidly expands, and the intelligent demand of enterprises on the cameras is rapidly increased. Among many computer vision tasks, the target tracking task is widely applied in the field of surveillance video, and the most common tasks include pedestrian tracking, vehicle tracking and the like.

The target tracking technology has been developed in the field of computer vision for many years, and can be divided into the following steps according to different tracking tasks: 1. and tracking a single camera and a single target. 2. And tracking the multiple targets of the single camera. 3. And (4) multi-target tracking across cameras. The method mainly solves the problem that the 3 rd cross-Camera Multi-Camera Tracking task (MTMC Tracking) is adopted, most of the Multi-Target Tracking methods proposed at home and abroad currently relate to the design of an integral Tracking frame, one mainstream Tracking frame is based on Target detection and Target global optimal matching, and a feature extraction model is required to extract the appearance features of a Target and combine Target position motion information to achieve the optimal matching effect, but the scheme has the following defects: 1) When the appearance similarity of different targets is high, mismatching is easy to occur; 2) Because the targets may be shielded from each other, the extracted appearance features are unreliable; 3) The time consumption is long, the time consumption of the feature extraction model is often equivalent to that of the detection model, and in addition, in a dense scene, due to the fact that the number of targets is large, the time consumption of global matching is increased;

in summary, the present invention finally solves the following problems: 1) Target identity switching caused by insufficient degree of distinction of target appearance; 2) Target tracking loss caused by mutual shielding of targets; 3) The tracking algorithm overhead due to the introduction of the additional feature extraction model is increased.

Disclosure of Invention

The invention aims to provide a cross-camera multi-target tracking method for a warehouse scene aiming at the defects of the prior art, and the method can realize robust cross-camera detection and tracking of multiple targets in the warehouse scene.

The purpose of the invention is realized by the following technical scheme: a cross-camera multi-target tracking method for warehouse scenes comprises the following steps:

each camera in the warehouse is responsible for monitoring one shelf channel, and each camera covers all channels between shelves; each camera divides different tracking pools according to the number and the positions of the channel inlets and outlets in the visual field, wherein the tracking pools comprise 1 main channel survival tracking pool, 1 main channel extinction tracking pool and a plurality of inlet and outlet extinction tracking pools;

acquiring video streams obtained by shooting channels among shelves by cameras in a warehouse in a depression manner;

detecting a target object in the video stream frame by frame or frame skipping based on a target detector of deep learning;

when no tracking target exists in the first frame or the main channel survival tracking pool, a target detection frame detected by a target detector is used as a template frame, a new target is created, and the main channel survival tracking pool is initialized;

when a tracking target exists in a subsequent frame or a main channel survival tracking pool, predicting the position of the target objects in the current frame according to a template frame of the target objects in the main channel survival tracking pool based on a multi-target tracker of a twin network;

in the subsequent tracking process, the IOU optimal matching is carried out according to the tracking frame obtained by the prediction of the multi-target tracker and the detection frame detected by the current frame target detector, and the targets in each tracking pool are updated according to the matching result: if the tracking frame of the target is matched with the detection frame, updating the detection frame to be the latest template frame of the target; if the tracking frame of the target continuous multiframe does not have a detection frame matched with the tracking frame, moving the target from the survival tracking pool of the main channel into a corresponding extinction tracking pool according to the position of the last tracking frame of the target; if a new target appears, recovering the target or building a new target from a corresponding extinction tracking pool based on an appearance characteristic matching mode according to the position of the new target, so that cross-camera tracking of the target is realized or the target is tracked after disappearing;

and mapping the tracking track of the target in the video stream into a unified map based on a projection mapping technology.

Furthermore, the cameras rely on the respective corresponding extinction tracking pools to perform target transmission with the adjacent cameras according to the position relation, so that cross-camera tracking is realized.

Further, the tracked target object comprises information such as a template frame, the number of continuous abnormal tracking frames, appearance characteristics, a history tracking frame, a history frame number and the like.

Furthermore, the target detector based on deep learning not only predicts the position of the target, but also predicts the probability of the target being shielded, judges whether the detected target is shielded according to the probability threshold, divides the detected target into two types of non-shielded and shielded, and respectively performs post-processing screening of the non-maximum suppression algorithm to obtain the final detection frame of the target.

Furthermore, the multi-target tracker is realized based on a twin network, the template frames of all targets in the main channel survival tracking pool are used as centers to determine the search frame of each target in the current frame, and the multi-target tracker predicts the tracking frames of all targets in the current frame according to the template frames and the respective search frames of all targets in the main channel survival tracking pool.

Further, according to the matching result of the detection frame and the tracking frame: if the detection frame is matched with the tracking frame, the tracking target is normally tracked, the detection frame of the tracking target is used as a latest template frame, and the number of the continuous abnormal tracking frames of the tracking target is set to be 0; if the detection frame has no matched tracking frame, the detection frame is a new target, and a new tracking target needs to be created by taking the detection frame as a template frame or a tracking target is recovered from a corresponding extinction tracking pool and added into a main channel survival tracking pool; if the trace frame has no matched detection frame, it represents that the trace target may disappear, the number of consecutive abnormal trace frames of the trace target is increased by 1, and the template frame is not updated.

Further, when the number of continuous abnormal tracking frames for tracking the target in the main channel survival tracking pool exceeds a certain threshold, the target needs to be transferred from the main channel survival tracking pool to the corresponding casualty tracking pool according to the position of the last tracking frame of the target; if the target does not disappear in the entrance and exit area, adding the tracking target into a main channel extinction tracking pool; and if the target is judged to be lost in a certain entrance and exit area, adding the tracking target into the corresponding entrance and exit extinction tracking pool.

Further, a newly-appeared target frame needs to determine whether the target frame appears in an entrance area or an exit area, if the target frame appears in a certain entrance area and a corresponding entrance and exit casualty tracking pool of an adjacent camera or a corresponding entrance and exit casualty tracking pool of the camera has an casualty target, the target is recovered from the casualty tracking pool according to an appearance characteristic matching mode, and if the target does not exist in the casualty tracking pool, a new target is created by taking the detection frame as a template; and if the target does not exist in the entrance area and the exit area and the casual target exists in the main channel casual tracking pool, recovering the target from the main channel casual tracking pool according to the appearance characteristic matching mode, and if the target does not exist in the main channel casual tracking pool, establishing a new target by taking the detection frame as a template.

Further, when the targets in the extinction tracking pool are recovered, the correlation degree of the appearance features between the new target and all the targets in the extinction tracking pool needs to be calculated, the extinction target with the maximum correlation degree of the recovered appearance features is selected, the detection frame of the new target is used as the template frame of the recovered target, and meanwhile, the number of the continuous abnormal tracking frames is set to be 0.

Further, a 3x3 projection transformation matrix is established between the image coordinate system of each camera and the unified map coordinate system by using a projection mapping technology, and is used for transforming the foot point coordinates of the tracking target in the field of view of each camera from each image coordinate system to the unified map so as to visualize the tracking track.

The invention has the beneficial effects that: 1) The invention provides a multi-target tracking method based on target detection and twin network multi-target tracking; 2) The multi-target tracking method provided by the invention is applicable to tracking scenes with similar appearances among targets; 3) The method improves the tracking robustness during cross shielding between the targets by optimizing the detection performance of the target detector under the shielding condition; 4) The invention utilizes the twin network multi-target tracker to simultaneously realize multi-target tracking and target feature extraction, thereby improving the tracking efficiency; 5) The invention realizes cross-camera tracking of the target by utilizing the space-time relationship among the multiple cameras.

Drawings

FIG. 1 is a block diagram of a cross-camera multi-target tracking method for a warehouse scene provided by an embodiment of the invention;

FIG. 2 is a single-camera multi-target tracking frame diagram provided by an embodiment of the present invention;

FIG. 3 is a flowchart illustrating the process of updating tracking target information according to an embodiment of the present invention;

fig. 4 is a schematic diagram of composition and update of an overall tracking pool of a camera provided in an embodiment of the present invention;

FIG. 5 is a flow chart of target detector detection provided by an embodiment of the present invention;

FIG. 6 is a flow chart of multi-target tracker tracking provided by embodiments of the present invention;

fig. 7 is a schematic view of a warehouse scenario in which the present invention is applied.

Detailed Description

For better understanding of the technical solutions of the present application, the following detailed descriptions of the embodiments of the present application are provided with reference to the accompanying drawings.

It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

As shown in fig. 1, an embodiment of the present invention provides a cross-camera multi-target tracking method for a warehouse scene, which mainly includes a multi-path camera, a target detector, a multi-target tracker, and a trajectory visualization part. The method comprises the steps of firstly obtaining video streams generated by multiple paths of cameras, enabling a target detector to be responsible for detecting detection frames of all targets appearing in a current frame of the camera video streams frame by frame or frame skipping, enabling a multi-target tracker to be responsible for predicting tracking frames of tracking targets in a survival tracking pool of a main channel of the cameras in the current frame, enabling the tracking frames of the current frame and the detection frames to be optimally matched based on an IOU (input output Unit), analyzing matching results to update a camera tracking pool, enabling adjacent cameras to be associated with the targets according to a space-time relation and a target appearance characteristic correlation degree to achieve cross-camera tracking of the targets, and finally enabling all tracking target tracks in the camera video streams to be projected and mapped to a unified warehouse map to achieve track visualization.

As shown in fig. 2, the single-camera multi-target tracking process specifically includes the steps of:

in step 2001, a video stream is input, and a detection frame of the current frame is obtained by the target detector.

Step 2002, judging whether a tracking target exists in the survival tracking pool of the main channel, and if the tracking target does not exist, executing step 2003; if a tracking target exists, go to step 2004;

step 2003, with the target detection frame of the target detector as a template frame, creating a tracking target to initialize a main channel survival tracking pool, and returning to step 2001;

step 2004, the twin network-based multi-target tracker predicts the positions of the target objects in the current frame according to the template frames of the target objects in the main channel survival tracking pool;

step 2005, calculating the IOU between the current frame tracking frame and the detection frame, and optimally matching the detection frame with the tracking frame;

and step 2006, analyzing and processing the 'new life' and 'death' of the tracking target according to the matching result, updating the overall camera tracking pool, ending the tracking of the current frame, and returning to step 2001.

As shown in fig. 3, the specific process of updating the tracking target information according to the detection result and the tracking result of the current frame includes:

step 3001, calculating an IOU distance matrix between the detection frame and the tracking frame;

step 3002, optimally matching the detection frame with the tracking frame by using an optimal target association algorithm;

step 3003, there are three matching situations according to the matching result: 1) The tracking frame is matched with the detection frame to indicate that the tracking target tracks normally, the detection frame is updated to be a template frame of the tracking target, and the number of the continuous abnormal tracking frames of the tracking target is 0; 2) The detection frame without the matching of the tracking frame indicates that the detection frame detects that the target does not exist in the survival tracking pool of the main channel of the camera and needs to recover the target from the corresponding death tracking pool or create a new target; 3) And adding 1 to the continuous abnormal tracking frame number of the tracking target without the tracking frame matched with the detection frame.

As shown in fig. 4, the composition of the camera tracking pool includes: the system comprises 1 main channel survival tracking pool, 1 main channel death tracking pool and a plurality of exit and entrance death tracking pools. The updating condition of the camera tracking pool comprises the following steps: 1) When the number of abnormal tracking frames of a tracking target in a 'survival' tracking pool of a main channel is greater than a certain threshold value and the last tracking frame of the target is in the main channel area, transferring the target from the 'survival' tracking pool of the main channel to a 'death' tracking pool of the main channel; 2) When the number of abnormal tracking frames of a tracking target in a 'survival' tracking pool of a main channel is greater than a certain threshold value and the final tracking frame of the target is in an entrance and exit area, the target can be transferred from the 'survival' tracking pool of the main channel to a corresponding entrance and exit 'death' tracking pool; 3) When a new target is detected and found in the main channel region and is associated to a target corresponding to a main channel 'death' tracking pool through appearance characteristics, recovering the death target from the main channel 'death' tracking pool to a main channel 'survival' tracking pool; 4) When a new target is detected and found in the entrance and exit area and is associated to a target corresponding to the entrance and exit 'death' tracking pool through the appearance characteristics, the death target is recovered to the main channel 'survival' tracking pool from the entrance and exit 'death' tracking pool.

As shown in fig. 5, the specific steps of the target detector detection process include:

step 5001, performing picture preprocessing on the video stream picture, wherein the picture preprocessing mainly comprises operations of picture size adjustment, picture normalization and the like;

and 5002, sending the processed picture to a target detector, wherein the target detector mainly comprises a feature extraction network, a feature fusion network, a prediction frame regression network and a prediction frame classification network. Compressing the pictures into feature maps of different scales by a feature extraction network; the feature fusion network further fuses the feature graphs of different scales to obtain feature graphs of different scales after feature fusion; gridding the picture according to the size of the characteristic diagram, and generating a plurality of anchor frames with different scales and length-width ratios on each grid point; the prediction frame regression network regresses the positions of all the anchor frames and predicts the sheltered probability; predicting classification confidence coefficients of all anchor frames by a prediction frame classification network;

step 5003, screening the anchor frame with the confidence coefficient larger than a certain threshold value as an effective prediction frame;

step 5004, according to a certain threshold, based on the occluded probability, dividing the effective prediction frame into two types: an unoccluded prediction frame and an occluded prediction frame;

step 5005, respectively performing non-maximum suppression processing on the unoccluded prediction frame and the occluded prediction frame, and combining the results to obtain a final prediction frame.

As shown in fig. 6, the tracking flow of the multi-target tracker includes the following specific steps:

step 6001, according to the template frame of the main channel "live" tracking target, determining the search area of each target in the current frame of the video stream, where the search area takes the template map as the center;

step 6002, inputting multiple pairs of template graphs and corresponding search areas into a twin network simultaneously to obtain template feature graphs and search area feature graphs of the tracking targets;

step 6003, performing convolution operation on the search area feature map by taking the template feature map as a convolution kernel to obtain a feature response map;

and 6004, decoding to obtain tracking frames of all tracking targets in the current frame according to the peak positions of the characteristic response images and the original search areas.

As shown in fig. 7, the schematic view of a warehouse in the present invention is that each camera in the warehouse is responsible for monitoring one shelf channel, and each camera covers all the main channels between shelves.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A cross-camera multi-target tracking method for warehouse scenes is characterized by comprising the following steps:

each camera in the warehouse is responsible for monitoring one shelf channel, and each camera covers all channels between shelves; each camera divides different tracking pools according to the number and the positions of the inlets and the outlets of the channels in the visual field, wherein the tracking pools comprise 1 main channel survival tracking pool, 1 main channel death tracking pool and a plurality of inlet and outlet death tracking pools;

acquiring video streams obtained by shooting channels among shelves by cameras in a warehouse in a depression manner; the cameras rely on the corresponding extinction tracking pools to carry out target transmission with adjacent cameras according to the position relation, so that cross-camera tracking is realized;

detecting a target object in the video stream frame by frame or frame skipping based on a target detector of deep learning; the target detector based on deep learning not only predicts the position of a target, but also predicts the probability of the target being shielded, judges whether the detected target is shielded according to a probability threshold, divides the detected target into two types of non-shielding and shielding, and respectively carries out post-processing screening of a non-maximum suppression algorithm to obtain a final detection frame of the target.

when a tracking target exists in a subsequent frame or a main channel survival tracking pool, predicting the position of the target objects in the current frame according to a template frame of the target objects in the main channel survival tracking pool based on a multi-target tracker of a twin network; the multi-target tracker is realized on the basis of a twin network, a search frame of each target in the current frame is determined by taking all target template frames in a main channel survival tracking pool as centers, and the multi-target tracker predicts the tracking frames of the targets in the current frame according to the template frames and the corresponding search frames of all targets in the main channel survival tracking pool;

in the subsequent tracking process, performing IOU optimal matching according to a tracking frame predicted by the multi-target tracker and a detection frame detected by the current frame target detector, and updating targets in each tracking pool according to a matching result: according to the matching result of the detection frame and the tracking frame: if the detection frame is matched with the tracking frame, the tracking target is normally tracked, the detection frame of the tracking target is used as a latest template frame, and the number of the continuous abnormal tracking frames of the tracking target is set to be 0; if the detection frame has no matched tracking frame, the detection frame is a new target, and a new tracking target needs to be created by taking the detection frame as a template frame or a tracking target is recovered from a corresponding extinction tracking pool and added into a main channel survival tracking pool; if the tracing frame has no matched detection frame, it represents that the tracing target may disappear, the continuous abnormal tracing frame number of the tracing target is added with 1, and the template frame is not updated.

If the tracking frame of the target continuous multiframe does not have a detection frame matched with the tracking frame, moving the target from the survival tracking pool of the main channel into a corresponding extinction tracking pool according to the position of the last tracking frame of the target; if a new target appears, recovering the target or building a new target from a corresponding extinction tracking pool based on an appearance characteristic matching mode according to the position of the new target, so that cross-camera tracking of the target is realized or the target is tracked after disappearing; when the number of continuous abnormal tracking frames for tracking the target in the main channel survival tracking pool exceeds a certain threshold, the target needs to be transferred from the main channel survival tracking pool to a corresponding extinction tracking pool according to the position of the last tracking frame of the target; if the target does not disappear in the entrance and exit area, adding the tracking target into a main channel extinction tracking pool; and if the target is judged to be lost in a certain entrance and exit area, adding the tracking target into the corresponding entrance and exit extinction tracking pool.

If the newly-appeared target frame appears in a certain entrance area and a casual target exists in an entrance/exit casualty tracking pool of a corresponding adjacent camera or the entrance/exit casualty tracking pool of the corresponding adjacent camera, recovering the target from the casualty tracking pool according to an appearance characteristic matching mode, and if the target does not exist in the casualty tracking pool, establishing a new target by taking the detection frame as a template; and if the target does not exist in the entrance area and the exit area and the casual target exists in the main channel casual tracking pool, recovering the target from the main channel casual tracking pool according to the appearance characteristic matching mode, and if the target does not exist in the main channel casual tracking pool, establishing a new target by taking the detection frame as a template.

When the targets in the extinction tracking pool are recovered, the correlation degree of the appearance characteristics of the new target and all the targets in the extinction tracking pool needs to be calculated, the extinction target with the maximum correlation degree of the recovered appearance characteristics is selected, the detection frame of the new target is used as the template frame of the recovered target, and meanwhile, the number of the continuous abnormal tracking frames is set to be 0.

The tracked target object comprises information such as a template frame, a continuous abnormal tracking frame number, an appearance characteristic, a historical tracking frame, a historical frame sequence number and the like;

and establishing a 3x3 projection transformation matrix between the image coordinate system of each camera and the unified map coordinate system by using a projection mapping technology, wherein the projection transformation matrix is used for transforming the foot point coordinates of the tracking target in the field of view of each camera from each image coordinate system to the unified map so as to visualize the tracking track.