CN114743130A

CN114743130A - Multi-target pedestrian tracking method and system

Info

Publication number: CN114743130A
Application number: CN202210264036.5A
Authority: CN
Inventors: 刘海英; 郑太恒; 邓立霞; 孙凤乾; 王超平
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-07-12

Abstract

The invention discloses a multi-target pedestrian tracking method and a multi-target pedestrian tracking system, wherein the method comprises the following steps: acquiring a video to be processed; marking a plurality of target pedestrians in a first frame of a video to be processed; carrying out target detection on a non-first frame of a video to be processed to obtain a target detection frame; extracting the features of the image in the target detection frame; performing state prediction and track generation on the target detection frame; determining a correlation cost based on the feature extraction result, the state prediction result and the track generation result; matching the track with the correlation cost larger than the set threshold value with the target detection frame to obtain a preliminary matching result; matching the unmatched track and the unmatched target detection frame again; and finally, determining a tracking result and completing a tracking task of the multi-target pedestrian. The model weight size is reduced while maintaining accuracy.

Description

Multi-target pedestrian tracking method and system

Technical Field

The invention relates to the technical field of multi-target tracking, in particular to a multi-target pedestrian tracking method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Target tracking is always a challenging research direction in the field of machine vision, and in recent years, multi-target tracking becomes a key research object of many researchers, wherein the multi-target tracking is to assign corresponding IDs to different objects in a video and track the objects in all the following frames, the different objects have different IDs, and the IDs of the same objects are not changed in theory. Different from target detection, target tracking can accurately search the same object in subsequent frames, and can also realize the track prediction of the object, and the characteristics enable the multi-target tracking to have a large application space in the aspects of automatic driving, intelligent monitoring and the like.

In recent years, with the continuous update of GPU equipment, deep learning becomes a research hot, and target tracking based on deep learning has higher accuracy and real-time performance compared with the traditional method. The classic deep sort multi-target tracking algorithm is applied to many aspects, and provides a series of solutions for problems of ID SWITCH, poor instantaneity and the like in multi-target tracking.

A large-scale neural network is adopted in a traditional deep sort detector and a feature extractor, and the traditional deep sort detector and the feature extractor have the advantages of high precision, strong real-time performance, low omission factor, few ID SWITCH and the like. But the use cost is higher, and the algorithm operation is supported by insufficient storage space, GPU and heat dissipation for small equipment, mobile terminals and the like with poor hardware conditions.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-target pedestrian tracking method and a multi-target pedestrian tracking system; the target detection part uses the latest yolov5, the target tracking uses deepsort, the feature extraction network of the traditional deepsort is modified in the tracker, and the lightweight ShuffeNet V2 is used for replacing the feature extraction network, so that the model weight is reduced while the precision is maintained.

In a first aspect, the invention provides a multi-target pedestrian tracking method;

the multi-target pedestrian tracking method comprises the following steps:

acquiring a video to be processed; marking a plurality of target pedestrians in a first frame of a video to be processed;

carrying out target detection on a non-first frame of a video to be processed to obtain a target detection frame;

extracting the features of the image in the target detection frame;

performing state prediction and track generation on the target detection frame;

determining a correlation cost based on the feature extraction result, the state prediction result and the track generation result;

matching the track with the correlation cost larger than the set threshold value with the target detection frame to obtain a primary matching result; matching the unmatched track and the unmatched target detection frame again; and finally, determining a tracking result and completing a tracking task of the multi-target pedestrians.

In a second aspect, the present invention provides a multi-target pedestrian tracking system;

a multi-target pedestrian tracking system, comprising:

an acquisition module configured to: acquiring a video to be processed; marking a plurality of target pedestrians in a first frame of a video to be processed;

an object detection module configured to: carrying out target detection on a non-first frame of a video to be processed to obtain a target detection frame;

a feature extraction module configured to: extracting the features of the image in the target detection frame;

a state prediction and trajectory generation module configured to: performing state prediction and track generation on the target detection frame;

an association cost determination module configured to: determining an association cost based on the feature extraction result, the state prediction result and the track generation result;

a tracking module configured to: matching the track with the correlation cost larger than the set threshold value with the target detection frame to obtain a preliminary matching result; matching the unmatched track and the unmatched target detection frame again; and finally, determining a tracking result and completing a tracking task of the multi-target pedestrian.

In a third aspect, the present invention further provides an electronic device, including:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect.

In a fourth aspect, the present invention also provides a storage medium storing non-transitory computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, perform the instructions of the method of the first aspect.

In a fifth aspect, the invention also provides a computer program product comprising a computer program for implementing the method of the first aspect when run on one or more processors.

Compared with the prior art, the invention has the beneficial effects that:

the method uses the ShuffleNet V2 network proposed based on the principle to be combined with the traditional Deepsort to replace a feature extraction network in a Deepsort tracker, so that the complexity of the model and the weight parameter are greatly reduced, the ShuffleNet V2 makes a large amount of modification on the basis of the ShuffleNet V1, and the operation of point-by-point volume and bottleneck structure, which can increase the memory access cost, and the like are modified. The modified DeepSort can be operated on the low-performance embedded terminal equipment with poor hardware equipment, and the applicability of the algorithm is increased.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of the present invention;

FIGS. 2(a) and 2(b) are schematic diagrams of block and downsampling layer structures in the network structure of ShuffleNet V2;

FIGS. 3(a) and 3(b) are size comparisons of two network models;

fig. 4 is a schematic diagram of the final detection effect.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.

Example one

The embodiment provides a multi-target pedestrian tracking method;

as shown in fig. 1, the multi-target pedestrian tracking method includes:

s101: acquiring a video to be processed; marking a plurality of target pedestrians of a first frame of a video to be processed;

s102: carrying out target detection on a non-first frame of a video to be processed to obtain a target detection frame;

s103: extracting the features of the image in the target detection frame;

s104: performing state prediction and track generation on the target detection frame;

s105: determining a correlation cost based on the feature extraction result, the state prediction result and the track generation result;

s106: matching the track with the correlation cost larger than the set threshold value with the target detection frame to obtain a preliminary matching result; matching the unmatched track and the unmatched target detection frame again; and finally, determining a tracking result and completing a tracking task of the multi-target pedestrian.

Further, the S102: carrying out target detection on a non-first frame of a video to be processed to obtain a target detection frame; the trained Yolov5s target detection network is adopted to carry out target detection.

Further, the Yolov5s target detection network includes: and the CSPNet network used for feature extraction and the PANET network used for feature fusion are connected in sequence.

Further, the training process of the trained Yolov5s target detection network includes:

constructing a first training set; the first training set is a video of a known target detection frame;

and inputting the first training set into a Yolov5s target detection network, and training the network to obtain a trained Yolov5s target detection network.

Illustratively, a Yolov5s target detection algorithm is adopted as a detector of the tracking system, and is used for obtaining a BoudingBox of a target, and the accuracy, the speed and the reliability of the tracking system can be ensured by using the algorithm;

the yolov5s is adopted to reduce the width (width _ multiple) and depth (depth _ multiple) of the default network structure of yolov5 by reducing the number of Conv modules and CSPNet modules in the whole neural network, so that the detector is lighter and operates faster. Compared with detectors based on artificial feature extraction or detectors based on deep learning, such as two-stage, yolov3 and improved yolov3, adopted by a traditional tracking algorithm, the yolov5s used in the invention has better performance and higher speed, and can meet the requirements of a pedestrian tracking system on various light-weight and embedded scenes.

The network structure of yolov5s is divided into a backbone network (backbone) and a feature fusion network (head), wherein the backbone network uses a CSPNet (cross Stage Partial networks) cross-phase local network module, the module has low model complexity, and the gradient combination which can be realized by smaller calculation amount is abundant. The yolov5s feature fusion network uses a Path Aggregation Network (PANET) to further process and process the information in the backbone layer, thereby enhancing the detection capability of the abnormal scale target. In the present invention, the various modules of the neural network are spliced by configuring the yaml file of the pyrrch, resulting in a final usable detector through deep learning training.

Further, the step S103: extracting the features of the image in the target detection frame; the method specifically comprises the following steps:

and (3) extracting the features of the images in the target detection frame by adopting the trained feature extraction network Shufflenet V2.

Further, the trained feature extraction network ShuffleNet V2; the training process comprises the following steps:

constructing a second training set; wherein the second training set is an image of a known feature label;

and inputting the second training set into the feature extraction network ShuffleNet V2, and training the network to obtain a trained feature extraction network ShuffleNet V2.

Illustratively, ShuffLeNetV2 is used as a backbone network to replace a feature extraction network in a Deepsort original ReiD network to perform feature extraction on an image in a target BoundingBox obtained in a detector.

Furthermore, the ShuffleNet V2 network is formed by connecting Stage1-Stage7 in sequence;

stage1 consists of convolution layers with convolution kernel size of 3 x3 step size of 2 and maximum pooling layer with step size of 2;

stage2 consists of one downsampled layer and three Block layers;

stage3 consists of one layer of downsampling and seven layers of blocks;

stage4 is composed of one layer of downsampling and three layers of blocks;

stage5 consists of convolution layers with convolution kernel size of 1 x 1;

stage6 consists of a global pooling layer;

stage7 consists of a fully connected layer.

As shown in fig. 2(a), in the shefflenetv 2 network structure.

The Block layer introduces Channel Split operation, and after receiving the output from the previous layer, the input of c channels is divided into two branches, which have c 'and c-c' channels respectively. One of the branches is an identity function, and the other branch is composed of three convolutions: two 1 x 1 convolutions and one channel-by-channel convolution. And finally, splicing the two branches by the Concat so as to ensure that the number of the channels is kept unchanged, and finally performing Channel Shuffle operation to ensure that information can be exchanged between the two branches.

As shown in fig. 2(b), the down-sampling layer is formed by modifying the Block layer, deleting the Channel Split operation, splicing one branch of the Channel-by-Channel convolution layer and 1 × 1 convolution layer with the other branch of the Channel-by-Channel convolution layer and 1 × 1 convolution layer, and then performing Channel Split.

Further, the S104: performing state prediction and track generation on the target detection frame; the method specifically comprises the following steps:

performing state prediction on the target detection frame by adopting a Kalman filtering algorithm;

and combining the result of the Kalman filtering algorithm to generate a track of the target detection frame.

Further, performing state prediction on the target detection frame by adopting a Kalman filtering algorithm; the method specifically comprises the following steps:

defining an eight-dimensional state space

Wherein (u, v) is the central coordinate of BoundingBox, γ is the aspect ratio, h is the height of BoundingBox,

the corresponding velocity in the image coordinates. And taking BoundingBox coordinates as direct measurement of the object state, and using a Kalman filter to complete state estimation of the target. Input value of kalman filter: mean and variance of each trace. Output value of kalman filter: the projected mean and covariance matrix for a given state estimate are returned.

Further, combining the result of the Kalman filtering algorithm to generate a track of the target detection frame; the method specifically comprises the following steps:

counting the number of frames a of each track from the last matching success_kWhen the kalman filter predicts the position of the trajectory (track) in the next frame, a_k＝a_k+1, if a track is associated with the detected position information and appearance feature in the next frame, then a_kAnd setting 0.

Setting a predefined maximum life value A_maxWhen a is_k＞A_maxWhen so, deleting the track; when a is_k≤A_maxWhile, the trace is preserved.

When the detected position information and appearance characteristics can not be matched with the track, temporarily defining the track as a new track, and the trial period is 3 frames, and if the matched detection is not carried out in the 3 frames, deleting the track.

Further, the step S105: determining a correlation cost based on the feature extraction result, the state prediction result and the track generation result; the method specifically comprises the following steps:

calculating a first distance between the prediction state and the target detection frame;

calculating a second distance between the stored feature vector in the track and the feature vector of the image in the target detection frame;

and carrying out weighted summation on the first distance and the second distance, and taking the summation result as the associated cost.

Further, the first distance is a mahalanobis distance; the second distance is a cosine distance.

Exemplarily, the association cost is determined based on the feature extraction result, the state prediction result and the trajectory generation result; the method specifically comprises the following steps:

(1) and when the motion information is combined, calculating the Mahalanobis distance between the Kalman prediction state and the newly detected target frame.

i denotes the ith trace and j denotes the jth detection. Mahalanobis distance takes into account the uncertainty of the state estimate by calculating how much the detected distance exceeds the standard deviation from the mean trajectory position. The threshold is calculated by the inverse chi-squared distribution, excluding less likely associations.

(2) When merging the appearance information, the minimum cosine distance between the trajectory and the detection is calculated in the appearance space.

Calculating the cosine distance between each detected feature vector extracted by the ShuffleNet V2 and the feature vector stored in the track, wherein the number of the feature vectors stored in the track is set by the bucket parameter, default is 100, and the cosine distance between each detected track and the track has bucket cosine distance, and the cosine distance with the minimum value is taken as the cosine distance between the detected track and the track.

(3) The cost function of the correlation problem is the weighted sum of the two indexes:

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j)

when the motion uncertainty is low, the mahalanobis distance can exert the effect, but in the image space problem, the state distribution of the kalman prediction estimated by using the linear motion system can only provide rough estimation, so when the association cost is calculated, the parameter λ can be regarded as infinitesimal, and only the appearance information is used for association. But both the cosine distance based on the appearance information and the mahalanobis distance based on the motion information must be less than their prescribed thresholds.

Further, the step S106: matching the track with the correlation cost larger than the set threshold value with the target detection frame to obtain a preliminary matching result; matching the unmatched track and the unmatched target detection frame again; finally, determining a tracking result and completing a tracking task of the multi-target pedestrian; the method specifically comprises the following steps:

matching by adopting a Hungarian algorithm to obtain a primary matching result;

matching the unmatched track and the unmatched target detection frame again by adopting an intersection over Union matching algorithm; and finally, determining a tracking result and completing a tracking task of the multi-target pedestrian.

Illustratively, cascade matching uses cosine distance and mahalanobis distance of appearance features as a metric, and the uncertainty of one track in cascade matching increases as the number of times it has not been matched increases, so that the track on the closest match has a higher matching priority than other tracks.

A cost matrix generated after the detection and the track are processed by the mahalanobis distance and the cosine distance is used as the input of the Hungarian algorithm to obtain a linear matching result;

and carrying out IOU matching again on the unmatched track, the unmatched detection and the unconfirmed Kalman prediction result. And finally, determining a tracking result and finishing the multi-target tracking process.

Wherein, one track is detected for more than three times on the previous frame matching, and is converted into an affirmation state; track match detection less than three times is considered an unacknowledged state.

Illustratively, the method further comprises: making a data set for training a detector and a feature extractor in an algorithm and preprocessing the data set;

the data set used when the algorithm trains the ReiD feature extraction network is a Market-1501 data set for pedestrian re-identification. The data set was collected by the university of Qinghua, and included 1501 pedestrians, where the training set included 751 people, including 12936 images, and the testing machine included 750 people, including 19732 images;

and (3) carrying out data preprocessing on the data set, wherein the directory in the original data set is a specific picture file without embodying the id, and images (the same person) with the same id are placed in the same folder by utilizing a data preprocessing script file, and the folder is named as the person id. For example, all 0001 initial pictures under the bounding _ box _ train in the Maket-1501 are placed under the preprocessed 0001 folder, and all picture files under the bounding _ box _ test are similar;

after data processing, the data set is roughly divided into a training set, a verification set, Query and Gallery. The model obtained after training through the training set can extract features from images in Query and Gallery and calculate similarity, and a similar image is found in Gallery for each Query.

Illustratively, the method further comprises: configuring Python and pytorech programming environments for neural network model training and testing:

a virtual environment is created by Anaconda, Pycharm as an integrated development environment. Anaconda is an open-source Python release version containing a large number of installed software packages available for deep learning development. Wherein conda is an open source package, environment manager, which can realize the functions of conveniently installing different software packages on one machine, using multiple environments and freely switching among multiple environments.

The pytorech deep learning framework is used. Using conda to create an environment using python version 3.7 and named as torch1.7, and installing the torch1.7, CUDA, CUDNN and related dependency packages required by running programs after entering the environment;

the design and training of the correlation algorithm are carried out by an NVIDIA RTX3060 GPU. In order to solve the problem that a large amount of parallel computing is accelerated in operation speed, a parallel computing framework CUDA (compute unified device architecture) which is provided by NVIDIA and used for a self-contained GPU (graphics processing unit) needs to be installed, and a GPU acceleration library CUDNN used for a deep neural network is installed at the same time. Both the CUDA and CUDNN versions used in the present invention are 11.0.

Illustratively, the method comprises: the lightweight model is tested, so that the effect is real and effective;

adopting a preprocessed data set;

modifying the path of REID _ CKPT in the deepsort. yaml file into a weight address obtained after retraining by using a new model;

and modifying the model in the feature extractor into ShuffleNetV2, changing parameters such as model _ path, weight path and the like, and running an executable file to observe the experimental result.

The yolov5s model is used as a detector of the DeepSort, the improved ShuffleNetV2 is used for replacing a feature extraction network in the DeepSort tracker, and the size of the weight file is reduced on the premise of ensuring the precision, so that small equipment with small memory, no gpu, poor heat dissipation and other hardware conditions can achieve the real-time tracking effect.

Illustratively, collection and preprocessing of data sets; the collection and pretreatment of the data set are realized by retraining the feature extraction network in the tracker by using a Market-1501 data set, firstly, dividing pictures in the data set into a training set and a test set by using a dataset.

Illustratively, the virtual environment is configured, the dependency package is installed; the method comprises the steps of configuring a virtual environment, installing a dependency package, creating the virtual environment through Anaconda, and installing a pytorch, cuda, cudnn and related dependencies required by a running program in the virtual environment. Pycharm is used as IDE and calls the virtual environment torch1.7 created by conda.

Illustratively, the training of the detector and the tracker uses improved ShuffleNet V2 as a feature extraction network in the depsort tracker, imports a new model, deletes the original model, trains the new feature extraction network by using a training script file train.

Illustratively, the test model effect described in the test model effect is to introduce a modified shuffleNet V2 model into the script file of the feature extractor, modify the feature extraction network weight path required in the feature extractor, use the modified deepsort at this time for multi-target pedestrian tracking, test the model effect, and observe the ID Switch condition.

Illustratively, the platform for training and testing the experiment is red rice RedmiG, and the specific hardware configuration is NVIDIA GeForce RTX3060 (6G), AMD Ryzen 75800H with radio Graphics.

Illustratively, the experiment adopts python as a compiling language, adopts a pytorech deep learning framework, and the overall structure of project codes mainly comprises files such as deep, sort, configurations, yolov5, demo. The sort folder is used for storing some tools needed by the tracker, such as kalman filter py storage kalman filter related codes, nn matching py calculates nearest neighbor distance by calculating mahalanobis distance cosine distance, and the like, and track py storage track information. Yaml is used to store some important parameters in the depsort algorithm under the configs file directory. And the deep folder is used for storing a feature extraction network model structure, a feature extractor, a feature extraction network training script file and the like. Txt is used to download dependency packages for running programs.

Firstly, data preprocessing is carried out on a data set, a directory in an original data set is a specific picture file, images (the same person) with the same id are placed in the same folder by means of a data preprocessing script file dataset. For example, all 0001 initial pictures under bounding _ box _ train in the Maket-1501 are placed under a preprocessed 0001 folder, and after data processing, the data set is roughly divided into a training set, a verification set, Query and Gallery. The model obtained after training through the training set can extract features from images in Query and Gallery and calculate similarity, and a similar image is found in Gallery for each Query. After the data preprocessing script is operated, a folder named as a pyrch is generated in the data set folder, wherein the folder comprises folders such as train, val, query and galery for training and testing a new network model, and the folder is moved to a deep folder of the deepsort.

Opening a trace file in a deep folder, importing a ShuffleNet V2 model structure, changing a data set path into a folder path named as a pyrrch after data preprocessing, setting the number of training rounds as 100, setting the batch-size as 24, setting num _ works as 0, and running python-trace at a terminal to start training.

Finding a weight file named ckpt.t8 in the deep/checkpoint folder, modifying the path of the REID _ CKPT weight file in the deep _ sort.yaml file in the configs under the root directory into a path of the ckpt.t8, continuously changing the feature _ extra.py file under the deep folder, importing a ShuffleNet V2 model, and changing parameters such as model _ path and weight path. And running the test executable file to obtain an experimental result, and comparing the experimental result with the result of network training before improvement.

FIGS. 3(a) and 3(b) are size comparisons of two network models; fig. 4 is a schematic diagram of the final detection effect.

Table 1 schematic network structure of ShuffleNet V2

Example two

The embodiment provides a multi-target pedestrian tracking system;

a multi-target pedestrian tracking system, comprising:

It should be noted here that the above-mentioned obtaining module, the target detecting module, the feature extracting module, the state predicting and trajectory generating module, the associated cost determining module and the tracking module correspond to steps S101 to S106 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the contents disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multi-target pedestrian tracking method is characterized by comprising the following steps:

extracting the features of the image in the target detection frame;

performing state prediction and track generation on the target detection frame;

matching the track with the correlation cost larger than the set threshold value with the target detection frame to obtain a preliminary matching result; matching the unmatched track and the unmatched target detection frame again; and finally, determining a tracking result and completing a tracking task of the multi-target pedestrians.

2. The multi-target pedestrian tracking method according to claim 1, wherein target detection is performed on a non-first frame of the video to be processed to obtain a target detection frame; carrying out target detection by adopting a trained Yolov5s target detection network; the Yolov5s target detection network comprises: and the CSPNet network used for feature extraction and the PANET network used for feature fusion are connected in sequence.

3. The multi-target pedestrian tracking method according to claim 1, wherein the image in the target detection frame is subjected to feature extraction; the method specifically comprises the following steps:

adopting the trained feature extraction network ShuffleNet V2 to extract the features of the images in the target detection frame;

the ShuffleNet V2 network is formed by connecting Stage1-Stage7 in sequence; stage1 consists of convolution layers with convolution kernel size of 3 x3 steps of 2 and maximum pooling layers with steps of 2; stage2 is composed of one layer of downsampling and three layers of blocks; stage3 consists of one layer of downsampling and seven layers of blocks; stage4 is composed of one layer of downsampling and three layers of blocks; stage5 consists of convolution layers with convolution kernel size of 1 x 1; stage6 consists of a global pooling layer; stage7 consists of a fully connected layer;

the Block layer introduces Channel Split operation, after the Block layer receives the output from the previous layer, the input of c channels is divided into two branches, and the branches are respectively provided with c 'channels and c-c' channels; one of the branches is an identity function, and the other branch is composed of three convolutions: two 1 x 1 convolutions and one channel-by-channel convolution; the two branches are spliced by the Concat finally, so that the number of the channels is kept unchanged, and Channel Shuffle operation is carried out finally to ensure that information exchange can be carried out between the two branches;

the down-sampling layer is formed by modifying the Block layer, deleting the Channel Split operation, splicing one branch of the 1 × 1 convolution layer by the Channel-by-Channel convolution layer with the other branch of the 1 × 1 convolution layer, the Channel-by-Channel convolution layer and the 1 × 1 convolution layer by Concat and then carrying out Channel Split.

4. The multi-target pedestrian tracking method according to claim 1, characterized in that the state prediction and the trajectory generation are performed for the target detection frame; the method specifically comprises the following steps:

combining the result of the Kalman filtering algorithm to generate a track of the target detection frame;

performing state prediction on the target detection frame by adopting a Kalman filtering algorithm; the method specifically comprises the following steps:

defining an eight-dimensional state space

Wherein (u, v) is the central coordinate of the BoundingBox, gamma is the aspect ratio, h is the height of the BoundingBox,

the corresponding speed in the image coordinates; taking BoundingBox coordinates as direct measurement of the object state, and finishing state estimation of a target by using a Kalman filter; input value of kalman filter: the mean and variance of each trace; output value of kalman filter: returning the projected mean and covariance matrix of the given state estimate;

combining the result of the Kalman filtering algorithm to generate a track of the target detection frame; the method specifically comprises the following steps:

counting the number of frames a of each track from the last matching success_kWhen the Kalman filter predicts the position of the trajectory in the next frame, a_k＝a_k+1, if a track is associated with the detected position information and appearance feature in the next frame, then a_kSetting 0; setting a predefined maximum life value A_maxWhen a is_k＞A_maxWhen so, deleting the track; when a is_k≤A_maxThen, the trajectory is preserved; when the detected position information and appearance characteristics can not be matched with the track, temporarily defining the track as a new track, and the trial period is 3 frames, and if the matched detection is not carried out in the 3 frames, deleting the track.

5. The multi-target pedestrian tracking method according to claim 1, wherein the association cost is determined based on a feature extraction result, a state prediction result, and a trajectory generation result; the method specifically comprises the following steps:

calculating a second distance between the feature vector stored in the track and the feature vector of the image in the target detection frame;

carrying out weighted summation on the first distance and the second distance, and taking the summation result as the correlation cost;

the first distance is a mahalanobis distance; the second distance is a cosine distance.

6. The multi-target pedestrian tracking method according to claim 1, wherein the tracks with the associated costs greater than the set threshold value and the target detection frames are matched to obtain a preliminary matching result; matching the unmatched track and the unmatched target detection frame again; finally, determining a tracking result and completing a tracking task of the multi-target pedestrian; the method specifically comprises the following steps:

matching by adopting a Hungarian algorithm to obtain a primary matching result;

matching the unmatched track and the unmatched target detection box again by adopting an intersection-comparison IOU matching algorithm; and finally, determining a tracking result and completing a tracking task of the multi-target pedestrians.

7. The multi-target pedestrian tracking method according to claim 1, characterized by further comprising: configuring Python and pytorech programming environments for neural network model training and testing:

configuring a virtual environment and installing a dependency package; configuring a virtual environment, installing a dependency package, creating the virtual environment through Anaconda, and installing a pytorch, cuda, cudnn and related dependencies required by a running program in the virtual environment; pycharm is used as IDE and calls the virtual environment torch1.7 created by conda.

8. Multi-target pedestrian tracking system, characterized by includes:

an association cost determination module configured to: determining a correlation cost based on the feature extraction result, the state prediction result and the track generation result;

9. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of claims 1-7.

10. A storage medium storing non-transitory computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform the instructions of the method of any one of claims 1-7.