CN110969112B

CN110969112B - Pedestrian identity alignment method under camera-crossing scene

Info

Publication number: CN110969112B
Application number: CN201911189515.XA
Authority: CN
Inventors: 余春艳; 钟诗俊; 赖奇嵘
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2022-08-16
Anticipated expiration: 2039-11-28
Also published as: CN110969112A

Abstract

The invention provides a method for aligning the identity of a pedestrian under a cross-camera scene, which is used for solving the problem associated with the pedestrian under the cross-camera scene, and on the basis of continuously and correctly tracking a target pedestrian under a single camera, the target enters a blind area after leaving a camera view field area, the tracking is discontinuous, but when the target again appears under the camera, the pedestrian can be identified again and the identity identification of the pedestrian can be maintained to be unchanged and continuously tracked. When a detector detects a new pedestrian, the new pedestrian is added into a candidate pool to be associated, then a pair of pedestrians is selected from the candidate pool to finish image preprocessing by using an F-CCT model, and the processed image is used as input data of the SAM-Dets model to obtain the appearance adaptation degree of the two pedestrians. After the appearance adaptation degrees of the pedestrians in the candidate pool are paired and calculated, establishing a minimum cost flow graph model according to the adaptation degrees of each pair of pedestrians and combining a space-time relation to solve an optimal pedestrian relation solution; and finally, keeping the original identification for the pedestrian or endowing the pedestrian with a new identity identification according to the correlation result.

Description

Pedestrian identity alignment method under camera-crossing scene

Technical Field

The invention belongs to the field of machine vision and intelligent security, and particularly relates to a pedestrian identity alignment method in a camera-crossing scene.

Background

With the continuous development of economic society, people have greater and greater requirements on safety. Therefore, the field of intelligent security is discussed and developed continuously, the application range is getting larger and larger, and technologies similar to pedestrian detection and tracking and the like become hot problems of current research. For the pedestrian tracking problem, it is natural that research under a cross-camera has practical significance, and a plurality of pedestrians are required to be processed. Although the pedestrian tracking technology under the single-camera scene is relatively mature at present, the space-time information of the target becomes unreliable due to the existence of the blind area under the condition of a plurality of cameras, especially under the condition of non-overlapping visual field areas, so that great trouble is caused to the identification, tracking and retrieval of the same target in different cameras under different spaces at different moments. Therefore, a search for pedestrian tracking technology across camera scenes is being promoted. The most important part is how to match the identity of the pedestrian under different cameras.

The cross-camera pedestrian identity alignment mainly takes pedestrians as research objects and focuses on the multi-camera multi-target tracking problem of non-overlapping view field areas. The current common solution to this problem is divided into two steps: firstly, a detection and tracking algorithm is used for obtaining the running track of a target under a single camera. And secondly, performing association integration on independent pedestrian running tracks among the cameras by using an association algorithm, so as to obtain a complete motion track of each target. The above mechanism is limited in that only offline data can be processed, and is essentially suitable for retrieval scenes, and cannot support online tracking. The reason for this is that after the target pedestrian leaves the current camera view, due to the blind area, when the target enters the next camera view, the space-time information is lost, and the difficulty of correctly handing over the target pedestrian from the previous camera to the next camera is increased. This mechanism also creates a side effect that makes the cross-camera pedestrian tracking result heavily dependent on the single-camera pedestrian tracking effect.

The key to achieving pedestrian cross-camera identity alignment is to correctly associate the same target pedestrian within different fields of view. The method aims at the problem that the learning capability of pedestrian features in most of the existing cross-camera pedestrian tracking algorithms is limited, and the more robust pedestrian features cannot be learned. Therefore, the accuracy of the pedestrian similarity measure in the following is ultimately affected, and an undesirable data correlation result is ultimately produced. It is difficult to adapt to the complex environment of cross-camera pedestrian tracking.

Although the existing research related to cross-camera identity alignment can effectively solve some pedestrian tracking on offline data, the requirement for instant online tracking cannot be met, and effective tracking cannot be performed when an unknown pedestrian enters or exits an area.

Disclosure of Invention

In order to overcome the blank and the defects of the prior art, the scheme of the invention aims to solve the problem of pedestrian association under a cross-camera, and the main function is to assign a new identity representation to a newly-entered target or to successfully associate with a previously-left target, and then endow the previous target with a new identity. The method is based on continuous and correct tracking of a target pedestrian under a single camera, the target enters a blind area after leaving a camera visual field area, tracking is intermittent, and when the target appears under the camera again, the pedestrian can be identified again and the identification of the pedestrian can be maintained unchanged for continuous tracking. When a detector detects a new pedestrian, the new pedestrian is added into a candidate pool to be associated, then a pair of pedestrians is selected from the candidate pool to finish image preprocessing by using an F-CCT model, and the processed image is used as input data of the SAM-Dets model to obtain the appearance adaptation degree of the two pedestrians. After the appearance adaptation degrees of the pedestrians in the candidate pool are paired and calculated, establishing a minimum cost flow graph model according to the adaptation degrees of each pair of pedestrians and combining a space-time relation to solve an optimal pedestrian relation solution; and finally, keeping the original identification or endowing a new identity identification for the pedestrian according to the correlation result, and handing the pedestrian to a tracker for continuous tracking.

The invention specifically adopts the following technical scheme:

a pedestrian identity alignment method under a camera-crossing scene is characterized by comprising the following steps:

step S1: the multiple cameras respectively add pedestrian images detected by the cameras through the detectors into a candidate pool to be associated; step S2: calculating the appearance adaptation degree of two pedestrian images belonging to different cameras in a candidate pool to be associated;

step S3: after the pedestrian images in the candidate pool to be associated are pairwise paired and the appearance adaptation degrees are calculated, according to the appearance adaptation degrees of each pair of pedestrians, a minimum cost flow diagram model is established by combining the space-time relationship, and the optimal pedestrian association solution is solved;

step S4: according to the association result of step S3, the pedestrian is operated to retain the original identification or to give a new identification.

Preferably, in step S1, the detector is Faster R-CNN.

Preferably, in step S2, the step of calculating the degree of appearance suitability specifically includes the following steps:

step A21: using a fuzzy C-means clustering F-CCT model to complete image preprocessing, and setting the integral characteristic of the pedestrian image A as X ═ X { (X) } ₁ ,x ₂ ,...,x _N The overall characteristic of the pedestrian image B is Y ═ Y ₁ ,y ₂ ,...,y _N }；

Step A22: taking the image processed in the step S21 as input data of a pedestrian correlation model SAM-Dets with fine-grained representation fused: using X as an input vector and Y as a weight vector, and encoding the local fine-grained characteristic f of the pedestrian A through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization ₁ (ii) a And with Y as an input vector and X as a weight vector, coding local fine-grained features f of the pedestrian B through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization ₂ ；

Step A23: will f is _s ＝(f ₁ -f ₂ ) ² As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) ₁ ,q ₂ ) The probability value indicating that two objects belong to the same person in the real world is input as the appearance suitability.

Preferably, in step a22, the structure of the fused fine-grained representation pedestrian correlation model SAM-Dets includes: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:

the first layer is a convolution layer A and is used for extracting high-level features of the input pedestrian overall features;

the second layer is an activation layer, and the activation function is softmax;

the third layer is a dimension expansion layer;

the fourth layer is a summation layer, and the overall pedestrian characteristics are added with the result obtained by the third layer;

the fifth layer is a global average pooling layer for reducing feature dimension;

the sixth layer is a full connection layer and is used for completing the inner product calculation of the input vector and the weight vector in the weight matrix;

and the splicing layer splices the results obtained by the K attention branches according to channels and outputs the local fine-grained characteristics of pedestrians.

Preferably, the convolution kernel size of convolution layer a is 1 × 1, and the step size is 1; the dimension expanding layer expands the channel dimension into 512 dimensions; the size of the global average pooling layer is 1 × 1, and the step size is 1.

step B21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X ₁ ,x ₂ ,...,x _N The overall characteristic of the pedestrian image B is Y ═ Y ₁ ,y ₂ ,...,y _N }；

Step B22: extracting pedestrian abstract characteristics from the pedestrian image A and the pedestrian image B through a DR-ResNet basic network;

step B23: further extracting high-level features of the target pedestrian by using the convolutional layer B, and using the high-level features as input data of a classification model and a pedestrian correlation model SAM-Dets fusing fine-grained representation; the classification model respectively outputs the identity identification representation numbers of the pedestrian image A and the pedestrian image B, and the pedestrian correlation model SAM-Dets fused with the fine-grained representation outputs the local fine-grained feature f of the pedestrian A ₁ Local fine-grained feature f of pedestrian B ₂ 。

Step B24: will f is _s ＝(f ₁ -f ₂ ) ² As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) ₁ ,q ₂ ) The probability value indicating that two objects belong to the same person in the real world is input as the appearance suitability.

Preferably, the DR-ResNet base network comprises two weight-shared identical deep convolutional twinned neural base network modules R-ResNet; the structure of the deep convolution twin nerve basic network module R-ResNet comprises forty-nine convolutional layers, three parallel convolutional layers and a tail convolutional layer:

wherein the convolution kernel size of the first convolution layer is (7,7, 64), the max-firing is (3,3), and the sliding step length is 2;

the sizes of convolution kernels of the second convolution layer to the fourth convolution layer are (1,1,64), (3,3,64) and (1,1,256), and the ReLu function is adopted as the activation function; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of the second convolutional layer and also used as the input value of the third layer of the activation function of the convolutional block; the fifth to seventh convolutional layers, and the eighth to tenth convolutional layers all adopt the same structures as the second to fourth convolutional layers;

the sizes of convolution kernels of the eleventh convolution layer to the thirteenth convolution layer are (1,1, 128), (3,3, 128) and (1,1, 512), and the ReLu function is adopted as the activation function; the three layers of convolution layers and the activation function form a convolution block, and the input value of the convolution block is used as the input value of the convolution layer of the eleventh layer and also used as the input value of the activation function of the third layer of the convolution block; the fourteenth to sixteenth convolutional layers, the seventeenth to nineteenth convolutional layers, and the twentieth to twenty second convolutional layers all adopt the same structures as the eleventh to thirteenth convolutional layers;

the sizes of convolution kernels of the twenty-third convolution layer to the twenty-fifth convolution layer are (1,1,256), (3,3, 256) and (1,1, 1024), and the activation functions all adopt ReLu functions; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a twenty-third convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the twenty-sixth to twenty-eighth, twenty-ninth to thirty-first, thirty-second to thirty-fourth, thirty-fifth to thirty-seventh, and thirty-eighth to forty-fourth convolutional layers all adopt the same structure as the twenty-third to twenty-fifth convolutional layers;

the sizes of convolution kernels from the forty-th convolution layer to the forty-third convolution layer are (1,1, 512), (3,3, 512) and (1,1, 2048), and the activation functions all adopt ReLu functions; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a forty-th convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the forty-fourth to forty-sixth and forty-seventh to forty-ninth buildup layers have the same structure as the forty-fourth to forty-second buildup layers;

after the forty-ninth convolutional layer, three parallel convolutional layers are formed, each convolutional layer uses 2048 convolutional cores, the sizes of the first parallel convolutional layer to the third parallel convolutional layer are respectively (3, 1024), (5, 1024) and (7, 1024), the channels of the three parallel convolutional layers are combined through a connecting layer, and then the max-pooling is (4, 4);

the last layer is the end convolution layer with size (2, 2048) using 1024 convolution kernels;

the convolutional layer B uses 2 convolutional kernels and has the size of (1,1, 4096);

the structure of the pedestrian correlation model SAM-Dets fused with the fine-grained representation comprises: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:

the first layer is convolution layer A, the convolution kernel size is 1 multiplied by 1, and the step length is 1; the high-level features are used for extracting the input overall features of the pedestrians;

the third layer is a dimension expanding layer, and the dimension of the channel is expanded into 512 dimensions;

the fifth layer is a global average pooling layer with the size of 1 multiplied by 1 and the step length of 1, and is used for reducing the characteristic dimension;

the sixth layer is a full connection layer and is used for finishing inner product calculation of the input vector and the weight vector in the weight matrix;

Preferably, in step S3, according to the degree of adaptation of each pair of pedestrians, the method combines the spatio-temporal relationship to establish a minimum cost flow graph model, and solves the optimal pedestrian association solution, including the following steps:

step S31: let given t _p The cost flow graph at time-1 after the instant alignment is completed is

When t is reached _p Time of day, is the set of pedestrians in the field of vision

And go out of field pedestrian collection

Each target newly adds in and out two nodes, and the newly added nodes are updated to be connected with the directed edges between the source points and the sinks;

step S32: according to the pedestrian set in the field of vision

And go out of field pedestrian collection

The pedestrian appearance adaptation degree between every two middle targets is updated, and the directed edges between the corresponding nodes are updated to obtain t _p Momentarily new cost flow graph

Step S33: deleting all aligned target nodes and set of pedestrians in view

The target nodes which are not aligned remain, and the pedestrians in the visual field are collected

Obtaining a cost flow graph by taking the residual unaligned target nodes as newly entered target pedestrians

And waiting for updating the alignment at the next moment.

Preferably, in step S4, after the pedestrian is subjected to the operation of keeping the original identity or giving a new identity, the pedestrian is handed to the tracker for continuous tracking; the tracker adopts a KCF algorithm for tracking; the KCF algorithm assigns a tracker to each pedestrian.

Preferably, the tracking visual field area of the camera is divided into a core area and a critical area, and in step S1, only pedestrians in the critical area are detected by the detector.

Compared with the prior art, the invention and the optimal scheme thereof realize the on-line tracking of the pedestrian crossing the camera, have accurate identification and high efficiency, and are not influenced by the fact that the target enters a blind area and is tracked discontinuously after leaving the visual field area of the camera.

The tracking function is realized by the existing mature FasterR-CNN to realize pedestrian detection, and the KCF algorithm is used for realizing online pedestrian tracking; the core for realizing the invention and the preferred scheme is to establish a minimum cost flow graph model according to the time-space information and the pedestrian similarity value to immediately complete the pedestrian identity alignment task, and integrate the fusion of an F-CCT model and a SAM-Dets model or the fusion of the SAM-Dets model and a DR-ResNet network model, thereby solving the problem of measuring the appearance adaptation degree of the pedestrian.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic view of the overall flow of example 1 of the present invention;

FIG. 2 is a diagram of a SAM-Dets model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a SAM-Dets model network structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a fine-grained association model of a pedestrian according to embodiment 2 of the present invention;

fig. 5 is a schematic diagram 1 illustrating the effect of a pedestrian identity alignment model in a cross-camera scene in embodiment 2 of the present invention;

fig. 6 is a schematic diagram 2 illustrating the effect of the pedestrian identity alignment model in the cross-camera scene in embodiment 2 of the present invention.

Detailed Description

In order to make the features and advantages of this patent more comprehensible, 2 embodiments accompanied with figures are described in detail below:

as shown in fig. 1, in a first embodiment of the present invention, an overall scheme for implementing identity alignment of pedestrians in a camera-crossing scene includes the following steps:

step S1: when the pedestrian enters the visual field of the cameras, the multiple cameras respectively add images of the pedestrian detected by the cameras through the detectors into the candidate pool to be associated.

In this embodiment, the detector selects a target detection representative method based on deep learning, namely Faster R-CNN, which uses RPN to detect the convolution feature of the network shared whole graph when selecting the candidate frame, so that the classification and regression tasks have the same convolution feature.

Step S2: and calculating the appearance adaptation degree of two pedestrian images belonging to different cameras in the candidate pool to be associated.

In an embodiment, the calculating the appearance suitability specifically includes the following steps:

step A21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X ₁ ,x ₂ ,...,x _N The overall characteristic of the pedestrian image B is Y ═ Y ₁ ,y ₂ ,...,y _N }; and (3) carrying out clustering domain division on the images by using a fuzzy clustering algorithm, realizing local color brightness migration between clustering domains by matching the clustering domains of the source image and the target image, and introducing a membership factor to improve the color brightness migration effect.

Step A22: taking the image processed in the step S21 as input data of a pedestrian correlation model SAM-Dets with fine-grained representation fused: with X as an input vector and Y as a weight vector, the local fine-grained characteristic f of the pedestrian A is encoded by a pedestrian correlation model SAM-Dets fused with fine-grained characteristics ₁ (ii) a With Y as an input vector and X as a weight vector, the local fine-grained characteristic f of the pedestrian B is coded by a pedestrian correlation model SAM-Dets fused with fine-grained characteristics ₂ 。

As shown in fig. 2, in this embodiment, the pedestrian association model SAM-Dets fused with fine-grained representation is composed of a plurality of attentions, each branch in the model has the same functional module, and input data is the global features of pedestrians extracted by the base network. And each attention branch passes through a local detector, a global pooling module and a linear embedding module once, and finally, the results of the K branches are spliced according to channels to obtain a complete output result of the attention model.

In the local detector module, high-level features of input information are firstly acquired, a softmax function is used for normalizing the high-level features to obtain attention weights which accord with probability distribution value intervals, the dimensionality of the high-level features is firstly expanded for the next summation operation, and finally the attention distribution probability distribution of corresponding features is acquired by using a weighted summation function. In the global pooling module and the linear embedding module, the attention distribution probability distribution is screened and reserved for corresponding pedestrian features, and finally high-level pedestrian features with attention distribution are output.

As shown in fig. 3, specifically, the network structure of the pedestrian association model SAM-Dets fusing fine-grained representation includes: k attention branches and splice layers; each attention branch comprises the following six layers, wherein:

the second layer is an active layer, and the active function is softmax;

the third layer is a dimension expansion layer;

The convolution kernel size of convolution layer A is 1 × 1, and the step length is 1; the dimension expanding layer expands the channel dimension into 512 dimensions; the size of the global average pooling layer is 1 × 1, step size is 1.

Step A23: converting the similarity value of a pair of pedestrians to be calculated and input into a pair f ₁ And f ₂ Similarity comparison of features. Introducing a parameter-free layer Square layer to pair f ₁ And f ₂ The feature solution squared error as f ₁ And f ₂ And (3) comparing the similarity, and recording the Square layer as follows: f. of _s ＝(f ₁ -f ₂ ) ² (ii) a Will f is _s ＝(f ₁ -f ₂ ) ² As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) ₁ ,q ₂ ) The probability value indicating that two objects belong to the same person in the real world is input as the appearance suitability.

Further, according to the obtained similarity probability value between a pair of pedestrians as a weight of the graph, a newly entered pedestrian and a target pedestrian to be associated are respectively used as two different vertex sets, and a weighted value matching graph is established; and obtaining a solution of data association between the newly entered pedestrian and the target pedestrian waiting for association by solving the solution of the maximum weight matching graph problem.

Step S3: after the appearance adaptation degrees of the pedestrian images in the candidate pool to be associated are pairwise paired and calculated, a minimum cost flow graph model is established according to the appearance adaptation degrees of each pair of pedestrians and by combining the space-time relationship, and an optimal pedestrian association solution is solved.

The method specifically comprises the following steps:

When t is _p Time of day, is the set of pedestrians in the field of vision

And go out of field pedestrian collection

step S32: according to the pedestrian set in the field of vision

And go out of field pedestrian collection

The pedestrian appearance adaptation degree between every two middle targets is updated, the directed edges between the corresponding nodes are updated, and t is obtained _p Momentarily new cost flow graph

Step S33: deleting all aligned target nodes and set of pedestrians in view

And waiting for updating the alignment at the next moment.

Step S4: according to the correlation result of the step S3, the operation of keeping the original identification or giving a new identification is carried out on the pedestrian, and the pedestrian is sent to a tracker for continuous tracking; the tracker adopts a KCF algorithm for tracking; the KCF algorithm assigns one tracker to each pedestrian. The algorithm forms a cyclic matrix in a target area, diagonalization and other properties of the cyclic matrix in a Fourier space are utilized, and a general prediction formula is obtained through regression of regression ridges.

Meanwhile, in the present embodiment, the tracking visual field area of the camera is divided into the core area and the critical area, and in step S1, only the pedestrian in the critical area is detected by the detector. The method considers that any object entering and leaving must pass through the critical area, so that only pedestrians in the critical area are considered to have reasonable spatial transfer relation, and the universality of subsequent pedestrian alignment solving is guaranteed to the maximum extent.

In the second embodiment of the present invention, as shown in fig. 4, there is provided step S2: another preferred implementation scheme for calculating the appearance suitability of two pedestrian images belonging to different cameras in a candidate pool to be associated specifically includes the following steps:

The DR-ResNet basic network comprises two completely same deep convolution twin nerve basic network modules R-ResNet shared by weights; the structure of the deep convolution twin nerve basic network module R-ResNet comprises forty-nine convolutional layers, three parallel convolutional layers and a terminal convolutional layer:

the sizes of convolution kernels of the second convolution layer to the fourth convolution layer are (1,1,64), (3,3,64) and (1,1,256), and the ReLu function is adopted as the activation function; the three layers of convolution layers and the activation function form a convolution block, and the input value of the convolution block is used as the input value of the second convolution layer and the input value of the third layer of activation function of the convolution block; the fifth to seventh convolutional layers, and the eighth to tenth convolutional layers all adopt the same structures as the second to fourth convolutional layers;

the sizes of convolution kernels of the eleventh convolution layer to the thirteenth convolution layer are (1,1, 128), (3,3, 128) and (1,1, 512), and the ReLu function is adopted as the activation function; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of the first convolutional layer and also used as the input value of the third layer of the activation function of the convolutional block; the fourteenth to sixteenth convolutional layers, the seventeenth to nineteenth convolutional layers, and the twentieth to twenty second convolutional layers all adopt the same structures as the eleventh to thirteenth convolutional layers;

after the forty-ninth convolutional layer, three parallel convolutional layers are formed, each convolutional layer uses 2048 convolutional kernels, the sizes of the first parallel convolutional layer to the third parallel convolutional layer are respectively (3, 1024), (5, 1024) and (7, 1024), the channels of the three parallel convolutional layers are combined through a connecting layer, and the subsequent max-posing is (4, 4);

convolutional layer B uses 2 convolutional kernels, with a size of (1,1, 4096).

As shown in fig. 5 and fig. 6, the performance of cross-camera tracking is realized by the scheme of the present embodiment compared with the existing offline cross-camera tracking scheme, and the difference is that the scheme of the present embodiment can directly realize tracking for online.

The present invention is not limited to the above preferred embodiments, and other various methods for aligning the identity of a pedestrian across a camera scene can be derived by anyone based on the teaching of the present invention.

Claims

1. A pedestrian identity alignment method under a camera-crossing scene is characterized by comprising the following steps:

step S1: the cameras respectively add the pedestrian images detected by the cameras through the detector into a candidate pool to be associated;

step S2: calculating the appearance adaptation degree of two pedestrian images belonging to different cameras in a candidate pool to be associated;

step S4: according to the correlation result of the step S3, the operation of keeping the original identification or giving a new identification is carried out on the pedestrian;

in step S1, the detector is Faster R-CNN;

in step S2, the specific steps of calculating the degree of appearance suitability are step a 21-step a23, or step B21-step B24:

wherein the step A21-the step A23 specifically comprises the following steps:

step A21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X ₁ ,x ₂ ,...,x _N The overall characteristic of the pedestrian image B is Y ═ Y ₁ ,y ₂ ,...,y _N }；

Step A22: will be provided withThe image processed in the step S21 is used as input data of a pedestrian correlation model SAM-Dets with fine-grained representation fused: using X as an input vector and Y as a weight vector, and encoding the local fine-grained characteristic f of the pedestrian A through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization ₁ (ii) a And with Y as an input vector and X as a weight vector, coding local fine-grained features f of the pedestrian B through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization ₂ ；

Step A23: will f is _s ＝(f ₁ -f ₂ ) ² As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) ₁ ,q ₂ ) Representing the probability value of inputting two objects belonging to the same person in the real world as the appearance suitability;

in step a22, the structure of the fused fine-grained representation pedestrian association model SAM-Dets includes: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:

the third layer is a dimension expansion layer;

the splicing layer splices results obtained by the K attention branches according to channels and outputs the local fine-grained characteristic of the pedestrian;

the convolution kernel size of the convolution layer A is 1 multiplied by 1, and the step length is 1; the dimension expanding layer expands the channel dimension into 512 dimensions; the size of the global average pooling layer is 1 multiplied by 1, and the step length is 1;

the step B21-the step B24 are specifically:

step B23: further extracting high-level features of the target pedestrian by using the convolutional layer B, and using the high-level features as input data of a classification model and a pedestrian correlation model SAM-Dets fusing fine-grained representation; the classification model respectively outputs the identity identification representation numbers of the pedestrian image A and the pedestrian image B, and the pedestrian correlation model SAM-Dets fused with the fine-grained representation outputs the local fine-grained feature f of the pedestrian A ₁ Local fine-grained feature f of pedestrian B ₂ ；

Step B24: will f is _s ＝(f ₁ -f ₂ ) ² As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) ₁ ,q ₂ ) Representing the probability value of inputting two objects belonging to the same person in the real world as the appearance suitability;

in step B22, the DR-ResNet basis network comprises two weight-shared identical deep convolution twin neural basis network modules R-ResNet; the structure of the deep convolution twin neural basic network module R-ResNet comprises forty-nine convolution layers, three parallel convolution layers and a tail end convolution layer:

the sizes of convolution kernels of the second convolution layer to the fourth convolution layer are (1,1,64), (3,3,64) and (1,1,256), and the ReLu function is adopted as the activation function; forming a convolution block by the three layers of convolution layers and the activation function, and taking an input value of the convolution block as an input value of a second convolution layer and an input value of a third layer of activation function of the convolution block; the fifth to seventh convolutional layers, and the eighth to tenth convolutional layers all adopt the same structures as the second to fourth convolutional layers;

the sizes of convolution kernels from the twenty-third convolution layer to the twenty-fifth convolution layer are (1,1,256), (3,3, 256) and (1,1, 1024), and the ReLu function is adopted as the activation function; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a twenty-third convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the twenty-sixth to twenty-eighth, twenty-ninth to thirty-first, thirty-second to thirty-fourth, thirty-fifth to thirty-seventh, and thirty-eighth to forty-fourth convolutional layers all adopt the same structure as the twenty-third to twenty-fifth convolutional layers;

the structure of the pedestrian correlation model SAM-Dets fused with the fine-grained representation comprises the following steps: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:

the second layer is an active layer, and the active function is softmax;

in step S3, according to the degree of adaptation of each pair of pedestrians, a minimum cost flow graph model is established in combination with the spatio-temporal relationship, and a specific process of solving an optimal pedestrian association solution includes the following steps:

When t is _p Time of day, is the collection of pedestrians in the field of view

And go out of field pedestrian collection

step S32: according to the pedestrian set in the field of vision

And go out of field pedestrian collection

The pedestrian appearance adaptation degree between every two middle targets is updated, the directed edges between the corresponding nodes are updated, and t is obtained _p New cost flow graph of time of day

Step S33: deleting all aligned target nodes and set of pedestrians in view

Waiting for updating alignment at the next moment;

in step S4, the pedestrian is given the operation of keeping the original identification or giving a new identification, and then tracked by the tracker; the tracker adopts a KCF algorithm for tracking; the KCF algorithm allocates a tracker for each pedestrian;

the tracking visual field area of the camera is divided into a core area and a critical area, and in step S1, only pedestrians in the critical area are detected by the detector.