CN110969112B - Pedestrian identity alignment method under camera-crossing scene - Google Patents

Pedestrian identity alignment method under camera-crossing scene Download PDF

Info

Publication number
CN110969112B
CN110969112B CN201911189515.XA CN201911189515A CN110969112B CN 110969112 B CN110969112 B CN 110969112B CN 201911189515 A CN201911189515 A CN 201911189515A CN 110969112 B CN110969112 B CN 110969112B
Authority
CN
China
Prior art keywords
pedestrian
layer
convolution
convolutional
layers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911189515.XA
Other languages
Chinese (zh)
Other versions
CN110969112A (en
Inventor
余春艳
钟诗俊
赖奇嵘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201911189515.XA priority Critical patent/CN110969112B/en
Publication of CN110969112A publication Critical patent/CN110969112A/en
Application granted granted Critical
Publication of CN110969112B publication Critical patent/CN110969112B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/30Scenes; Scene-specific elements in albums, collections or shared content, e.g. social network photos or video
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Traffic Control Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method for aligning the identity of a pedestrian under a cross-camera scene, which is used for solving the problem associated with the pedestrian under the cross-camera scene, and on the basis of continuously and correctly tracking a target pedestrian under a single camera, the target enters a blind area after leaving a camera view field area, the tracking is discontinuous, but when the target again appears under the camera, the pedestrian can be identified again and the identity identification of the pedestrian can be maintained to be unchanged and continuously tracked. When a detector detects a new pedestrian, the new pedestrian is added into a candidate pool to be associated, then a pair of pedestrians is selected from the candidate pool to finish image preprocessing by using an F-CCT model, and the processed image is used as input data of the SAM-Dets model to obtain the appearance adaptation degree of the two pedestrians. After the appearance adaptation degrees of the pedestrians in the candidate pool are paired and calculated, establishing a minimum cost flow graph model according to the adaptation degrees of each pair of pedestrians and combining a space-time relation to solve an optimal pedestrian relation solution; and finally, keeping the original identification for the pedestrian or endowing the pedestrian with a new identity identification according to the correlation result.

Description

Pedestrian identity alignment method under camera-crossing scene
Technical Field
The invention belongs to the field of machine vision and intelligent security, and particularly relates to a pedestrian identity alignment method in a camera-crossing scene.
Background
With the continuous development of economic society, people have greater and greater requirements on safety. Therefore, the field of intelligent security is discussed and developed continuously, the application range is getting larger and larger, and technologies similar to pedestrian detection and tracking and the like become hot problems of current research. For the pedestrian tracking problem, it is natural that research under a cross-camera has practical significance, and a plurality of pedestrians are required to be processed. Although the pedestrian tracking technology under the single-camera scene is relatively mature at present, the space-time information of the target becomes unreliable due to the existence of the blind area under the condition of a plurality of cameras, especially under the condition of non-overlapping visual field areas, so that great trouble is caused to the identification, tracking and retrieval of the same target in different cameras under different spaces at different moments. Therefore, a search for pedestrian tracking technology across camera scenes is being promoted. The most important part is how to match the identity of the pedestrian under different cameras.
The cross-camera pedestrian identity alignment mainly takes pedestrians as research objects and focuses on the multi-camera multi-target tracking problem of non-overlapping view field areas. The current common solution to this problem is divided into two steps: firstly, a detection and tracking algorithm is used for obtaining the running track of a target under a single camera. And secondly, performing association integration on independent pedestrian running tracks among the cameras by using an association algorithm, so as to obtain a complete motion track of each target. The above mechanism is limited in that only offline data can be processed, and is essentially suitable for retrieval scenes, and cannot support online tracking. The reason for this is that after the target pedestrian leaves the current camera view, due to the blind area, when the target enters the next camera view, the space-time information is lost, and the difficulty of correctly handing over the target pedestrian from the previous camera to the next camera is increased. This mechanism also creates a side effect that makes the cross-camera pedestrian tracking result heavily dependent on the single-camera pedestrian tracking effect.
The key to achieving pedestrian cross-camera identity alignment is to correctly associate the same target pedestrian within different fields of view. The method aims at the problem that the learning capability of pedestrian features in most of the existing cross-camera pedestrian tracking algorithms is limited, and the more robust pedestrian features cannot be learned. Therefore, the accuracy of the pedestrian similarity measure in the following is ultimately affected, and an undesirable data correlation result is ultimately produced. It is difficult to adapt to the complex environment of cross-camera pedestrian tracking.
Although the existing research related to cross-camera identity alignment can effectively solve some pedestrian tracking on offline data, the requirement for instant online tracking cannot be met, and effective tracking cannot be performed when an unknown pedestrian enters or exits an area.
Disclosure of Invention
In order to overcome the blank and the defects of the prior art, the scheme of the invention aims to solve the problem of pedestrian association under a cross-camera, and the main function is to assign a new identity representation to a newly-entered target or to successfully associate with a previously-left target, and then endow the previous target with a new identity. The method is based on continuous and correct tracking of a target pedestrian under a single camera, the target enters a blind area after leaving a camera visual field area, tracking is intermittent, and when the target appears under the camera again, the pedestrian can be identified again and the identification of the pedestrian can be maintained unchanged for continuous tracking. When a detector detects a new pedestrian, the new pedestrian is added into a candidate pool to be associated, then a pair of pedestrians is selected from the candidate pool to finish image preprocessing by using an F-CCT model, and the processed image is used as input data of the SAM-Dets model to obtain the appearance adaptation degree of the two pedestrians. After the appearance adaptation degrees of the pedestrians in the candidate pool are paired and calculated, establishing a minimum cost flow graph model according to the adaptation degrees of each pair of pedestrians and combining a space-time relation to solve an optimal pedestrian relation solution; and finally, keeping the original identification or endowing a new identity identification for the pedestrian according to the correlation result, and handing the pedestrian to a tracker for continuous tracking.
The invention specifically adopts the following technical scheme:
a pedestrian identity alignment method under a camera-crossing scene is characterized by comprising the following steps:
step S1: the multiple cameras respectively add pedestrian images detected by the cameras through the detectors into a candidate pool to be associated; step S2: calculating the appearance adaptation degree of two pedestrian images belonging to different cameras in a candidate pool to be associated;
step S3: after the pedestrian images in the candidate pool to be associated are pairwise paired and the appearance adaptation degrees are calculated, according to the appearance adaptation degrees of each pair of pedestrians, a minimum cost flow diagram model is established by combining the space-time relationship, and the optimal pedestrian association solution is solved;
step S4: according to the association result of step S3, the pedestrian is operated to retain the original identification or to give a new identification.
Preferably, in step S1, the detector is Faster R-CNN.
Preferably, in step S2, the step of calculating the degree of appearance suitability specifically includes the following steps:
step A21: using a fuzzy C-means clustering F-CCT model to complete image preprocessing, and setting the integral characteristic of the pedestrian image A as X ═ X { (X) } 1 ,x 2 ,...,x N The overall characteristic of the pedestrian image B is Y ═ Y 1 ,y 2 ,...,y N };
Step A22: taking the image processed in the step S21 as input data of a pedestrian correlation model SAM-Dets with fine-grained representation fused: using X as an input vector and Y as a weight vector, and encoding the local fine-grained characteristic f of the pedestrian A through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization 1 (ii) a And with Y as an input vector and X as a weight vector, coding local fine-grained features f of the pedestrian B through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization 2
Step A23: will f is s =(f 1 -f 2 ) 2 As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) 1 ,q 2 ) The probability value indicating that two objects belong to the same person in the real world is input as the appearance suitability.
Preferably, in step a22, the structure of the fused fine-grained representation pedestrian correlation model SAM-Dets includes: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:
the first layer is a convolution layer A and is used for extracting high-level features of the input pedestrian overall features;
the second layer is an activation layer, and the activation function is softmax;
the third layer is a dimension expansion layer;
the fourth layer is a summation layer, and the overall pedestrian characteristics are added with the result obtained by the third layer;
the fifth layer is a global average pooling layer for reducing feature dimension;
the sixth layer is a full connection layer and is used for completing the inner product calculation of the input vector and the weight vector in the weight matrix;
and the splicing layer splices the results obtained by the K attention branches according to channels and outputs the local fine-grained characteristics of pedestrians.
Preferably, the convolution kernel size of convolution layer a is 1 × 1, and the step size is 1; the dimension expanding layer expands the channel dimension into 512 dimensions; the size of the global average pooling layer is 1 × 1, and the step size is 1.
Preferably, in step S2, the step of calculating the degree of appearance suitability specifically includes the following steps:
step B21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X 1 ,x 2 ,...,x N The overall characteristic of the pedestrian image B is Y ═ Y 1 ,y 2 ,...,y N };
Step B22: extracting pedestrian abstract characteristics from the pedestrian image A and the pedestrian image B through a DR-ResNet basic network;
step B23: further extracting high-level features of the target pedestrian by using the convolutional layer B, and using the high-level features as input data of a classification model and a pedestrian correlation model SAM-Dets fusing fine-grained representation; the classification model respectively outputs the identity identification representation numbers of the pedestrian image A and the pedestrian image B, and the pedestrian correlation model SAM-Dets fused with the fine-grained representation outputs the local fine-grained feature f of the pedestrian A 1 Local fine-grained feature f of pedestrian B 2
Step B24: will f is s =(f 1 -f 2 ) 2 As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) 1 ,q 2 ) The probability value indicating that two objects belong to the same person in the real world is input as the appearance suitability.
Preferably, the DR-ResNet base network comprises two weight-shared identical deep convolutional twinned neural base network modules R-ResNet; the structure of the deep convolution twin nerve basic network module R-ResNet comprises forty-nine convolutional layers, three parallel convolutional layers and a tail convolutional layer:
wherein the convolution kernel size of the first convolution layer is (7,7, 64), the max-firing is (3,3), and the sliding step length is 2;
the sizes of convolution kernels of the second convolution layer to the fourth convolution layer are (1,1,64), (3,3,64) and (1,1,256), and the ReLu function is adopted as the activation function; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of the second convolutional layer and also used as the input value of the third layer of the activation function of the convolutional block; the fifth to seventh convolutional layers, and the eighth to tenth convolutional layers all adopt the same structures as the second to fourth convolutional layers;
the sizes of convolution kernels of the eleventh convolution layer to the thirteenth convolution layer are (1,1, 128), (3,3, 128) and (1,1, 512), and the ReLu function is adopted as the activation function; the three layers of convolution layers and the activation function form a convolution block, and the input value of the convolution block is used as the input value of the convolution layer of the eleventh layer and also used as the input value of the activation function of the third layer of the convolution block; the fourteenth to sixteenth convolutional layers, the seventeenth to nineteenth convolutional layers, and the twentieth to twenty second convolutional layers all adopt the same structures as the eleventh to thirteenth convolutional layers;
the sizes of convolution kernels of the twenty-third convolution layer to the twenty-fifth convolution layer are (1,1,256), (3,3, 256) and (1,1, 1024), and the activation functions all adopt ReLu functions; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a twenty-third convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the twenty-sixth to twenty-eighth, twenty-ninth to thirty-first, thirty-second to thirty-fourth, thirty-fifth to thirty-seventh, and thirty-eighth to forty-fourth convolutional layers all adopt the same structure as the twenty-third to twenty-fifth convolutional layers;
the sizes of convolution kernels from the forty-th convolution layer to the forty-third convolution layer are (1,1, 512), (3,3, 512) and (1,1, 2048), and the activation functions all adopt ReLu functions; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a forty-th convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the forty-fourth to forty-sixth and forty-seventh to forty-ninth buildup layers have the same structure as the forty-fourth to forty-second buildup layers;
after the forty-ninth convolutional layer, three parallel convolutional layers are formed, each convolutional layer uses 2048 convolutional cores, the sizes of the first parallel convolutional layer to the third parallel convolutional layer are respectively (3, 1024), (5, 1024) and (7, 1024), the channels of the three parallel convolutional layers are combined through a connecting layer, and then the max-pooling is (4, 4);
the last layer is the end convolution layer with size (2, 2048) using 1024 convolution kernels;
the convolutional layer B uses 2 convolutional kernels and has the size of (1,1, 4096);
the structure of the pedestrian correlation model SAM-Dets fused with the fine-grained representation comprises: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:
the first layer is convolution layer A, the convolution kernel size is 1 multiplied by 1, and the step length is 1; the high-level features are used for extracting the input overall features of the pedestrians;
the second layer is an activation layer, and the activation function is softmax;
the third layer is a dimension expanding layer, and the dimension of the channel is expanded into 512 dimensions;
the fourth layer is a summation layer, and the overall pedestrian characteristics are added with the result obtained by the third layer;
the fifth layer is a global average pooling layer with the size of 1 multiplied by 1 and the step length of 1, and is used for reducing the characteristic dimension;
the sixth layer is a full connection layer and is used for finishing inner product calculation of the input vector and the weight vector in the weight matrix;
and the splicing layer splices the results obtained by the K attention branches according to channels and outputs the local fine-grained characteristics of pedestrians.
Preferably, in step S3, according to the degree of adaptation of each pair of pedestrians, the method combines the spatio-temporal relationship to establish a minimum cost flow graph model, and solves the optimal pedestrian association solution, including the following steps:
step S31: let given t p The cost flow graph at time-1 after the instant alignment is completed is
Figure BDA0002293687310000053
When t is reached p Time of day, is the set of pedestrians in the field of vision
Figure BDA0002293687310000051
And go out of field pedestrian collection
Figure BDA0002293687310000052
Each target newly adds in and out two nodes, and the newly added nodes are updated to be connected with the directed edges between the source points and the sinks;
step S32: according to the pedestrian set in the field of vision
Figure BDA0002293687310000061
And go out of field pedestrian collection
Figure BDA0002293687310000062
The pedestrian appearance adaptation degree between every two middle targets is updated, and the directed edges between the corresponding nodes are updated to obtain t p Momentarily new cost flow graph
Figure BDA0002293687310000063
Step S33: deleting all aligned target nodes and set of pedestrians in view
Figure BDA0002293687310000065
The target nodes which are not aligned remain, and the pedestrians in the visual field are collected
Figure BDA0002293687310000064
Obtaining a cost flow graph by taking the residual unaligned target nodes as newly entered target pedestrians
Figure BDA0002293687310000066
And waiting for updating the alignment at the next moment.
Preferably, in step S4, after the pedestrian is subjected to the operation of keeping the original identity or giving a new identity, the pedestrian is handed to the tracker for continuous tracking; the tracker adopts a KCF algorithm for tracking; the KCF algorithm assigns a tracker to each pedestrian.
Preferably, the tracking visual field area of the camera is divided into a core area and a critical area, and in step S1, only pedestrians in the critical area are detected by the detector.
Compared with the prior art, the invention and the optimal scheme thereof realize the on-line tracking of the pedestrian crossing the camera, have accurate identification and high efficiency, and are not influenced by the fact that the target enters a blind area and is tracked discontinuously after leaving the visual field area of the camera.
The tracking function is realized by the existing mature FasterR-CNN to realize pedestrian detection, and the KCF algorithm is used for realizing online pedestrian tracking; the core for realizing the invention and the preferred scheme is to establish a minimum cost flow graph model according to the time-space information and the pedestrian similarity value to immediately complete the pedestrian identity alignment task, and integrate the fusion of an F-CCT model and a SAM-Dets model or the fusion of the SAM-Dets model and a DR-ResNet network model, thereby solving the problem of measuring the appearance adaptation degree of the pedestrian.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic view of the overall flow of example 1 of the present invention;
FIG. 2 is a diagram of a SAM-Dets model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a SAM-Dets model network structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a fine-grained association model of a pedestrian according to embodiment 2 of the present invention;
fig. 5 is a schematic diagram 1 illustrating the effect of a pedestrian identity alignment model in a cross-camera scene in embodiment 2 of the present invention;
fig. 6 is a schematic diagram 2 illustrating the effect of the pedestrian identity alignment model in the cross-camera scene in embodiment 2 of the present invention.
Detailed Description
In order to make the features and advantages of this patent more comprehensible, 2 embodiments accompanied with figures are described in detail below:
as shown in fig. 1, in a first embodiment of the present invention, an overall scheme for implementing identity alignment of pedestrians in a camera-crossing scene includes the following steps:
step S1: when the pedestrian enters the visual field of the cameras, the multiple cameras respectively add images of the pedestrian detected by the cameras through the detectors into the candidate pool to be associated.
In this embodiment, the detector selects a target detection representative method based on deep learning, namely Faster R-CNN, which uses RPN to detect the convolution feature of the network shared whole graph when selecting the candidate frame, so that the classification and regression tasks have the same convolution feature.
Step S2: and calculating the appearance adaptation degree of two pedestrian images belonging to different cameras in the candidate pool to be associated.
In an embodiment, the calculating the appearance suitability specifically includes the following steps:
step A21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X 1 ,x 2 ,...,x N The overall characteristic of the pedestrian image B is Y ═ Y 1 ,y 2 ,...,y N }; and (3) carrying out clustering domain division on the images by using a fuzzy clustering algorithm, realizing local color brightness migration between clustering domains by matching the clustering domains of the source image and the target image, and introducing a membership factor to improve the color brightness migration effect.
Step A22: taking the image processed in the step S21 as input data of a pedestrian correlation model SAM-Dets with fine-grained representation fused: with X as an input vector and Y as a weight vector, the local fine-grained characteristic f of the pedestrian A is encoded by a pedestrian correlation model SAM-Dets fused with fine-grained characteristics 1 (ii) a With Y as an input vector and X as a weight vector, the local fine-grained characteristic f of the pedestrian B is coded by a pedestrian correlation model SAM-Dets fused with fine-grained characteristics 2
As shown in fig. 2, in this embodiment, the pedestrian association model SAM-Dets fused with fine-grained representation is composed of a plurality of attentions, each branch in the model has the same functional module, and input data is the global features of pedestrians extracted by the base network. And each attention branch passes through a local detector, a global pooling module and a linear embedding module once, and finally, the results of the K branches are spliced according to channels to obtain a complete output result of the attention model.
In the local detector module, high-level features of input information are firstly acquired, a softmax function is used for normalizing the high-level features to obtain attention weights which accord with probability distribution value intervals, the dimensionality of the high-level features is firstly expanded for the next summation operation, and finally the attention distribution probability distribution of corresponding features is acquired by using a weighted summation function. In the global pooling module and the linear embedding module, the attention distribution probability distribution is screened and reserved for corresponding pedestrian features, and finally high-level pedestrian features with attention distribution are output.
As shown in fig. 3, specifically, the network structure of the pedestrian association model SAM-Dets fusing fine-grained representation includes: k attention branches and splice layers; each attention branch comprises the following six layers, wherein:
the first layer is a convolution layer A and is used for extracting high-level features of the input pedestrian overall features;
the second layer is an active layer, and the active function is softmax;
the third layer is a dimension expansion layer;
the fourth layer is a summation layer, and the overall pedestrian characteristics are added with the result obtained by the third layer;
the fifth layer is a global average pooling layer for reducing feature dimension;
the sixth layer is a full connection layer and is used for completing the inner product calculation of the input vector and the weight vector in the weight matrix;
and the splicing layer splices the results obtained by the K attention branches according to channels and outputs the local fine-grained characteristics of pedestrians.
The convolution kernel size of convolution layer A is 1 × 1, and the step length is 1; the dimension expanding layer expands the channel dimension into 512 dimensions; the size of the global average pooling layer is 1 × 1, step size is 1.
Step A23: converting the similarity value of a pair of pedestrians to be calculated and input into a pair f 1 And f 2 Similarity comparison of features. Introducing a parameter-free layer Square layer to pair f 1 And f 2 The feature solution squared error as f 1 And f 2 And (3) comparing the similarity, and recording the Square layer as follows: f. of s =(f 1 -f 2 ) 2 (ii) a Will f is s =(f 1 -f 2 ) 2 As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) 1 ,q 2 ) The probability value indicating that two objects belong to the same person in the real world is input as the appearance suitability.
Further, according to the obtained similarity probability value between a pair of pedestrians as a weight of the graph, a newly entered pedestrian and a target pedestrian to be associated are respectively used as two different vertex sets, and a weighted value matching graph is established; and obtaining a solution of data association between the newly entered pedestrian and the target pedestrian waiting for association by solving the solution of the maximum weight matching graph problem.
Step S3: after the appearance adaptation degrees of the pedestrian images in the candidate pool to be associated are pairwise paired and calculated, a minimum cost flow graph model is established according to the appearance adaptation degrees of each pair of pedestrians and by combining the space-time relationship, and an optimal pedestrian association solution is solved.
The method specifically comprises the following steps:
step S31: let given t p The cost flow graph at time-1 after the instant alignment is completed is
Figure BDA0002293687310000091
When t is p Time of day, is the set of pedestrians in the field of vision
Figure BDA0002293687310000092
And go out of field pedestrian collection
Figure BDA0002293687310000093
Each target newly adds in and out two nodes, and the newly added nodes are updated to be connected with the directed edges between the source points and the sinks;
step S32: according to the pedestrian set in the field of vision
Figure BDA0002293687310000094
And go out of field pedestrian collection
Figure BDA0002293687310000095
The pedestrian appearance adaptation degree between every two middle targets is updated, the directed edges between the corresponding nodes are updated, and t is obtained p Momentarily new cost flow graph
Figure BDA0002293687310000096
Step S33: deleting all aligned target nodes and set of pedestrians in view
Figure BDA0002293687310000097
The target nodes which are not aligned remain, and the pedestrians in the visual field are collected
Figure BDA0002293687310000098
Obtaining a cost flow graph by taking the residual unaligned target nodes as newly entered target pedestrians
Figure BDA0002293687310000099
And waiting for updating the alignment at the next moment.
Step S4: according to the correlation result of the step S3, the operation of keeping the original identification or giving a new identification is carried out on the pedestrian, and the pedestrian is sent to a tracker for continuous tracking; the tracker adopts a KCF algorithm for tracking; the KCF algorithm assigns one tracker to each pedestrian. The algorithm forms a cyclic matrix in a target area, diagonalization and other properties of the cyclic matrix in a Fourier space are utilized, and a general prediction formula is obtained through regression of regression ridges.
Meanwhile, in the present embodiment, the tracking visual field area of the camera is divided into the core area and the critical area, and in step S1, only the pedestrian in the critical area is detected by the detector. The method considers that any object entering and leaving must pass through the critical area, so that only pedestrians in the critical area are considered to have reasonable spatial transfer relation, and the universality of subsequent pedestrian alignment solving is guaranteed to the maximum extent.
In the second embodiment of the present invention, as shown in fig. 4, there is provided step S2: another preferred implementation scheme for calculating the appearance suitability of two pedestrian images belonging to different cameras in a candidate pool to be associated specifically includes the following steps:
step B21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X 1 ,x 2 ,...,x N The overall characteristic of the pedestrian image B is Y ═ Y 1 ,y 2 ,...,y N };
Step B22: extracting pedestrian abstract characteristics from the pedestrian image A and the pedestrian image B through a DR-ResNet basic network;
step B23: further extracting high-level features of the target pedestrian by using the convolutional layer B, and using the high-level features as input data of a classification model and a pedestrian correlation model SAM-Dets fusing fine-grained representation; the classification model respectively outputs the identity identification representation numbers of the pedestrian image A and the pedestrian image B, and the pedestrian correlation model SAM-Dets fused with the fine-grained representation outputs the local fine-grained feature f of the pedestrian A 1 Local fine-grained feature f of pedestrian B 2
Step B24: will f is s =(f 1 -f 2 ) 2 As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) 1 ,q 2 ) The probability value indicating that two objects belong to the same person in the real world is input as the appearance suitability.
The DR-ResNet basic network comprises two completely same deep convolution twin nerve basic network modules R-ResNet shared by weights; the structure of the deep convolution twin nerve basic network module R-ResNet comprises forty-nine convolutional layers, three parallel convolutional layers and a terminal convolutional layer:
wherein the convolution kernel size of the first convolution layer is (7,7, 64), the max-firing is (3,3), and the sliding step length is 2;
the sizes of convolution kernels of the second convolution layer to the fourth convolution layer are (1,1,64), (3,3,64) and (1,1,256), and the ReLu function is adopted as the activation function; the three layers of convolution layers and the activation function form a convolution block, and the input value of the convolution block is used as the input value of the second convolution layer and the input value of the third layer of activation function of the convolution block; the fifth to seventh convolutional layers, and the eighth to tenth convolutional layers all adopt the same structures as the second to fourth convolutional layers;
the sizes of convolution kernels of the eleventh convolution layer to the thirteenth convolution layer are (1,1, 128), (3,3, 128) and (1,1, 512), and the ReLu function is adopted as the activation function; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of the first convolutional layer and also used as the input value of the third layer of the activation function of the convolutional block; the fourteenth to sixteenth convolutional layers, the seventeenth to nineteenth convolutional layers, and the twentieth to twenty second convolutional layers all adopt the same structures as the eleventh to thirteenth convolutional layers;
the sizes of convolution kernels of the twenty-third convolution layer to the twenty-fifth convolution layer are (1,1,256), (3,3, 256) and (1,1, 1024), and the activation functions all adopt ReLu functions; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a twenty-third convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the twenty-sixth to twenty-eighth, twenty-ninth to thirty-first, thirty-second to thirty-fourth, thirty-fifth to thirty-seventh, and thirty-eighth to forty-fourth convolutional layers all adopt the same structure as the twenty-third to twenty-fifth convolutional layers;
the sizes of convolution kernels from the forty-th convolution layer to the forty-third convolution layer are (1,1, 512), (3,3, 512) and (1,1, 2048), and the activation functions all adopt ReLu functions; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a forty-th convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the forty-fourth to forty-sixth and forty-seventh to forty-ninth buildup layers have the same structure as the forty-fourth to forty-second buildup layers;
after the forty-ninth convolutional layer, three parallel convolutional layers are formed, each convolutional layer uses 2048 convolutional kernels, the sizes of the first parallel convolutional layer to the third parallel convolutional layer are respectively (3, 1024), (5, 1024) and (7, 1024), the channels of the three parallel convolutional layers are combined through a connecting layer, and the subsequent max-posing is (4, 4);
the last layer is the end convolution layer with size (2, 2048) using 1024 convolution kernels;
convolutional layer B uses 2 convolutional kernels, with a size of (1,1, 4096).
As shown in fig. 5 and fig. 6, the performance of cross-camera tracking is realized by the scheme of the present embodiment compared with the existing offline cross-camera tracking scheme, and the difference is that the scheme of the present embodiment can directly realize tracking for online.
The present invention is not limited to the above preferred embodiments, and other various methods for aligning the identity of a pedestrian across a camera scene can be derived by anyone based on the teaching of the present invention.

Claims (1)

1. A pedestrian identity alignment method under a camera-crossing scene is characterized by comprising the following steps:
step S1: the cameras respectively add the pedestrian images detected by the cameras through the detector into a candidate pool to be associated;
step S2: calculating the appearance adaptation degree of two pedestrian images belonging to different cameras in a candidate pool to be associated;
step S3: after the pedestrian images in the candidate pool to be associated are pairwise paired and the appearance adaptation degrees are calculated, according to the appearance adaptation degrees of each pair of pedestrians, a minimum cost flow diagram model is established by combining the space-time relationship, and the optimal pedestrian association solution is solved;
step S4: according to the correlation result of the step S3, the operation of keeping the original identification or giving a new identification is carried out on the pedestrian;
in step S1, the detector is Faster R-CNN;
in step S2, the specific steps of calculating the degree of appearance suitability are step a 21-step a23, or step B21-step B24:
wherein the step A21-the step A23 specifically comprises the following steps:
step A21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X 1 ,x 2 ,...,x N The overall characteristic of the pedestrian image B is Y ═ Y 1 ,y 2 ,...,y N };
Step A22: will be provided withThe image processed in the step S21 is used as input data of a pedestrian correlation model SAM-Dets with fine-grained representation fused: using X as an input vector and Y as a weight vector, and encoding the local fine-grained characteristic f of the pedestrian A through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization 1 (ii) a And with Y as an input vector and X as a weight vector, coding local fine-grained features f of the pedestrian B through the pedestrian correlation model SAM-Dets fused with the fine-grained characterization 2
Step A23: will f is s =(f 1 -f 2 ) 2 As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) 1 ,q 2 ) Representing the probability value of inputting two objects belonging to the same person in the real world as the appearance suitability;
in step a22, the structure of the fused fine-grained representation pedestrian association model SAM-Dets includes: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:
the first layer is a convolution layer A and is used for extracting high-level features of the input pedestrian overall features;
the second layer is an activation layer, and the activation function is softmax;
the third layer is a dimension expansion layer;
the fourth layer is a summation layer, and the overall pedestrian characteristics are added with the result obtained by the third layer;
the fifth layer is a global average pooling layer for reducing feature dimension;
the sixth layer is a full connection layer and is used for finishing inner product calculation of the input vector and the weight vector in the weight matrix;
the splicing layer splices results obtained by the K attention branches according to channels and outputs the local fine-grained characteristic of the pedestrian;
the convolution kernel size of the convolution layer A is 1 multiplied by 1, and the step length is 1; the dimension expanding layer expands the channel dimension into 512 dimensions; the size of the global average pooling layer is 1 multiplied by 1, and the step length is 1;
the step B21-the step B24 are specifically:
step B21: image preprocessing is finished by using a fuzzy C mean value clustering F-CCT model, and the integral characteristic of the pedestrian image A is set as X ═ X 1 ,x 2 ,...,x N The overall characteristic of the pedestrian image B is Y ═ Y 1 ,y 2 ,...,y N };
Step B22: extracting pedestrian abstract characteristics from the pedestrian image A and the pedestrian image B through a DR-ResNet basic network;
step B23: further extracting high-level features of the target pedestrian by using the convolutional layer B, and using the high-level features as input data of a classification model and a pedestrian correlation model SAM-Dets fusing fine-grained representation; the classification model respectively outputs the identity identification representation numbers of the pedestrian image A and the pedestrian image B, and the pedestrian correlation model SAM-Dets fused with the fine-grained representation outputs the local fine-grained feature f of the pedestrian A 1 Local fine-grained feature f of pedestrian B 2
Step B24: will f is s =(f 1 -f 2 ) 2 As input values of two convolution layers C with a kernel size of 1 × 1 × 4096, softmax is used as an output function to output a two-dimensional vector (q) 1 ,q 2 ) Representing the probability value of inputting two objects belonging to the same person in the real world as the appearance suitability;
in step B22, the DR-ResNet basis network comprises two weight-shared identical deep convolution twin neural basis network modules R-ResNet; the structure of the deep convolution twin neural basic network module R-ResNet comprises forty-nine convolution layers, three parallel convolution layers and a tail end convolution layer:
wherein the convolution kernel size of the first convolution layer is (7,7, 64), the max-firing is (3,3), and the sliding step length is 2;
the sizes of convolution kernels of the second convolution layer to the fourth convolution layer are (1,1,64), (3,3,64) and (1,1,256), and the ReLu function is adopted as the activation function; forming a convolution block by the three layers of convolution layers and the activation function, and taking an input value of the convolution block as an input value of a second convolution layer and an input value of a third layer of activation function of the convolution block; the fifth to seventh convolutional layers, and the eighth to tenth convolutional layers all adopt the same structures as the second to fourth convolutional layers;
the sizes of convolution kernels of the eleventh convolution layer to the thirteenth convolution layer are (1,1, 128), (3,3, 128) and (1,1, 512), and the ReLu function is adopted as the activation function; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of the first convolutional layer and also used as the input value of the third layer of the activation function of the convolutional block; the fourteenth to sixteenth convolutional layers, the seventeenth to nineteenth convolutional layers, and the twentieth to twenty second convolutional layers all adopt the same structures as the eleventh to thirteenth convolutional layers;
the sizes of convolution kernels from the twenty-third convolution layer to the twenty-fifth convolution layer are (1,1,256), (3,3, 256) and (1,1, 1024), and the ReLu function is adopted as the activation function; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a twenty-third convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the twenty-sixth to twenty-eighth, twenty-ninth to thirty-first, thirty-second to thirty-fourth, thirty-fifth to thirty-seventh, and thirty-eighth to forty-fourth convolutional layers all adopt the same structure as the twenty-third to twenty-fifth convolutional layers;
the sizes of convolution kernels from the forty-th convolution layer to the forty-third convolution layer are (1,1, 512), (3,3, 512) and (1,1, 2048), and the activation functions all adopt ReLu functions; the three convolutional layers and the activation function form a convolutional block, and the input value of the convolutional block is used as the input value of a forty-th convolutional layer and also used as the input value of a third layer of the activation function of the convolutional block; the forty-fourth to forty-sixth and forty-seventh to forty-ninth buildup layers have the same structure as the forty-fourth to forty-second buildup layers;
after the forty-ninth convolutional layer, three parallel convolutional layers are formed, each convolutional layer uses 2048 convolutional cores, the sizes of the first parallel convolutional layer to the third parallel convolutional layer are respectively (3, 1024), (5, 1024) and (7, 1024), the channels of the three parallel convolutional layers are combined through a connecting layer, and then the max-pooling is (4, 4);
the last layer is the end convolution layer with size (2, 2048) using 1024 convolution kernels;
the convolutional layer B uses 2 convolutional kernels and has the size of (1,1, 4096);
the structure of the pedestrian correlation model SAM-Dets fused with the fine-grained representation comprises the following steps: k attention branches and splice layers; each of the attention branches includes the following six layers, wherein:
the first layer is convolution layer A, the convolution kernel size is 1 multiplied by 1, and the step length is 1; the high-level features are used for extracting the input overall features of the pedestrians;
the second layer is an active layer, and the active function is softmax;
the third layer is a dimension expanding layer, and the dimension of the channel is expanded into 512 dimensions;
the fourth layer is a summation layer, and the overall pedestrian characteristics are added with the result obtained by the third layer;
the fifth layer is a global average pooling layer with the size of 1 multiplied by 1 and the step length of 1, and is used for reducing the characteristic dimension;
the sixth layer is a full connection layer and is used for finishing inner product calculation of the input vector and the weight vector in the weight matrix;
the splicing layer splices results obtained by the K attention branches according to channels and outputs the local fine-grained characteristic of the pedestrian;
in step S3, according to the degree of adaptation of each pair of pedestrians, a minimum cost flow graph model is established in combination with the spatio-temporal relationship, and a specific process of solving an optimal pedestrian association solution includes the following steps:
step S31: let given t p The cost flow graph at time-1 after the instant alignment is completed is
Figure FDA0003705068140000041
When t is p Time of day, is the collection of pedestrians in the field of view
Figure FDA0003705068140000042
And go out of field pedestrian collection
Figure FDA0003705068140000043
Each target newly adds in and out two nodes, and the newly added nodes are updated to be connected with the directed edges between the source points and the sinks;
step S32: according to the pedestrian set in the field of vision
Figure FDA0003705068140000044
And go out of field pedestrian collection
Figure FDA0003705068140000045
The pedestrian appearance adaptation degree between every two middle targets is updated, the directed edges between the corresponding nodes are updated, and t is obtained p New cost flow graph of time of day
Figure FDA0003705068140000046
Step S33: deleting all aligned target nodes and set of pedestrians in view
Figure FDA0003705068140000047
The target nodes which are not aligned remain, and the pedestrians in the visual field are collected
Figure FDA0003705068140000048
Obtaining a cost flow graph by taking the residual unaligned target nodes as newly entered target pedestrians
Figure FDA0003705068140000049
Waiting for updating alignment at the next moment;
in step S4, the pedestrian is given the operation of keeping the original identification or giving a new identification, and then tracked by the tracker; the tracker adopts a KCF algorithm for tracking; the KCF algorithm allocates a tracker for each pedestrian;
the tracking visual field area of the camera is divided into a core area and a critical area, and in step S1, only pedestrians in the critical area are detected by the detector.
CN201911189515.XA 2019-11-28 2019-11-28 Pedestrian identity alignment method under camera-crossing scene Active CN110969112B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911189515.XA CN110969112B (en) 2019-11-28 2019-11-28 Pedestrian identity alignment method under camera-crossing scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911189515.XA CN110969112B (en) 2019-11-28 2019-11-28 Pedestrian identity alignment method under camera-crossing scene

Publications (2)

Publication Number Publication Date
CN110969112A CN110969112A (en) 2020-04-07
CN110969112B true CN110969112B (en) 2022-08-16

Family

ID=70031971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911189515.XA Active CN110969112B (en) 2019-11-28 2019-11-28 Pedestrian identity alignment method under camera-crossing scene

Country Status (1)

Country Link
CN (1) CN110969112B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969571A (en) * 2019-11-29 2020-04-07 福州大学 Method and system for specified self-adaptive illumination migration in camera-crossing scene
CN112287868B (en) * 2020-11-10 2021-07-13 上海依图网络科技有限公司 Human body action recognition method and device
CN112950954B (en) * 2021-02-24 2022-05-20 电子科技大学 Intelligent parking license plate recognition method based on high-position camera
CN113947782B (en) * 2021-10-14 2024-06-07 哈尔滨工程大学 Pedestrian target alignment method based on attention mechanism

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106056581A (en) * 2016-05-23 2016-10-26 北京航空航天大学 Method of extracting infrared pedestrian object by utilizing improved fuzzy clustering algorithm
CN108198200A (en) * 2018-01-26 2018-06-22 福州大学 The online tracking of pedestrian is specified under across camera scene
CN108257158A (en) * 2018-03-27 2018-07-06 福州大学 A kind of target prediction and tracking based on Recognition with Recurrent Neural Network
CN110428448A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 Target detection tracking method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8842881B2 (en) * 2012-04-26 2014-09-23 General Electric Company Real-time video tracking system
PL3209033T3 (en) * 2016-02-19 2020-08-10 Nokia Technologies Oy Controlling audio rendering

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106056581A (en) * 2016-05-23 2016-10-26 北京航空航天大学 Method of extracting infrared pedestrian object by utilizing improved fuzzy clustering algorithm
CN108198200A (en) * 2018-01-26 2018-06-22 福州大学 The online tracking of pedestrian is specified under across camera scene
CN108257158A (en) * 2018-03-27 2018-07-06 福州大学 A kind of target prediction and tracking based on Recognition with Recurrent Neural Network
CN110428448A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 Target detection tracking method, device, equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
An Equalized Global Graph Model-Based Approach for Multicamera Object Tracking;W.Chen 等;《in IEEE Transactions on Circuits and Systems for Video Technology》;20160711;全文 *
余春艳 等.鉴别性特征学习模型实现跨摄像头下行人即时对齐.《计算机辅助设计与图形学学报》.2019,第31卷(第4期),第602-611页. *
基于深度学习和时空约束的跨摄像头行人跟踪;夏天 等;《计算机与数字工程》;20171130;第45卷(第11期);全文 *
联合多级深度特征表示和有序加权距离融合的视频行人再识别方法;孙锐 等;《光学学报》;20190930;第39卷(第9期);全文 *
鉴别性特征学习模型实现跨摄像头下行人即时对齐;余春艳 等;《计算机辅助设计与图形学学报》;20190415;第31卷(第4期);正文第1-4节 *

Also Published As

Publication number Publication date
CN110969112A (en) 2020-04-07

Similar Documents

Publication Publication Date Title
CN110969112B (en) Pedestrian identity alignment method under camera-crossing scene
CN108198200B (en) Method for tracking specified pedestrian on line under cross-camera scene
CN111814661B (en) Human body behavior recognition method based on residual error-circulating neural network
Yin et al. Recurrent convolutional network for video-based smoke detection
CN111666843B (en) Pedestrian re-recognition method based on global feature and local feature splicing
CN106778604B (en) Pedestrian re-identification method based on matching convolutional neural network
CN109101915B (en) Face, pedestrian and attribute recognition network structure design method based on deep learning
CN109325546B (en) Step-by-step footprint identification method combining features of step method
CN111709311A (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN111582092B (en) Pedestrian abnormal behavior detection method based on human skeleton
CN111723645A (en) Multi-camera high-precision pedestrian re-identification method for in-phase built-in supervised scene
CN109214263A (en) A kind of face identification method based on feature multiplexing
CN111860291A (en) Multi-mode pedestrian identity recognition method and system based on pedestrian appearance and gait information
CN110765839B (en) Multi-channel information fusion and artificial intelligence emotion monitoring method for visible light facial image
CN109902585A (en) A kind of three modality fusion recognition methods of finger based on graph model
CN108229435B (en) Method for pedestrian recognition
CN111259837B (en) Pedestrian re-identification method and system based on part attention
CN111401113A (en) Pedestrian re-identification method based on human body posture estimation
CN111241963A (en) First-person visual angle video interactive behavior identification method based on interactive modeling
CN113269099B (en) Vehicle re-identification method under heterogeneous unmanned system based on graph matching
CN114463340A (en) Edge information guided agile remote sensing image semantic segmentation method
Nguyen et al. Attention-based shape and gait representations learning for video-based cloth-changing person re-identification
CN111160115A (en) Video pedestrian re-identification method based on twin double-flow 3D convolutional neural network
Nguyen et al. How feature fusion can help to improve multi-shot person re-identification performance?
Nguyen et al. Robust person re-identification through the combination of metric learning and late fusion techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant