CN114694089A - Novel multi-mode fusion pedestrian re-recognition algorithm - Google Patents

Novel multi-mode fusion pedestrian re-recognition algorithm Download PDF

Info

Publication number
CN114694089A
CN114694089A CN202210190938.9A CN202210190938A CN114694089A CN 114694089 A CN114694089 A CN 114694089A CN 202210190938 A CN202210190938 A CN 202210190938A CN 114694089 A CN114694089 A CN 114694089A
Authority
CN
China
Prior art keywords
pedestrian
mode
modal
image
modes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210190938.9A
Other languages
Chinese (zh)
Inventor
王晓嫚
崔前进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University
Original Assignee
Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University filed Critical Zhengzhou University
Priority to CN202210190938.9A priority Critical patent/CN114694089A/en
Publication of CN114694089A publication Critical patent/CN114694089A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a novel multi-mode fusion pedestrian re-recognition algorithm, which comprises the steps of obtaining pedestrian images containing three modes of RGB, NI and TI; inputting the pedestrian images of the three modes into a pre-trained multi-mode fusion pedestrian re-recognition network to obtain a prediction classification result; the pedestrian re-recognition algorithm based on multi-mode fusion fuses image feature information of three modes, wherein an attention module is introduced into a network of each mode, so that the image feature information is better extracted, different modes are learned towards a common direction, noise is effectively inhibited, the difference between the modes is reduced, and information complementation between the modes is realized. Through the trans-modal hard sample triple loss, samples of different pedestrians can be pulled apart trans-modal, samples of the same pedestrian can be pulled close trans-modal, and trans-modal feature clustering is effectively achieved. The method has wide application prospect in the field of pedestrian re-identification.

Description

Novel multi-mode fusion pedestrian re-recognition algorithm
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a novel multi-mode fusion pedestrian re-recognition algorithm.
Background
Pedestrian Re-identification (Re-ID) is an important image identification technology, and aims to solve the problem that a specific pedestrian is searched in a massive image or video library through appearance visual characteristics and action characteristics of the pedestrian after a camera is crossed. The pedestrian identities shot by different cameras which are mutually networked are correlated so as to obtain the motion trail of a specific pedestrian in time. The pedestrian re-identification technology is an important basis of an intelligent analysis technology and gradually draws close attention of scientific researchers in the field of computer vision.
With the wide application of deep learning in the field of computer vision, pedestrian re-identification based on deep learning has become a mainstream model at present, and the effect is far better than that of a pedestrian re-identification scheme based on traditional machine learning. Since pedestrian images captured by an infrared mode camera, a depth camera and the like are also very common in real scenes, cross-modal pedestrian re-recognition is proposed for retrieving images matching the same pedestrian in an image library under multiple modes. The method effectively solves the problem of cross-mode pedestrian re-identification, and has great significance in the aspects of public safety, crime prevention, criminal investigation and the like.
In recent years, a great deal of research work and related framework for cross-modal pedestrian re-identification has emerged. However, there are still some gaps in the practical application of the cross-modal pedestrian re-identification. The difficulties and challenges faced at present are mainly:
1) there is a large difference in the images captured in the multiple modalities. For example, an RGB image has three channels containing visible light color information of red, green and blue, while an infrared image has only one channel containing intensity information of near-infrared light; from the perspective of imaging principles, the wavelength ranges of the two are also different; the effect of different sharpness and lighting conditions on the two types of images may also be quite different.
2) Intra-modal differences present in conventional pedestrian re-identification. Due to the change of the monitoring environment, different pedestrian videos or images are shot at different places and at different times, the visual angles, illumination, postures and the like of pedestrians are different, and the characteristic information of the same pedestrian has huge deviation, so that the effective identification of the pedestrian is negatively influenced.
Therefore, how to effectively reduce the difference between modal images and learn the shared robustness characteristics between the modalities so that the information between the modalities can be complemented and the improvement of the network performance is the key of the research in the cross-modal pedestrian re-identification field.
Disclosure of Invention
In order to solve the problems, a novel pedestrian re-identification algorithm with multi-mode fusion is provided.
The object of the invention is achieved in the following way:
a novel multi-mode fused pedestrian re-identification algorithm comprises,
acquiring pedestrian images of three modes including RGB, NI and TI;
inputting the pedestrian images of the three modes into a pre-trained multi-mode fusion pedestrian re-recognition network to obtain a prediction classification result;
wherein the multimodal pedestrian re-recognition network is configured to: three branches are included to capture the features of the person image in each modality, resulting in image features representing RGB, NI and TI modalities,
horizontally cutting image features of RGB, NI and TI modes into p blocks, and obtaining p partial column vector features after global average pooling GAP
Figure BDA0003525038530000021
And
Figure BDA0003525038530000022
the embedding layer characteristics of each mode;
then p partial column vector features of each of the three modes
Figure BDA0003525038530000023
And
Figure BDA0003525038530000024
respectively input into a classifier composed of full connection layer FC and softmax functionsObtaining the identity predicted value vector of the pedestrian image input in each mode
Figure BDA0003525038530000025
Firstly, the p predicted value vector characteristics of each mode are connected to generate the characteristic vector under each mode
Figure BDA0003525038530000026
Medicine for curing cancer
Figure BDA0003525038530000027
Then, the three feature vectors are connected to obtain a fusion feature vector X1×3KpThen, the feature is passed through a classification layer to obtain the prediction classification result
Figure BDA0003525038530000028
Further, the training process of the multi-modal pedestrian re-recognition network comprises the following steps:
s1, initializing the network layer weight, generally adopting random initialization;
s2, single-mode feature extraction: for each pedestrian image, respectively extracting characteristics of three modes of RGB, Near Infrared (NI) and Thermal Infrared (TI); respectively inputting the extracted features of the RGB, Near Infrared (NI) and Thermal Infrared (TI) modes into forward propagation of each layer of a convolution layer, a normalization layer, an average pooling layer and the like in a ResNet50 convolution neural network containing an SCA attention module to respectively obtain image features respectively representing the RGB, NI and TI modes
Figure BDA0003525038530000029
And
Figure BDA00035250385300000210
s3, single-mode image feature processing: firstly, for each modality, in order to acquire local information of a person image, each tensor χ is horizontally divided into p blocks by adopting a local scheme, and p parts are obtained after Global Average Pooling (GAP) is carried outColumn vector
Figure BDA00035250385300000211
And
Figure BDA00035250385300000212
namely the imbedding layer characteristics of each mode; then, each part of the feature vector g of each mode is calculatediRespectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an identity predicted value vector of each modal input pedestrian image
Figure BDA0003525038530000031
Then calculating the difference value between the ID prediction vector of each modal pedestrian and the real label, and using the sum of the cross entropies of the p classification layers as the loss function of the single mode
Figure BDA0003525038530000032
To optimize the network;
s4, setting a virtual branch to enable the imbedding layer features of the three-mode images to be subjected to joint learning, realizing information fusion among the modes, enabling the features of the three different modes to be learned towards a common virtual mean vector, and exiting the virtual branch when the single-mode cross entropy classification loss is small enough; then calculating the difference value between the characteristics of each modal embedding layer and the virtual mean value, and optimizing the network by using the cosine distance as a loss function;
s5, carrying out connection operation on the pedestrian ID prediction vector characteristics of each mode obtained in S3 and forward propagation of a classification layer to obtain a multi-mode fusion characteristic output value, namely a prediction classification result
Figure BDA0003525038530000033
The classification layer consists of a full connection layer and a softmax classifier;
s6, solving KL divergence loss between the multi-modal fusion feature output value and a target value and trans-modal hard sample triplet loss;
s7, obtaining the final multi-mode fusion global loss: adding a cross entropy loss function generated by single-mode feature processing, a Euclidean distance loss function generated by multi-mode fusion virtual branches, a KL divergence loss function related to multi-mode fusion feature processing and a trans-mode hard sample triplet loss function to form a final multi-mode global loss participating in network training;
s8, reversely transmitting the multi-modal fusion global loss back to the network, and sequentially obtaining each layer of the network: classification layer classifier, full connection layer FC, pooling layer GAP and attention-bearing respet 50 structure back propagation errors;
s9, adjusting all weight coefficients in the network according to the back propagation errors of each layer, namely updating the weight;
s10, selecting new image data again randomly, entering S2, and carrying out network forward propagation to obtain an output value;
s11, repeating iteration, and ending the training when the error between the output value of the network and the target value (label) is less than a certain threshold value or the iteration frequency exceeds a certain threshold value;
and S12, storing the trained network parameters of all layers.
Furthermore, the ResNet50 convolutional neural network structure comprising the attention module comprises five parts, wherein the first part mainly performs convolution, regularization, activation function and maximum pooling calculation on input, and the second, third, fourth and fifth part structures introduce residual blocks, namely direct connection channels are added in the network to allow original input information to be directly transmitted to a later layer, and each residual block comprises three layers of convolution; and adding SCA attention module after the second part and the third part of ResNet50, except the last residual block of ResNet50, sharing the network parameters of three modes, thereby obtaining image characteristic information representing RGB, NI and TI modes respectively
Figure BDA0003525038530000041
And
Figure BDA0003525038530000042
further, the S3 specifically includes: step 1: averaging multi-modal image features
Obtaining a feature column vector fused with three modes by solving the mean value of the features of the embedding layers of the modes, wherein the formula is as follows:
Figure BDA0003525038530000043
step 2: solving loss function of each mode embedding layer feature and fusion feature vector
The loss function here selects cosine distances, and uses the sum of cosine distances of all parts as the loss of the modal feature and the fused feature vector, that is:
Figure BDA0003525038530000044
and step 3: exit virtual Branch
Given a threshold α, when the classification layer loss value of each of the following modalities is less than the threshold, i.e.
Figure BDA0003525038530000045
The virtual branch is exited.
Further, the S6 specifically includes: calculating the difference between the predicted classification result and the real label, wherein the KL divergence is used as a loss function to optimize the network, and the formula is as follows:
Figure BDA0003525038530000046
where y represents the true tag of the pedestrian (3 images), is a K-dimensional vector, yjA jth value representing a y vector;
the cross-modal hard sample triplet loss specifically includes:
step 1: in each training batch, P was drawn randomly in the RGB modality1One rowPeople, each pedestrian extracts K1Opening an image; extracting P under NI mode2Individual pedestrian, each pedestrian extracts K2Opening an image; extracting P in TI mode3Individual pedestrian, each pedestrian extracts K3Sheet image, total P1K1+P2K2+P3K3Sample pictures;
step 2: for a pedestrian image, in three modalities, the pedestrian image ID is defined as a positive sample image, and the pedestrian image ID is defined as a negative sample image;
and step 3: using features of images
Figure BDA0003525038530000051
Medicine for curing cancer
Figure BDA0003525038530000052
Representing pedestrian images in RGB, NI and TI modes, wherein the distance between the images is the distance between image features, and defining the distance function between the images as Euclidean distance:
Figure BDA0003525038530000053
and 4, step 4: for a fixed anchor picture a, assuming that the picture is a picture in RGB mode, first calculate the anchor picture and its positive sample pictures p in RGB modeRGBD (a, p) ofRGB) And its negative sample picture nRGBD (a, n) ofRGB);
And 5: calculating positive samples p of the anchor point picture and the NI modeNIDistance d (a, P)NI) And each negative sample nNID (a, n) ofNI);
Step 6: calculating positive samples p of the anchor picture and TI modeTIDistance d (a, p)TI) And each negative sample nTID (a, n) ofTI);
And 7: defining cross-modal hard sample triplets
When the anchor picture is in the RGB mode, the loss of the hard sample triplet is:
Figure BDA0003525038530000061
wherein m is a boundary value parameter, alpha + beta + gamma is 1, and alpha, beta and gamma are more than or equal to 0;
similarly, when the anchor point pictures are in the NI and TI modalities, the hard sample triplet losses are respectively:
Figure BDA0003525038530000062
Figure BDA0003525038530000071
therefore, our cross-modal hard sample triplet penalty is:
Lcross-modal=LT-RGB+LT-NI+LT-TI (12)
the positive pairs of samples can be made closer together and the negative pairs more distant across the modes by this loss.
Furthermore, the SCA attention module captures space and channel feature information of the image mainly through pooling operation, the input image features firstly aggregate the feature information through average pooling and maximum pooling respectively, then the two pooled features are connected together and subjected to dimensionality reduction through convolution, and space features are generated after normalization and nonlinear activation; and then performing maximum pooling operation on each channel along the horizontal direction and the vertical direction by using H multiplied by 1 and 1 multiplied by W pooling kernels respectively, and fusing the attention weight shared by all channels after being activated by a sigmoid function with the original input to obtain the final output characteristic.
An electronic device comprises
A memory for storing a program;
a processor for implementing the above method by executing the program stored in the memory.
A computer-readable storage medium comprising a program executable by a processor to implement the above-described method.
The invention has the beneficial effects that: the pedestrian re-recognition algorithm based on multi-mode fusion fuses image feature information of three modes, wherein an attention module is introduced into a network of each mode, so that the image feature information is better extracted, different modes are learned towards a common direction, noise is effectively inhibited, the difference between the modes is reduced, and information complementation between the modes is realized. Through cross-modal loss of the difficult sample triples, samples of different pedestrians can be pulled apart in a cross-modal manner, samples of the same pedestrians can be pulled close in a cross-modal manner, and cross-modal feature clustering is effectively achieved. The method has wide application prospect in the field of pedestrian re-identification.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a block diagram of the multi-modal converged pedestrian re-recognition network of the present invention.
Fig. 3 is a diagram of a ResNet50 network architecture.
FIG. 4 is a diagram of an SCA attention module.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same technical meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
A novel multi-mode fused pedestrian re-recognition algorithm,
acquiring pedestrian images of three modes including RGB, Near Infrared (NI) and Thermal Infrared (TI);
inputting the pedestrian images of the three modes into a pre-trained multi-mode fusion pedestrian re-recognition network to obtain a prediction classification result;
wherein the multimodal pedestrian re-recognition network is configured to: comprises three branches to respectivelyCapturing the characteristics of the figure image in each mode to obtain image characteristics respectively representing RGB, NI and TI modes, horizontally dividing the image characteristics of the RGB, NI and TI modes into p blocks, and obtaining p partial column vector characteristics after Global Average Pooling (GAP)
Figure BDA0003525038530000081
And
Figure BDA0003525038530000082
namely the imbedding layer characteristics of each mode; then p partial column vector features of the three modes are combined
Figure BDA0003525038530000083
And
Figure BDA0003525038530000084
respectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an Identity (ID) predicted value vector of each modal input pedestrian image
Figure BDA0003525038530000085
Firstly, p predicted value vector characteristics of each mode are connected to generate a characteristic vector under each mode
Figure BDA0003525038530000086
And
Figure BDA0003525038530000087
then, the three feature vectors are connected to obtain a fusion feature vector X1×3KpThen, the feature is used to obtain the prediction classification result by a classifier
Figure BDA0003525038530000088
It is known that a multi-modal fused pedestrian re-identification network needs training to converge after being established. And obtaining the trained network weight after convergence. In the reasoning process, the weight coefficient trained by the network is loaded in advance to carry out final classification on the input data.
As shown in fig. 1, the training process of the multi-modal pedestrian re-recognition network includes:
s1, initializing the network layer weight, generally adopting random initialization;
s2, single-mode feature extraction: for each pedestrian image, respectively extracting features of three modes, namely RGB, Near Infrared (NI) and Thermal Infrared (TI); respectively inputting the extracted features of the RGB, Near Infrared (NI) and Thermal Infrared (TI) modes into forward propagation of each layer of a convolution layer, a normalization layer, an average pooling layer and the like in a ResNet50 convolution neural network containing an SCA attention module to respectively obtain image features respectively representing the RGB, NI and TI modes
Figure BDA0003525038530000091
And
Figure BDA0003525038530000092
s3, single-mode image feature processing: first, for each modality, in order to obtain local information of a person image, we adopt a local scheme to divide each tensor into
Figure BDA0003525038530000093
Horizontally cutting into p blocks, and performing Global Average Pooling (GAP) to obtain p partial column vectors
Figure BDA0003525038530000094
And
Figure BDA0003525038530000095
namely the imbedding layer characteristics of each mode; then, each part of the feature vector g of each mode is calculatediRespectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an Identity (ID) predicted value vector of each modal input pedestrian image
Figure BDA0003525038530000096
Then calculating the difference value between the predicted value vector of each modal pedestrian ID and the real label, and using the sum of the cross entropy of the p classification layers as a singleLoss function of mode
Figure BDA0003525038530000097
To optimize the network;
s4, setting a virtual branch to enable the imbedding layer features of the three-mode images to be subjected to joint learning, realizing information fusion among the modes, enabling the features of the three different modes to be learned towards a common virtual mean vector, and exiting the virtual branch when the single-mode cross entropy classification loss is small enough; then calculating the difference value between the characteristics of each modal embedding layer and the virtual mean value, and optimizing the network by using the cosine distance as a loss function;
s5, carrying out connection operation on the vector characteristics of the predicted value vectors of the pedestrians in each mode obtained in the S3 and carrying out forward propagation on the classification layer to obtain a multi-mode fusion characteristic output value, namely a prediction classification result
Figure BDA0003525038530000098
S6, solving KL divergence loss between the multi-modal fusion feature output value and a target value and trans-modal hard sample triplet loss;
s7, obtaining the final multi-mode fusion global loss: adding a cross entropy loss function generated by single-mode feature processing, a Euclidean distance loss function generated by multi-mode fusion virtual branches, a KL divergence loss function related to multi-mode fusion feature processing and a trans-mode hard sample triplet loss function to form a final multi-mode global loss participating in network training;
s8, reversely returning the multi-modal fusion global loss to the network, and sequentially obtaining each layer of the network: classification layer classifier, full connection layer FC, pooling layer GAP and attention-bearing resnet50 structure layer (convolution layer, pooling layer, normalization layer and activation layer); (ii) a
S9, adjusting all weight coefficients in the network according to the back propagation errors of each layer, namely updating the weight;
s10, selecting new image data again randomly, entering S2, and carrying out network forward propagation to obtain an output value;
s11, repeating iteration, and ending the training when the error between the output value of the network and the target value (label) is less than a certain threshold value or the iteration frequency exceeds a certain threshold value;
and S12, storing the trained network parameters of all layers.
Specifically, as shown in fig. 2, the multi-modal fused pedestrian re-identification network provided by the present invention has the following structure:
firstly, the method comprises the following steps: single modal feature extraction
For each pedestrian image, there are three modalities, RGB, Near Infrared (NI) and Thermal Infrared (TI).
First, feature representations of the respective modalities are extracted. To obtain a high-quality feature representation of a single modality, three branches are designed based on a convolutional neural network (the invention chooses ResNet50) to capture the features of the human image in each modality respectively.
The network structure of ResNet50 is shown in FIG. 3. The ResNet50 structure is divided into seven parts, the first part mainly carries out convolution, regularization, activation function and maximum pooling calculation on input, the second, third, fourth and fifth part structures introduce residual blocks, namely, direct connection channels are added in the network, original input information is allowed to be directly transmitted to a later layer, and each residual block contains three layers of convolution. After the convolution calculation of the first five parts, the pooling layer converts the convolution calculation into a feature vector, and finally the classifier calculates the feature vector and outputs the class probability.
For better capturing image feature information, we propose a new Spatial-Channel Attention module SCA (Spatial-Channel Attention), the structure of which is shown in fig. 4. For input feature information FH×W×CWe first generate spatial attention features using spatial relationships between features, and since applying pooling along the channel axis can highlight valid information regions, we aggregate channel information of one feature map by two pooling operations (average pooling and maximum pooling) to generate two-dimensional feature maps for computing spatial attention
Figure BDA0003525038530000101
Then connecting them together and making dimension reduction by convolution, and making normalization and nonlinear activation function to produce space attention characteristic Fs∈RH×WTo encode the locations that need to be emphasized and suppressed. The calculation formula is as follows:
Fs=σ(BN(f1×1×2([AvgPool(F);MaxPool(F)])))
wherein [ ·; a]Indicating a connection operation, f1×1×2For convolution operations with a convolution kernel of 1 × 1 × 2, BN is the batch normalized BatchNorm, and σ represents the Non-linear activation function Non-linear.
Then, in channel focusing, we use the global maximum pooling to encode the spatial information globally. To enable the attention block to capture distant spatial interactions with precise location information, each channel feature is pooled along horizontal and vertical coordinates using H × 1 and 1 × W pooling kernels, respectively, generating a pair of directional perceptual feature vectors. Thus, the output at height h can be expressed as:
Figure BDA0003525038530000111
similarly, the output at width w can be expressed as:
Figure BDA0003525038530000112
these two transformations allow the attention block to capture the long distance dependency in one spatial direction and maintain accurate position information in the other spatial direction, which helps the network to more accurately locate the area of interest. G after sigmoid function activation is carried outhAnd gwAs the attention weight shared by all channels is fused with the original input, the final output of the SCA module is represented as:
F′c(i,j)=Fc(i,j)×δ(gh(i))×δ(gw(j))
where δ is the sigmoid activation function, Fc(i, j) is the pixel value of ith row and jth column of ith channel of the c channel of the original feature map, F'c(i, j) is the pixel value of the ith row and the jth column of the ith channel of the output characteristic diagram.
The SCA attention module is added after the second and third parts of the ResNet50, respectively, and then the images of each modality are input into the first five parts of the ResNet50 containing the attention module, respectively, with the following averaging pooling (avg pool) and full connection (fc) layers deleted. To reduce the number of network parameters, the network parameters of the three modalities are shared except the last residual block of ResNet50, resulting in image feature information representing RGB, NI and TI modalities, respectively
Figure BDA0003525038530000121
And
Figure BDA0003525038530000122
secondly, the method comprises the following steps: single modality feature processing
First, for each modality, to obtain local information of a person image, we adopt a local scheme to combine each tensor
Figure BDA0003525038530000129
Horizontally cutting into p blocks, and performing Global Average Pooling (GAP) to obtain p partial column vectors
Figure BDA0003525038530000123
And
Figure BDA0003525038530000124
i.e. the embedding layer characteristics of each modality.
Then, each part of the feature vector g of each mode is calculatediRespectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an Identity (ID) predicted value vector of each modal input pedestrian image
Figure BDA0003525038530000125
Note that here each modalityAre not shared.
And then calculating the difference value between the predicted value of the pedestrian ID of each mode and the real label, and optimizing the network by using the sum of cross-entropy (cross-entropy) of p classification layers as a loss function of the single mode, wherein the formula is as follows:
Figure BDA0003525038530000126
where K is the number of ID categories for the identity of the pedestrian,
Figure BDA0003525038530000127
is the predicted probability that the ith local part belongs to the jth pedestrian ID,
Figure BDA0003525038530000128
representing the corresponding real label.
Thirdly, the method comprises the following steps: setting up multi-modal fused virtual branches
The image information of the three modes is beneficial to pedestrian re-identification, but each mode contains more noise, the wavelengths, the definition, the illumination conditions and the like of the different modes have obvious differences, and in order to effectively inhibit the noise, reduce the difference between the modes and realize the complementation of the information between the modes, a virtual branch is created, so that the image characteristics of the three modes are learned towards a common direction, and the convergence of the model is accelerated.
A virtual branch is set to enable the imbedding layer features of three modal images to perform joint learning, and the implementation mode is as follows:
step 1: averaging multi-modal image features
Obtaining a feature column vector fused by three modes by solving the mean value of the embedding layer features of each mode, wherein the formula is as follows:
Figure BDA0003525038530000131
step 2: loss function for solving each modal embedding layer feature and fusion feature vector
The loss function here selects cosine distances, and uses the sum of cosine distances of all parts as the loss of the modal feature and the fused feature vector, that is:
Figure BDA0003525038530000132
and step 3: exiting virtual Branch
Given a threshold α, when the classification layer loss value of each of the following modalities is less than the threshold, i.e.
Figure BDA0003525038530000133
The virtual branch is exited.
Through the virtual branch, information fusion among the modes can be realized, and the characteristics of three different modes are learned towards a common virtual mean vector, so that the noise is effectively inhibited, the difference among the modes is reduced, and the convergence is accelerated. And when the classification loss is small enough, the virtual branch is exited, and the calculation cost is reduced.
Fourthly: and (3) multi-modal fusion feature processing:
in order to further fuse the features of each mode and enhance the generalization capability of the network, p features of each mode are connected to generate a feature vector under each mode
Figure BDA0003525038530000134
And
Figure BDA0003525038530000135
then, the three eigenvectors are connected to obtain a fused eigenvector X1×3KpThen, the feature is passed through a classification layer to obtain the prediction classification result
Figure BDA0003525038530000136
I.e. the ID prediction probability for re-recognition, which is finally generated by the classification layer classifierAnd (5) vector quantity. Such as: the vector is [0.01, 0.53, 0.24]It means that the pedestrian ID is 0 pedestrian with a probability of 0.01, 1 pedestrian with a probability of 0.53.
The difference between the predicted classification result and the true label is then calculated, where we use KL (Kullback-Leibler) divergence as a loss function to optimize the network, as follows:
Figure BDA0003525038530000141
fifth, the method comprises the following steps: computing cross-modal hard sample triplet loss functions
In order to cluster pedestrian image samples in a cross-modal state, samples of different pedestrians can be pulled apart in the cross-modal state, and samples of the same pedestrian can be pulled close in the cross-modal state. We have designed a novel cross-modal hard sample Triplet loss (Triplet loss) function.
The conventional triple loss function input is three pictures, which are respectively a fixed anchor point picture a (anchor), a positive sample picture p (positive) and a negative sample picture n (negative), wherein the picture a and the picture p form a positive sample pair, the picture a and the picture n form a negative sample pair, and then the corresponding triple loss function is expressed as:
Lt=(da,p-da,n+α)+ (6)
wherein (z)+Represents max (z, 0), da,pIs the distance between the pair of positive samples, da,pAlpha is a threshold parameter set according to actual needs, which is the distance between the pair of negative samples.
The triple loss function can draw positive sample pairs close and push negative sample pairs open, so that the image features of the same label can be clustered on a feature space.
In order to enhance the generalization ability of the network and enable the network to learn better characteristics, for each fixed picture a, the positive sample picture farthest away and the negative sample picture n closest to the fixed picture a are selected in a training batch to train the network, which is called as a hard sample triplet loss. The formula is as follows:
Lth=(maxda,p-minda,n+α)+ (7)
accordingly, we propose cross-modal hard sample triplet loss, which is implemented as follows:
step 1: in each training batch, P was drawn randomly in the RGB modality1Individual pedestrian, each pedestrian extracts K1Opening an image; extracting P under NI mode2Individual pedestrian, each pedestrian extracts K2Opening an image; extracting P in TI mode3Individual pedestrian, each pedestrian extracts K3Sheet image, total P1K1+P2K2+P3K3A sample picture.
Step 2: for a pedestrian image, in three modalities, defining the pedestrian image as a positive sample image with the same ID as the pedestrian image, and otherwise defining the pedestrian image as a negative sample image;
and step 3: using features of images
Figure BDA0003525038530000151
And
Figure BDA0003525038530000152
representing pedestrian images in RGB, NI and TI modes, wherein the distance between the images is the distance between image features, and defining the distance function between the images as Euclidean distance:
Figure BDA0003525038530000153
and 4, step 4: for a fixed anchor picture a, assuming that the picture is a picture in RGB mode, first calculate the anchor picture and its positive sample pictures p in RGB modeRGBD (a, p) ofRGB) And its negative sample picture nRGBDistance d (q, n)RGB);
And 5: calculating each positive sample p under the anchor point picture and NI modeNIDistance d (a, p)NI) And each negative sample nNID (a, n) ofNI);
And 6: computing the anchor point picture and each positive sample p in TI modeTIDistance d (a, p)TI) And each negative sample nTID (a, n) ofTI);
And 7: defining cross-modal hard sample triplets
When the anchor picture is in the RGB mode, the loss of the hard sample triplet is:
Figure BDA0003525038530000161
wherein m is a boundary value parameter, alpha + beta + gamma is 1, and alpha, beta and gamma are more than or equal to 0.
Similarly, when the anchor point pictures are in the NI and TI modalities, the hard sample triplet losses are respectively:
Figure BDA0003525038530000162
Figure BDA0003525038530000163
therefore, our cross-modal hard sample triplet penalty is:
Lcross-modal=LT-RGB+LT-NI+LT-TI (12)
the positive pairs of samples can be made closer together and the negative pairs more distant across the modes by this loss.
Sixth: computing a multi-modal fusion global loss function
A cross entropy loss function generated by single-mode feature processing, a Euclidean distance loss function generated by multi-mode fusion virtual branches, a KL divergence loss function related to the multi-mode fusion feature processing and a cross-mode hard sample triplet loss function are added to serve as a final multi-mode global loss to participate in network training.
Figure BDA0003525038530000171
And (3) reasoning process:
1) and removing the virtual branch and only reserving the backbone network.
2) And loading pre-training weight, and classifying the pedestrian image or extracting image features.
In summary, using the trained network model (storing all the trained network layer parameters, at this time, the network model is fixed, and updating parameters are not propagated backwards), running new data, obtaining the optimal model parameters in the training phase, and directly putting the optimal network model into use in the inference phase: and (4) putting the new data into the network model to carry out forward propagation once to obtain a result, and not carrying out backward propagation on the updated parameters.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the overall concept of the present invention, and these should also be considered as the protection scope of the present invention.

Claims (8)

1. A novel pedestrian re-recognition algorithm based on multi-mode fusion is characterized in that:
acquiring pedestrian images containing three modes of RGB, NI and TI;
inputting the pedestrian images of the three modes into a pre-trained multi-mode fusion pedestrian re-recognition network to obtain a prediction classification result;
wherein the multimodal pedestrian re-recognition network is configured to: three branches are included to capture features of the person image in each modality, respectively, resulting in image features representing RGB, NI and TI modalities, respectively,
horizontally cutting image features of RGB, NI and TI modes into p blocks, and obtaining p partial column vector features after global average pooling GAP
Figure FDA0003525038520000011
And
Figure FDA0003525038520000012
namely the imbedding layer characteristics of each mode;
then p partial column vector features of each of the three modes
Figure FDA0003525038520000013
And
Figure FDA0003525038520000014
respectively inputting the images into a classifier consisting of full connection layer FC and softmax functions so as to obtain the identity predicted value vector of each modal input pedestrian image
Figure FDA0003525038520000015
Firstly, the p predicted value vector characteristics of each mode are connected to generate the characteristic vector under each mode
Figure FDA0003525038520000016
And
Figure FDA0003525038520000017
then, the three feature vectors are connected to obtain a fusion feature vector X1×3KpThen, the feature is passed through a classification layer to obtain the prediction classification result
Figure FDA0003525038520000018
2. The novel multi-modal fused pedestrian re-recognition algorithm of claim 1, characterized in that: the training process of the multi-mode pedestrian re-recognition network comprises the following steps:
s1, initializing the network layer weight, generally adopting random initialization;
s2, single-mode feature extraction: for each pedestrian image, respectively extracting features of three modes, namely RGB, Near Infrared (NI) and Thermal Infrared (TI); dividing the extracted features of three modes of RGB, Near Infrared (NI) and Thermal Infrared (TI)Respectively inputting forward propagation of each layer such as convolution layer, normalization layer, average pooling layer and the like in ResNet50 convolution neural network containing SCA attention module to respectively obtain image features respectively representing RGB, NI and TI modes
Figure FDA0003525038520000019
And
Figure FDA00035250385200000110
s3, single-mode image feature processing: first, for each modality, in order to obtain local information of a person image, we adopt a local scheme to divide each tensor into
Figure FDA00035250385200000111
Horizontally cutting into p blocks, and performing Global Average Pooling (GAP) to obtain p partial column vectors
Figure FDA00035250385200000112
And
Figure FDA00035250385200000113
namely the imbedding layer characteristics of each mode; then, the feature vectors g of each part of each mode are calculatedlRespectively inputting the images into a classifier consisting of a Fully Connected (FC) layer and a softmax function so as to obtain an identity predicted value vector of the pedestrian image input in each mode
Figure FDA0003525038520000021
Then calculating the difference value between the ID prediction vector of each modal pedestrian and the real label, and using the sum of the cross entropies of the p classification layers as the loss function of the single mode
Figure FDA0003525038520000022
To optimize the network;
s4, setting a virtual branch to enable the imbedding layer features of the three-mode images to be subjected to joint learning, realizing information fusion among the modes, enabling the features of the three different modes to be learned towards a common virtual mean vector, and exiting the virtual branch when the single-mode cross entropy classification loss is small enough; then calculating the difference value between the characteristics of each modal embedding layer and the virtual mean value, and optimizing the network by using the cosine distance as a loss function;
s5, carrying out connection operation on the pedestrian ID prediction vector characteristics of each mode obtained in S3 and forward propagation of a classification layer to obtain a multi-mode fusion characteristic output value, namely a prediction classification result
Figure FDA0003525038520000023
The classification layer consists of a full connection layer and a softmax classifier;
s6, solving KL divergence loss between the multi-modal fusion feature output value and a target value and trans-modal hard sample triplet loss;
s7, obtaining the final multi-mode fusion global loss: adding a cross entropy loss function generated by single-mode feature processing, a Euclidean distance loss function generated by multi-mode fusion virtual branches, a KL divergence loss function related to multi-mode fusion feature processing and a trans-mode hard sample triplet loss function to form a final multi-mode global loss participating in network training;
s8, reversely returning the multi-modal fusion global loss to the network, and sequentially obtaining each layer of the network: classification layer classifier, full connection layer FC, pooling layer GAP and attention-bearing respet 50 structure back propagation errors;
s9, adjusting all weight coefficients in the network according to the back propagation errors of each layer, namely updating the weight;
s10, selecting new image data again randomly, entering S2, and carrying out network forward propagation to obtain an output value;
s11, repeating iteration, and ending the training when the error between the output value of the network and the target value (label) is less than a certain threshold value or the iteration frequency exceeds a certain threshold value;
and S12, storing the trained network parameters of all layers.
3. The novel multi-modal fused pedestrian re-recognition algorithm of claim 2, characterized in that: the ResNet50 convolutional neural network structure comprising the attention module comprises five parts, wherein the first part is mainly used for performing convolution, regularization, activation function and maximum pooling calculation on input, the second part, the third part, the fourth part and the fifth part introduce residual blocks, namely, direct connection channels are added in the network to allow original input information to be directly transmitted to a later layer, and each residual block comprises three layers of convolution; and adding SCA attention module after ResNet50 second part and third part, except ResNet50 last residual block, sharing network parameters of three modes, thereby obtaining image characteristic information representing RGB, NI and TI modes respectively
Figure FDA0003525038520000031
Figure FDA0003525038520000032
And
Figure FDA0003525038520000033
4. the novel multi-modal fused pedestrian re-recognition algorithm of claim 2, characterized in that: the S3 specifically includes: step 1: averaging multi-modal image features
Obtaining a feature column vector fused with three modes by solving the mean value of the features of the embedding layers of the modes, wherein the formula is as follows:
Figure FDA0003525038520000034
step 2: loss function for solving each modal embedding layer feature and fusion feature vector
The loss function here selects cosine distances, and uses the sum of cosine distances of all parts as the loss of the modal feature and the fused feature vector, that is:
Figure FDA0003525038520000035
and 3, step 3: exit virtual Branch
Given a threshold α, when the classification layer loss value of each of the following modalities is less than the threshold, i.e.
Figure FDA0003525038520000036
The virtual branch is exited.
5. The novel multi-modal fused pedestrian re-recognition algorithm of claim 2, wherein: the S6 specifically includes: calculating the difference between the predicted classification result and the real label, wherein the KL divergence is used as a loss function to optimize the network, and the formula is as follows:
Figure FDA0003525038520000041
where y represents the true tag of the pedestrian (3 images) and is a K-dimensional vector, yjA jth value representing a y vector;
the cross-modal hard sample triplet loss specifically includes:
step 1: in each training batch, P was drawn randomly in the RGB modality1Individual pedestrian, each pedestrian extracts K1Opening an image; extracting P under NI mode2Individual pedestrian, each pedestrian extracts K2Opening an image; extracting P in TI mode3Individual pedestrian, each pedestrian extracts K3Sheet image, total P1K1+P2K2+P3K3Sample pictures;
step 2: for a pedestrian image, in three modalities, the pedestrian image ID is defined as a positive sample image, and the pedestrian image ID is defined as a negative sample image;
and step 3: using features of images
Figure FDA0003525038520000042
And
Figure FDA0003525038520000043
representing pedestrian images in RGB, NI and TI modes, wherein the distance between the images is the distance between image features, and defining the distance function between the images as Euclidean distance:
Figure FDA0003525038520000044
and 4, step 4: for a fixed anchor picture a, assuming that the picture is a picture in RGB mode, first calculate the anchor picture and its positive sample pictures p in RGB modeRGBD (a, p) ofRGB) And its negative sample picture nRGBD (a, n) ofRGB);
And 5: calculating positive samples p of the anchor point picture and the NI modeNIDistance d (a, p)NI) And each negative sample nNID (a, n) ofNI);
Step 6: calculating positive samples p of the anchor picture and TI modeTIDistance d (a, p)TI) And each negative sample nTID (a, n) ofTI);
And 7: defining cross-modal hard sample triplets
When the anchor picture is in the RGB mode, the loss of the hard sample triplet is:
Figure FDA0003525038520000051
wherein m is a boundary value parameter, alpha + beta + gamma is 1, and alpha, beta and gamma are more than or equal to 0;
similarly, when the anchor point pictures are in the NI and TI modalities, the hard sample triplet losses are respectively:
Figure FDA0003525038520000052
Figure FDA0003525038520000061
therefore, our cross-modal hard sample triplet penalty is:
Lcross-modal=LT-RGB+LT-NI+LT-TI (12)
by this loss it is possible to make the positive pairs of samples closer together and the negative pairs of samples farther apart across the modality.
6. The novel multi-modal fused pedestrian re-recognition algorithm of claim 2, characterized in that: the SCA attention module captures space and channel feature information of an image mainly through pooling operation, input image features firstly aggregate feature information through average pooling and maximum pooling respectively, then two pooled features are connected together and subjected to dimensionality reduction through convolution, and space features are generated after normalization and nonlinear activation; and then performing maximum pooling operation on each channel along the horizontal direction and the vertical direction by using H multiplied by 1 and 1 multiplied by W pooling kernels respectively, and fusing the attention weight shared by all channels after being activated by a sigmoid function with the original input to obtain the final output characteristic.
7. An electronic device, characterized in that: comprises that
A memory for storing a program;
a processor for implementing the method of any one of claims 1-6 by executing a program stored by the memory.
8. A computer-readable storage medium characterized by: comprising a program executable by a processor to implement the method of any one of claims 1-6.
CN202210190938.9A 2022-02-28 2022-02-28 Novel multi-mode fusion pedestrian re-recognition algorithm Pending CN114694089A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210190938.9A CN114694089A (en) 2022-02-28 2022-02-28 Novel multi-mode fusion pedestrian re-recognition algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210190938.9A CN114694089A (en) 2022-02-28 2022-02-28 Novel multi-mode fusion pedestrian re-recognition algorithm

Publications (1)

Publication Number Publication Date
CN114694089A true CN114694089A (en) 2022-07-01

Family

ID=82136592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210190938.9A Pending CN114694089A (en) 2022-02-28 2022-02-28 Novel multi-mode fusion pedestrian re-recognition algorithm

Country Status (1)

Country Link
CN (1) CN114694089A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503914A (en) * 2023-06-27 2023-07-28 华东交通大学 Pedestrian re-recognition method, system, readable storage medium and computer equipment
CN116524542A (en) * 2023-05-08 2023-08-01 杭州像素元科技有限公司 Cross-modal pedestrian re-identification method and device based on fine granularity characteristics
CN117218453A (en) * 2023-11-06 2023-12-12 中国科学院大学 Incomplete multi-mode medical image learning method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524542A (en) * 2023-05-08 2023-08-01 杭州像素元科技有限公司 Cross-modal pedestrian re-identification method and device based on fine granularity characteristics
CN116524542B (en) * 2023-05-08 2023-10-31 杭州像素元科技有限公司 Cross-modal pedestrian re-identification method and device based on fine granularity characteristics
CN116503914A (en) * 2023-06-27 2023-07-28 华东交通大学 Pedestrian re-recognition method, system, readable storage medium and computer equipment
CN116503914B (en) * 2023-06-27 2023-09-01 华东交通大学 Pedestrian re-recognition method, system, readable storage medium and computer equipment
CN117218453A (en) * 2023-11-06 2023-12-12 中国科学院大学 Incomplete multi-mode medical image learning method
CN117218453B (en) * 2023-11-06 2024-01-16 中国科学院大学 Incomplete multi-mode medical image learning method

Similar Documents

Publication Publication Date Title
Cheng et al. Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion
US11704907B2 (en) Depth-based object re-identification
Fu et al. Deep ordinal regression network for monocular depth estimation
CN109977757B (en) Multi-modal head posture estimation method based on mixed depth regression network
Miksik et al. Efficient temporal consistency for streaming video scene analysis
CN114694089A (en) Novel multi-mode fusion pedestrian re-recognition algorithm
CN112446270A (en) Training method of pedestrian re-identification network, and pedestrian re-identification method and device
CN112906545B (en) Real-time action recognition method and system for multi-person scene
CN107203745B (en) Cross-visual angle action identification method based on cross-domain learning
JP7228961B2 (en) Neural network learning device and its control method
CN112434654B (en) Cross-modal pedestrian re-identification method based on symmetric convolutional neural network
CN116309725A (en) Multi-target tracking method based on multi-scale deformable attention mechanism
Wang et al. Pm-gans: Discriminative representation learning for action recognition using partial-modalities
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN112906493A (en) Cross-modal pedestrian re-identification method based on cross-correlation attention mechanism
Hu et al. LDF-Net: Learning a displacement field network for face recognition across pose
CN114743162A (en) Cross-modal pedestrian re-identification method based on generation of countermeasure network
CN114882537A (en) Finger new visual angle image generation method based on nerve radiation field
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
Pini et al. Learning to generate facial depth maps
Zhang et al. Visual Object Tracking via Cascaded RPN Fusion and Coordinate Attention.
CN117133041A (en) Three-dimensional reconstruction network face recognition method, system, equipment and medium based on deep learning
CN113449550A (en) Human body weight recognition data processing method, human body weight recognition method and device
Zhao et al. Research on human behavior recognition in video based on 3DCCA
Cate et al. Deepface: Face generation using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination