CN114694089A

CN114694089A - Novel multi-mode fusion pedestrian re-recognition algorithm

Info

Publication number: CN114694089A
Application number: CN202210190938.9A
Authority: CN
Inventors: 王晓嫚; 崔前进
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-07-01

Abstract

The invention discloses a novel multi-mode fusion pedestrian re-recognition algorithm, which comprises the steps of obtaining pedestrian images containing three modes of RGB, NI and TI; inputting the pedestrian images of the three modes into a pre-trained multi-mode fusion pedestrian re-recognition network to obtain a prediction classification result; the pedestrian re-recognition algorithm based on multi-mode fusion fuses image feature information of three modes, wherein an attention module is introduced into a network of each mode, so that the image feature information is better extracted, different modes are learned towards a common direction, noise is effectively inhibited, the difference between the modes is reduced, and information complementation between the modes is realized. Through the trans-modal hard sample triple loss, samples of different pedestrians can be pulled apart trans-modal, samples of the same pedestrian can be pulled close trans-modal, and trans-modal feature clustering is effectively achieved. The method has wide application prospect in the field of pedestrian re-identification.

Description

Novel multi-mode fusion pedestrian re-recognition algorithm

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a novel multi-mode fusion pedestrian re-recognition algorithm.

Background

Pedestrian Re-identification (Re-ID) is an important image identification technology, and aims to solve the problem that a specific pedestrian is searched in a massive image or video library through appearance visual characteristics and action characteristics of the pedestrian after a camera is crossed. The pedestrian identities shot by different cameras which are mutually networked are correlated so as to obtain the motion trail of a specific pedestrian in time. The pedestrian re-identification technology is an important basis of an intelligent analysis technology and gradually draws close attention of scientific researchers in the field of computer vision.

With the wide application of deep learning in the field of computer vision, pedestrian re-identification based on deep learning has become a mainstream model at present, and the effect is far better than that of a pedestrian re-identification scheme based on traditional machine learning. Since pedestrian images captured by an infrared mode camera, a depth camera and the like are also very common in real scenes, cross-modal pedestrian re-recognition is proposed for retrieving images matching the same pedestrian in an image library under multiple modes. The method effectively solves the problem of cross-mode pedestrian re-identification, and has great significance in the aspects of public safety, crime prevention, criminal investigation and the like.

In recent years, a great deal of research work and related framework for cross-modal pedestrian re-identification has emerged. However, there are still some gaps in the practical application of the cross-modal pedestrian re-identification. The difficulties and challenges faced at present are mainly:

1) there is a large difference in the images captured in the multiple modalities. For example, an RGB image has three channels containing visible light color information of red, green and blue, while an infrared image has only one channel containing intensity information of near-infrared light; from the perspective of imaging principles, the wavelength ranges of the two are also different; the effect of different sharpness and lighting conditions on the two types of images may also be quite different.

2) Intra-modal differences present in conventional pedestrian re-identification. Due to the change of the monitoring environment, different pedestrian videos or images are shot at different places and at different times, the visual angles, illumination, postures and the like of pedestrians are different, and the characteristic information of the same pedestrian has huge deviation, so that the effective identification of the pedestrian is negatively influenced.

Therefore, how to effectively reduce the difference between modal images and learn the shared robustness characteristics between the modalities so that the information between the modalities can be complemented and the improvement of the network performance is the key of the research in the cross-modal pedestrian re-identification field.

Disclosure of Invention

In order to solve the problems, a novel pedestrian re-identification algorithm with multi-mode fusion is provided.

The object of the invention is achieved in the following way:

a novel multi-mode fused pedestrian re-identification algorithm comprises,

acquiring pedestrian images of three modes including RGB, NI and TI;

inputting the pedestrian images of the three modes into a pre-trained multi-mode fusion pedestrian re-recognition network to obtain a prediction classification result;

wherein the multimodal pedestrian re-recognition network is configured to: three branches are included to capture the features of the person image in each modality, resulting in image features representing RGB, NI and TI modalities,

horizontally cutting image features of RGB, NI and TI modes into p blocks, and obtaining p partial column vector features after global average pooling GAP

And

the embedding layer characteristics of each mode;

then p partial column vector features of each of the three modes

And

respectively input into a classifier composed of full connection layer FC and softmax functionsObtaining the identity predicted value vector of the pedestrian image input in each mode

Firstly, the p predicted value vector characteristics of each mode are connected to generate the characteristic vector under each mode

Medicine for curing cancer

Then, the three feature vectors are connected to obtain a fusion feature vector X^1×3KpThen, the feature is passed through a classification layer to obtain the prediction classification result

Further, the training process of the multi-modal pedestrian re-recognition network comprises the following steps:

s1, initializing the network layer weight, generally adopting random initialization;

s2, single-mode feature extraction: for each pedestrian image, respectively extracting characteristics of three modes of RGB, Near Infrared (NI) and Thermal Infrared (TI); respectively inputting the extracted features of the RGB, Near Infrared (NI) and Thermal Infrared (TI) modes into forward propagation of each layer of a convolution layer, a normalization layer, an average pooling layer and the like in a ResNet50 convolution neural network containing an SCA attention module to respectively obtain image features respectively representing the RGB, NI and TI modes

And

s3, single-mode image feature processing: firstly, for each modality, in order to acquire local information of a person image, each tensor χ is horizontally divided into p blocks by adopting a local scheme, and p parts are obtained after Global Average Pooling (GAP) is carried outColumn vector

And

namely the imbedding layer characteristics of each mode; then, each part of the feature vector g of each mode is calculatedⁱRespectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an identity predicted value vector of each modal input pedestrian image

Then calculating the difference value between the ID prediction vector of each modal pedestrian and the real label, and using the sum of the cross entropies of the p classification layers as the loss function of the single mode

To optimize the network;

s4, setting a virtual branch to enable the imbedding layer features of the three-mode images to be subjected to joint learning, realizing information fusion among the modes, enabling the features of the three different modes to be learned towards a common virtual mean vector, and exiting the virtual branch when the single-mode cross entropy classification loss is small enough; then calculating the difference value between the characteristics of each modal embedding layer and the virtual mean value, and optimizing the network by using the cosine distance as a loss function;

s5, carrying out connection operation on the pedestrian ID prediction vector characteristics of each mode obtained in S3 and forward propagation of a classification layer to obtain a multi-mode fusion characteristic output value, namely a prediction classification result

The classification layer consists of a full connection layer and a softmax classifier;

s6, solving KL divergence loss between the multi-modal fusion feature output value and a target value and trans-modal hard sample triplet loss;

s7, obtaining the final multi-mode fusion global loss: adding a cross entropy loss function generated by single-mode feature processing, a Euclidean distance loss function generated by multi-mode fusion virtual branches, a KL divergence loss function related to multi-mode fusion feature processing and a trans-mode hard sample triplet loss function to form a final multi-mode global loss participating in network training;

s8, reversely transmitting the multi-modal fusion global loss back to the network, and sequentially obtaining each layer of the network: classification layer classifier, full connection layer FC, pooling layer GAP and attention-bearing respet 50 structure back propagation errors;

s9, adjusting all weight coefficients in the network according to the back propagation errors of each layer, namely updating the weight;

s10, selecting new image data again randomly, entering S2, and carrying out network forward propagation to obtain an output value;

s11, repeating iteration, and ending the training when the error between the output value of the network and the target value (label) is less than a certain threshold value or the iteration frequency exceeds a certain threshold value;

and S12, storing the trained network parameters of all layers.

Furthermore, the ResNet50 convolutional neural network structure comprising the attention module comprises five parts, wherein the first part mainly performs convolution, regularization, activation function and maximum pooling calculation on input, and the second, third, fourth and fifth part structures introduce residual blocks, namely direct connection channels are added in the network to allow original input information to be directly transmitted to a later layer, and each residual block comprises three layers of convolution; and adding SCA attention module after the second part and the third part of ResNet50, except the last residual block of ResNet50, sharing the network parameters of three modes, thereby obtaining image characteristic information representing RGB, NI and TI modes respectively

And

further, the S3 specifically includes: step 1: averaging multi-modal image features

Obtaining a feature column vector fused with three modes by solving the mean value of the features of the embedding layers of the modes, wherein the formula is as follows:

step 2: solving loss function of each mode embedding layer feature and fusion feature vector

The loss function here selects cosine distances, and uses the sum of cosine distances of all parts as the loss of the modal feature and the fused feature vector, that is:

and step 3: exit virtual Branch

Given a threshold α, when the classification layer loss value of each of the following modalities is less than the threshold, i.e.

The virtual branch is exited.

Further, the S6 specifically includes: calculating the difference between the predicted classification result and the real label, wherein the KL divergence is used as a loss function to optimize the network, and the formula is as follows:

where y represents the true tag of the pedestrian (3 images), is a K-dimensional vector, y_jA jth value representing a y vector;

the cross-modal hard sample triplet loss specifically includes:

step 1: in each training batch, P was drawn randomly in the RGB modality₁One rowPeople, each pedestrian extracts K₁Opening an image; extracting P under NI mode₂Individual pedestrian, each pedestrian extracts K₂Opening an image; extracting P in TI mode₃Individual pedestrian, each pedestrian extracts K₃Sheet image, total P₁K₁+P₂K₂+P₃K₃Sample pictures;

step 2: for a pedestrian image, in three modalities, the pedestrian image ID is defined as a positive sample image, and the pedestrian image ID is defined as a negative sample image;

and step 3: using features of images

Medicine for curing cancer

Representing pedestrian images in RGB, NI and TI modes, wherein the distance between the images is the distance between image features, and defining the distance function between the images as Euclidean distance:

and 4, step 4: for a fixed anchor picture a, assuming that the picture is a picture in RGB mode, first calculate the anchor picture and its positive sample pictures p in RGB mode_RGBD (a, p) of_RGB) And its negative sample picture n_RGBD (a, n) of_RGB)；

And 5: calculating positive samples p of the anchor point picture and the NI mode_NIDistance d (a, P)_NI) And each negative sample n_NID (a, n) of_NI)；

Step 6: calculating positive samples p of the anchor picture and TI mode_TIDistance d (a, p)_TI) And each negative sample n_TID (a, n) of_TI)；

And 7: defining cross-modal hard sample triplets

When the anchor picture is in the RGB mode, the loss of the hard sample triplet is:

wherein m is a boundary value parameter, alpha + beta + gamma is 1, and alpha, beta and gamma are more than or equal to 0;

similarly, when the anchor point pictures are in the NI and TI modalities, the hard sample triplet losses are respectively:

therefore, our cross-modal hard sample triplet penalty is:

L_cross-modal＝L_T-RGB+L_T-NI+L_T-TI (12)

the positive pairs of samples can be made closer together and the negative pairs more distant across the modes by this loss.

Furthermore, the SCA attention module captures space and channel feature information of the image mainly through pooling operation, the input image features firstly aggregate the feature information through average pooling and maximum pooling respectively, then the two pooled features are connected together and subjected to dimensionality reduction through convolution, and space features are generated after normalization and nonlinear activation; and then performing maximum pooling operation on each channel along the horizontal direction and the vertical direction by using H multiplied by 1 and 1 multiplied by W pooling kernels respectively, and fusing the attention weight shared by all channels after being activated by a sigmoid function with the original input to obtain the final output characteristic.

An electronic device comprises

A memory for storing a program;

a processor for implementing the above method by executing the program stored in the memory.

A computer-readable storage medium comprising a program executable by a processor to implement the above-described method.

The invention has the beneficial effects that: the pedestrian re-recognition algorithm based on multi-mode fusion fuses image feature information of three modes, wherein an attention module is introduced into a network of each mode, so that the image feature information is better extracted, different modes are learned towards a common direction, noise is effectively inhibited, the difference between the modes is reduced, and information complementation between the modes is realized. Through cross-modal loss of the difficult sample triples, samples of different pedestrians can be pulled apart in a cross-modal manner, samples of the same pedestrians can be pulled close in a cross-modal manner, and cross-modal feature clustering is effectively achieved. The method has wide application prospect in the field of pedestrian re-identification.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a block diagram of the multi-modal converged pedestrian re-recognition network of the present invention.

Fig. 3 is a diagram of a ResNet50 network architecture.

FIG. 4 is a diagram of an SCA attention module.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same technical meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

A novel multi-mode fused pedestrian re-recognition algorithm,

acquiring pedestrian images of three modes including RGB, Near Infrared (NI) and Thermal Infrared (TI);

wherein the multimodal pedestrian re-recognition network is configured to: comprises three branches to respectivelyCapturing the characteristics of the figure image in each mode to obtain image characteristics respectively representing RGB, NI and TI modes, horizontally dividing the image characteristics of the RGB, NI and TI modes into p blocks, and obtaining p partial column vector characteristics after Global Average Pooling (GAP)

And

namely the imbedding layer characteristics of each mode; then p partial column vector features of the three modes are combined

And

respectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an Identity (ID) predicted value vector of each modal input pedestrian image

Firstly, p predicted value vector characteristics of each mode are connected to generate a characteristic vector under each mode

And

then, the three feature vectors are connected to obtain a fusion feature vector X^1×3KpThen, the feature is used to obtain the prediction classification result by a classifier

It is known that a multi-modal fused pedestrian re-identification network needs training to converge after being established. And obtaining the trained network weight after convergence. In the reasoning process, the weight coefficient trained by the network is loaded in advance to carry out final classification on the input data.

As shown in fig. 1, the training process of the multi-modal pedestrian re-recognition network includes:

s2, single-mode feature extraction: for each pedestrian image, respectively extracting features of three modes, namely RGB, Near Infrared (NI) and Thermal Infrared (TI); respectively inputting the extracted features of the RGB, Near Infrared (NI) and Thermal Infrared (TI) modes into forward propagation of each layer of a convolution layer, a normalization layer, an average pooling layer and the like in a ResNet50 convolution neural network containing an SCA attention module to respectively obtain image features respectively representing the RGB, NI and TI modes

And

s3, single-mode image feature processing: first, for each modality, in order to obtain local information of a person image, we adopt a local scheme to divide each tensor into

Horizontally cutting into p blocks, and performing Global Average Pooling (GAP) to obtain p partial column vectors

And

namely the imbedding layer characteristics of each mode; then, each part of the feature vector g of each mode is calculatedⁱRespectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an Identity (ID) predicted value vector of each modal input pedestrian image

Then calculating the difference value between the predicted value vector of each modal pedestrian ID and the real label, and using the sum of the cross entropy of the p classification layers as a singleLoss function of mode

To optimize the network;

s5, carrying out connection operation on the vector characteristics of the predicted value vectors of the pedestrians in each mode obtained in the S3 and carrying out forward propagation on the classification layer to obtain a multi-mode fusion characteristic output value, namely a prediction classification result

s8, reversely returning the multi-modal fusion global loss to the network, and sequentially obtaining each layer of the network: classification layer classifier, full connection layer FC, pooling layer GAP and attention-bearing resnet50 structure layer (convolution layer, pooling layer, normalization layer and activation layer); (ii) a

and S12, storing the trained network parameters of all layers.

Specifically, as shown in fig. 2, the multi-modal fused pedestrian re-identification network provided by the present invention has the following structure:

firstly, the method comprises the following steps: single modal feature extraction

For each pedestrian image, there are three modalities, RGB, Near Infrared (NI) and Thermal Infrared (TI).

First, feature representations of the respective modalities are extracted. To obtain a high-quality feature representation of a single modality, three branches are designed based on a convolutional neural network (the invention chooses ResNet50) to capture the features of the human image in each modality respectively.

The network structure of ResNet50 is shown in FIG. 3. The ResNet50 structure is divided into seven parts, the first part mainly carries out convolution, regularization, activation function and maximum pooling calculation on input, the second, third, fourth and fifth part structures introduce residual blocks, namely, direct connection channels are added in the network, original input information is allowed to be directly transmitted to a later layer, and each residual block contains three layers of convolution. After the convolution calculation of the first five parts, the pooling layer converts the convolution calculation into a feature vector, and finally the classifier calculates the feature vector and outputs the class probability.

For better capturing image feature information, we propose a new Spatial-Channel Attention module SCA (Spatial-Channel Attention), the structure of which is shown in fig. 4. For input feature information F^H×W×CWe first generate spatial attention features using spatial relationships between features, and since applying pooling along the channel axis can highlight valid information regions, we aggregate channel information of one feature map by two pooling operations (average pooling and maximum pooling) to generate two-dimensional feature maps for computing spatial attention

Then connecting them together and making dimension reduction by convolution, and making normalization and nonlinear activation function to produce space attention characteristic F_s∈R^H×WTo encode the locations that need to be emphasized and suppressed. The calculation formula is as follows:

F_s＝σ(BN(f^1×1×2([AvgPool(F)；MaxPool(F)])))

wherein [ ·; a]Indicating a connection operation, f^1×1×2For convolution operations with a convolution kernel of 1 × 1 × 2, BN is the batch normalized BatchNorm, and σ represents the Non-linear activation function Non-linear.

Then, in channel focusing, we use the global maximum pooling to encode the spatial information globally. To enable the attention block to capture distant spatial interactions with precise location information, each channel feature is pooled along horizontal and vertical coordinates using H × 1 and 1 × W pooling kernels, respectively, generating a pair of directional perceptual feature vectors. Thus, the output at height h can be expressed as:

similarly, the output at width w can be expressed as:

these two transformations allow the attention block to capture the long distance dependency in one spatial direction and maintain accurate position information in the other spatial direction, which helps the network to more accurately locate the area of interest. G after sigmoid function activation is carried out^hAnd g^wAs the attention weight shared by all channels is fused with the original input, the final output of the SCA module is represented as:

F′_c(i，j)＝F_c(i，j)×δ(g^h(i))×δ(g^w(j))

where δ is the sigmoid activation function, F_c(i, j) is the pixel value of ith row and jth column of ith channel of the c channel of the original feature map, F'_c(i, j) is the pixel value of the ith row and the jth column of the ith channel of the output characteristic diagram.

The SCA attention module is added after the second and third parts of the ResNet50, respectively, and then the images of each modality are input into the first five parts of the ResNet50 containing the attention module, respectively, with the following averaging pooling (avg pool) and full connection (fc) layers deleted. To reduce the number of network parameters, the network parameters of the three modalities are shared except the last residual block of ResNet50, resulting in image feature information representing RGB, NI and TI modalities, respectively

And

secondly, the method comprises the following steps: single modality feature processing

First, for each modality, to obtain local information of a person image, we adopt a local scheme to combine each tensor

And

i.e. the embedding layer characteristics of each modality.

Then, each part of the feature vector g of each mode is calculatedⁱRespectively inputting the images into a classifier composed of a Full Connection (FC) layer and a softmax function, thereby obtaining an Identity (ID) predicted value vector of each modal input pedestrian image

Note that here each modalityAre not shared.

And then calculating the difference value between the predicted value of the pedestrian ID of each mode and the real label, and optimizing the network by using the sum of cross-entropy (cross-entropy) of p classification layers as a loss function of the single mode, wherein the formula is as follows:

where K is the number of ID categories for the identity of the pedestrian,

is the predicted probability that the ith local part belongs to the jth pedestrian ID,

representing the corresponding real label.

Thirdly, the method comprises the following steps: setting up multi-modal fused virtual branches

The image information of the three modes is beneficial to pedestrian re-identification, but each mode contains more noise, the wavelengths, the definition, the illumination conditions and the like of the different modes have obvious differences, and in order to effectively inhibit the noise, reduce the difference between the modes and realize the complementation of the information between the modes, a virtual branch is created, so that the image characteristics of the three modes are learned towards a common direction, and the convergence of the model is accelerated.

A virtual branch is set to enable the imbedding layer features of three modal images to perform joint learning, and the implementation mode is as follows:

step 1: averaging multi-modal image features

Obtaining a feature column vector fused by three modes by solving the mean value of the embedding layer features of each mode, wherein the formula is as follows:

step 2: loss function for solving each modal embedding layer feature and fusion feature vector

and step 3: exiting virtual Branch

The virtual branch is exited.

Through the virtual branch, information fusion among the modes can be realized, and the characteristics of three different modes are learned towards a common virtual mean vector, so that the noise is effectively inhibited, the difference among the modes is reduced, and the convergence is accelerated. And when the classification loss is small enough, the virtual branch is exited, and the calculation cost is reduced.

Fourthly: and (3) multi-modal fusion feature processing:

in order to further fuse the features of each mode and enhance the generalization capability of the network, p features of each mode are connected to generate a feature vector under each mode

And

then, the three eigenvectors are connected to obtain a fused eigenvector X^1×3KpThen, the feature is passed through a classification layer to obtain the prediction classification result

I.e. the ID prediction probability for re-recognition, which is finally generated by the classification layer classifierAnd (5) vector quantity. Such as: the vector is [0.01, 0.53, 0.24]It means that the pedestrian ID is 0 pedestrian with a probability of 0.01, 1 pedestrian with a probability of 0.53.

The difference between the predicted classification result and the true label is then calculated, where we use KL (Kullback-Leibler) divergence as a loss function to optimize the network, as follows:

fifth, the method comprises the following steps: computing cross-modal hard sample triplet loss functions

In order to cluster pedestrian image samples in a cross-modal state, samples of different pedestrians can be pulled apart in the cross-modal state, and samples of the same pedestrian can be pulled close in the cross-modal state. We have designed a novel cross-modal hard sample Triplet loss (Triplet loss) function.

The conventional triple loss function input is three pictures, which are respectively a fixed anchor point picture a (anchor), a positive sample picture p (positive) and a negative sample picture n (negative), wherein the picture a and the picture p form a positive sample pair, the picture a and the picture n form a negative sample pair, and then the corresponding triple loss function is expressed as:

L_t＝(d_a，p-d_a，n+α)₊ (6)

wherein (z)₊Represents max (z, 0), d_a，pIs the distance between the pair of positive samples, d_a，pAlpha is a threshold parameter set according to actual needs, which is the distance between the pair of negative samples.

The triple loss function can draw positive sample pairs close and push negative sample pairs open, so that the image features of the same label can be clustered on a feature space.

In order to enhance the generalization ability of the network and enable the network to learn better characteristics, for each fixed picture a, the positive sample picture farthest away and the negative sample picture n closest to the fixed picture a are selected in a training batch to train the network, which is called as a hard sample triplet loss. The formula is as follows:

L_th＝(maxd_a，p-mind_a，n+α)₊ (7)

accordingly, we propose cross-modal hard sample triplet loss, which is implemented as follows:

step 1: in each training batch, P was drawn randomly in the RGB modality₁Individual pedestrian, each pedestrian extracts K₁Opening an image; extracting P under NI mode₂Individual pedestrian, each pedestrian extracts K₂Opening an image; extracting P in TI mode₃Individual pedestrian, each pedestrian extracts K₃Sheet image, total P₁K₁+P₂K₂+P₃K₃A sample picture.

Step 2: for a pedestrian image, in three modalities, defining the pedestrian image as a positive sample image with the same ID as the pedestrian image, and otherwise defining the pedestrian image as a negative sample image;

and step 3: using features of images

And

and 4, step 4: for a fixed anchor picture a, assuming that the picture is a picture in RGB mode, first calculate the anchor picture and its positive sample pictures p in RGB mode_RGBD (a, p) of_RGB) And its negative sample picture n_RGBDistance d (q, n)_RGB)；

And 5: calculating each positive sample p under the anchor point picture and NI mode_NIDistance d (a, p)_NI) And each negative sample n_NID (a, n) of_NI)；

And 6: computing the anchor point picture and each positive sample p in TI mode_TIDistance d (a, p)_TI) And each negative sample n_TID (a, n) of_TI)；

And 7: defining cross-modal hard sample triplets

wherein m is a boundary value parameter, alpha + beta + gamma is 1, and alpha, beta and gamma are more than or equal to 0.

therefore, our cross-modal hard sample triplet penalty is:

L_cross-modal＝L_T-RGB+L_T-NI+L_T-TI (12)

Sixth: computing a multi-modal fusion global loss function

A cross entropy loss function generated by single-mode feature processing, a Euclidean distance loss function generated by multi-mode fusion virtual branches, a KL divergence loss function related to the multi-mode fusion feature processing and a cross-mode hard sample triplet loss function are added to serve as a final multi-mode global loss to participate in network training.

And (3) reasoning process:

1) and removing the virtual branch and only reserving the backbone network.

2) And loading pre-training weight, and classifying the pedestrian image or extracting image features.

In summary, using the trained network model (storing all the trained network layer parameters, at this time, the network model is fixed, and updating parameters are not propagated backwards), running new data, obtaining the optimal model parameters in the training phase, and directly putting the optimal network model into use in the inference phase: and (4) putting the new data into the network model to carry out forward propagation once to obtain a result, and not carrying out backward propagation on the updated parameters.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the overall concept of the present invention, and these should also be considered as the protection scope of the present invention.

Claims

1. A novel pedestrian re-recognition algorithm based on multi-mode fusion is characterized in that:

acquiring pedestrian images containing three modes of RGB, NI and TI;

wherein the multimodal pedestrian re-recognition network is configured to: three branches are included to capture features of the person image in each modality, respectively, resulting in image features representing RGB, NI and TI modalities, respectively,

And

namely the imbedding layer characteristics of each mode;

then p partial column vector features of each of the three modes

And

respectively inputting the images into a classifier consisting of full connection layer FC and softmax functions so as to obtain the identity predicted value vector of each modal input pedestrian image

And

2. The novel multi-modal fused pedestrian re-recognition algorithm of claim 1, characterized in that: the training process of the multi-mode pedestrian re-recognition network comprises the following steps:

s2, single-mode feature extraction: for each pedestrian image, respectively extracting features of three modes, namely RGB, Near Infrared (NI) and Thermal Infrared (TI); dividing the extracted features of three modes of RGB, Near Infrared (NI) and Thermal Infrared (TI)Respectively inputting forward propagation of each layer such as convolution layer, normalization layer, average pooling layer and the like in ResNet50 convolution neural network containing SCA attention module to respectively obtain image features respectively representing RGB, NI and TI modes

And

And

namely the imbedding layer characteristics of each mode; then, the feature vectors g of each part of each mode are calculated^lRespectively inputting the images into a classifier consisting of a Fully Connected (FC) layer and a softmax function so as to obtain an identity predicted value vector of the pedestrian image input in each mode

To optimize the network;

s8, reversely returning the multi-modal fusion global loss to the network, and sequentially obtaining each layer of the network: classification layer classifier, full connection layer FC, pooling layer GAP and attention-bearing respet 50 structure back propagation errors;

and S12, storing the trained network parameters of all layers.

3. The novel multi-modal fused pedestrian re-recognition algorithm of claim 2, characterized in that: the ResNet50 convolutional neural network structure comprising the attention module comprises five parts, wherein the first part is mainly used for performing convolution, regularization, activation function and maximum pooling calculation on input, the second part, the third part, the fourth part and the fifth part introduce residual blocks, namely, direct connection channels are added in the network to allow original input information to be directly transmitted to a later layer, and each residual block comprises three layers of convolution; and adding SCA attention module after ResNet50 second part and third part, except ResNet50 last residual block, sharing network parameters of three modes, thereby obtaining image characteristic information representing RGB, NI and TI modes respectively

And

4. the novel multi-modal fused pedestrian re-recognition algorithm of claim 2, characterized in that: the S3 specifically includes: step 1: averaging multi-modal image features

and 3, step 3: exit virtual Branch

The virtual branch is exited.

5. The novel multi-modal fused pedestrian re-recognition algorithm of claim 2, wherein: the S6 specifically includes: calculating the difference between the predicted classification result and the real label, wherein the KL divergence is used as a loss function to optimize the network, and the formula is as follows:

where y represents the true tag of the pedestrian (3 images) and is a K-dimensional vector, y_jA jth value representing a y vector;

the cross-modal hard sample triplet loss specifically includes:

step 1: in each training batch, P was drawn randomly in the RGB modality₁Individual pedestrian, each pedestrian extracts K₁Opening an image; extracting P under NI mode₂Individual pedestrian, each pedestrian extracts K₂Opening an image; extracting P in TI mode₃Individual pedestrian, each pedestrian extracts K₃Sheet image, total P₁K₁+P₂K₂+P₃K₃Sample pictures;

and step 3: using features of images

And

And 7: defining cross-modal hard sample triplets

therefore, our cross-modal hard sample triplet penalty is:

L_cross-modal＝L_T-RGB+L_T-NI+L_T-TI (12)

by this loss it is possible to make the positive pairs of samples closer together and the negative pairs of samples farther apart across the modality.

6. The novel multi-modal fused pedestrian re-recognition algorithm of claim 2, characterized in that: the SCA attention module captures space and channel feature information of an image mainly through pooling operation, input image features firstly aggregate feature information through average pooling and maximum pooling respectively, then two pooled features are connected together and subjected to dimensionality reduction through convolution, and space features are generated after normalization and nonlinear activation; and then performing maximum pooling operation on each channel along the horizontal direction and the vertical direction by using H multiplied by 1 and 1 multiplied by W pooling kernels respectively, and fusing the attention weight shared by all channels after being activated by a sigmoid function with the original input to obtain the final output characteristic.

7. An electronic device, characterized in that: comprises that

A memory for storing a program;

a processor for implementing the method of any one of claims 1-6 by executing a program stored by the memory.

8. A computer-readable storage medium characterized by: comprising a program executable by a processor to implement the method of any one of claims 1-6.