CN111368815B

CN111368815B - Pedestrian re-identification method based on multi-component self-attention mechanism

Info

Publication number: CN111368815B
Application number: CN202010467045.5A
Authority: CN
Inventors: 陆易; 叶喜勇; 徐晓刚; 张逸; 张文广; 祝敏航
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-04
Anticipated expiration: 2040-05-28
Also published as: CN111368815A

Abstract

The invention provides a pedestrian re-identification method based on a multi-component self-attention mechanism, which comprises the steps of pre-training a deep convolutional neural network backbone model; then, after the backbone model is branched, a multi-component self-attention network is constructed, and multi-component self-attention characteristics are obtained; then inputting the multi-component self-attention characteristics into a classifier, and performing combined training to minimize cross entropy loss and metric loss; and finally, inputting the test set picture into the trained model, fusing the output component characteristics to obtain overall characteristics, and sequencing the longitude quantities to realize pedestrian re-identification. Various challenges in the pedestrian re-identification problem are fully considered, a multi-component self-attention mechanism is provided, the attention activation region is effectively expanded, and the pedestrian characteristics are enriched; the self-attention module enables the network to pay attention to the area with the distinguishing characteristic more fully and finely, wherein the space attention module and the channel attention module are fused into the network in the form of residual errors, so that the network is more robust and stable and is easy to train.

Description

Pedestrian re-identification method based on multi-component self-attention mechanism

Technical Field

The invention belongs to the technical field of artificial intelligence and computer vision, and particularly relates to a pedestrian re-identification method based on a multi-component self-attention mechanism.

Background

With the acceleration of urbanization, public safety has become a focus and a demand of increasing attention. Monitoring cameras are widely covered in important public health areas such as university campuses, theme parks, hospitals, streets and the like, and good objective conditions are created for automatic monitoring by utilizing a computer vision technology.

In recent years, pedestrian re-identification has been receiving increasing attention as an important research direction in the field of video monitoring. Specifically, pedestrian re-identification refers to a technology of judging whether a specific pedestrian exists in an image or a video sequence by using a computer vision technology under a cross-camera and cross-scene condition. As an important supplement of a face recognition technology, the technology can recognize pedestrians according to wearing, posture, hairstyle and other information of the pedestrians, the pedestrians who cannot acquire clear shot faces are continuously tracked across cameras in an actual monitoring scene, the space-time continuity of data is enhanced, a large amount of manpower and material resources are saved, and the technology has important research significance.

In an open environment, due to the fact that a monitoring scene is complex and changeable, interference factors such as background noise, illumination change, posture change and severe shielding often exist in an acquired pedestrian image, an existing recognition model cannot well pay attention to an area with strong discriminativity and high discriminativity, extracted features are not robust enough, and recognition performance is poor. Therefore, it is highly desirable to provide a pedestrian re-identification method capable of accurately extracting strong-discriminability and high-resolution features.

Disclosure of Invention

The invention aims to provide a pedestrian re-identification method based on a multi-component self-attention mechanism, aiming at the defects of the existing method. The method utilizes the advantages of the existing deep learning and extracts features through a deep residual error neural network; an identification model based on a multi-component self-attention mechanism is built, pedestrian features are extracted in a component-by-component, multi-branch and high-fusion manner, the attention activation range is expanded, and the region with discriminability is more widely and sufficiently noted; the spatial self-attention and the channel self-attention are fused, the model is enabled to pay more attention to key areas with distinguishing characteristics in space, the model is enabled to integrate and summarize channels containing similar semantic information in the channels, classification results are enabled to be more distinctive, meanwhile, the spatial attention module and the channel attention module are fused into the network in a residual error mode, and the network is enabled to be more robust and stable and easy to train. In conclusion, the invention improves the performance of pedestrian re-identification under the condition of crossing the cameras, and has good robustness and universal applicability.

The purpose of the invention is realized by the following technical scheme: a pedestrian re-identification method based on a multi-component self-attention mechanism comprises the following steps:

s1: pre-training a deep convolutional neural network backbone model B;

s2: segmenting the backbone model B: b is_commonAnd B_branchIn which B is_branchCorresponding to the last layer of residual error network layer4 of the backbone model B, and dividing B into two layers_branchDeeply copying 2 parts to obtain three branches: b is_branch1，B_branch2，B_branch3Constructing a multi-component self-attention network ANet behind the branch to obtain a multi-component self-attention feature F of the pedestrian;

s3: inputting the multi-component self-attention features into a classifier CLS, and jointly training to minimize cross-entropy loss L_xentAnd measure the loss L_triplet；

S4: and inputting the test set picture into the trained model, fusing the output component characteristics to obtain an overall characteristic f, and sequencing the longitude quantities to realize pedestrian re-identification.

Further, the step S1 specifically includes: and adopting ResNet by the deep convolutional neural network backbone model B, and pre-training on an ImageNet data set to enable B to obtain an initial value.

Further, the step S2 includes the following sub-steps:

s2.1: let B_commonAnd B_branch1，B_branch2，B_branch3Corresponding to a learning parameter of W_commonAnd W_branch1，W_branch2，W_branch3，W_branch1，W_branch2，W_branch3The initialization parameters are consistent; pedestrian image P passing through B_commonRespectively pass through B_branch1，B_branch2，B_branch3Then, the corresponding extracted feature map is classified as F₁∈R^C×H×W，F₂∈R^C×H×W，F₃∈R^C×H×WWherein C is the number of channels of the feature map, H is the height of the feature map, W is the width of the feature map, and the calculation formula is as follows:

wherein T represents a matrix transposition function;

s2.2: in B_branch1And later establishing a branch: a local component based self-attention network PCPA; f₁Input PCPA, output feature set F_pcpaPCPA network parameter is W_pcpa；

S2.3: in B_branch2And later establishing a branch: a global component based self-attention network PCA; f₂Input PCA, output feature vector F_pcaPCA network parameter is W_pca；

S2.4: in B_branch3And later establishing a branch: global component-based feature mapping network Global; f₃Input Global, output feature vector F_gThe Global network parameter is W_global；

S2.5：F_pcpa，F_pca，F_gCollectively constituting a multi-component self-attention feature set F.

Further, the PCPA in step S2.2 comprises the following sub-steps:

s2.2.1: for F₁∈R^C×H×WDividing the image into height-wise average segments to form a feature map

And

s2.2.2: for input

Through three branches: a feature mapping module Identity, a spatial self-attention module Patt, a channel self-attention module CATt; correspondingly, the extracted feature maps are respectively F_{1_up_identity}，F_{1_up_patt}，F_{1_up_catt}The method comprises the following substeps:

(a) for input

The Identity mapping relationship is calculated in the following way:

F_{1_up_identity}＝F_{1_up}

(b) for input

The spatial self-attention module PAtt specifically includes: for arbitrary feature vectors x_i，x_j∈F_{1_up}，1≤i≤N，1≤j≤N，

Self-attention relation modeling is carried out on the space scale, and a relation matrix D ∈ R is obtained^N*N(ii) a For each value D_kThe calculation formula is as follows:

applying a relationship matrix D to F_{1_up}And is merged into the network in the form of residual error to obtain an updated characteristic diagram F_{1_up_patt}The calculation formula is as follows:

wherein, W_pattAre the parameters of the PAtt module and,

represents a matrix multiplication;

(c) for input

The channel self-attention module CAtt specifically includes: for arbitrary feature vectors c_i，c_j∈F_{1_up}I is more than or equal to 1 and less than or equal to C, j is more than or equal to 1 and less than or equal to C, and the self-attention relation modeling is carried out on the channel scale to obtain a relation matrix E ∈ R^C*C(ii) a For each value E of E_kThe calculation formula is：

Applying a relationship matrix E to the sum F_{1_up}Obtaining an updated feature map F_{1_up_catt}The calculation formula is as follows:

wherein, W_cattAs a parameter of the CAtt module,

represents a matrix multiplication;

s2.2.3: handle F_{1_up_identity}，F_{1_up_patt}，F_{1_up_catt}Fusing to obtain an output characteristic diagram F_{1_up_pcpa}The calculation method is as follows:

F_{1_up_pcpa}＝F_{1_up_identity}+F_{1_up_patt}+F_{1_up_catt}

s2.2.4: for the

Mode of operation and_{1_up}obtaining an output characteristic diagram F passing through the PCPA_{1_down_pcpa}And is with F_{1_up}Sharing a parameter W_pattAnd W_catt；

S2.2.5: to F_{1_up_pcpa}And F_{1_down_pcpa}Performing global average pooling to obtain a feature vector set:

F_pcpa＝{F_{up_pcpa}，F_{down_pcpa}}

wherein, F_{up_pcpa}＝AvgP(F_{1_up_pcpa})，F_{down_pcpa}＝AvgP(F_{1_down_pcpa}) (ii) a AvgP (·) represents the global average pooling operation, which is calculated as:

wherein C is more than or equal to 1 and less than or equal to C₁，1≤w≤W₁，1≤h≤H₁，C₁Is F_{1_up_pcpa}Number of channels of feature map, W₁Is F_{1_up_pcpa}Width of the feature map, H₁Is F_{1_up_pcpa}Height, x of the feature map_c，w，hFor three-dimensional feature maps F_{1_up_pcpa}An element of (1); AvgP (F)_{1_down_pcpa}) The same is true.

Further, the PCA in step S2.3 specifically is: for input F₂∈R^C×H×WThrough three branches: the feature mapping module Identity, the spatial self-attention module Patt, and the channel self-attention module CATt are calculated in the same manner as steps S2.2.2-S2.2.3 of the PCPA, but without performing feature map horizontal segmentation.

Further, Global in step S2.4 is specifically: for input F₃∈R^C×H×WPerforming global average pooling to obtain a feature vector F_gThe calculation formula is as follows:

F_g＝AvgP(F₃)

wherein, AvgP (-) represents the global average pooling operation, and the calculation formula is:

wherein C is more than or equal to 1 and less than or equal to C₃，1≤w≤W₃，1≤h≤H₃，C₃Is F₃Number of channels of feature map, W₃Is F₃Width of the feature map, H₃Is F₃Height, x of the feature map_c，w，hFor three-dimensional feature maps F₃Of (2) is used.

Further, the multi-component self-attention feature set F ═ F in step S2.5_{up_pcpa}，F_{down_pcpa}，F_pca，F_g}。

Further, the step S3 includes the following sub-steps:

s3.1: for the input pedestrian image P ═ { P ═ P₁，p₂，p₃......p_nAndcorresponding identity tag IDs ═ { q ═ q₁，q₂，q₃......q_nN is the number of samples of P; obtaining a multi-component self-attention feature F ═ { F ═ F corresponding to the pedestrian image P through the step S2_{up_pcpa}，F_{down_pcpa}，F_pca，F_g}；

S3.2: the classifier CLS (-) represents a fully connected layer for p_i∈ P, 1 ≦ i ≦ n, and its corresponding multi-component self-attention feature F_i＝{F_{i，up_pcpa}，F_{i，down_pcpa}，F_i，pca，F_i，g}; let the weight matrix be W_cls＝{W_cls1，W_cls2，W_cls3，W_cls4}，W_cls1，W_cls2，W_cls3，W_cls4∈R^K×ZK is the input feature vector F_{i，up_pcpa}，F_{i，down_pcpa}，F_i，pca，F_i，gZ is an output dimension, namely the number of pedestrian identity tags; through CLS (-), the output classification probability Pro:

s3.3: calculating the cross entropy loss L_xentThe calculation formula is as follows:

s3.4: for arbitrarily inputted pedestrian sample p_i∈P＝{p₁，p₂，p₃......p_nH, identity label q_i∈IDs＝{q₁，q₂，q₃......q_nGet P out of P_iNearest negative example p_jAnd from p_iThe farthest positive sample p_k(ii) a Obtaining a multi-component self-attention feature via said step S2:

F_pi＝{F_{pi，up_pcpa}，F_{pi，down_pcpa}，F_pi，pca，F_pi，g}

F_pj＝{F_{pj，up_pcpa}，F_{pj，down_pcp}a，F_pj，pca，F_pj，g}

F_pk＝{F_{pk，up_pcpa}，F_{pk，down_pcpa}，F_pk，pca，F_pk，g}

s3.5: calculating metric loss L_tripletThe calculation method is as follows: for F_{up_pcpa}Recording:

order sn_{up_pcpa}-sp_{up_pcpa}M is a boundary value, for F_{up_pcpa}Is measured by the loss L_{triple_up_pcpa，i，j，k}The calculation formula of (2) is as follows:

wherein [ ·]₊Is a change function; f_{down_pcpa}，F_pca，F_gThe same principle is applied to the calculation of the metric loss of (1), to obtain L_tripletThe calculation formula of (2) is as follows:

s3.6: joint training minimization of cross entropy loss L_xentAnd measure the loss L_tripletThe total loss is L ═ L_xent+λ_Ltripletλ is L_xentAnd L_tripletThe balance parameter of (1).

Further, the step S4 includes the following sub-steps:

s4.1: the query picture set A is ═ { a ═ a₁，a₂，a₃......a_MAnd a picture set G to be selected is G ═ G₁，g₂，g₃......g_TAnd inputting the backbone model B and the multi-component self-attention network ANet respectively to obtain corresponding feature sets:

F_A＝{F_{A,up_pcpa}，F_{A，down_pcpa}，F_A，pca，_FA，g}

F_G＝{F_{G,up_pcpa}，F_{G，down_pcpa}，F_G，pca，F_G，g}

s4.2: f is to be_AAnd F_GRespectively fusing the features of the components in a mode of carrying out feature connection on the dimensions to obtain an overall feature f_AAnd f_G；

S4.3: calculating f_AAnd f_GEuclidean distance between them, constructing a distance matrix S ∈ R^M*TAnd sorting according to the distance to obtain a retrieval candidate list.

The invention has the beneficial effects that: the invention relates to pedestrian picture retrieval and identification under a cross-camera, which is characterized in that:

(1) the invention constructs an identification model based on a multi-component self-attention mechanism, extracts pedestrian features in a way of splitting, multi-branching and high fusion, enlarges the attention activation range, more widely and fully pays attention to key areas with discriminability, and improves the feature robustness.

(2) The invention integrates the self-mapping module, the space attention module and the channel attention module to extract the pedestrian characteristics, and more accurately focuses on the areas with high discriminability. The self-mapping module is helpful for the model to pay attention to local information and global information of the model, the space attention module enables the model to pay more attention to key areas with distinguishing characteristics, and the channel attention module enables the model to integrate and summarize channels containing similar semantic information, so that the classification result is more distinctive, and the accuracy of pedestrian re-identification is improved.

(3) According to the invention, through multi-branch combined training measurement loss and classification cross entropy loss, the characteristic discrimination and the discrimination of each branch are improved, so that the identification accuracy is improved.

Drawings

FIG. 1 is a network model structure diagram of a pedestrian re-identification method based on a multi-component self-attention mechanism disclosed in the invention;

fig. 2 is a schematic diagram of a multi-component self-attention network ANet disclosed in the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions and specific operation procedures in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, but the scope of the present invention is not limited to the embodiments described below.

As shown in fig. 1-2, an embodiment of the present invention discloses a pedestrian re-identification method based on a multi-component self-attention mechanism, which includes the following steps:

s1: and pre-training a deep convolutional neural network backbone model B.

In step S1, ResNet is used as the convolutional neural network backbone model, and pretraining is performed on the ImageNet large-scale data set, so that the backbone network B obtains an ideal initial value.

S2: segmenting the backbone model B: b is_commonAnd B_branchIn which B is_branchCorresponding to the last layer of residual network layer4 of ResNet50, B_branchDeeply copying 2 parts to obtain three branches: b is_branch1，B_branch2，B_branch3And constructing a multi-component self-attention network ANet after the branch to obtain a multi-component self-attention feature F of the pedestrian.

The step S2 specifically includes:

s2.1: let B_commonAnd B_branch1，B_branch2，B_branch3Corresponding to a learning parameter of W_commonAnd W_branch1，W_branch2，W_branch3，W_branch1，W_branch2，W_branch3Sets the input pedestrian image P as a picture in RGB format, resized to 348 × 128 × 3, passed B_commonRespectively pass through B_branch1，B_branch2，B_branch3Then, the corresponding extracted feature maps are respectively F₁∈R^C×H×W，F₂∈R^C×H×W，F₃∈R^C×H×WWherein C is the number of channels of the feature map, H is the height of the feature map, W is the width of the feature map, and the calculation formula is as follows:

where T represents a matrix transposition function.

S2.2: in B_branch1And later establishing a branch: local component based self-attention network PCPA. For PCPA, input F₁Output feature set F_pcpaWherein the PCPA network parameter is W_pcpa。

In S2.2, the PCPA specifically includes the following steps:

s2.2.1: for F₁∈R^C×H×WHorizontally and equally dividing the height of the character into an upper part and a lower part to form a characteristic diagram

(upper half) and

(lower half).

S2.2.2: for input

Respectively passing through three branches: a feature mapping module Identity, a spatial self-attention module Patt, a channel self-attention module CATt; correspondingly, the extracted feature maps are respectively F_{1_up_identity}，F_{1_up_patt}，F_{1_up_catt}. Wherein Identity is used for mapping inputIn its own right, PAtt is aimed at placing more attention in the spatial regions with discriminability and distinctiveness, and CAtt is aimed at generalizing the channel features containing similar semantic information:

(a) for input

The Identity mapping relationship is calculated by the following formula: f_{1_up_identity}＝F_{1_up}；

(b) For input

The spatial self-attention module PAtt procedure is as follows:

for arbitrary feature vectors x_i，x_j∈F_{1_up}，1≤i≤N，1≤j≤N，

Self-attention relation modeling is carried out on the space scale, and a relation matrix D ∈ R is obtained^N*N. For each value D_kThe calculation formula is as follows:

k is (i, j). Applying a relationship matrix D to F_{1_up}Obtaining an updated feature map F_{1_up_patt}The calculation formula is as follows:

wherein, W_pattAre the parameters of the PAtt module and,

representing matrix multiplication and outer product;

(c) for input

The channel self attention module CAtt procedure is as follows:

for any characteristic directionQuantity c_i，c_j∈F_{1_up}I is more than or equal to 1 and less than or equal to C, j is more than or equal to 1 and less than or equal to C, and the self-attention relation modeling is carried out on the channel scale to obtain a relation matrix E ∈ R^C*C. For each value E of E_kThe calculation formula is as follows:

wherein, W_cattAs a parameter of the CAtt module,

representing a matrix multiplication.

F_{1_up_pcpa}＝F_{1_up_identity}+F_{1_up_patt}+F_{1_up_catt}。

s2.2.4: for the

Mode of operation and_{1_up}obtaining an output characteristic diagram F passing through the PCPA_{1_down_pcpa}And is with F_{1_up}And sharing the parameters.

S2.2.5: to F_{1_up_pcpa}And F_{1_down_pcpa}Carrying out global average pooling to obtain a characteristic vector set F_pcpa：

F_pcpa＝{F_{up_pcpa}，F_{down_pcpa}}，

F_{up_pcpa}＝AvgP(F_{1_up_pcpa})，

F_{down_pcpa}＝AvgP(F_{1_down_pcpa})

wherein x is_c，w，hFor three-dimensional feature maps F_{1_up_pcpa}C is not less than 1 and not more than C₁，1≤w≤W₁，1≤h≤H₁，C₁Is F_{1_up_pcpa}Number of channels of feature map, W₁Is F_{1_up_pcpa}Width of the feature map, H₁Is F_{1_up_pcpa}The height of the feature map.

S2.3: in B_branch2And later establishing a branch: self-attention network PCA based on global components. For PCA, input F₂Outputting the feature vector F_pcaWherein the PCA network parameter is W_pca。

In S2.3, for input F₂∈R^C*H*WThe PCA specifically comprises: through three branches: the calculation modes of the feature mapping module Identity, the space self-attention module Patt and the channel self-attention module CATt are consistent with those of the Identity, the Patt and the CATt in the PCPA, but F is equal to F₂The operations of steps S2.2.2-S2.2.3 are performed without horizontal splitting. In particular, the module is in B_branch2Albeit with B_branch1The PCPA of (a) is calculated in a consistent manner, but does not share parameters.

S2.4: in B_branch3And later establishing a branch: and mapping the Global component-based feature to the Global. For Global, input F₃Outputting the feature vector F_gWherein, the Global network parameter is W_global。

In S2.4, the Global specific process includes: for input F₃∈R^C*H*WPerforming global average pooling to obtain a feature F_gThe calculation formula is as follows:

F_g＝AvgP(F₃)

wherein x is_c，w，hFor three-dimensional feature maps F₃C is not less than 1 and not more than C₃，1≤w≤W₃，1≤h≤H₃，C₃Is F₃Number of channels of feature map, W₃Is F₃Width of the feature map, H₃Is F₃The height of the feature map.

S2.5：F_pcpa，F_pca，F_gCollectively forming a multi-component self-attention feature set F; the multi-component self-attention feature set F specifically includes:

F＝{F_{up_pcpa}，F_{down_pcpa}，F_pca，F_g}

s3: inputting the multi-component self-attention features into a classifier CLS, and jointly training to minimize cross-entropy loss L_xentAnd measure the loss L_triplet。

S3.1: for the input pedestrian image P ═ { P ═ P₁，p₂，p₃......p_nAnd the corresponding identity tag IDs ═ q₁，q₂，q₃......q_nN is the number of samples of the input pedestrian picture P; obtaining a multi-component self-attention feature F ═ { F) through said S2_{up_pcpa}，F_{down_pcpa}，F_pca，F_g}。

S3.2: the classifier CLS (-) represents a fully connected layer for p_i∈ P, 1 ≦ i ≦ n, and its corresponding multi-component self-attention feature F_i＝{F_{i，up_pcpa}，F_{i，down_pcpa}，F_i，pca，F_i，g}, input feature vector F_{i，up_pcpa}，F_{i，down_pcpa}，F_i，pca，F_i，gAll the dimensions of (A) are K; let the weight matrix be W_cls＝{W_cls1，W_cls2，W_cls3，W_cls4}，W_cls1，W_cls2，W_cls3，W_cls4∈R^K×ZZ is the output dimension, i.e. pedestrian identity labelAnd (4) the number. Through CLS (-), the output classification probability Pro:

wherein q is_i∈IDs。

S3.4: for arbitrarily inputted pedestrian sample p_i∈P＝{p₁，p₂，p₃......p_nH, identity label q_i∈IDs＝{q₁，q₂，q₃......q_nGet P out of P_iNearest negative example p_jAnd from p_iThe farthest positive sample p_k. Obtaining a multi-component self-attention feature via said S2:

F_pi＝{F_{pi，up_pcpa}，F_{pi，down_pcpa}，F_pi，pca，F_pi，g}

F_pj＝{F_{pj，up_pcpa}，F_{pj，down_pcpa}，F_pj，pca，F_pj，g}

F_pk＝{F_{pk，up_pcpa}，F_{pk，down_pcpa}，F_pk，pca，F_pk，g}。

s3.5: calculating metric loss L_tripletA calculation partyThe formula is as follows:

with F_{up_pcpa}For example, note:

order sn_{up_pcpa}-sp_{up_pcpa}M is more than or equal to m, m is a boundary value, and the value of m in the embodiment is 1.2; then for F_{up_pcpa}Is measured by the loss L_{triple_up_pcpa，i，j，k}The calculation formula of (2) is as follows:

wherein q is_i、q_j、q_kIs a sample p_i、p_j、p_kA corresponding identity tag; [. the]₊Is a change function; thus obtaining L_tripletThe calculation formula of (2) is as follows:

s3.6: joint training minimization of cross entropy loss L_xentAnd measure the loss L_tripletThe total loss is L ═ L_xent+λL_tripletλ is L_xentAnd L_tripletThe balance parameter of (1). In this example, λ is 0.5.

S4: and inputting the test set picture into the trained model, fusing the output component characteristics to obtain an overall characteristic f, and sequencing the longitude quantities to realize pedestrian re-identification. The test set pictures comprise a query picture set and a to-be-selected picture set, and the to-be-selected pictures with the same pedestrian are found from the to-be-selected picture set according to the query picture set.

The step S4 specifically includes:

s4.1: for query picture set a ═ a₁，a₂，a₃......a_MAnd a picture set G to be selected is G ═ G₁，g₂，g₃......g_TM is the number of the elements of the query picture set, and T is the number of the elements of the picture set to be selected; a and G are both RGB pictures, and the size is setAdjusted to 384 × 128 × 3, the backbone model B and the multi-component self-attention network ANet are input separately to obtain the corresponding feature sets:

F_A＝{F_{A，up_pcpa}，F_{A，down_pcpa}，F_A，pca，F_A，g}

F_G＝{F_{G，up_pcpa}，F_{G，down_pcpa}，F_G，pca，F_G，g}。

s4.2: f is to be_AAnd F_GRespectively fusing the features of the components in a mode of carrying out feature connection on the dimensions to obtain an overall feature f_AAnd f_G。

S4.3: calculating f_AAnd f_GEuclidean distance between them, constructing a distance matrix S ∈ R^M*TSorting the pictures to be selected according to the distance of each query picture, setting the query number s, and taking the first s pictures to be selected with smaller distance as a retrieval candidate list of the query picture; and evaluating the accuracy of the result by using mAP and Rank @ 1.

Table 1 below shows the result of the recognition accuracy obtained by the method according to the above embodiment of the present invention. The results of comparison of other reference methods for comparison with the results of the embodiment are shown from top to bottom, and it can be seen that the recognition performance of the embodiment of the invention is greatly improved.

Table 1: identification accuracy results

Method of producing a composite material	mAP	Rank@1
			SVD-Net	62.1％	82.3％
AACN	66.87	85.9％
			MGCAM	74.3％	83.8％
HA-CNN	75.7％	91.2％
			PCB	81.6％	93.8％
Method as described in the example	88.5％	95.4％

In summary, the embodiment of the invention discloses a pedestrian re-identification method based on a multi-component self-attention mechanism, which constructs an identification model based on the multi-component self-attention mechanism, extracts pedestrian features in a component-by-component, multi-branch and high-fusion manner, enlarges the attention activation range, pays attention to a key region with discriminability more widely and fully, and improves the feature robustness; the method integrates a self-mapping module, a space attention module and a channel attention module to extract pedestrian features, and focuses on the regions with high discriminability more accurately. The self-mapping module is helpful for the model to pay attention to local information and global information of the model, the space attention module enables the model to pay more attention to key areas with distinguishing characteristics, and the channel attention module enables the model to integrate and summarize channels containing similar semantic information, so that classification results are more distinctive; in addition, the method improves the characteristic discrimination and the discrimination of each branch through multi-branch combined training measurement loss and classification cross entropy loss, thereby improving the identification accuracy.

Claims

1. A pedestrian re-identification method based on a multi-component self-attention mechanism, comprising the steps of:

s1: pre-training a deep convolutional neural network backbone model B;

The step S2 includes the following sub-steps:

s2.1: let B_commonAnd B_branch1，B_branch2，B_branch3Corresponding to a learning parameter of W_commonAnd W_branch1，W_branch2，W_branch3，W_branch1，W_branch2，W_branch3The initialization parameters are consistent; pedestrian image P passing through B_commonRespectively pass through B_branch1，B_branch2，B_branch3Then, the corresponding extracted feature maps are respectively F₁∈R^C×H×W，F₂∈R^C×H×W，F₃∈R^C×H×WWherein C is the number of channels of the feature map, H is the height of the feature map, W is the width of the feature map, and the calculation formula is as follows:

wherein T represents a matrix transposition function;

The PCPA in said step S2.2 comprises the following sub-steps:

And

s2.2.2: for input

(a) for input

The Identity mapping relationship is calculated in the following way:

F_{1_up_identity}＝F_{1_up}

(b) for input

wherein, W_pattAre the parameters of the PAtt module and,

represents a matrix multiplication;

(c) for input

The channel self-attention module CAtt specifically includes: for arbitrary feature vectors c_i，c_j∈F_{1_up}I is more than or equal to 1 and less than or equal to C, j is more than or equal to 1 and less than or equal to C, and the self-optimization is carried out on the channel scaleAttention relationship modeling to obtain a relationship matrix E ∈ R^C*C(ii) a For each value E of E_kThe calculation formula is as follows:

wherein, W_cattAs a parameter of the CAtt module,

represents a matrix multiplication;

F_{1_up_pcpa}＝F_{1_up_identity}+F_{1_up_patt}+F_{1_up_catt}

s2.2.4: for the

F_pcpa＝{F_{up_pcpa}，F_{down_pcpa}}

wherein x is_c,w，hFor three-dimensional feature maps F_{1_up_pcpa}C is not less than 1 and not more than C₁，1≤w≤W₁，1≤h≤H₁，C₁Is F_{1_up_pcpa}Number of channels of feature map, W₁Is F_{1_up_pcpa}Width of the feature map, H₁Is F_{1_up_pcpa}Height of the feature map; AvgP (F)_{1_down_pcpa}) The same is true.

2. The pedestrian re-identification method based on the multi-component self-attention mechanism according to claim 1, wherein the step S1 is specifically: and adopting ResNet by the deep convolutional neural network backbone model B, and pre-training on an ImageNet data set to enable B to obtain an initial value.

3. The pedestrian re-identification method based on the multi-component self-attention mechanism as claimed in claim 1, wherein the PCA in step S2.3 is specifically: for input F₂∈R^C×H×WThrough three branches: the feature mapping module Identity, the spatial self-attention module Patt, and the channel self-attention module CATt are calculated in the same manner as steps S2.2.2-S2.2.3 of the PCPA, but without performing feature map horizontal segmentation.

4. The pedestrian re-identification method based on the multi-component self-attention mechanism according to claim 3, wherein Global in the step S2.4 is specifically: for input F₃∈R^C×H×WPerforming global average pooling to obtain a feature vector F_gThe calculation formula is as follows:

F_g＝AvgP(F₃)

5. The method for pedestrian re-identification based on the multi-component self-attention mechanism according to claim 4, wherein the multi-component self-attention feature set F ═ { F ═ F in step S2.5_{up_pcpa}，F_{down_pcpa}，F_pca，F_g}。

6. The method for pedestrian re-identification based on the multi-component self-attention mechanism as claimed in claim 5, wherein the step S3 comprises the sub-steps of:

s3.1: for the input pedestrian image P ═ { P ═ P₁，p₂，p₃......p_nAnd the corresponding identity tag IDs ═ q₁，q₂，q₃......q_nN is the number of samples of P; obtaining a multi-component self-attention feature F ═ { F ═ F corresponding to the pedestrian image P through the step S2_{up_pcpa}，F_{down_pcpa}，F_pca，F_g}；

S3.2: the classifier CLS (-) represents a fully connected layer for p_i∈ P, 1 ≦ i ≦ n, and its corresponding multi-component self-attention feature F_i＝{F_{i，up_pcpa}，F_{i，down_pcpa}，F_i，pca，F_i，g}; let the weight matrix be W_cls＝{W_cls1，W_cls2，W_cls3，W_cls4}，W_cls1,W_cls2,W_cls3,W_cls4∈R^K×ZK is the input feature vector F_{i,up_pcpa},F_{i,down_pcpa},F_i,pca,F_i,gZ is an output dimension, namely the number of pedestrian identity tags; through CLS (-), the output classification probability Pro:

s3.4: for arbitrarily inputted pedestrian sample p_i∈P＝{p₁,p₂,p₃......p_nH, identity label q_i∈IDs＝{q₁,q₂,q₃......q_nGet P out of P_iNearest negative example p_jAnd from p_iThe farthest positive sample p_k(ii) a Obtaining a multi-component self-attention feature via said step S2:

order sn_{up_pcpa}-sp_{up_pcpa}M is a boundary value, for F_{up_pcpa}Is measured by the loss L_{triple_up_pcpa,i,j,k}The calculation formula of (2) is as follows:

wherein [ ·]₊Is a change function; f_{down_pcpa},F_pca,F_gThe same principle is applied to the calculation of the metric loss of (1), to obtain L_tripletThe calculation formula of (2) is as follows:

s3.6: joint training minimization of cross entropy loss L_xentAnd measure the loss L_tripletThe total loss is L ═ L_xent+λL_tripletλ is L_xentAnd L_tripletThe balance parameter of (1).

7. The method for pedestrian re-identification based on the multi-component self-attention mechanism as claimed in claim 6, wherein the step S4 comprises the sub-steps of:

s4.1: the query picture set A is ═ { a ═ a₁,a₂,a₃......a_MAnd a picture set G to be selected is G ═ G₁,g₂,g₃......g_TAnd inputting the backbone model B and the multi-component self-attention network ANet respectively to obtain corresponding feature sets:

F_A＝{F_{A,up_pcpa},F_{A,down_pcpa},F_A,pca,F_A,g}

F_G＝{F_{G,up_pcpa},F_{G,down_pcpa},F_G,pca,F_G,g}

S4.3: calculating f_AAnd f_GEuclidean distance between them, constructing a distance matrix S ∈ R^M*TSorting according to the distance to obtain a retrieval candidate list; wherein, M is the number of pictures in the query picture set A, and T is the number of pictures in the candidate picture set G.