CN111723600A

CN111723600A - Pedestrian re-recognition feature descriptor based on multi-task learning

Info

Publication number: CN111723600A
Application number: CN201910205685.6A
Authority: CN
Inventors: 何小海; 刘康凝; 熊淑华; 其他发明人请求不公开姓名
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2020-09-29
Anticipated expiration: 2039-03-18
Also published as: CN111723600B

Abstract

The invention discloses a pedestrian re-recognition feature descriptor based on multi-task learning, which adopts a twin network structure input in pairs, and sends Local maximum appearance (LOMO) features and Deep features into a network together and maps the LOMO features and the Deep features into a single feature space for training to form a new network model TDFN (digital and Deep features Fusion network). The neural network self-learning characteristic is utilized, and the loss functions of various tasks are combined to update the network, so that the deep features learn more detailed information which is complementary with the manual local features, and new discriminative features are obtained. Experiments show that the average accuracy mAP and Rank-1 accuracy of the novel characteristics of the invention are superior to the global descriptor directly extracted from the twin network. The method is suitable for application systems in the aspects of safety and monitoring, such as video monitoring analysis, content-based image and video retrieval.

Description

Pedestrian re-recognition feature descriptor based on multi-task learning

Technical Field

The invention relates to a pedestrian re-identification problem in the field of intelligent video monitoring, in particular to a pedestrian re-identification feature descriptor based on multi-task learning and a new network model TDFN (traditional and discrete Fusion network).

Background

Pedestrian Re-Identification (Re-Identification) aims at matching image frames containing the same pedestrian in cross-camera monitoring videos, and is a challenging subject in the field of intelligent monitoring analysis. Pedestrian re-identification has attracted a great deal of attention in the industry and academia due to its important applications in security and surveillance, such as video surveillance analysis and content-based image, video retrieval. The re-recognition model generally includes two parts, characterization learning and metric learning. In typical re-identification, each pedestrian picture is usually described using a single feature, and then these features are matched in the metric space of a particular task, where feature vectors of the same pedestrian have a smaller distance than feature vectors of different pedestrians.

In a real scene, due to the fact that visual angles, illumination, background clutter, occlusion and other significant changes exist under different cameras, the same pedestrian often has large differences in non-overlapping camera views. The combination of different visual characteristics is manually made, so that the change of cross viewpoints in a re-recognition task can be overcome, and the method is sometimes more reliable. Among the hand-made features, color and texture are the two most useful of them. For example, HSV and LAB color histogram information is used to measure color information in an image, and LBP histograms and Gabor filters are used to describe image texture information. Although the manual features have certain uniqueness, the manual features have inferior effect to the pedestrian features extracted by using deep learning. In recent years, many algorithms learn the corresponding features directly from the original input picture through neural networks, and different networks have been studied for pedestrian re-recognition. For example, a twin network structure for pedestrian re-identification is studied by jointly learning identification loss, verification loss, a ternary network is studied by learning relative similarities between three types of images (including anchor point, positive and negative pair images), and a quadruple deep network for learning an edge-based hard-case mining strategy from four input images. These methods can effectively learn global pedestrian representations, but they ignore very rich information around the local position of the body, and in some scenarios may instead produce suboptimal results. The LOMO feature is a manually-made local feature, is composed of color and texture histogram information of local blocks, and contains rich detail information. Based on this, the LOMO features are complementary to the deep features learned by the twin network.

Disclosure of Invention

The invention provides a pedestrian re-recognition feature descriptor based on multi-task learning, which adopts a twin network structure input in pairs, and sends Local maximum appearance (LOMO) features and Deep features into a network together and maps the LOMO features and the Deep features into a single feature space for training to form a new network model TDFN (localized Deep features Fusion network). The neural network self-learning characteristic is utilized, and the loss functions of various tasks are combined to update the network, so that the deep features learn more detailed information which is complementary with the manual local features, and new discriminative features are obtained.

The invention realizes the purpose through the following technical scheme:

(1) deep features and Local maximum occurrence (LOMO) features of the paired pictures are extracted, and the LOMO features are reduced in dimensionality using the full-connectivity layer.

(2) The deep features and the reduced-dimensionality LOMO features are fed into a network and mapped into a single feature space for training.

(3) The network uses a multitask learning network, not only analyzes the pedestrian similarity of two pictures, but also predicts the pedestrian identity in each picture.

(4) The loss functions of a plurality of tasks are combined, self-learning of the neural network is utilized, and learning of deep features is influenced by detail information in the LOMO features.

Drawings

FIG. 1 is a diagram of a pedestrian re-identification feature description sub-frame based on multitask learning;

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

the TDFN model network structure is specifically as follows:

the model adopts a twin network structure, comprising two CNN models (resulting from the removal of the last layer of FC by the ResNet-50 network), and the two CNN models share weights. Two pictures are input, and two deep features are output by the two CNN models. In addition, LOMO characteristics of the two pictures are extracted and sent to a full connection layer to reduce dimensionality, so that the huge difference between the dimensionality of the two characteristics can be alleviated for fusion. And then, the deep features extracted by the twin network and the LOMO features subjected to dimensionality reduction are sent to a Merge1 layer and a Merge2 layer together for fusion of the two features, and then the deep features and the LOMO features are sent to an FC3 layer and an FC4 layer together for learning, so that two new features are obtained. The network has three tasks (two tasks for predicting the identity of the pedestrian and one task for acquiring the similarity of the pedestrian of two images), the loss functions generated by each task are weighted together to update the network, the self-learning characteristic of the neural network is utilized, the parameters of the image convolution kernel are optimized, the deep features are promoted to learn more detailed information which is complementary with the LOMO features, and therefore the new features with discriminative power are obtained.

The new fusion characteristics of the TDFN model are specifically as follows:

to obtain a better feature representation, a large number of images are required for model training. However, there are not as many images in the re-recognition dataset, so the present invention extracts deep features using twin networks pre-trained on ImageNet parameters. Although features generated by twin networks can effectively learn global pedestrian representation, the features ignore very rich information around local positions of bodies, and the LOMO features are manually made local features which are formed by color and texture histogram information of local blocks and contain rich detailed information, and the two features have complementarity. Therefore, the LOMO characteristic and the deep characteristic of the paired pictures are extracted, the two characteristics are sent into a network for multi-task learning to train, the network is updated by weighting different task loss functions by utilizing the principle of back propagation, the extraction of the deep characteristic is normalized by detail information in local manual characteristics, and the new characteristic with translation invariance is obtained. Inputting two pictures p_iAnd p_jRespectively acquiring deep features and LOMO features of two pictures, sending the deep features and the LOMO features into a Merge1 layer and a Merge2 layer together, and forming two new features through a full connection layer FC3 layer and a full connection layer FC4 layer

And

input to FC3 and FC4 layers:

x1＝[LOMO1,Deep_Feature1](1)

x2＝[LOMO2,Deep_Feature2](2)

the outputs of the FC3 layer and FC4 are:

where h (-) is the activation function, the activation function ReLU is employed at the FC3 and FC4 levels and a discard level is used to learn redundant expressions, prevent overfitting, and the discard rate is set to 0.5. According to the principle of back propagation, the weight of the z-th layer after iteration is:

the new characteristics of the joint multitask lost learning are as follows:

in the TDFN network, such a joint loss based on multi-task learning can better extract features by not only effectively extracting features for each image, but also comparing pairs of pictures through a deep network. The two new features of the full-connection layer are learned by joint loss of multi-task learning, and extraction of deep features is influenced by local blocks in manual features through back propagation, so that the two features are subjected to complementary learning. The model comprises three tasks, wherein the three tasks comprise a task for acquiring the similarity of the pedestrians and two tasks for predicting the identity of the pedestrians, and the specific process comprises the following steps:

acquiring pedestrian similarity: two pedestrian descriptors of FC3 layer and FC4 layer

And

entering a Square layer, calculating the Square difference of the two elements one by one:

then using a convolution layer

Converting into a two-dimensional vector representing the similarity of two pictures

Wherein,

θ_sdenotes the parameters of the convolutional layer, o denotes the convolution operation, sigmoid is the activation function. And is

Comparing the similarity score with the real matching degree of the two pictures to obtain a verification loss, wherein the calculation method comprises the following steps:

when p is_iAnd p_jIs the same person, then q_i1, otherwise q_j＝0。

And (3) predicting the identity of the pedestrian: each pedestrian descriptor

The input convolutional layer is mapped into a one-dimensional vector with the size of K, and the value of K is the same as the pedestrian category number of the data set. Pedestrian identity is then predicted using the Softmax layer, whichThe output is:

wherein, theta_iRepresents the parameters of the convolutional layer, o represents the convolution operation,

for predicting the identity of two input pictures. Will be provided with

Compared with the true identity label of the corresponding picture, the recognition loss can be calculated:

wherein

Representing the identity of the input picture. When in use

And all other k values

When the picture is input, the identity of the picture is t.

Finally, the loss function of the network herein is defined as:

LOSS_Muti＝LOSS_v+LOSS_id(12)

the deep level feature and LOMO feature complementary learning is as follows:

in the training process, the deep characteristic of one picture is assumed to be f, and the LOMO characteristic is assumed to be f

The input of the full connection layer FC3 is

Suppose that

In order to connect the weight of the nth layer j node and the nth-1 layer i node, the nth layer j node outputs in forward propagation:

wherein

The LOSS function LOSS weighted by multiple tasks by using a stochastic gradient descent method during training_MutiThe gradient generated

And learning rate α update weights

To optimize the network. Wherein the gradient is

The calculation formula of (A) is as follows:

suppose the output of FC3 at level 6 q-node is

Then, the following equations (13), (14) and (5) can be obtained:

thus in a TDFN network

Influencing in two ways

First LOMO feature

By passing

Forward propagation, and thus weighting of deep features

In the back propagation of updated network

The influence of (c). Second, the LOSS function LOSS_MutiOutput gradient of

Will also receive

Thereby affecting the weight of the deep features

And complementary learning of the two characteristics is realized.

The invention uses two different metric learning methods to verify the proposed features on the Market1501 and DukeMTMC-ReID databases and compares the features with a reference model and some mainstream algorithms respectively. The evaluation was performed using a single query setup and two evaluation indices of Rank-k precision (k ═ 1, 5, 10) and mean precision (mAP) were used. The results of the experiments are shown in tables 1, 2 and 3:

table 1 results of comparison with the reference model

TABLE 2 Market1501 data set vs. mainstream algorithm results

TABLE 3 DukeMTMC-reiD data set vs. mainstream Algorithm results

Claims

1. A pedestrian re-identification feature descriptor based on multitask learning is characterized by comprising the following steps:

(1) extracting Local maximum appearance (LOMO) features and deep features in the paired pictures, and reducing the dimension of the LOMO features by using a full connection layer;

(2) sending the Deep features and the LOMO features after dimensionality reduction into a network and mapping the Deep features and the LOMO features into a single feature space for training to form a new model TDFN (traditional and Deep features Fusion network);

(3) the TDFN model uses a multitask learning network, not only analyzes the pedestrian similarity of paired pictures, but also predicts the pedestrian identity in each picture;

(4) the loss function of multiple tasks is combined to update the network, and the self-learning characteristic of the neural network is utilized to promote deep features to learn more detailed information complementary with the LOMO features.

2. The method of claim 1, wherein in step (1) the deep features are extracted using a twin network, wherein the twin network comprises two CNN models pre-trained on ImageNet parameters, and the backbone network of the CNN model is obtained by removing the last FC layer from a ResNet-50 network; in addition, the LOMO features of the paired pictures are extracted and reduced in dimension using the fully connected layer, which can mitigate the large difference between the two feature dimensions for fusion.

3. The method according to claim 1, wherein in step (2), the deep layer feature and the reduced dimension LOMO feature of the two pictures are combined together into a Merge1 layer and a Merge2 layer, and two new features are formed by fully connecting the FC3 layer and the FC4 layer

And

then the two new characteristics are sent to a multi-task learning network for training, and two pictures p are input_iAnd p_jThe inputs to the FC3 layer and the FC4 layer are:

x1＝[LOMO1,Deep_Feature1](1)

x2＝[LOMO2,Deep_Feature2](2)

the outputs of the FC3 layer and FC4 are:

4. the method according to claim 1, wherein in step (3), not only the feature extraction is effectively performed on each image, but also the comparison is performed on the paired images through the depth network, and the model has three tasks, including a task of acquiring the pedestrian similarity and two tasks of predicting the identity of the pedestrian, and the specific processes are as follows:

And

then using a convolution layer

and (3) predicting the identity of the pedestrian: each pedestrian descriptor

The pedestrian identity is predicted by using a Softmax layer, and the output is as follows:

will be provided with

5. the method of claim 1, wherein the gradient of a node in the deep feature learning process in step (4) is influenced by the LOMO feature in two ways. Firstly, information in the LOMO features needs to be propagated through a ReLU function of a full connection layer, so that extraction of deep features can be self-adapted to a convolution kernel according to the LOMO features, and forward complementarity of the deep features is realized; second, LOSS function LOSS weighted by multiple tasks_MutiThe output gradient of (a) may also be affected by the LOMO characteristics, among others LOSS_MutiThe calculation formula of (A) is as follows:

LOSS_Muti＝LOSS_v+LOSS_id(9)

based on this, the deep features learn detailed information complementary to the LOMO when propagating back to update the network.