CN110728238A

CN110728238A - Personnel re-detection method of fusion type neural network

Info

Publication number: CN110728238A
Application number: CN201910970957.1A
Authority: CN
Inventors: 杨会成; 潘玥; 储慧敏
Original assignee: Anhui Polytechnic University
Current assignee: Anhui Polytechnic University
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-01-24

Abstract

The invention provides a personnel retest method of a fusion type neural network, which aims at the problems of small data set and insufficient training samples and proposes to enlarge the data set by utilizing an improved generation type countermeasure network (GAN); then, the problem of feature extraction is effectively solved by using a scheme of combining a color model HSV and a scale-invariant local ternary pattern with neural network feature fusion; furthermore, in the classification model, the similarity measure is calculated using cross entropy loss. In addition, the neural network selects a residual error network with high convergence rate and good extraction effect, and a Dropout layer is added before the final convolution layer, aiming at preventing overfitting. The experimental result shows that when the number of samples is small, the characteristic is effectively extracted through the characteristic fusion scheme, and the method has a certain application value.

Description

Personnel re-detection method of fusion type neural network

Technical Field

The invention relates to the technical field of personnel retest, in particular to a personnel retest method of a fusion type neural network.

Background

With the increasing attention on cross-border tracking technology, people re-detection technology is also a big research hotspot. In practice, the challenge of human re-detection techniques is further increased by the influence of many external factors. In recent years, public safety is more and more emphasized, and camera monitoring is visible everywhere on a road. In security and law enforcement related applications, the ability to identify pedestrians in surveillance cameras is urgently needed. In the traditional method, the operation of identifying the pedestrian is performed by related operators, and the operation generally has manual design characteristics of edges, gradients and the like, which is called manual screening; however, in the world of the internet, the data size is getting larger and larger, and the manual screening method cannot meet the needs of people, so that it is expected to match pedestrians automatically and accurately in different cameras, which is generally called as people re-detection or pedestrian re-identification. In early studies, due to immaturity of related technologies such as image processing, pattern recognition and the like, the development of human re-detection technology has not been advanced; in recent years, the appearance of high-definition cameras and the development of image processing technology enable people to rapidly advance the re-detection technology. However, in the road of development progress, people re-detection technology faces great challenges. Shooting differences exist among different camera devices; meanwhile, pedestrians have diverse changes in movement, and the appearance of pedestrians is easily influenced by clothes, body types, postures, real object shielding, illumination backgrounds and the like.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a personnel retest method of a fusion type neural network, which firstly aims at the problems of small data set and insufficient training samples, and proposes to enlarge the data set by using an improved generation type countermeasure network (GAN); then, the problem of feature extraction is effectively solved by using a scheme of combining a color model HSV and a scale-invariant local ternary pattern with neural network feature fusion; furthermore, in the classification model, the similarity measure is calculated using cross entropy loss. In addition, the neural network selects a residual error network with high convergence rate and good extraction effect, and a Dropout layer is added before the final convolution layer, aiming at preventing overfitting. The experimental result shows that when the number of samples is small, the characteristic is effectively extracted through the characteristic fusion scheme, and the method has a certain application value.

The invention provides a personnel re-detection method of a fusion type neural network, which comprises three parts, wherein the first part relates to a generative confrontation network, and a data set is expanded by utilizing the generative confrontation network GAN; the second part is a convolutional neural network, a generated image generated by using a GAN network is processed by using basic operations of convolution, pooling and function activation, an input image is extracted by combining additional manual features, and then a fusion layer is externally connected to fuse the convolutional neural network and the manually extracted features to obtain an image description feature with more complete representation; and the third part is to apply the minimized cross entropy to the classification model and perform precision sequencing on the human re-detection.

The generation type confrontation network (GAN) belongs to the unsupervised learning network, is rapidly applied to various computer vision fields, and has great success in deep learning and machine learning. The GAN includes a generative model G and a discriminative model D. The generator G learns how to generate an image from an input image, and the discriminator D learns whether the generated image is a real image or a "false image" generated by the generator G, both models learning in a mutual game, also called resist learning, so that the generated image gradually approximates the original image. To further improve the stability of GAN and the quality of generated pictures, two aspects can be considered: firstly, how to search a better model to train the GAN network; the second is how to improve the loss function in GAN. This results in improved network Deep Conditional General Adaptive Networks (DCGAN) and Least Square Generic Adaptive Networks (LSGAN), which use DCGAN in the present invention.

The further improvement lies in that: g and D in the GAN are two convolutional neural networks, the generator G is used for up-sampling by means of deconvolution, the discriminator D is used for replacing pooling by convolution added with stride, the generator G is provided with a Relu layer, and the discriminator D is provided with LeakyRelu. The first distinctive module is an improvement in detail compared to GAN, which replaces G and D in classical GAN with two Convolutional Neural Networks (CNN). However, instead of a direct substitution, upsampling is performed by deconvolution in generator G, posing is replaced by stride-added convolution in discriminator D, Relu layer is used in generator G, and leakyreu is used in discriminator D. First starting with a random vector of 100 dimensions and then scaling it up to 4 x 1024 with a linear function. To expand this tensor, a 4 step-2 micro-step convolution function is used, whose kernel size is 5 × 5. Each micro-stride convolution function is followed by a modified linear unit and a batch unit. In addition, there is an optional deconvolution layer with stride 1, whose kernel size is 5 × 5, and a Tanh activation function is added to fine-tune the result. Finally, a sample size of 64 × 64 × 3 can be generated. The resulting image may then be resized to 256 x 3 using a bilinear sampling method. The pictures are generated after DCGAN is trained on a face data set and a landscape data set, so that the picture set with very high quality can be generated under the condition of high reiteration times by using DCGAN, and the effect of expanding the data set can be achieved.

The training framework of DCGAN is used to train the images of the original data set, adjust all images to 128 x 48 in size before training, and flip them randomly to obtain more conformal data. The number of training iterations was set to 50, and 26000 images of 256 × 256 size were generated.

The feature extraction module is divided into two parts: traditional manual extraction of features and convolutional neural network extraction of features. The input image set is extracted using the conventional color descriptors HSV and the Scale Invariant Local Ternary Pattern (SILTP). SILTP is a well-known Local Binary Pattern (LBP) improvement operator, which has good invariance under monotonic gray scale transformation, but is not robust to image noise. SILTP improves LBP by introducing scale-invariant local comparison tolerances, achieving invariance to intensity scale variations and robustness to image noise.

The further improvement lies in that: in the first part, the image is preprocessed, the image is equivalently segmented, the image is divided into six strips in the horizontal direction, an image pyramid is constructed for each horizontal strip, and a single histogram is calculated, so that more complete multi-scale information is obtained. The resulting histogram achieves some invariance to viewpoint changes while capturing local regional features of the person. The specific operation is as follows: down-sampling of the original 128 x 48 image, with two 2 x 2 slipsAnd performing local average merging operation on the active cell window to describe local details of the person image, wherein the step length of the active cell window is 5 pixels, repeating the process, and finally obtaining two SILTP histograms and a combined HSV histogram of 8 multiplied by bin according to the calculated local maximum occurrence frequency, wherein each histogram box represents the occurrence probability of one mode in the active cell window. To account for viewpoint variations, all sub-windows at the same horizontal position are examined and the local occurrence of each mode in these sub-windows (i.e., the same histogram frame) is maximized. The resulting histogram achieves some invariance to viewpoint changes while capturing local regional features of the person. Using the obtained two single histograms as₁-norm is standardized, and the two are connected into a whole to obtain a fusion Feature, which is called as a joint descriptor FFD (fusion Feature description) with the size of 960. However, spatial information within the bands may also be lost, thereby lacking the necessary characterization, affecting the resolution.

The further improvement lies in that: and generating a picture, and extracting features by using a residual error network ResNet.

Most body parts can be found in the image, but there are some severe distortions and misalignments, so the last fully-connected layer in the network uses higher-level convolutional layers, which consist of a stack of filters of predefined size, which are convolved with the input of the layer. Parameter sharing used by the convolutional layer is more efficient (requiring less computation and memory storage) than dense matrix multiplication. Parameter sharing may also make convolutional layers equivalent to linear translation (i.e., any shift in the input will result in a similar shift in the output). In addition, the pooling layer also adopts corresponding nonlinear expression, which is helpful for the normalization of feature extraction.

The DCGAN generated image data set uses the neural network ResNet18 to extract features, and deeper networks are not selected, so that the defects of high training requirement, long time and the like caused by gradient explosion prevention and increase of the number of layers are overcome. In addition, the residual error network can well solve the problems of degradation and the like through a residual error function, network optimization is achieved, and the convergence is quicker.

All training images in the present invention are adjusted to 256 × 256 size before neural network training, then randomly flipped horizontally and cropped to 224 × 224 size. When the neural network is trained, the learning rate is given to be 0.05, then the learning rate is reduced to be 0.001 after 40 times of iterative training, and the iterative training is carried out on a training set for 40 times respectively. The training strategy applies a small batch of random Gradient descent (sgd) (stochastic Gradient present) to modify the parameters to achieve faster back propagation and faster convergence. And a Dropout layer is added before the last convolutional layer to prevent over-fitting. And (4) drawing a training and checking loss curve, and showing that the training effect is basically optimal after 30-40 iterations.

The further improvement lies in that: the operator of manual feature extraction is combined with the CNN extraction operator, and through a fusion layer, two feature extraction channels are complemented to obtain a feature of 2048 tensor to represent image input as follows: x ═ HS, RN _ Features]The output is calculated by the following formula:

where h (-) represents the activation function.

The further improvement lies in that: a downsampling layer and a ReLU layer are adopted, the downsampling rate is set to be 0.5, the neural network adopts a feedback propagation algorithm, and the iteration formula is as follows:

the aim is to effectively extract complete features on each image, and not to extract features for comparison by using a neural network. Therefore, instead of using the verification model, a classification model is chosen. Efficient characterization results in lower losses. Here, a softmax loss function is selected for application in the model. For a single input vector x and a single output node last layer, the penalty can be calculated by the following formula:

the last layer of the network is designed to minimize cross-entropy loss:

the invention has the beneficial effects that: firstly, expanding a VIPeR data set by adopting an improved GAN network to obtain 26000 256 × 256 image sets, which can well reduce overfitting conditions for feature extraction; thirdly, combining manual extraction and neural network extraction, performing feature extraction on the input data set and the generated data set, and obtaining more complete feature expression through a fusion layer; and finally, sorting the similarity of the designated image and the images in the test set by calculating cross entropy loss through a classification model and adopting Rank-k evaluation indexes. Compared with the result of the traditional algorithm, the performance of the obtained result is improved to a certain extent. A learning method that operates by multiple layers of linearity and nonlinearity simultaneously learned in an end-to-end manner using a deep neural network. In order to accurately extract semantic features with good robustness, parameters of the layers are learned through multiple iterations. To expand the data set, a large number of high quality clear pictures are generated using a generative confrontation network GAN. Therefore, under the condition of a large number of levels of feature extraction, the initial method is effectively improved, and the performance is improved to a certain extent.

Drawings

FIG. 1 is a diagram of the Re-ID framework of the present invention.

Fig. 2 is a diagram of a DCGAN network architecture of the present invention.

Fig. 3 is a diagram of a residual network setup unit of the present invention.

FIG. 4 is a residual network training accuracy and check loss curve of the present invention.

Detailed Description

For the purpose of enhancing understanding of the present invention, the present invention will be further described in detail with reference to the following examples, which are provided for illustration only and are not to be construed as limiting the scope of the present invention. As shown in fig. 1-4, the present embodiment provides a method for re-detecting persons by using a fusion type neural network, which is divided into three parts, wherein the first part relates to a generative confrontation network, and a data set is expanded by using the generative confrontation network GAN; the second part is a convolutional neural network, a generated image generated by using a GAN network is processed by using basic operations of convolution, pooling and function activation, an input image is extracted by combining additional manual features, and then a fusion layer is externally connected to fuse the convolutional neural network and the manually extracted features to obtain an image description feature with more complete representation; and the third part is to apply the minimized cross entropy to the classification model and perform precision sequencing on the human re-detection.

G and D in the GAN are two convolutional neural networks, the generator G is used for up-sampling by means of deconvolution, the discriminator D is used for replacing pooling by convolution added with stride, the generator G is provided with a Relu layer, and the discriminator D is provided with LeakyRelu.

In the first part, the image is preprocessed, equivalent segmentation is carried out on the image, the image is divided into six strips in the horizontal direction, an image pyramid is constructed for each horizontal strip, a single histogram is calculated, the obtained histogram realizes certain invariance of viewpoint change, and meanwhile, local region characteristics of people are captured.

And generating a picture, and extracting features by using a residual error network ResNet.

The operator of manual feature extraction is combined with the CNN extraction operator, and through a fusion layer, two feature extraction channels are complemented to obtain a feature of 2048 tensor to represent image input as follows: x ═ HS, RN _ Features]The output is calculated by the following formula:

where h (-) represents the activation function.

A downsampling layer and a ReLU layer are adopted, the downsampling rate is set to be 0.5, the neural network adopts a feedback propagation algorithm, and the iteration formula is as follows:

the aim is to effectively extract complete features on each image, and not to extract features for comparison by using a neural network. Therefore, instead of using the verification model, a classification model is chosen. Efficient feature representation will result in lowerIs lost. Here, a softmax loss function is selected for application in the model. For a single input vector x and a single output node last layer, the penalty can be calculated by the following formula:the last layer of the network is designed to minimize cross-entropy loss:

compared with the traditional personnel re-detection algorithm, the method is correspondingly optimized when the network framework is constructed (1) when the image sets are few, useful information description is lacked in the feature extraction module, so that the phenomena of overfitting and the like are caused, the DCGAN network is adopted to generate a large number of pictures, the input data set is expanded, and the situation can be effectively eliminated; (2) the traditional manual feature extraction and neural network combination is adopted in a feature extraction module, the color information and the texture information are combined, more complete feature representation can be extracted, and a ResNet network is adopted, and extra convolutional layers, down-sampling and other operations are added, so that a better training effect can be achieved; (3) dropout is introduced into the classification model through the cross entropy model, so that the overfitting phenomenon can be reduced. Therefore, the training precision on the VIPeR data set is higher than that of the traditional algorithm, and the specific conditions are as follows in table 1:

table 1 individual person search evaluation table

According to the table, the proposed method combining the manual extraction and the neural network extraction features has better detection effect than the traditional feature learning and metric learning. The accuracy of conventional single feature extraction cannot meet the requirements. Further development can be continued on shortening the training running time in the future, and the obtained fusion features can be reprocessed.

Claims

1. A personnel re-detection method of a fusion type neural network is characterized in that: the method is divided into three parts, wherein the first part relates to a generative confrontation network, and a data set is expanded by using the generative confrontation network GAN; the second part is a convolutional neural network, a generated image generated by using a GAN network is processed by using basic operations of convolution, pooling and function activation, an input image is extracted by combining additional manual features, and then a fusion layer is externally connected to fuse the convolutional neural network and the manually extracted features to obtain an image description feature with more complete representation; and the third part is to apply the minimized cross entropy to the classification model and perform precision sequencing on the human re-detection.

2. The method for retesting persons in a fused neural network as set forth in claim 1, wherein: g and D in the GAN are two convolutional neural networks, the generator G is used for up-sampling by means of deconvolution, the discriminator D is used for replacing pooling by convolution added with stride, the generator G is provided with a Relu layer, and the discriminator D is provided with LeakyRelu.

3. The method for retesting persons in a fused neural network as set forth in claim 1, wherein: in the first part, the image is preprocessed, equivalent segmentation is carried out on the image, the image is divided into six strips in the horizontal direction, an image pyramid is constructed for each horizontal strip, a single histogram is calculated, the obtained histogram realizes certain invariance of viewpoint change, and meanwhile, local region characteristics of people are captured.

4. The human retest method of the fused neural network of claim 3, wherein: and generating a picture, and extracting features by using a residual error network ResNet.

5. The method for retesting persons in a fused neural network as set forth in claim 1, wherein: the operator of manual feature extraction is combined with the CNN extraction operator, and two feature extraction channels are complemented through a fusion layer to obtain a feature of 2048 tensor to representThe image input is: x ═ HS, RN _ Features]The output is calculated by the following formula:

where h (-) represents the activation function.

6. The human retest method of the fused neural network of claim 5, wherein: a downsampling layer and a ReLU layer are adopted, the downsampling rate is set to be 0.5, the neural network adopts a feedback propagation algorithm, and the iteration formula is as follows: