CN113378729B - Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding - Google Patents

Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding Download PDF

Info

Publication number
CN113378729B
CN113378729B CN202110667913.9A CN202110667913A CN113378729B CN 113378729 B CN113378729 B CN 113378729B CN 202110667913 A CN202110667913 A CN 202110667913A CN 113378729 B CN113378729 B CN 113378729B
Authority
CN
China
Prior art keywords
image
pedestrian
feature
images
resnet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110667913.9A
Other languages
Chinese (zh)
Other versions
CN113378729A (en
Inventor
廖开阳
雷浩
郑元林
章明珠
范冰
黄港
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110667913.9A priority Critical patent/CN113378729B/en
Publication of CN113378729A publication Critical patent/CN113378729A/en
Application granted granted Critical
Publication of CN113378729B publication Critical patent/CN113378729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding, which comprises the following steps: preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, and performing baseline network optimization on a Resnet-50 network model to extract depth convolution characteristics; extracting a significant human body image from an original pedestrian image; firstly, carrying out gesture extraction on a human body saliency image, and then extracting local semantic features from a body part image; carrying out weighted fusion on the depth convolution feature and the local semantic feature, and carrying out distance measurement on the weighted fusion feature to generate an initial measurement list; and reordering the images in the initial measurement list according to a reordering algorithm to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians. The accuracy of recognition and positioning can be greatly improved.

Description

Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding
Technical Field
The invention belongs to the technical field of image processing methods, and relates to a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding.
Background
In recent years, artificial intelligence is an important progress point of current technological development, and the artificial intelligence is a new technology for chelating heads in various technologies used by us. Its use in the intelligent monitoring domain has also become extremely important. With the expansion of cities, monitoring systems have become more popular, with thousands of cameras being distributed throughout street locations. The use of cameras is increasing, and relying solely on manual monitoring is extremely expensive and cannot monitor so many pictures at the same time. Therefore, the pedestrian re-recognition technology attracts attention from researchers. It can help people perform monitoring, tracking, and identifying pedestrians. Since humans mainly receive and perceive various information from the outside through visual technology, the visual technology possessed by humans can directly obtain desired information from troublesome images. Researchers also want to be able to follow a person's visual system to allow a camera to effectively and quickly capture objects in the environment. This technique is ultimately derived as our current pedestrian re-recognition technique. The technology of pedestrian re-recognition is widely used, and for example, an intelligent monitoring system needs to use the technology of pedestrian re-recognition. The technology processes data by means of strong capability of a computer, for example, a video monitoring system can automatically filter out some useless information and actively identify a human body, so that the comprehensive monitoring is effectively carried out, and a 24-hour monitoring system for early warning and evidence collection after early warning is realized. Also using this technique is pedestrian traffic statistics. It also takes advantage of the powerful capabilities of the computer to process the data, automatically filter out some unwanted information, and automatically identify pedestrians and count them. Meanwhile, pedestrians which appear multiple times in different areas cannot be counted repeatedly, so that pedestrian traffic can be counted effectively and accurately.
One key factor affecting the accuracy of pedestrian re-recognition is the dislocation of pedestrians, and the continuous variation of mutual shielding and posture among parts of the body of the pedestrians caused by the dislocation is a great challenge for the study of pedestrian re-recognition. First, the posture of a pedestrian is constantly changing during movement, and the pedestrian inevitably changes various postures, which means that local changes of the body are unpredictable in the bounding box. For example, pedestrians may have their hands behind or overhead during movement, creating localized occlusion from misalignment, which has a significant impact on the extracted features. Secondly, detection when pedestrians are arranged irregularly has an influence on the accuracy of the pedestrian re-recognition study. One method commonly used in the pedestrian re-identification field is to divide the bounding box into horizontal stripes, however this method can only be built with slight vertical deviations. When the vertical bias mistake is on time, the detection of the body and head may match the background, resulting in false recognition of the pedestrian re-recognition task. The horizontal streaking method is not ideal in the case of severe misalignment. In the case of a pedestrian changing the posture, the background is changed continuously, so that the background may be weighted by the convolutional neural network erroneously to affect the recognition accuracy. Therefore, how to solve the influence of dislocation and background variation caused by the variation of the posture of the pedestrian is a key for improving the accuracy of the re-recognition of the pedestrian.
Disclosure of Invention
The invention aims to provide a multi-scale convolution feature fusion pedestrian re-recognition method based on pose embedding, which solves the problem of low pedestrian re-recognition precision caused by dislocation and background change due to pedestrian pose change in the prior art.
The technical scheme adopted by the invention is that the multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding comprises the following steps:
step 1, preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, performing baseline network optimization on a Resnet-50 network model, and inputting the pedestrian image into the optimized Resnet-50 network model to obtain a depth convolution characteristic;
step 2, taking an original pedestrian image as an input image to perform feature extraction to obtain a significant human body image;
step3, firstly carrying out gesture extraction on the remarkable human body image by adopting a gesture convolver to obtain a body part image, and then inputting the body part image into ResNet-50 networks to extract local semantic features;
Step 4, carrying out weighted fusion on the depth convolution feature and the local semantic feature to obtain weighted fusion features, measuring the distances between the images in the image test library and the image query library and the fusion features respectively, and generating an initial measurement list for the measured results;
and 5, reordering the images in the initial measurement list according to a reordering algorithm to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians.
The invention is also characterized in that:
the specific mode for carrying out the optimization of the base line network on the Resnet-50 network model is as follows:
and optimizing the loss function of the Resnet-50 network model by combining the Softmax loss and the Triplet loss, wherein the optimized loss function is as follows:
In the above formula, m is the number of loss functions;
In the above-mentioned method, the step of, Is the eigenvector of the anchor sample,/>Is a feature vector of positive samples,/>Feature vector being negative sample, a is/>Distance between and/>The minimum distance between the two is, + represents that the value in [ ] is greater than zero, the value is the loss value, and when the value is less than zero, the loss is zero.
The step2 specifically comprises the following steps:
Step 2.1, taking the last pooling stage of the VGG-16 network structure as the network structure after being removed, taking the original pedestrian image as an input image, inputting the input image into the network structure, and then outputting feature mapping;
Step 2.2, deconvoluting the feature map to the size of the input image, and adding a new convolution layer to generate a prediction significance map;
And 2.3, firstly applying a convolution layer with a kernel size of 1 multiplied by 1 in a network structure to the conv1-2 layer to generate boundary prediction, then adding the boundary prediction to the prediction saliency map to obtain a refined boundary frame, and then applying a convolution layer to convolve the refined boundary frame to obtain a saliency human body image.
The step 3 specifically comprises the following steps:
Step3.1, using the significant human body image as input of a gesture estimator to position 14 joint points;
step 3.2, positioning 14 human joints into 6 sub-areas, cutting, rotating and adjusting the sizes of the 6 sub-areas to fixed sizes and directions, and combining to form a spliced body part image;
Step 3.3, carrying out pose transformation on the size of each body part in the spliced body part image to obtain a body part image;
and 3.4, inputting the body part image into ResNet-50 networks for training, and extracting local semantic features.
The specific process of the step 5 is as follows:
encoding k-reciprocal nearest neighbors into single vectors through weighting to form k-reciprocal features, calculating Jacobian distances between the pedestrian test image p and the image set by using the k-reciprocal features, and finally weighting original distances between the pedestrian test image p and the image set and Jacobian distances to obtain a distance formula; and calculating the distance between the image and the fusion feature in the initial measurement list according to the distance formula, reordering to obtain the correct matching ranking of the image, and outputting the pedestrian matching image to identify the specific pedestrian.
The beneficial effects of the invention are as follows:
The invention discloses a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding, which is characterized in that depth global features and local semantic features are fused, distance measurement between different images is carried out through the fused weighted features, images of the same pedestrian are identified and retrieved, and the multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is used for identifying and retrieving the images of the pedestrian in an original image database to obtain images of specific pedestrians, so that the multi-scale convolution feature fusion pedestrian re-identification system based on pose embedding is better applicable to a multi-scale convolution feature fusion pedestrian re-identification system based on pose embedding; the performance of the base line network is improved by a method of random erasure and triplet loss function, and the characteristic weighting aggregation is carried out on the local characteristics obtained by extracting the gesture estimation and the global characteristics obtained by the base line network, so that the purpose of global optimization is realized, the target identification and positioning are facilitated, the operation speed of an algorithm is accelerated, and the stability of the system is improved; the method can greatly improve the accuracy of recognition and positioning, and can not only perform target recognition and retrieval on pedestrian images, but also be used in other fields.
Drawings
FIG. 1 is a flow chart of a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding;
FIG. 2 is a graph of random erasure processing effect of a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding;
FIG. 3 is a schematic diagram of triplet loss for a multi-scale convolution feature fusion pedestrian re-recognition method based on pose embedding;
fig. 4 is a pose embedding effect diagram of the multi-scale convolution feature fusion pedestrian re-recognition method based on pose embedding.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
A multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is shown in fig. 1, and comprises the following steps:
Step 1, an image database is established, wherein the image database is a pedestrian image which is manually collected and corrected by a computer, and 72000 images are taken in total. Preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, performing baseline network optimization on a Resnet-50 network model, and inputting the pedestrian image into the optimized Resnet-50 network model to obtain a depth convolution characteristic;
Step 1.1, randomly erasing an original pedestrian image by adopting a random erasing enhancement processing method to obtain a pedestrian image;
specifically, the random erase enhancement process (Random Erasing Augmentation, REA) is an effective method of data enhancement. The method aims at shielding different training images, randomly generating a rectangular area in the images, randomly generating the position and the size of the rectangular area, shielding part of pedestrian images, and setting the pixel value of the image shielding area as a random value. By the method, the occurrence of the condition of over fitting can be reduced, and the convergence capacity of the network model is improved, so that the performance of the deep learning model is improved.
In the training of the network model, for the original training data set, assuming that the probability of random erasure of the original data set is P, the probability of non-erasure is 1-P. In the random erasing process, a rectangular area is generated with the set probability P to shade the image, and the random erasing and shading positions and the shading area are random in the process.
Assume that the size of an image that needs to be randomly erased, i.e., an original pedestrian image, is:
S=W×H (1);
in the above formula, W is the width of the pedestrian image, and H is the height value of the pedestrian image;
It is assumed that the area size of the rectangular area in which random erasure is performed is S e, and the area size is within a range specified by the minimum value S l and the maximum value S h. The aspect ratio of the random erase region is r e, then the width H e and height W e of the random erase rectangular region are:
In the above formula, S e is the area value of the erased rectangular frame, r e is the aspect ratio of the erased rectangular frame, H e is the height of the erased rectangular frame, and W e is the width of the erased rectangular frame.
Randomly selecting a point P= (x e,ye) on the original pedestrian image, if the following formulas (4) and (5) are satisfied:
xe+We≤W (4);
ye+He≤H (5);
Then the rectangular area to be erased of the original pedestrian image is (x e,ye,xe+We,ye+He), and the random value in each pixel in the rectangular area is allocated to replace the original rectangular area by using the randomly erased selected area to be erased. If the randomly selected point p= (x e,ye) does not meet the conditions of equations (4) and (5), the above procedure is repeated, and a new point p= (x e,ye) is re-selected in the image until the appropriate random point is selected. Finally, the original pedestrian image (i.e., pedestrian image) after random erasure is outputted as shown in fig. 2.
Step 1.2, optimizing a loss function of the Resnet-50 network model by combining Softmax loss and Triplet loss;
In particular, in the field of pedestrian re-identification, triple loss (triple loss) is also widely used, and more so, is applied in combination with Softmax loss in a network model. As shown in fig. 3, when using the triplet loss function, three pictures are taken as inputs to the network: Wherein/> For Anchor point samples (Anchor), randomly selecting samples in the data set for training a network model,/>Training samples representing the identity of pedestrians belonging to the same class as the anchor sample, i.e. positive samples,/>The training samples representing identities of pedestrians of different classes than the anchor sample, namely negative samples. These training samples are input to a similar network structure for feature extraction, as shown in fig. 3, and after learning by the Triplet loss, the distance between the original sample and the positive sample is minimized, and the distance between the original sample and the negative sample is maximized. The final formula for calculating the Triplet loss is:
In the above-mentioned method, the step of, Is the eigenvector of the anchor sample,/>Is a feature vector of positive samples,/>Feature vector being negative sample, a is/>Distance between and/>The minimum interval between the distances between, + represents that the value in [ ] is greater than zero, the value is a loss value, and when the value is less than zero, the loss is zero;
from the objective function it can be seen that: when (when) And/>The distance between them is less than/>And/>When the distance between the two values is added with a, [ ] being larger than zero, there will be a loss value when/>And/>The distance between them is greater than or equal to/>And/>When a is added to the distance between the two, the loss value is zero.
Through the triple loss function, the network model can shorten the distance between the pedestrian images of the same tag, and lengthen the distance between the pedestrian images of different tags, so that the trained network model has more discriminant.
And optimizing the loss function of the Resnet-50 network model by combining the Softmax loss and the Triplet loss, wherein the optimized loss function is as follows:
In the above formula, m is the number of loss functions;
And 1.3, inputting the pedestrian image into the optimized Resnet-50 network model to obtain the depth convolution characteristic.
Step 2, taking an original pedestrian image as an input image to perform feature extraction, and separating a foreground from a background to obtain a significant human body image;
Step 2.1, taking the last pooling stage of the VGG-16 network structure as the network structure after being removed, taking the original pedestrian image as an input image, inputting the input image into the network structure, and then outputting feature mapping;
Specifically, because the VGG-16 model has ideal effects in terms of image classification and generalized special effects, the saliency model also uses VGG-16 to construct a network structure. Given an input image of size W H, the size of the output map is [ W/2 5,H/25 ], so the network architecture built on VGG-16 reduces the output by a factor of 32 by the feature map. In this embodiment, the last pooling stage of VGG-16 is eliminated, so that the size of the input image can be enlarged, and the semantic context and the image detail can be balanced. Therefore, the feature map of the network structure output of the present invention is reduced by a factor of 16 compared to the input image.
Step 2.2, the integrated feature map already contains various saliency cues, so they can be used to predict the saliency map. Specifically, deconvoluting the feature map to the size of the input image, and adding a new convolution layer to generate a prediction significance map;
And 2.3, adding boundary refinement by introducing short connection into the prediction result, and further separating the foreground and the background by boundary refinement, wherein the bottom features are expected to be helpful for predicting the boundary of the object. Furthermore, these features also have the same spatial resolution for the input image. Specifically, a convolution layer with a kernel size of 1×1 in a network structure is applied to conv1-2 layers to generate boundary prediction, the boundary prediction is added to a prediction saliency map to obtain a refined boundary frame, and then a convolution layer is applied to convolve the refined boundary frame to obtain a saliency human body image.
And 3, firstly carrying out gesture extraction on the remarkable human body image by adopting a gesture convolver to obtain a body part image, and then inputting the body part image into a ResNet-50 network to extract local semantic features. Specifically, the pose extraction is performed using an off-the-shelf model of a pose convolver, which is a sequential convolution structure that can detect 14 body joints, namely, head, neck, left and right shoulders, left and right elbows, left and right wrists, left and right hips, left and right knees, and left and right ankles, as shown in fig. 4.
Step 3.1, using the significant human body image as input of a gesture estimator to position 14 joint points, wherein the 14 joints are head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, left hip, left knee, left ankle, right hip, right knee and right ankle;
Step 3.2, positioning 14 human joints into 6 sub-areas (head, upper body, left arm, right arm, left leg and right leg) as human body parts, cutting, rotating and adjusting the sizes of the 6 sub-areas to fixed sizes and directions, and combining to form spliced body part images; due to the difference in the sizes of 6 parts of the human body, a black area inevitably appears in the human body image;
Step 3.3, carrying out pose transformation on the size of each body part in the spliced body part image to obtain a body part image;
Since black areas appear in the stitched body part images, pose transformation is required for the size of each body part, which is mainly determined from observation, to remove the black areas. For example, this embodiment observes that the width of the arm is about 20 pixels and the width of the leg is about 30 pixels, and decreasing these parameters results in lost information, and increasing these parameters may result in more background noise. But system performance remains stable as long as the parameter variation is small. The reason for this is that when the part size varies within a small range, the authentication information contained therein does not vary much, and thus the network can still learn authentication embedding given the monitoring signal.
And 3.4, dividing the body part image into a test set and a training set, inputting the test set and the training set into a ResNet-50 network for training, and extracting local semantic features. The ResNet-50 network in the step and the ResNet-50 network optimized in the step 1 do not contribute weights, but train a new weight alone to judge the local semantic image and extract the local semantic features.
Step 4, carrying out weighted fusion on the depth convolution feature and the local semantic feature to obtain weighted fusion features, measuring the distances between the images in the image test library and the image query library and the fusion features respectively, generating an initial measurement list ranking on the measured results, and returning the query scores; the feature weighted aggregation is adopted as follows:
d=αfDEEP+(1-α)fSOD (8);
In the above formula, the parameter 0.ltoreq.α.ltoreq.1 represents different weights between the depth global feature and the local semantic feature.
And 5, reordering the images in the initial measurement list according to a reordering algorithm to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians.
Specifically, a pedestrian test image p and an image set G= { G i |i=1, 2,.. N }, a k-reciprocal nearest neighbor is encoded into a single vector through weighting to form a k-reciprocal feature, then the Jacobian distance between the pedestrian test image p and the image set is calculated by using the image k-reciprocal feature, and finally the original distance between the pedestrian test image p and the image set and the Jacobian distance are weighted to obtain the distance; and calculating the distance between the images in the initial measurement list and the fusion features, sequencing the images to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians.
Step 5.1, firstly, a pedestrian image p is given for testing, and an image set g= { G i |i=1, 2, & gt, N } is given for pedestrian image reference, the original distance between the pedestrian image p and the reference data set gi is measured through the mahalanobis distance, and the measurement result is shown in a formula
d(p,gi)=(xp-xgi)TM(xp-xgi) (9);
In the above formula, x p is the appearance characteristic of the test image p,As the appearance characteristic of the reference image g i, M is a semi-positive definite matrix;
the ordered list is initialized according to the original distance between the test image P and the reference image g i, and then the ordered list is obtained:
Step 5.2, the purpose of the reordering strategy is to reorder the L (p, G) initial list ranking so that more correctly matched image samples are ranked at the first position of the list, thereby improving the recognition accuracy of pedestrian re-recognition.
The top k ranked samples in the initial ranked list are defined, namely k neighbors (k-nearest neighbors, k-nn):
The k-nearest neighbor (k-reciprocal nearest neighbors, k-rnn) is expressed as:
R(p,k)=gi|(gi∈N(p,k))∧p∈N(gi,k) (12);
however, due to a series of influencing factors such as brightness variation, attitude variation, viewing angle variation, and occlusion, a correctly matched sample may be excluded from nearest neighbors. To solve this problem, each candidate nearest neighbor set is converted to a more robust set:
for each test image sample in the original set R (p, k), find their k-nearest neighbor set When the number of the coincident samples reaches a certain condition, obtaining a union set of the coincident samples and R (p, k), and adding more positive samples into the R (p, k) set after expansion;
Step 5.3, reassigning weights according to the original distance between the search image and the nearest neighbor, encoding the k-nearest neighbor set of the sample image into an N-dimensional vector by a Gaussian kernel, which is defined as Expressed as:
Based on the neighbors being assigned a greater weight and the neighbors being assigned a lesser weight, the number of candidates for the intersection and union needed to calculate the jacobian distance can be calculated as:
The intersection uses the smallest value in the corresponding dimension of the two feature vectors as the degree of g i contained together by the two feature vectors through the minimum operation, and the maximum operation of the union is to count the total set of matching candidates in the two sets;
Step 5.4, the final jacobian distance is expressed as:
The initial ordered list is modified by combining the original distance and the Jacobian distance, and the final distance is defined as:
d*(p,gi)=(1-λ)dJ(p,gi)+λd(p,gi) (18);
in the above formula, λ is a weighting parameter, λ represents a weight of two distances, only the jacobian distance is considered when λ=0, only the original distance is considered when λ=1, and λ=0.3 is set herein;
and 5.5, calculating the distance between the image and the fusion feature in the initial measurement list by using the formula (18), and sequencing to obtain correct matching ranking of the images, outputting pedestrian matching images to identify specific pedestrians, and completing identification.
Through the mode, the multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is mainly used for searching and inquiring corresponding pedestrian pictures from a large number of pedestrian image databases, and the pictures of the same pedestrians in the image databases can be found through one image. Extracting local features of pedestrians by a method of estimating key points of human bodies under the influence of a complex background by separating a foreground and the background and filtering the influence of the complex background, and preprocessing images of a base line network by a random erasing method to strengthen the robustness of a network model, so that global features with higher robustness are extracted; finally, the similarity measurement between the features is improved by carrying out depth weighted fusion on the features with different scales and a reordering method.

Claims (1)

1. The multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is characterized by comprising the following steps of:
step 1, preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, performing baseline network optimization on a Resnet-50 network model, and inputting the pedestrian image into the optimized Resnet-50 network model to obtain a depth convolution characteristic;
step 2, taking an original pedestrian image as an input image to perform feature extraction to obtain a significant human body image;
Step 3, firstly carrying out gesture extraction on the remarkable human body image by adopting a gesture convolver to obtain a body part image, and then inputting the body part image into ResNet-50 networks to extract local semantic features;
Step 4, carrying out weighted fusion on the depth convolution feature and the local semantic feature to obtain weighted fusion features, measuring the distances between the images in the image test library and the image query library and the fusion features respectively, and generating an initial measurement list for the measured distances;
step 5, reordering the images in the initial measurement list according to a reordering algorithm to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians;
the specific mode for carrying out the optimization of the base line network on the Resnet-50 network model is as follows:
And optimizing the loss function of the Resnet-50 network model by combining Softmax loss and Triplet loss, wherein the optimized loss function is as follows:
In the above formula, m is the number of loss functions;
In the above-mentioned method, the step of, Is the eigenvector of the anchor sample,/>Is a feature vector of positive samples,/>Feature vector being negative sample, a is/>Distance between and/>The minimum interval between the distances between, + represents that the value in [ ] is greater than zero, the value is a loss value, and when the value is less than zero, the loss is zero;
The step2 specifically comprises the following steps:
Step 2.1, taking the last pooling stage of the VGG-16 network structure as the network structure after being removed, and taking an original pedestrian image as an input image to be input into the network structure and then outputting feature mapping;
Step 2.2, deconvoluting the feature map to the size of the input image, and adding a new convolution layer to generate a prediction significance map;
Step 2.3, firstly applying a convolution layer with a kernel size of 1 multiplied by 1 in the network structure to a conv1-2 layer to generate boundary prediction, then adding the boundary prediction to a prediction saliency map to obtain a refined boundary frame, and then applying a convolution layer to convolve the refined boundary frame to obtain a saliency human body image;
The step 3 specifically comprises the following steps:
Step3.1, using the significant human body image as input of a gesture estimator to position 14 joint points;
step 3.2, positioning 14 human joints into 6 sub-areas, cutting, rotating and adjusting the sizes of the 6 sub-areas to fixed sizes and directions, and combining to form a spliced body part image;
step 3.3, carrying out pose transformation on the size of each body part in the spliced body part image to obtain a body part image;
Step 3.4, inputting the body part image into ResNet-50 networks for training, and extracting local semantic features;
encoding k-reciprocal nearest neighbors into single vectors through weighting to form k-reciprocal features, calculating Jacobian distances between the pedestrian test image p and the image set by using the k-reciprocal features, and finally weighting original distances between the pedestrian test image p and the image set and Jacobian distances to obtain a distance formula; and calculating the distance between the image and the fusion feature in the initial measurement list according to the distance formula, reordering to obtain the correct matching ranking of the image, and outputting the pedestrian matching image to identify the specific pedestrian.
CN202110667913.9A 2021-06-16 2021-06-16 Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding Active CN113378729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110667913.9A CN113378729B (en) 2021-06-16 2021-06-16 Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110667913.9A CN113378729B (en) 2021-06-16 2021-06-16 Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding

Publications (2)

Publication Number Publication Date
CN113378729A CN113378729A (en) 2021-09-10
CN113378729B true CN113378729B (en) 2024-05-10

Family

ID=77572789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110667913.9A Active CN113378729B (en) 2021-06-16 2021-06-16 Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding

Country Status (1)

Country Link
CN (1) CN113378729B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787439A (en) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 Depth image human body joint positioning method based on convolution nerve network
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN109740541A (en) * 2019-01-04 2019-05-10 重庆大学 A kind of pedestrian weight identifying system and method
CN110163110A (en) * 2019-04-23 2019-08-23 中电科大数据研究院有限公司 A kind of pedestrian's recognition methods again merged based on transfer learning and depth characteristic
CN110717411A (en) * 2019-09-23 2020-01-21 湖北工业大学 Pedestrian re-identification method based on deep layer feature fusion
CN111401113A (en) * 2019-01-02 2020-07-10 南京大学 Pedestrian re-identification method based on human body posture estimation
CN111709311A (en) * 2020-05-27 2020-09-25 西安理工大学 Pedestrian re-identification method based on multi-scale convolution feature fusion
CN111783736A (en) * 2020-07-23 2020-10-16 上海高重信息科技有限公司 Pedestrian re-identification method, device and system based on human body semantic alignment
CN111860147A (en) * 2020-06-11 2020-10-30 北京市威富安防科技有限公司 Pedestrian re-identification model optimization processing method and device and computer equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787439A (en) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 Depth image human body joint positioning method based on convolution nerve network
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN111401113A (en) * 2019-01-02 2020-07-10 南京大学 Pedestrian re-identification method based on human body posture estimation
CN109740541A (en) * 2019-01-04 2019-05-10 重庆大学 A kind of pedestrian weight identifying system and method
CN110163110A (en) * 2019-04-23 2019-08-23 中电科大数据研究院有限公司 A kind of pedestrian's recognition methods again merged based on transfer learning and depth characteristic
CN110717411A (en) * 2019-09-23 2020-01-21 湖北工业大学 Pedestrian re-identification method based on deep layer feature fusion
CN111709311A (en) * 2020-05-27 2020-09-25 西安理工大学 Pedestrian re-identification method based on multi-scale convolution feature fusion
CN111860147A (en) * 2020-06-11 2020-10-30 北京市威富安防科技有限公司 Pedestrian re-identification model optimization processing method and device and computer equipment
CN111783736A (en) * 2020-07-23 2020-10-16 上海高重信息科技有限公司 Pedestrian re-identification method, device and system based on human body semantic alignment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于姿态引导对齐网络的局部行人再识别;郑烨;赵杰煜;王翀;张毅;;计算机工程;20200515(第05期);全文 *

Also Published As

Publication number Publication date
CN113378729A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN107832672B (en) Pedestrian re-identification method for designing multi-loss function by utilizing attitude information
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
WO2019232894A1 (en) Complex scene-based human body key point detection system and method
CN109598268A (en) A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN111046856B (en) Parallel pose tracking and map creating method based on dynamic and static feature extraction
CN107133569A (en) The many granularity mask methods of monitor video based on extensive Multi-label learning
Zulkifley Two streams multiple-model object tracker for thermal infrared video
CN112329559A (en) Method for detecting homestead target based on deep convolutional neural network
CN111709317B (en) Pedestrian re-identification method based on multi-scale features under saliency model
CN109271927A (en) A kind of collaboration that space base is multi-platform monitoring method
CN110543817A (en) Pedestrian re-identification method based on posture guidance feature learning
Fu et al. Complementarity-aware Local-global Feature Fusion Network for Building Extraction in Remote Sensing Images
CN111680560A (en) Pedestrian re-identification method based on space-time characteristics
Pang et al. Analysis of computer vision applied in martial arts
Zhang [Retracted] Sports Action Recognition Based on Particle Swarm Optimization Neural Networks
CN115830643B (en) Light pedestrian re-recognition method based on posture guiding alignment
CN113378729B (en) Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding
CN114973305B (en) Accurate human body analysis method for crowded people
CN114663835A (en) Pedestrian tracking method, system, equipment and storage medium
CN105809719A (en) Object tracking method based on pixel multi-coding-table matching
CN115880332A (en) Target tracking method for low-altitude aircraft visual angle
CN116912670A (en) Deep sea fish identification method based on improved YOLO model
CN114360058A (en) Cross-visual angle gait recognition method based on walking visual angle prediction
Tan A Method for Identifying Sports Behaviors in Sports Adversarial Project Training Based on Image Block Classification under the Internet of Things

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant