CN114708653A - Specified pedestrian action retrieval method based on pedestrian re-identification algorithm - Google Patents

Specified pedestrian action retrieval method based on pedestrian re-identification algorithm Download PDF

Info

Publication number
CN114708653A
CN114708653A CN202210291238.9A CN202210291238A CN114708653A CN 114708653 A CN114708653 A CN 114708653A CN 202210291238 A CN202210291238 A CN 202210291238A CN 114708653 A CN114708653 A CN 114708653A
Authority
CN
China
Prior art keywords
pedestrian
action
backbone network
frame
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210291238.9A
Other languages
Chinese (zh)
Inventor
张伟
周鑫
陈云芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210291238.9A priority Critical patent/CN114708653A/en
Publication of CN114708653A publication Critical patent/CN114708653A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a specified pedestrian action retrieval method based on a pedestrian re-identification algorithm. The specified pedestrian action retrieval method based on the pedestrian re-identification algorithm comprises the following steps: inputting each frame of video data into a feature extraction backbone network, extracting a frame-level backbone network feature map, inputting the frame-level backbone network feature map into a pedestrian detection branch module, and outputting a final target detection boundary box of each pedestrian after processing by the pedestrian detection branch module; the re-identification branch module processes the backbone network characteristic diagram and the final target detection bounding box of each pedestrian and outputs an action characteristic queue of each pedestrian; the action classification module uniformly scales the action feature queue of each pedestrian into 288 × 288, aggregates the action feature queue in the channel dimension to extract the information of each pedestrian in the time dimension, and then carries out action classification to obtain a final action retrieval result. The invention adds the pedestrian re-identification characteristic and the action identification result of the sustainable tracking specified target, thereby greatly improving the accuracy of pedestrian detection.

Description

Specified pedestrian action retrieval method based on pedestrian re-identification algorithm
Technical Field
The invention relates to the technical field of computer vision, in particular to a specified pedestrian action retrieval method based on a pedestrian re-recognition algorithm.
Background
With the increasing proliferation of video data, a number of computer vision tasks have been proposed to analyze video data, where human motion recognition has significant value in many aspects of real life, and is gaining increasing attention.
The current action recognition algorithm mainly utilizes the motion information of the target to complete the classification of the action, and the method achieves good effect in a simple experiment scene. However, video data in actual life is often more complex, the number of people is more than the number of the existing people, and the situation that position movement and interaction frequently occur among pedestrians is caused, and at the moment, the problem that pedestrians are lost by following the method is easily caused continuously, so that correct identification of the actions of the pedestrians is influenced. This requires us to further mine the role of appearance information in action recognition.
In a sparse scene with few people and no shielding, the appearance information at the moment is only accurate to the identity which can be identified as the human by the pedestrian detection sub-algorithm, for example, people who do the same action can not be classified into two types because of the difference of body types or clothes with different styles and colors. In a complex scene, the extracted appearance information needs to be rich enough to distinguish different pedestrians. For example, Yanwenhao et al propose to use face information as a main feature for measuring the similarity between different action senders, thereby reducing the problem of action misclassification caused by pedestrian id misclassification. However, the face features are difficult to collect when the pedestrians face to each other, the light is dark, the distance is long, and the like, so that the pedestrian appearance features which are more universal need to be utilized.
On the other hand, in practical applications such as behavior analysis of customers in an unmanned store, search and rescue in a security scene, and tracking of suspects, all actions of a specific pedestrian in a discrete time generally need to be recognized so as to obtain useful information after further analysis. For example, Ketan Kotecha (keten kotaca) measures similarity between objects by a non-deep learning method, cuts an input video into independent video segments according to different pedestrians, and finally acts on the video segments as a classification task. However, the non-deep learning method has poor generalization and is difficult to identify the same pedestrian in a longer time frame, and therefore a more effective similarity measurement method is required.
Therefore, the motion recognition algorithm in the prior art generally has the defect that the motion recognition algorithm is easy to carry out error classification due to the lack of appearance characteristics, and the prior art cannot continuously track the motion recognition result of the specified target.
Disclosure of Invention
Aiming at the problems, the invention provides a specified pedestrian action retrieval method based on a pedestrian re-identification algorithm.
In order to realize the aim of the invention, the invention provides a specified pedestrian action retrieval method based on a pedestrian re-identification algorithm, which comprises the following steps:
s1: respectively inputting each frame of video data acquired by a video acquisition device in real time into a feature extraction backbone network, processing each frame by the feature extraction backbone network, and extracting a backbone network feature map;
s2: inputting the backbone network characteristic diagram into a pedestrian detection branch module, processing the backbone network characteristic diagram by the pedestrian detection branch module, and outputting a final target detection bounding box of each pedestrian;
s3: inputting the backbone network feature map and the final target detection boundary box of each pedestrian into a re-identification branch module, wherein the re-identification branch module processes the backbone network feature map and the final target detection boundary box of each pedestrian and outputs an action feature queue of each pedestrian;
s4: and the action classification module uniformly scales the action feature queue where each pedestrian is located into 288 × 288, aggregates the action feature queue in channel dimension to extract the information of each pedestrian in time dimension, and performs action classification to obtain a final action retrieval result.
Further, the specific process of step s1 includes:
sequentially inputting each frame of video data acquired by the video acquisition equipment into the feature extraction backbone network, extracting by the feature extraction backbone network to obtain a backbone network feature map, marked as f, corresponding to each frame of image,
Figure BDA0003560287200000021
wherein, R represents a real number space, W represents the width of the backbone network characteristic diagram f, H represents the height of the backbone network characteristic diagram f, D represents a spatial down-sampling rate, and B represents the number of channels of the backbone network characteristic diagram f.
Further, the pedestrian detection branching module includes: a bounding box center point prediction head sub-network, a bounding box size prediction head sub-network, and a center point offset prediction head sub-network;
the boundary box central point prediction head sub-network, the boundary box size prediction head sub-network and the central point offset prediction head sub-network are obtained through actual sample training respectively;
the specific process of the step s2 includes:
inputting the backbone network feature map f into the boundary box central point prediction head sub-network, predicting the backbone network feature map f by the boundary box central point prediction head sub-network and outputting thermodynamic diagrams of all pedestrians
Figure BDA0003560287200000022
Figure BDA0003560287200000023
Thermodynamic diagrams for the individual pedestrians
Figure BDA0003560287200000024
Using the loss function, focal loss:
Figure BDA0003560287200000025
wherein x and y respectively represent the outputted thermodynamic diagrams of the respective pedestrians
Figure BDA0003560287200000031
The abscissa and the ordinate of each element in (a) and (b) represent hyper-parameters controlling the contribution weight of the center point,
Figure BDA0003560287200000032
indicates the probability of the existence of a pedestrian object with coordinates (x, y) as the center point, Lx,yRepresenting the real probability that the pedestrian target exists by taking the coordinates (x, y) as a central point;
inputting the backbone network feature map f into the boundary box size prediction head sub-network, predicting the backbone network feature map f by the boundary box size prediction head sub-network and outputting the boundary box size of each pedestrian
Figure BDA0003560287200000033
Figure BDA0003560287200000034
The size of the boundary frame for each pedestrian
Figure BDA0003560287200000035
Using the minimum absolute value deviation loss function l 1:
Figure BDA0003560287200000036
wherein i ∈ [1, N ∈ ]]Indicating the pedestrian index, siThe true value representing the ith pedestrian bounding box size,
Figure BDA0003560287200000037
a predicted value representing the size of the ith pedestrian bounding box, N representing the number of pedestrians in the current frame, and lsizeI.e., the minimum absolute value bias loss function l1, size, which represents the prediction used here to constrain the bounding box size;
inputting the backbone network feature map f into the central point offset prediction head sub-network, predicting the backbone network feature map f by the central point offset prediction head sub-network, and outputting the offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
Figure BDA0003560287200000038
The offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
Figure BDA0003560287200000039
Using the minimum absolute value deviation loss function l 1:
Figure BDA00035602872000000310
loffi.e. the minimum absolute deviation loss function l1, off indicates the network to which it belongs, oiA quantized offset representing the truth of the ith pedestrian;
Figure BDA00035602872000000311
a quantized offset representing a predicted ith pedestrian;
corresponding the thermodynamic diagrams of the pedestrians
Figure BDA00035602872000000312
Size of bounding box
Figure BDA00035602872000000313
And the offset of the center point of the bounding box in both the length and width dimensions
Figure BDA00035602872000000314
Combining the candidate target detection boundary frames of all the pedestrians, and then using an NMS algorithmAnd carrying out duplication removal on the candidate target detection boundary frames of the pedestrians, and screening the boundary frames with the confidence degrees lower than a threshold value 0.8 to obtain the final target detection boundary frames of the pedestrians.
Further, the re-recognition branch module comprises: the device comprises a first preprocessing module, a first convolution layer, a global average pooling layer and a post-processing module, wherein the convolution core size of the first convolution layer is 128; the re-recognition branch module is obtained by training an actual sample;
the specific process of the step s3 includes:
the first preprocessing module cuts out a target feature map P of each pedestrian from the backbone network feature map f corresponding to each frame image according to the final target detection bounding box of each pedestrianF,jWherein, F represents the frame number, j represents the pedestrian label in this frame;
the target characteristic map P of the first convolution layer to each pedestrianF,jFeature extraction is carried out again to obtain a target feature map P 'with the channel number of 128'F,jThe average pooling layer is used for solving the target feature map P 'with the channel number of 128'F,jAll pixels in the target feature map of each channel are added and averaged, and the space embedded feature E of each pedestrian with the length of 128 is obtainedF,j,EF,j∈R128Wherein E isF,jRepresenting the spatial embedding characteristics corresponding to the jth pedestrian of the image of the frame;
the post-processing module embeds the spatial embedding features E of the individual pedestriansF,jEmbedding characteristics E with the space of each pedestrian in the previous frameF-1,jComparing one by one, selecting the object with the minimum measurement distance as a matching object, and then taking the object characteristic graph P of each pedestrianF,jStoring the action characteristic queue Q corresponding to the pedestrian id in the matching targetP,idWhere E denotes that the stored data is a spatial embedding feature of each pedestrian, and P denotes that the stored data is a region image in the final object detection bounding box of the each pedestrian.
Further, the action classification module comprises: the second preprocessing module, a second convolution layer group and a plurality of full connection layers; the action classification module is obtained through actual sample training;
the specific process of the step s4 includes:
the second preprocessing module queues the action characteristics Q with the length of KP,idScaling uniformly to 288 x 288 sizes and aggregating them in the channel dimension to extract information of the individual pedestrians in the time dimension; then the characteristic vector A is output through the second convolution layer group and a plurality of full connection layers, and the A belongs to [0,1 ]]num_actionI.e. the prediction vector for the category to which each pedestrian belongs, i.e. the final action search result, where num _ action represents the number of action types in the data set.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention fuses the embedded characteristics of pedestrian re-recognition in the common action recognition algorithm, adopts a multi-task architecture capable of end-to-end training and optimization, and solves the defect that the common action recognition algorithm is easy to carry out error classification due to the lack of appearance characteristics. Furthermore, in order to meet the increasing requirements for fine tracking, retrieval and analysis of specified pedestrians in video data, the method can continuously track the action recognition result of the specified pedestrians after the pedestrians in the video are retrieved by using the images of the specified pedestrians, and the action recognition result is used for further high-level semantic reference and analysis; meanwhile, the adopted strategy of searching pedestrians first and then identifying actions can greatly reduce the required calculated amount.
Drawings
FIG. 1 is a flow diagram of a re-recognition algorithm based method for retrieving a specified pedestrian action according to one embodiment;
FIG. 2 is a flow diagram that illustrates one embodiment of specifying a single pedestrian for action retrieval;
FIG. 3 is a flow diagram that illustrates one embodiment for group action retrieval by selecting a portion of features in a query image;
fig. 4 is a schematic diagram of an attribute-based pedestrian re-identification model architecture in the group action retrieval method according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Referring to fig. 1, fig. 1 is a flowchart illustrating a specified pedestrian motion retrieval method based on a re-recognition algorithm according to an embodiment.
In the embodiment, the specified pedestrian action retrieval method based on the pedestrian re-identification algorithm comprises the following steps:
s1: respectively inputting each frame of video data acquired by a video acquisition device in real time into a feature extraction backbone network, processing each frame by the feature extraction backbone network, and extracting a backbone network feature map;
s2: inputting the backbone network characteristic diagram into a pedestrian detection branch module, processing the backbone network characteristic diagram by the pedestrian detection branch module, and outputting a final target detection boundary frame of each pedestrian;
s3: inputting the backbone network feature map and the final target detection boundary box of each pedestrian into a re-identification branch module, wherein the re-identification branch module processes the backbone network feature map and the final target detection boundary box of each pedestrian and outputs an action feature queue of each pedestrian;
s4: and the action classification module uniformly scales the action feature queue where each pedestrian is located into 288 × 288, aggregates the action feature queue in channel dimension to extract the information of each pedestrian in time dimension, and performs action classification to obtain a final action retrieval result.
In one embodiment, the specific process of step s1 includes:
sequentially inputting each frame of video data acquired by the video acquisition equipment into the feature extraction backbone network, extracting by the feature extraction backbone network to obtain a backbone network feature map, marked as f, corresponding to each frame of image,
Figure BDA0003560287200000061
wherein, R represents a real number space, W represents the width of the backbone network characteristic diagram f, H represents the height of the backbone network characteristic diagram f, D represents a spatial down-sampling rate, and B represents the number of channels of the backbone network characteristic diagram f.
In one embodiment, the pedestrian detection branching module includes: a bounding box center point prediction head sub-network, a bounding box size prediction head sub-network, and a center point offset prediction head sub-network;
the boundary box central point prediction head sub-network, the boundary box size prediction head sub-network and the central point offset prediction head sub-network are obtained through actual sample training respectively;
the specific process of the step s2 includes:
inputting the backbone network feature map f into the boundary box central point prediction head sub-network, predicting the backbone network feature map f by the boundary box central point prediction head sub-network and outputting thermodynamic diagrams of all pedestrians
Figure BDA0003560287200000062
Figure BDA0003560287200000063
Thermodynamic diagrams for the individual pedestrians
Figure BDA0003560287200000064
Using the loss function, focal loss:
Figure BDA0003560287200000065
wherein x and y respectively represent the outputted thermodynamic diagrams of the respective pedestrians
Figure BDA0003560287200000066
The abscissa and the ordinate of each element in (a) and (b) represent hyper-parameters controlling the contribution weight of the center point,
Figure BDA0003560287200000067
indicating the probability of the existence of a pedestrian object with coordinates (x, y) as the center point, Lx,yRepresenting the true probability that a pedestrian target exists with coordinates (x, y) as a central point;
inputting the backbone network feature map f into the boundary box size prediction head sub-network, predicting the backbone network feature map f by the boundary box size prediction head sub-network and outputting the boundary box size of each pedestrian
Figure BDA0003560287200000068
Figure BDA0003560287200000069
The size of the boundary frame for each pedestrian
Figure BDA00035602872000000610
Using the minimum absolute value deviation loss function l 1:
Figure BDA00035602872000000611
wherein i ∈ [1, N ∈ ]]Indicating the pedestrian index, siThe true value representing the ith pedestrian bounding box size,
Figure BDA00035602872000000612
a predicted value representing the size of the ith pedestrian bounding box, N representing the number of pedestrians in the current frame, and lsizeI.e. the minimum absolute deviation loss function l1, size represents the prediction used here to constrain the bounding box size;
inputting the backbone network feature map f into the central point offset prediction head sub-network, predicting the backbone network feature map f by the central point offset prediction head sub-network, and outputting the offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
Figure BDA00035602872000000613
The offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
Figure BDA00035602872000000614
Using the minimum absolute value deviation loss function l 1:
Figure BDA0003560287200000071
loffi.e. the minimum absolute deviation loss function l1, off indicates the network to which it belongs, oiA quantized offset representing the truth of the ith pedestrian;
Figure BDA0003560287200000072
a quantized offset representing a predicted ith pedestrian;
the thermodynamic diagrams corresponding to the pedestrians
Figure BDA0003560287200000073
Size of bounding box
Figure BDA0003560287200000074
And the offset of the center point of the bounding box in both the length and width dimensions
Figure BDA0003560287200000075
And combining the candidate target detection boundary frames of all the pedestrians, and then using an NMS algorithm to perform duplication elimination on the candidate target detection boundary frames of all the pedestrians and screen out the boundary frames with the confidence degrees lower than a threshold value of 0.8 to obtain the final target detection boundary frames of all the pedestrians.
In one embodiment, the re-identification branching module comprises: the device comprises a first preprocessing module, a first convolution layer, a global average pooling layer and a post-processing module, wherein the convolution core size of the first convolution layer is 128; the re-recognition branch module is obtained by training an actual sample;
the specific process of the step s3 includes:
the first preprocessing module cuts out a target feature map P of each pedestrian from the backbone network feature map f corresponding to each frame image according to the final target detection bounding box of each pedestrianF,jWherein, F represents the frame number, j represents the pedestrian label in this frame;
the target characteristic map P of the first convolution layer to each pedestrianF,jFeature extraction is carried out again to obtain a target feature map P 'with the channel number of 128'F,jThe average pooling layer is used for solving the target feature map P 'with the channel number of 128'F,jAll pixels in the target feature map of each channel are added and averaged, and the space embedded feature E of each pedestrian with the length of 128 is obtainedF,j,EF,j∈R128Wherein E isF,jRepresenting the spatial embedding characteristics corresponding to the jth pedestrian of the image of the frame;
the post-processing module embeds the space of each pedestrian into a feature EF,jEmbedding characteristics E with the space of each pedestrian in the previous frameF-1,jComparing one by one, selecting the object with the minimum measurement distance as a matching object, and then taking the object characteristic graph P of each pedestrianF,jStoring the action characteristic queue Q corresponding to the pedestrian id in the matching targetP,idWhere E denotes that the stored data is a spatial embedding feature of each pedestrian, and P denotes that the stored data is a region image in the final object detection bounding box of the each pedestrian.
In one embodiment, the action classification module comprises: the system comprises a first preprocessing module, a first convolution layer group and a plurality of full connection layers; the action classification module is obtained through actual sample training;
the specific process of the step s4 includes:
the second preprocessing module queues the action characteristics Q with the length of KP,idScaling uniformly to 288 x 288 sizes and aggregating them in the channel dimension to extract information of the individual pedestrians in the time dimension; then the characteristic vector A is output through the second convolution layer group and a plurality of full connection layers, and the A belongs to [0,1 ]]num_actionI.e. the prediction vector for the category to which each pedestrian belongs, i.e. the final action search result, where num _ action represents the number of action types in the data set.
As a practical application of the above specified pedestrian motion retrieval method based on the pedestrian re-recognition algorithm, the following describes an embodiment of the present invention with respect to an application scenario of analyzing a single pedestrian motion sequence. As shown in fig. 2, the step of designating a single pedestrian for action retrieval includes:
extracting a frame-level backbone network feature map from the determined query picture, inputting the extracted frame-level backbone network feature map into a trained pedestrian re-identification branch module, extracting features through a convolution layer with a convolution kernel size of 128, and obtaining a query embedded feature E through a global average pooling layerQuery,i∈R128Where i ∈ NqAnd represents a plurality of inquiry pictures of the same pedestrian.
Specifically, all pedestrians in the video to be queried are detected by using the pedestrian detection branch module, the spatial embedding features of all pedestrians are extracted, and a candidate query library is provided so that a user can update the query spatial embedding features manually.
And for each frame in the input video, extracting a frame-level backbone network feature map f by using a feature extraction backbone network, and outputting a boundary frame of each pedestrian through a pedestrian detection branch module. And cutting the corresponding region of each pedestrian from the original image according to the boundary frame of the pedestrian, wherein the corresponding region is represented as IF,jF denotes the number of frames, and j denotes the pedestrian number in the present frame.
Cutting the target characteristic diagram from the backbone network characteristic diagram f according to the boundary box of each target, and representing the target characteristic diagram as PF,jF represents the frame number, and j represents the target label in the frame; pF,jBy a convolutionAfter the convolution layer with the kernel size of 128 further extracts the features, the convolution layer passes through a global average pooling layer to obtain the spatial embedded feature E of the targetF,j∈R128
Will IF,j、EF,jStoring in a candidate query library in a format of (F, I)F,j,EF,j) Typically only the most recent 30 frames of data, I, are storedF,jRGB images, which are visual, are used to directly present the user with images that may be the target of a query. When I isF,jAlso belongs to the query target, and the user considers that the addition of the screenshot can make the query effect better, then I can be usedF,jCorresponding EF,jAdding EQueryAs a complementary query embedding feature.
Embedding features E from queriesQueryAnd spatial embedding characteristics E of all pedestrians in the frameF,jPerforming similarity comparison, if the similarity is larger than a set threshold value of 0.9, determining that the candidate target and the query picture belong to the same person, and entering the next step; otherwise, directly discarding the data corresponding to the candidate target.
Storing the characteristics of a new frame of candidate targets into an action identification characteristic queue Q with the length of KP,idCarrying out feature transformation, and outputting a feature vector A belonging to [0,1 ] after passing through a classification network]num_actionThat is, the prediction vector of the class to which the target belongs is selected, and the class with the largest predicted value is taken as the action classification result and is expressed as clsF
Classify the action results clsFAnd target detection bounding box BF,idAnd combining the search results into a single action search result, and recording the action detection result of the inquired target according to the time sequence. Each input video frame should correspond to a search record in the format of (F, cls)F,BF,id) After the whole process is finished, the retrieval records can be merged, the adjacent records with the same action classification result are merged into one record, and the merged record format is (N)last,clsF,BF,id) In which N islastIndicating the number of motion duration frames.
Fig. 2 above is described with emphasis on specifying a single pedestrian for action retrieval, and as a correspondence to the method shown in fig. 2 above, an embodiment of the present invention is described below by selecting a part of features in a query image for group action retrieval. As shown in fig. 3, the specific steps are as follows:
and training the pedestrian re-identification model based on the attributes. The data set employs a Market-1501_ Attribute annotated with 27 attributes at the ID level, such as gender, bag, age, and so forth. Because only the ID level is labeled, only 27 full-link layers are added to the original pedestrian re-identification model, and two classifications are performed, so that the attribute value can be predicted, and the model architecture diagram is shown in fig. 4: the Market-1501_ Attribute data set contains 1501 objects, namely 1501 IDs. During training, firstly, an input picture is subjected to a feature extraction backbone network to extract a backbone network feature map at a frame level; then, the pedestrian re-identification task is divided into two subtasks to be processed. The first subtask, namely a pedestrian ID classification subtask, obtains a pedestrian ID classification vector with the length of 1501 after the space embedded features pass through a full connection layer and an activation function, and takes the category with the largest predicted value as a classification result to complete the pedestrian ID classification subtask. And for each attribute in the 27 attributes, obtaining a pedestrian attribute classification vector with the length of 2 by using space embedded features through a full connection layer and an activation function, wherein two predicted values in the vector respectively represent the possibility that a pedestrian has the attribute and the possibility that the pedestrian does not have the attribute, and the largest predicted value is taken as a classification result to finish the pedestrian attribute classification subtask.
Determining a query picture, outputting a spatial embedding characteristic E of the query picture through a pedestrian Re-identification (Re-ID) model trained in the previous stepQuery,i∈R128And attribute value Patt. Then, selecting required attributes from the prediction results, such as the values of the existing prediction attributes: young, girl, short hair, black rings, long-sleep rings, white pages and short pages, but the groups which the user needs to index only need the attributes of girl, black rings and white pages, and the screened attribute values are used
Figure BDA0003560287200000091
Embedding features with queries EQueryTogether constitute the target group search basis.
Detecting all pedestrians in the video to be inquired by using the pedestrian detection branch module, and extracting the spatial embedded characteristics E of all the pedestriansF,jAnd attribute value PF,jAnd updating the candidate query library and the target group retrieval basis.
And screening candidate targets by using double search criteria to obtain a target group meeting the criteria. First embed the query into the features EQueryAnd spatial embedding characteristics E of all pedestrians in the frameF,jPerforming similarity comparison, if the similarity is greater than a set threshold value of 0.5, considering that the candidate target basically belongs to the target group, and entering the next step; otherwise, directly discarding the data corresponding to the candidate target. Secondly, judging the attribute value P of the candidate targetF,jWhether or not to include
Figure BDA0003560287200000101
If the conditions are met, the next step is carried out; otherwise, directly discarding the data corresponding to the candidate target.
Carrying out similarity measurement on the space embedding characteristics of each target in the group and data at the tail of each space embedding characteristic queue, if the similarity is greater than a set threshold value of 0.9, indicating that the targets are the same person, and entering the next step; otherwise, the new target is regarded as a new target to be added into the video, and a new characteristic queue is started.
Storing the target characteristic image of a new frame of candidate pedestrians into an action recognition characteristic queue Q with the length of KP,idCarrying out feature transformation, and outputting a feature vector A belonging to [0,1 ] after passing through a classification network]num_actionThat is, the prediction vector of the class to which the target belongs is selected, and the class with the largest predicted value is taken as the action classification result and is expressed as clsF
Classifying the action of each person in the group into a result clsFAnd target detection bounding box BF,idThe combination is performed for statistical analysis by the user.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application merely distinguish similar objects, and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may exchange a specific order or sequence when allowed. It should be understood that "first \ second \ third" distinct objects may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be implemented in an order other than those illustrated or described herein.
The terms "comprising" and "having" and any variations thereof in the embodiments of the present application are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or device that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, product, or device.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (5)

1. The specified pedestrian action retrieval method based on the pedestrian re-identification algorithm is characterized by comprising the following steps of:
s1, respectively inputting each frame of video data acquired by the video acquisition equipment in real time into a feature extraction backbone network, wherein the feature extraction backbone network processes each frame and extracts a backbone network feature map;
s2, inputting the backbone network characteristic diagram into a pedestrian detection branch module, and processing the backbone network characteristic diagram by the pedestrian detection branch module and outputting a final target detection boundary box of each pedestrian;
s3, inputting the backbone network characteristic diagram and the final target detection boundary box of each pedestrian into a re-identification branch module, wherein the re-identification branch module processes the backbone network characteristic diagram and the final target detection boundary box of each pedestrian and outputs an action characteristic queue of each pedestrian;
s4, the action classification module scales the action feature queues of the pedestrians into 288 × 288 sizes, and aggregates the sizes in the channel dimension to extract the information of the pedestrians in the time dimension, and then carries out action classification to obtain the final action retrieval result.
2. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 1, wherein the specific process of the step s1 includes:
sequentially inputting each frame of video data acquired by the video acquisition equipment into the feature extraction backbone network, extracting by the feature extraction backbone network to obtain a backbone network feature map, marked as f, corresponding to each frame of image,
Figure FDA0003560287190000011
wherein, R represents a real number space, W represents the width of the backbone network characteristic diagram f, H represents the height of the backbone network characteristic diagram f, D represents a spatial down-sampling rate, and B represents the number of channels of the backbone network characteristic diagram f.
3. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 2,
the pedestrian detection branch module includes: a bounding box center point prediction head sub-network, a bounding box size prediction head sub-network, and a center point offset prediction head sub-network;
the boundary box central point prediction head sub-network, the boundary box size prediction head sub-network and the central point offset prediction head sub-network are obtained through actual sample training respectively;
the specific process of step s2 includes:
inputting the backbone network feature map f into the boundary box central point prediction head sub-network, predicting the backbone network feature map f by the boundary box central point prediction head sub-network and outputting thermodynamic diagrams of all pedestrians
Figure FDA0003560287190000012
Figure FDA0003560287190000013
Thermodynamic diagrams for the individual pedestrians
Figure FDA0003560287190000014
Using the loss function, focal loss:
Figure FDA0003560287190000021
wherein x and y respectively represent the outputted thermodynamic diagrams of the respective pedestrians
Figure FDA0003560287190000022
The abscissa and the ordinate of each element in (a) and (b) represent hyper-parameters controlling the contribution weight of the center point,
Figure FDA0003560287190000023
indicating the probability of the existence of a pedestrian object with coordinates (x, y) as the center point, Lx,yRepresenting the true probability that a pedestrian target exists with coordinates (x, y) as a central point;
inputting the backbone network feature map f into the boundary box size prediction head sub-network, predicting the backbone network feature map f by the boundary box size prediction head sub-network and outputting the boundary box size of each pedestrian
Figure FDA0003560287190000024
Figure FDA0003560287190000025
The size of the boundary frame for each pedestrian
Figure FDA0003560287190000026
Using the minimum absolute value deviation loss function l 1:
Figure FDA0003560287190000027
wherein i ∈ [1, N ∈ ]]Indicating the pedestrian index, siThe true value representing the ith pedestrian bounding box size,
Figure FDA0003560287190000028
a predicted value representing the size of the ith pedestrian bounding box, N representing the number of pedestrians in the current frame, and lsizeI.e., the minimum absolute value bias loss function l1, size, which represents the prediction used here to constrain the bounding box size;
inputting the backbone network feature map f into the central point offset prediction head sub-network, predicting the backbone network feature map f by the central point offset prediction head sub-network, and outputting the offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
Figure FDA0003560287190000029
Figure FDA00035602871900000210
The offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
Figure FDA00035602871900000211
Using the minimum absolute value deviation loss function l 1:
Figure FDA00035602871900000212
loffi.e. the minimum absolute deviation loss function l1, off indicates the network to which it belongs, oiA quantized offset representing the truth of the ith pedestrian;
Figure FDA00035602871900000213
a quantized offset representing a predicted ith pedestrian;
the thermodynamic diagrams corresponding to the pedestrians
Figure FDA00035602871900000214
Size of bounding box
Figure FDA00035602871900000215
And the offset of the center point of the bounding box in both the length and width dimensions
Figure FDA00035602871900000216
And combining the candidate target detection boundary frames of all the pedestrians, and then using an NMS algorithm to perform duplication elimination on the candidate target detection boundary frames of all the pedestrians and screen out the boundary frames with the confidence degrees lower than a threshold value of 0.8 to obtain the final target detection boundary frames of all the pedestrians.
4. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 3,
the re-recognition branch module comprises: the device comprises a first preprocessing module, a first convolution layer, a global average pooling layer and a post-processing module, wherein the convolution core size of the first convolution layer is 128; the re-recognition branch module is obtained by training an actual sample;
the specific process of the step s3 includes:
the first preprocessing module cuts out the target characteristics of each pedestrian from the backbone network characteristic diagram f corresponding to each frame image according to the final target detection boundary frame of each pedestrianSign PF,jWherein, F represents the frame number, j represents the pedestrian label in this frame;
the target characteristic map P of the first convolution layer to each pedestrianF,jFeature extraction is carried out again to obtain a target feature map P 'with the channel number of 128'F,jThe average pooling layer stores a target feature map P 'with the channel number of 128'F,jAll pixels in the target feature map of each channel are added and averaged, and the space embedded feature E of each pedestrian with the length of 128 is obtainedF,j,EF,j∈R128Wherein E isF,jRepresenting the spatial embedding characteristics corresponding to the jth pedestrian of the image of the frame;
the post-processing module embeds the space of each pedestrian into a feature EF,jEmbedding characteristics E with the space of each pedestrian in the previous frameF-1,jComparing one by one, selecting the object with the minimum measurement distance as a matching object, and then taking the object characteristic graph P of each pedestrianF,jStoring the action characteristic queue Q corresponding to the pedestrian id in the matching targetP,idWhere E denotes that the stored data is a spatial embedding feature of each pedestrian, and P denotes that the stored data is a region image in the final object detection bounding box of the each pedestrian.
5. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 4,
the action classification module comprises: the second preprocessing module, a second convolution layer group and a plurality of full connection layers; the action classification module is obtained through actual sample training;
the specific process of step s4 includes:
the second preprocessing module queues the action characteristics Q with the length of KP,idUniformly scaling to 288 × 288 size and aggregating it in the channel dimension to extract information of the individual pedestrians in the time dimension; then the characteristic vector A is output through the second convolution layer group and a plurality of full connection layers, and the A belongs to [0,1 ]]num_actionI.e. the prediction vector for the class to which each pedestrian belongs, i.e. the mostAnd final action search results, wherein num _ action represents the number of action types in the data set.
CN202210291238.9A 2022-03-23 2022-03-23 Specified pedestrian action retrieval method based on pedestrian re-identification algorithm Pending CN114708653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210291238.9A CN114708653A (en) 2022-03-23 2022-03-23 Specified pedestrian action retrieval method based on pedestrian re-identification algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210291238.9A CN114708653A (en) 2022-03-23 2022-03-23 Specified pedestrian action retrieval method based on pedestrian re-identification algorithm

Publications (1)

Publication Number Publication Date
CN114708653A true CN114708653A (en) 2022-07-05

Family

ID=82167884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210291238.9A Pending CN114708653A (en) 2022-03-23 2022-03-23 Specified pedestrian action retrieval method based on pedestrian re-identification algorithm

Country Status (1)

Country Link
CN (1) CN114708653A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116094A (en) * 2022-07-08 2022-09-27 福州大学 Real scene pedestrian retrieval method based on sample enhancement and instance perception

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116094A (en) * 2022-07-08 2022-09-27 福州大学 Real scene pedestrian retrieval method based on sample enhancement and instance perception

Similar Documents

Publication Publication Date Title
CN112560999B (en) Target detection model training method and device, electronic equipment and storage medium
CN108596277B (en) Vehicle identity recognition method and device and storage medium
CN105469029B (en) System and method for object re-identification
CN108520226B (en) Pedestrian re-identification method based on body decomposition and significance detection
CN111783576B (en) Pedestrian re-identification method based on improved YOLOv3 network and feature fusion
Parham et al. Animal population censusing at scale with citizen science and photographic identification
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
CN110807434A (en) Pedestrian re-identification system and method based on combination of human body analysis and coarse and fine particle sizes
US20150110387A1 (en) Method for binary classification of a query image
JP2012509522A (en) Semantic classification for each event
CN111178251A (en) Pedestrian attribute identification method and system, storage medium and terminal
CN110008899B (en) Method for extracting and classifying candidate targets of visible light remote sensing image
CN108647703B (en) Saliency-based classification image library type judgment method
CN113762326A (en) Data identification method, device and equipment and readable storage medium
CN113283282A (en) Weak supervision time sequence action detection method based on time domain semantic features
CN115439884A (en) Pedestrian attribute identification method based on double-branch self-attention network
CN110688512A (en) Pedestrian image search algorithm based on PTGAN region gap and depth neural network
CN110956157A (en) Deep learning remote sensing image target detection method and device based on candidate frame selection
CN114708653A (en) Specified pedestrian action retrieval method based on pedestrian re-identification algorithm
Park et al. Intensity classification background model based on the tracing scheme for deep learning based CCTV pedestrian detection
Ahmed et al. Semantic region of interest and species classification in the deep neural network feature domain
CN115050044B (en) Cross-modal pedestrian re-identification method based on MLP-Mixer
CN112651996B (en) Target detection tracking method, device, electronic equipment and storage medium
CN115393802A (en) Railway scene unusual invasion target identification method based on small sample learning
Dutra et al. Re-identifying people based on indexing structure and manifold appearance modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination