CN114708653A - Specified pedestrian action retrieval method based on pedestrian re-identification algorithm - Google Patents
Specified pedestrian action retrieval method based on pedestrian re-identification algorithm Download PDFInfo
- Publication number
- CN114708653A CN114708653A CN202210291238.9A CN202210291238A CN114708653A CN 114708653 A CN114708653 A CN 114708653A CN 202210291238 A CN202210291238 A CN 202210291238A CN 114708653 A CN114708653 A CN 114708653A
- Authority
- CN
- China
- Prior art keywords
- pedestrian
- action
- backbone network
- frame
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7837—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a specified pedestrian action retrieval method based on a pedestrian re-identification algorithm. The specified pedestrian action retrieval method based on the pedestrian re-identification algorithm comprises the following steps: inputting each frame of video data into a feature extraction backbone network, extracting a frame-level backbone network feature map, inputting the frame-level backbone network feature map into a pedestrian detection branch module, and outputting a final target detection boundary box of each pedestrian after processing by the pedestrian detection branch module; the re-identification branch module processes the backbone network characteristic diagram and the final target detection bounding box of each pedestrian and outputs an action characteristic queue of each pedestrian; the action classification module uniformly scales the action feature queue of each pedestrian into 288 × 288, aggregates the action feature queue in the channel dimension to extract the information of each pedestrian in the time dimension, and then carries out action classification to obtain a final action retrieval result. The invention adds the pedestrian re-identification characteristic and the action identification result of the sustainable tracking specified target, thereby greatly improving the accuracy of pedestrian detection.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a specified pedestrian action retrieval method based on a pedestrian re-recognition algorithm.
Background
With the increasing proliferation of video data, a number of computer vision tasks have been proposed to analyze video data, where human motion recognition has significant value in many aspects of real life, and is gaining increasing attention.
The current action recognition algorithm mainly utilizes the motion information of the target to complete the classification of the action, and the method achieves good effect in a simple experiment scene. However, video data in actual life is often more complex, the number of people is more than the number of the existing people, and the situation that position movement and interaction frequently occur among pedestrians is caused, and at the moment, the problem that pedestrians are lost by following the method is easily caused continuously, so that correct identification of the actions of the pedestrians is influenced. This requires us to further mine the role of appearance information in action recognition.
In a sparse scene with few people and no shielding, the appearance information at the moment is only accurate to the identity which can be identified as the human by the pedestrian detection sub-algorithm, for example, people who do the same action can not be classified into two types because of the difference of body types or clothes with different styles and colors. In a complex scene, the extracted appearance information needs to be rich enough to distinguish different pedestrians. For example, Yanwenhao et al propose to use face information as a main feature for measuring the similarity between different action senders, thereby reducing the problem of action misclassification caused by pedestrian id misclassification. However, the face features are difficult to collect when the pedestrians face to each other, the light is dark, the distance is long, and the like, so that the pedestrian appearance features which are more universal need to be utilized.
On the other hand, in practical applications such as behavior analysis of customers in an unmanned store, search and rescue in a security scene, and tracking of suspects, all actions of a specific pedestrian in a discrete time generally need to be recognized so as to obtain useful information after further analysis. For example, Ketan Kotecha (keten kotaca) measures similarity between objects by a non-deep learning method, cuts an input video into independent video segments according to different pedestrians, and finally acts on the video segments as a classification task. However, the non-deep learning method has poor generalization and is difficult to identify the same pedestrian in a longer time frame, and therefore a more effective similarity measurement method is required.
Therefore, the motion recognition algorithm in the prior art generally has the defect that the motion recognition algorithm is easy to carry out error classification due to the lack of appearance characteristics, and the prior art cannot continuously track the motion recognition result of the specified target.
Disclosure of Invention
Aiming at the problems, the invention provides a specified pedestrian action retrieval method based on a pedestrian re-identification algorithm.
In order to realize the aim of the invention, the invention provides a specified pedestrian action retrieval method based on a pedestrian re-identification algorithm, which comprises the following steps:
s1: respectively inputting each frame of video data acquired by a video acquisition device in real time into a feature extraction backbone network, processing each frame by the feature extraction backbone network, and extracting a backbone network feature map;
s2: inputting the backbone network characteristic diagram into a pedestrian detection branch module, processing the backbone network characteristic diagram by the pedestrian detection branch module, and outputting a final target detection bounding box of each pedestrian;
s3: inputting the backbone network feature map and the final target detection boundary box of each pedestrian into a re-identification branch module, wherein the re-identification branch module processes the backbone network feature map and the final target detection boundary box of each pedestrian and outputs an action feature queue of each pedestrian;
s4: and the action classification module uniformly scales the action feature queue where each pedestrian is located into 288 × 288, aggregates the action feature queue in channel dimension to extract the information of each pedestrian in time dimension, and performs action classification to obtain a final action retrieval result.
Further, the specific process of step s1 includes:
sequentially inputting each frame of video data acquired by the video acquisition equipment into the feature extraction backbone network, extracting by the feature extraction backbone network to obtain a backbone network feature map, marked as f, corresponding to each frame of image,wherein, R represents a real number space, W represents the width of the backbone network characteristic diagram f, H represents the height of the backbone network characteristic diagram f, D represents a spatial down-sampling rate, and B represents the number of channels of the backbone network characteristic diagram f.
Further, the pedestrian detection branching module includes: a bounding box center point prediction head sub-network, a bounding box size prediction head sub-network, and a center point offset prediction head sub-network;
the boundary box central point prediction head sub-network, the boundary box size prediction head sub-network and the central point offset prediction head sub-network are obtained through actual sample training respectively;
the specific process of the step s2 includes:
inputting the backbone network feature map f into the boundary box central point prediction head sub-network, predicting the backbone network feature map f by the boundary box central point prediction head sub-network and outputting thermodynamic diagrams of all pedestrians
wherein x and y respectively represent the outputted thermodynamic diagrams of the respective pedestriansThe abscissa and the ordinate of each element in (a) and (b) represent hyper-parameters controlling the contribution weight of the center point,indicates the probability of the existence of a pedestrian object with coordinates (x, y) as the center point, Lx,yRepresenting the real probability that the pedestrian target exists by taking the coordinates (x, y) as a central point;
inputting the backbone network feature map f into the boundary box size prediction head sub-network, predicting the backbone network feature map f by the boundary box size prediction head sub-network and outputting the boundary box size of each pedestrian
The size of the boundary frame for each pedestrianUsing the minimum absolute value deviation loss function l 1:
wherein i ∈ [1, N ∈ ]]Indicating the pedestrian index, siThe true value representing the ith pedestrian bounding box size,a predicted value representing the size of the ith pedestrian bounding box, N representing the number of pedestrians in the current frame, and lsizeI.e., the minimum absolute value bias loss function l1, size, which represents the prediction used here to constrain the bounding box size;
inputting the backbone network feature map f into the central point offset prediction head sub-network, predicting the backbone network feature map f by the central point offset prediction head sub-network, and outputting the offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
The offset of the central point of the boundary frame of each pedestrian in two dimensions of length and widthUsing the minimum absolute value deviation loss function l 1:loffi.e. the minimum absolute deviation loss function l1, off indicates the network to which it belongs, oiA quantized offset representing the truth of the ith pedestrian;a quantized offset representing a predicted ith pedestrian;
corresponding the thermodynamic diagrams of the pedestriansSize of bounding boxAnd the offset of the center point of the bounding box in both the length and width dimensionsCombining the candidate target detection boundary frames of all the pedestrians, and then using an NMS algorithmAnd carrying out duplication removal on the candidate target detection boundary frames of the pedestrians, and screening the boundary frames with the confidence degrees lower than a threshold value 0.8 to obtain the final target detection boundary frames of the pedestrians.
Further, the re-recognition branch module comprises: the device comprises a first preprocessing module, a first convolution layer, a global average pooling layer and a post-processing module, wherein the convolution core size of the first convolution layer is 128; the re-recognition branch module is obtained by training an actual sample;
the specific process of the step s3 includes:
the first preprocessing module cuts out a target feature map P of each pedestrian from the backbone network feature map f corresponding to each frame image according to the final target detection bounding box of each pedestrianF,jWherein, F represents the frame number, j represents the pedestrian label in this frame;
the target characteristic map P of the first convolution layer to each pedestrianF,jFeature extraction is carried out again to obtain a target feature map P 'with the channel number of 128'F,jThe average pooling layer is used for solving the target feature map P 'with the channel number of 128'F,jAll pixels in the target feature map of each channel are added and averaged, and the space embedded feature E of each pedestrian with the length of 128 is obtainedF,j,EF,j∈R128Wherein E isF,jRepresenting the spatial embedding characteristics corresponding to the jth pedestrian of the image of the frame;
the post-processing module embeds the spatial embedding features E of the individual pedestriansF,jEmbedding characteristics E with the space of each pedestrian in the previous frameF-1,jComparing one by one, selecting the object with the minimum measurement distance as a matching object, and then taking the object characteristic graph P of each pedestrianF,jStoring the action characteristic queue Q corresponding to the pedestrian id in the matching targetP,idWhere E denotes that the stored data is a spatial embedding feature of each pedestrian, and P denotes that the stored data is a region image in the final object detection bounding box of the each pedestrian.
Further, the action classification module comprises: the second preprocessing module, a second convolution layer group and a plurality of full connection layers; the action classification module is obtained through actual sample training;
the specific process of the step s4 includes:
the second preprocessing module queues the action characteristics Q with the length of KP,idScaling uniformly to 288 x 288 sizes and aggregating them in the channel dimension to extract information of the individual pedestrians in the time dimension; then the characteristic vector A is output through the second convolution layer group and a plurality of full connection layers, and the A belongs to [0,1 ]]num_actionI.e. the prediction vector for the category to which each pedestrian belongs, i.e. the final action search result, where num _ action represents the number of action types in the data set.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention fuses the embedded characteristics of pedestrian re-recognition in the common action recognition algorithm, adopts a multi-task architecture capable of end-to-end training and optimization, and solves the defect that the common action recognition algorithm is easy to carry out error classification due to the lack of appearance characteristics. Furthermore, in order to meet the increasing requirements for fine tracking, retrieval and analysis of specified pedestrians in video data, the method can continuously track the action recognition result of the specified pedestrians after the pedestrians in the video are retrieved by using the images of the specified pedestrians, and the action recognition result is used for further high-level semantic reference and analysis; meanwhile, the adopted strategy of searching pedestrians first and then identifying actions can greatly reduce the required calculated amount.
Drawings
FIG. 1 is a flow diagram of a re-recognition algorithm based method for retrieving a specified pedestrian action according to one embodiment;
FIG. 2 is a flow diagram that illustrates one embodiment of specifying a single pedestrian for action retrieval;
FIG. 3 is a flow diagram that illustrates one embodiment for group action retrieval by selecting a portion of features in a query image;
fig. 4 is a schematic diagram of an attribute-based pedestrian re-identification model architecture in the group action retrieval method according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Referring to fig. 1, fig. 1 is a flowchart illustrating a specified pedestrian motion retrieval method based on a re-recognition algorithm according to an embodiment.
In the embodiment, the specified pedestrian action retrieval method based on the pedestrian re-identification algorithm comprises the following steps:
s1: respectively inputting each frame of video data acquired by a video acquisition device in real time into a feature extraction backbone network, processing each frame by the feature extraction backbone network, and extracting a backbone network feature map;
s2: inputting the backbone network characteristic diagram into a pedestrian detection branch module, processing the backbone network characteristic diagram by the pedestrian detection branch module, and outputting a final target detection boundary frame of each pedestrian;
s3: inputting the backbone network feature map and the final target detection boundary box of each pedestrian into a re-identification branch module, wherein the re-identification branch module processes the backbone network feature map and the final target detection boundary box of each pedestrian and outputs an action feature queue of each pedestrian;
s4: and the action classification module uniformly scales the action feature queue where each pedestrian is located into 288 × 288, aggregates the action feature queue in channel dimension to extract the information of each pedestrian in time dimension, and performs action classification to obtain a final action retrieval result.
In one embodiment, the specific process of step s1 includes:
sequentially inputting each frame of video data acquired by the video acquisition equipment into the feature extraction backbone network, extracting by the feature extraction backbone network to obtain a backbone network feature map, marked as f, corresponding to each frame of image,wherein, R represents a real number space, W represents the width of the backbone network characteristic diagram f, H represents the height of the backbone network characteristic diagram f, D represents a spatial down-sampling rate, and B represents the number of channels of the backbone network characteristic diagram f.
In one embodiment, the pedestrian detection branching module includes: a bounding box center point prediction head sub-network, a bounding box size prediction head sub-network, and a center point offset prediction head sub-network;
the boundary box central point prediction head sub-network, the boundary box size prediction head sub-network and the central point offset prediction head sub-network are obtained through actual sample training respectively;
the specific process of the step s2 includes:
inputting the backbone network feature map f into the boundary box central point prediction head sub-network, predicting the backbone network feature map f by the boundary box central point prediction head sub-network and outputting thermodynamic diagrams of all pedestrians
wherein x and y respectively represent the outputted thermodynamic diagrams of the respective pedestriansThe abscissa and the ordinate of each element in (a) and (b) represent hyper-parameters controlling the contribution weight of the center point,indicating the probability of the existence of a pedestrian object with coordinates (x, y) as the center point, Lx,yRepresenting the true probability that a pedestrian target exists with coordinates (x, y) as a central point;
inputting the backbone network feature map f into the boundary box size prediction head sub-network, predicting the backbone network feature map f by the boundary box size prediction head sub-network and outputting the boundary box size of each pedestrian
The size of the boundary frame for each pedestrianUsing the minimum absolute value deviation loss function l 1:
wherein i ∈ [1, N ∈ ]]Indicating the pedestrian index, siThe true value representing the ith pedestrian bounding box size,a predicted value representing the size of the ith pedestrian bounding box, N representing the number of pedestrians in the current frame, and lsizeI.e. the minimum absolute deviation loss function l1, size represents the prediction used here to constrain the bounding box size;
inputting the backbone network feature map f into the central point offset prediction head sub-network, predicting the backbone network feature map f by the central point offset prediction head sub-network, and outputting the offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
The offset of the central point of the boundary frame of each pedestrian in two dimensions of length and widthUsing the minimum absolute value deviation loss function l 1:loffi.e. the minimum absolute deviation loss function l1, off indicates the network to which it belongs, oiA quantized offset representing the truth of the ith pedestrian;a quantized offset representing a predicted ith pedestrian;
the thermodynamic diagrams corresponding to the pedestriansSize of bounding boxAnd the offset of the center point of the bounding box in both the length and width dimensionsAnd combining the candidate target detection boundary frames of all the pedestrians, and then using an NMS algorithm to perform duplication elimination on the candidate target detection boundary frames of all the pedestrians and screen out the boundary frames with the confidence degrees lower than a threshold value of 0.8 to obtain the final target detection boundary frames of all the pedestrians.
In one embodiment, the re-identification branching module comprises: the device comprises a first preprocessing module, a first convolution layer, a global average pooling layer and a post-processing module, wherein the convolution core size of the first convolution layer is 128; the re-recognition branch module is obtained by training an actual sample;
the specific process of the step s3 includes:
the first preprocessing module cuts out a target feature map P of each pedestrian from the backbone network feature map f corresponding to each frame image according to the final target detection bounding box of each pedestrianF,jWherein, F represents the frame number, j represents the pedestrian label in this frame;
the target characteristic map P of the first convolution layer to each pedestrianF,jFeature extraction is carried out again to obtain a target feature map P 'with the channel number of 128'F,jThe average pooling layer is used for solving the target feature map P 'with the channel number of 128'F,jAll pixels in the target feature map of each channel are added and averaged, and the space embedded feature E of each pedestrian with the length of 128 is obtainedF,j,EF,j∈R128Wherein E isF,jRepresenting the spatial embedding characteristics corresponding to the jth pedestrian of the image of the frame;
the post-processing module embeds the space of each pedestrian into a feature EF,jEmbedding characteristics E with the space of each pedestrian in the previous frameF-1,jComparing one by one, selecting the object with the minimum measurement distance as a matching object, and then taking the object characteristic graph P of each pedestrianF,jStoring the action characteristic queue Q corresponding to the pedestrian id in the matching targetP,idWhere E denotes that the stored data is a spatial embedding feature of each pedestrian, and P denotes that the stored data is a region image in the final object detection bounding box of the each pedestrian.
In one embodiment, the action classification module comprises: the system comprises a first preprocessing module, a first convolution layer group and a plurality of full connection layers; the action classification module is obtained through actual sample training;
the specific process of the step s4 includes:
the second preprocessing module queues the action characteristics Q with the length of KP,idScaling uniformly to 288 x 288 sizes and aggregating them in the channel dimension to extract information of the individual pedestrians in the time dimension; then the characteristic vector A is output through the second convolution layer group and a plurality of full connection layers, and the A belongs to [0,1 ]]num_actionI.e. the prediction vector for the category to which each pedestrian belongs, i.e. the final action search result, where num _ action represents the number of action types in the data set.
As a practical application of the above specified pedestrian motion retrieval method based on the pedestrian re-recognition algorithm, the following describes an embodiment of the present invention with respect to an application scenario of analyzing a single pedestrian motion sequence. As shown in fig. 2, the step of designating a single pedestrian for action retrieval includes:
extracting a frame-level backbone network feature map from the determined query picture, inputting the extracted frame-level backbone network feature map into a trained pedestrian re-identification branch module, extracting features through a convolution layer with a convolution kernel size of 128, and obtaining a query embedded feature E through a global average pooling layerQuery,i∈R128Where i ∈ NqAnd represents a plurality of inquiry pictures of the same pedestrian.
Specifically, all pedestrians in the video to be queried are detected by using the pedestrian detection branch module, the spatial embedding features of all pedestrians are extracted, and a candidate query library is provided so that a user can update the query spatial embedding features manually.
And for each frame in the input video, extracting a frame-level backbone network feature map f by using a feature extraction backbone network, and outputting a boundary frame of each pedestrian through a pedestrian detection branch module. And cutting the corresponding region of each pedestrian from the original image according to the boundary frame of the pedestrian, wherein the corresponding region is represented as IF,jF denotes the number of frames, and j denotes the pedestrian number in the present frame.
Cutting the target characteristic diagram from the backbone network characteristic diagram f according to the boundary box of each target, and representing the target characteristic diagram as PF,jF represents the frame number, and j represents the target label in the frame; pF,jBy a convolutionAfter the convolution layer with the kernel size of 128 further extracts the features, the convolution layer passes through a global average pooling layer to obtain the spatial embedded feature E of the targetF,j∈R128。
Will IF,j、EF,jStoring in a candidate query library in a format of (F, I)F,j,EF,j) Typically only the most recent 30 frames of data, I, are storedF,jRGB images, which are visual, are used to directly present the user with images that may be the target of a query. When I isF,jAlso belongs to the query target, and the user considers that the addition of the screenshot can make the query effect better, then I can be usedF,jCorresponding EF,jAdding EQueryAs a complementary query embedding feature.
Embedding features E from queriesQueryAnd spatial embedding characteristics E of all pedestrians in the frameF,jPerforming similarity comparison, if the similarity is larger than a set threshold value of 0.9, determining that the candidate target and the query picture belong to the same person, and entering the next step; otherwise, directly discarding the data corresponding to the candidate target.
Storing the characteristics of a new frame of candidate targets into an action identification characteristic queue Q with the length of KP,idCarrying out feature transformation, and outputting a feature vector A belonging to [0,1 ] after passing through a classification network]num_actionThat is, the prediction vector of the class to which the target belongs is selected, and the class with the largest predicted value is taken as the action classification result and is expressed as clsF。
Classify the action results clsFAnd target detection bounding box BF,idAnd combining the search results into a single action search result, and recording the action detection result of the inquired target according to the time sequence. Each input video frame should correspond to a search record in the format of (F, cls)F,BF,id) After the whole process is finished, the retrieval records can be merged, the adjacent records with the same action classification result are merged into one record, and the merged record format is (N)last,clsF,BF,id) In which N islastIndicating the number of motion duration frames.
Fig. 2 above is described with emphasis on specifying a single pedestrian for action retrieval, and as a correspondence to the method shown in fig. 2 above, an embodiment of the present invention is described below by selecting a part of features in a query image for group action retrieval. As shown in fig. 3, the specific steps are as follows:
and training the pedestrian re-identification model based on the attributes. The data set employs a Market-1501_ Attribute annotated with 27 attributes at the ID level, such as gender, bag, age, and so forth. Because only the ID level is labeled, only 27 full-link layers are added to the original pedestrian re-identification model, and two classifications are performed, so that the attribute value can be predicted, and the model architecture diagram is shown in fig. 4: the Market-1501_ Attribute data set contains 1501 objects, namely 1501 IDs. During training, firstly, an input picture is subjected to a feature extraction backbone network to extract a backbone network feature map at a frame level; then, the pedestrian re-identification task is divided into two subtasks to be processed. The first subtask, namely a pedestrian ID classification subtask, obtains a pedestrian ID classification vector with the length of 1501 after the space embedded features pass through a full connection layer and an activation function, and takes the category with the largest predicted value as a classification result to complete the pedestrian ID classification subtask. And for each attribute in the 27 attributes, obtaining a pedestrian attribute classification vector with the length of 2 by using space embedded features through a full connection layer and an activation function, wherein two predicted values in the vector respectively represent the possibility that a pedestrian has the attribute and the possibility that the pedestrian does not have the attribute, and the largest predicted value is taken as a classification result to finish the pedestrian attribute classification subtask.
Determining a query picture, outputting a spatial embedding characteristic E of the query picture through a pedestrian Re-identification (Re-ID) model trained in the previous stepQuery,i∈R128And attribute value Patt. Then, selecting required attributes from the prediction results, such as the values of the existing prediction attributes: young, girl, short hair, black rings, long-sleep rings, white pages and short pages, but the groups which the user needs to index only need the attributes of girl, black rings and white pages, and the screened attribute values are usedEmbedding features with queries EQueryTogether constitute the target group search basis.
Detecting all pedestrians in the video to be inquired by using the pedestrian detection branch module, and extracting the spatial embedded characteristics E of all the pedestriansF,jAnd attribute value PF,jAnd updating the candidate query library and the target group retrieval basis.
And screening candidate targets by using double search criteria to obtain a target group meeting the criteria. First embed the query into the features EQueryAnd spatial embedding characteristics E of all pedestrians in the frameF,jPerforming similarity comparison, if the similarity is greater than a set threshold value of 0.5, considering that the candidate target basically belongs to the target group, and entering the next step; otherwise, directly discarding the data corresponding to the candidate target. Secondly, judging the attribute value P of the candidate targetF,jWhether or not to includeIf the conditions are met, the next step is carried out; otherwise, directly discarding the data corresponding to the candidate target.
Carrying out similarity measurement on the space embedding characteristics of each target in the group and data at the tail of each space embedding characteristic queue, if the similarity is greater than a set threshold value of 0.9, indicating that the targets are the same person, and entering the next step; otherwise, the new target is regarded as a new target to be added into the video, and a new characteristic queue is started.
Storing the target characteristic image of a new frame of candidate pedestrians into an action recognition characteristic queue Q with the length of KP,idCarrying out feature transformation, and outputting a feature vector A belonging to [0,1 ] after passing through a classification network]num_actionThat is, the prediction vector of the class to which the target belongs is selected, and the class with the largest predicted value is taken as the action classification result and is expressed as clsF。
Classifying the action of each person in the group into a result clsFAnd target detection bounding box BF,idThe combination is performed for statistical analysis by the user.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application merely distinguish similar objects, and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may exchange a specific order or sequence when allowed. It should be understood that "first \ second \ third" distinct objects may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be implemented in an order other than those illustrated or described herein.
The terms "comprising" and "having" and any variations thereof in the embodiments of the present application are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or device that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, product, or device.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.
Claims (5)
1. The specified pedestrian action retrieval method based on the pedestrian re-identification algorithm is characterized by comprising the following steps of:
s1, respectively inputting each frame of video data acquired by the video acquisition equipment in real time into a feature extraction backbone network, wherein the feature extraction backbone network processes each frame and extracts a backbone network feature map;
s2, inputting the backbone network characteristic diagram into a pedestrian detection branch module, and processing the backbone network characteristic diagram by the pedestrian detection branch module and outputting a final target detection boundary box of each pedestrian;
s3, inputting the backbone network characteristic diagram and the final target detection boundary box of each pedestrian into a re-identification branch module, wherein the re-identification branch module processes the backbone network characteristic diagram and the final target detection boundary box of each pedestrian and outputs an action characteristic queue of each pedestrian;
s4, the action classification module scales the action feature queues of the pedestrians into 288 × 288 sizes, and aggregates the sizes in the channel dimension to extract the information of the pedestrians in the time dimension, and then carries out action classification to obtain the final action retrieval result.
2. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 1, wherein the specific process of the step s1 includes:
sequentially inputting each frame of video data acquired by the video acquisition equipment into the feature extraction backbone network, extracting by the feature extraction backbone network to obtain a backbone network feature map, marked as f, corresponding to each frame of image,wherein, R represents a real number space, W represents the width of the backbone network characteristic diagram f, H represents the height of the backbone network characteristic diagram f, D represents a spatial down-sampling rate, and B represents the number of channels of the backbone network characteristic diagram f.
3. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 2,
the pedestrian detection branch module includes: a bounding box center point prediction head sub-network, a bounding box size prediction head sub-network, and a center point offset prediction head sub-network;
the boundary box central point prediction head sub-network, the boundary box size prediction head sub-network and the central point offset prediction head sub-network are obtained through actual sample training respectively;
the specific process of step s2 includes:
inputting the backbone network feature map f into the boundary box central point prediction head sub-network, predicting the backbone network feature map f by the boundary box central point prediction head sub-network and outputting thermodynamic diagrams of all pedestrians
wherein x and y respectively represent the outputted thermodynamic diagrams of the respective pedestriansThe abscissa and the ordinate of each element in (a) and (b) represent hyper-parameters controlling the contribution weight of the center point,indicating the probability of the existence of a pedestrian object with coordinates (x, y) as the center point, Lx,yRepresenting the true probability that a pedestrian target exists with coordinates (x, y) as a central point;
inputting the backbone network feature map f into the boundary box size prediction head sub-network, predicting the backbone network feature map f by the boundary box size prediction head sub-network and outputting the boundary box size of each pedestrian
The size of the boundary frame for each pedestrianUsing the minimum absolute value deviation loss function l 1:
wherein i ∈ [1, N ∈ ]]Indicating the pedestrian index, siThe true value representing the ith pedestrian bounding box size,a predicted value representing the size of the ith pedestrian bounding box, N representing the number of pedestrians in the current frame, and lsizeI.e., the minimum absolute value bias loss function l1, size, which represents the prediction used here to constrain the bounding box size;
inputting the backbone network feature map f into the central point offset prediction head sub-network, predicting the backbone network feature map f by the central point offset prediction head sub-network, and outputting the offset of the central point of the boundary frame of each pedestrian in two dimensions of length and width
The offset of the central point of the boundary frame of each pedestrian in two dimensions of length and widthUsing the minimum absolute value deviation loss function l 1:loffi.e. the minimum absolute deviation loss function l1, off indicates the network to which it belongs, oiA quantized offset representing the truth of the ith pedestrian;a quantized offset representing a predicted ith pedestrian;
the thermodynamic diagrams corresponding to the pedestriansSize of bounding boxAnd the offset of the center point of the bounding box in both the length and width dimensionsAnd combining the candidate target detection boundary frames of all the pedestrians, and then using an NMS algorithm to perform duplication elimination on the candidate target detection boundary frames of all the pedestrians and screen out the boundary frames with the confidence degrees lower than a threshold value of 0.8 to obtain the final target detection boundary frames of all the pedestrians.
4. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 3,
the re-recognition branch module comprises: the device comprises a first preprocessing module, a first convolution layer, a global average pooling layer and a post-processing module, wherein the convolution core size of the first convolution layer is 128; the re-recognition branch module is obtained by training an actual sample;
the specific process of the step s3 includes:
the first preprocessing module cuts out the target characteristics of each pedestrian from the backbone network characteristic diagram f corresponding to each frame image according to the final target detection boundary frame of each pedestrianSign PF,jWherein, F represents the frame number, j represents the pedestrian label in this frame;
the target characteristic map P of the first convolution layer to each pedestrianF,jFeature extraction is carried out again to obtain a target feature map P 'with the channel number of 128'F,jThe average pooling layer stores a target feature map P 'with the channel number of 128'F,jAll pixels in the target feature map of each channel are added and averaged, and the space embedded feature E of each pedestrian with the length of 128 is obtainedF,j,EF,j∈R128Wherein E isF,jRepresenting the spatial embedding characteristics corresponding to the jth pedestrian of the image of the frame;
the post-processing module embeds the space of each pedestrian into a feature EF,jEmbedding characteristics E with the space of each pedestrian in the previous frameF-1,jComparing one by one, selecting the object with the minimum measurement distance as a matching object, and then taking the object characteristic graph P of each pedestrianF,jStoring the action characteristic queue Q corresponding to the pedestrian id in the matching targetP,idWhere E denotes that the stored data is a spatial embedding feature of each pedestrian, and P denotes that the stored data is a region image in the final object detection bounding box of the each pedestrian.
5. The pedestrian re-identification algorithm-based specified pedestrian motion retrieval method according to claim 4,
the action classification module comprises: the second preprocessing module, a second convolution layer group and a plurality of full connection layers; the action classification module is obtained through actual sample training;
the specific process of step s4 includes:
the second preprocessing module queues the action characteristics Q with the length of KP,idUniformly scaling to 288 × 288 size and aggregating it in the channel dimension to extract information of the individual pedestrians in the time dimension; then the characteristic vector A is output through the second convolution layer group and a plurality of full connection layers, and the A belongs to [0,1 ]]num_actionI.e. the prediction vector for the class to which each pedestrian belongs, i.e. the mostAnd final action search results, wherein num _ action represents the number of action types in the data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210291238.9A CN114708653A (en) | 2022-03-23 | 2022-03-23 | Specified pedestrian action retrieval method based on pedestrian re-identification algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210291238.9A CN114708653A (en) | 2022-03-23 | 2022-03-23 | Specified pedestrian action retrieval method based on pedestrian re-identification algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114708653A true CN114708653A (en) | 2022-07-05 |
Family
ID=82167884
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210291238.9A Pending CN114708653A (en) | 2022-03-23 | 2022-03-23 | Specified pedestrian action retrieval method based on pedestrian re-identification algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114708653A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115116094A (en) * | 2022-07-08 | 2022-09-27 | 福州大学 | Real scene pedestrian retrieval method based on sample enhancement and instance perception |
-
2022
- 2022-03-23 CN CN202210291238.9A patent/CN114708653A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115116094A (en) * | 2022-07-08 | 2022-09-27 | 福州大学 | Real scene pedestrian retrieval method based on sample enhancement and instance perception |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112560999B (en) | Target detection model training method and device, electronic equipment and storage medium | |
CN108596277B (en) | Vehicle identity recognition method and device and storage medium | |
CN105469029B (en) | System and method for object re-identification | |
CN108520226B (en) | Pedestrian re-identification method based on body decomposition and significance detection | |
CN111783576B (en) | Pedestrian re-identification method based on improved YOLOv3 network and feature fusion | |
Parham et al. | Animal population censusing at scale with citizen science and photographic identification | |
CN110717411A (en) | Pedestrian re-identification method based on deep layer feature fusion | |
CN110807434A (en) | Pedestrian re-identification system and method based on combination of human body analysis and coarse and fine particle sizes | |
US20150110387A1 (en) | Method for binary classification of a query image | |
JP2012509522A (en) | Semantic classification for each event | |
CN111178251A (en) | Pedestrian attribute identification method and system, storage medium and terminal | |
CN110008899B (en) | Method for extracting and classifying candidate targets of visible light remote sensing image | |
CN108647703B (en) | Saliency-based classification image library type judgment method | |
CN113762326A (en) | Data identification method, device and equipment and readable storage medium | |
CN113283282A (en) | Weak supervision time sequence action detection method based on time domain semantic features | |
CN115439884A (en) | Pedestrian attribute identification method based on double-branch self-attention network | |
CN110688512A (en) | Pedestrian image search algorithm based on PTGAN region gap and depth neural network | |
CN110956157A (en) | Deep learning remote sensing image target detection method and device based on candidate frame selection | |
CN114708653A (en) | Specified pedestrian action retrieval method based on pedestrian re-identification algorithm | |
Park et al. | Intensity classification background model based on the tracing scheme for deep learning based CCTV pedestrian detection | |
Ahmed et al. | Semantic region of interest and species classification in the deep neural network feature domain | |
CN115050044B (en) | Cross-modal pedestrian re-identification method based on MLP-Mixer | |
CN112651996B (en) | Target detection tracking method, device, electronic equipment and storage medium | |
CN115393802A (en) | Railway scene unusual invasion target identification method based on small sample learning | |
Dutra et al. | Re-identifying people based on indexing structure and manifold appearance modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |