CN111598067A

CN111598067A - Re-recognition training method, re-recognition method and storage device in video

Info

Publication number: CN111598067A
Application number: CN202010723115.9A
Authority: CN
Inventors: 张迪; 潘华东; 罗时现
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-08-28
Anticipated expiration: 2040-07-24
Also published as: CN111598067B

Abstract

The invention discloses a re-recognition training method, a re-recognition method and a storage device in a video, wherein the re-recognition training method comprises the following steps: detecting an animal picture sequence in a video by using an animal detection and animal tracking method; extracting time domain features and space features of the animal picture sequence, fusing the time domain features and the space features and obtaining a feature map of the animal picture sequence; carrying out blocking processing on the feature map in different sizes on a horizontal dimension, and respectively calculating a local blocking feature map and the loss between the global feature map and the real animal; and optimizing the loss for training until the training is converged to obtain an optimal animal re-identification result. By the mode, the time domain feature and the spatial feature of the animal picture sequence can be fused, fine-grained learning is performed on different parts of the animal, the learning of the global feature is considered, and the accuracy and the robustness of animal weight identification are improved.

Description

Re-recognition training method, re-recognition method and storage device in video

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a video re-recognition training method, a video re-recognition method, and a storage device.

Background

The monitoring video is often applied to public places such as subways, airports and traffic roads to maintain public safety, pedestrians are detected through the video, and suspects or lost children are found by utilizing a pedestrian re-identification technology. In the traditional pedestrian re-identification method, a single pedestrian picture is mostly used for retrieval, in a monitoring video, the postures, the shielding conditions, the environmental backgrounds and the like of pedestrians are possibly different along with different time points, and the robustness of a single picture retrieval result is not strong. The pedestrian re-identification method based on the video utilizes a plurality of pictures in the video sequence for identification, and the identification effect is better.

The method for solving the problem of video pedestrian re-identification at present is to take a section of sequence pictures of pedestrians in a monitoring video as input, extract the spatio-temporal information of the pictures by using a convolutional neural network and a cyclic neural network, encode the characteristic information into a characteristic vector, and identify the pedestrians by calculating the distance of the characteristic vector of each pedestrian. These methods usually only focus on global features of pedestrian sequences, but do not focus on significant features of faces, trunks and other key parts of pedestrians, so that accuracy of video-based pedestrian re-identification is not high enough.

Disclosure of Invention

The application provides a re-recognition training method, a re-recognition method and a storage device in a video, which can improve the accuracy and robustness of animal re-recognition.

In order to solve the technical problem, the application adopts a technical scheme that: the method for re-recognition training in the video comprises the following steps:

detecting an animal picture sequence in a video by using an animal detection and animal tracking method;

extracting time domain features and space features of the animal picture sequence, fusing the time domain features and the space features and obtaining a feature map of the animal picture sequence;

carrying out blocking processing on the feature map in different sizes on a horizontal dimension, and respectively calculating a local blocking feature map and the loss between the global feature map and the real animal;

and optimizing the loss for training until the training is converged to obtain an optimal animal re-identification result.

In order to solve the above technical problem, another technical solution adopted by the present application is: the method for re-identifying in the video comprises the following steps:

detecting a picture sequence of an animal to be detected in a video to be detected by using an animal detection and animal tracking method;

selecting a plurality of pictures to be detected from the picture sequence of the animal to be detected, and performing time domain feature and spatial feature fusion processing and blocking processing on the pictures to be detected to obtain a feature vector of the picture sequence of the animal to be detected;

and comparing the characteristic vector of the animal picture sequence to be detected with the characteristic vector of the animal picture sequence in a preset search base library, searching out a target picture with the highest similarity, and outputting a re-identification matching result.

In order to solve the above technical problem, the present application adopts another technical solution that: a storage device is provided, which stores a program file capable of realizing the re-identification method.

The beneficial effect of this application is: by means of the method, the situation that a single picture is possibly poor in recognition effect due to animal posture, environment background and shielding can be avoided, each frame is correlated in time, each position is correlated in space, fine-grained learning is conducted on different parts of an animal, and meanwhile global feature learning is considered, so that accuracy and robustness of animal re-recognition are improved.

Drawings

FIG. 1 is a schematic flow chart of a method for re-recognition training in video according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the structure of a convolutional neural network in an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a non-local attention module in an embodiment of the invention;

FIG. 4 is a flowchart illustrating a video re-recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart illustrating the establishment of a default search base according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a video re-recognition training device according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a video re-recognition apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a memory device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first", "second" and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any indication of the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. All directional indications (such as up, down, left, right, front, and rear … …) in the embodiments of the present application are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indication is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Fig. 1 is a schematic flow chart of a video re-recognition training method according to an embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:

step S101: and detecting an animal picture sequence in the video by using an animal detection and animal tracking method.

In step S101, a surveillance video is first acquired, animal pictures are then extracted from the surveillance video by using an animal detection and animal tracking method, and the extracted animal pictures are made into an animal picture sequence. The animal of the present embodiment is a dynamic object, including but not limited to pedestrians and vehicles.

Step S102: and extracting the time domain features and the space features of the animal picture sequence, fusing the time domain features and the space features and obtaining a feature map of the animal picture sequence.

In step S102, selecting a plurality of pictures from the animal picture sequence by random sampling, sorting the plurality of pictures according to the sequence of the shooting time, dividing the plurality of pictures into a plurality of sub-picture sequences, randomly selecting one picture from each sub-picture sequence, and sequentially performing scaling processing and random horizontal flipping processing; inputting the processed pictures into a convolutional neural network, performing first convolution on the pictures to obtain input features with the number of channels reduced, associating frame dimensions in the input features in a matrix multiplication mode, associating width dimensions in the input features to obtain output features after fusion of time domain features and space features, and sequentially performing second convolution and third convolution on the output features to extract feature maps of animal picture sequences.

Specifically, the pedestrian is taken as an example for explanation, a plurality of pictures are selected from a pedestrian picture sequence in a random sampling mode, the plurality of pictures are sorted according to the sequence of shooting time and are divided into four sub-picture sequences, during training, one picture is selected from each sub-picture sequence at random, the size of the picture is adjusted to (384,128), and then random horizontal turning processing is performed.

The convolutional neural network of the present embodiment employs a ResNet50 network, in which a non-local attention module for fusing a temporal feature and a spatial feature is inserted, and further, as shown in fig. 2, the convolutional neural network includes a 1 × 1 convolution, a 3 × 3 convolution and a 1 × 1 convolution, where the first 1 × 1 convolution reduces the amount of computation by reducing the number of channels of an input picture, and the non-local attention module is disposed after the first 1 × 1 convolution.

Further, the non-local attention module performs temporal feature and spatial feature fusion according to the following formula,

wherein x and y respectively represent input features and output features, i represents coordinates of a current position, j represents coordinates of all space-time positions, f is a feature correlation degree for calculating i and j, g is a linear representation of j position features, and C is a normalization coefficient.

The non-local attention module calculates the frame dimension correlation in the input features, correlates the width dimension and the height dimension in the input features, and has a huge calculation amount, so that for pedestrian re-identification, the calculation amount of the non-local attention module is suddenly increased when the input features are four frames of pictures. In the embodiment, the calculation amount of the non-local attention module is greatly reduced by simplifying the non-local attention module and compressing the redundancy characteristic of the spatio-temporal domain.

Specifically, referring to fig. 2, after the processed picture is input into a convolutional neural network as an initial input picture and is subjected to 1 × 1 convolution operation, a C/4 × H × W matrix is obtained, where C is the number of channels, H is the height of a feature map, and W is the width of the feature map, the C/4 × H × W matrix is input into a non-local attention module as an input feature and is subjected to time domain feature and spatial feature fusion, an output feature is obtained, the output feature is sequentially subjected to one 3 × 3 convolution and one 1 × 1 convolution, a feature map of a pedestrian picture sequence is extracted, and an output result is a C × H × W matrix.

Further, referring to fig. 3, the non-local attention module is configured to perform the following operations: the method comprises the steps of sequentially carrying out global pooling operation and 1 × 1 × 1 convolution operation on input features to obtain a matrix A of 1 × C ', sequentially carrying out maximum pooling operation and 1 × 1 × 1 convolution operation on the input features to obtain a matrix B of C ' × THW/4, wherein T is the number of sequence frames, H is the height of a feature map, W is the width of the feature map, and the number of C channels, sequentially carrying out another maximum pooling operation and 1 × 1 × 1 convolution operation on the input features to obtain a matrix D of THW/4 × C ', performing matrix multiplication on the matrix A and the matrix B to obtain a matrix C of 1 × THW/4, performing matrix multiplication and 1 × 1 × 1 convolution on the matrix D and the matrix C to obtain a matrix E of 1 × C, and adding the matrix E and the input features to obtain output features.

Step S103: and carrying out blocking processing on the feature map in different sizes in the horizontal dimension, and respectively calculating the local blocking feature map and the loss between the global feature map and the real animal.

In step S103, the feature map is subjected to blocking processing of different sizes in the horizontal dimension; performing maximum pooling and convolution dimension reduction on the block processing result in sequence to obtain a local block feature map and a global feature map; and calculating the cross entropy loss of the local block feature map, and calculating the triple loss and the cross entropy loss between the global feature map and the real animal.

Specifically, the feature map is divided into three branches, and in the high dimension, the first branch is divided into one piece, the second branch is divided into one piece and two pieces, and the third branch is divided into one piece and three pieces. More specifically, the feature map of the first branch has a size of (12, 4, 2048), the feature map of the second branch has a size of (24, 8, 2048), and the feature map of the third branch has a size of (24, 8, 2048); performing pooling operation on the feature map of the first branch using maximal pooling with kernel (12, 4); for the feature map of the second branch, dividing the feature map into one block by using the maximum pooling of the kernel (24, 8), and dividing the feature map into two blocks by using the maximum pooling of the kernel (12, 8); and for the feature map of the third branch, dividing the feature map into one block by using the maximum pooling with the kernel of (24, 8), dividing the feature map into three blocks by using the maximum pooling with the kernel of (8, 8), and performing convolution dimensionality reduction on the 2048-dimensional feature map obtained by block processing to 256 dimensions, thereby reducing the calculation amount. And (3) the feature map with the blocking result being one block is a global feature map, otherwise, the feature map is a local blocking feature map, then the cross entropy loss of the local blocking feature map is calculated, and the triple loss and the cross entropy loss between the global feature map and the real animal are calculated.

Step S104: and optimizing the loss for training until the training is converged to obtain an optimal animal re-identification result.

In step S104, the Adam optimization algorithm is used to optimize the loss for training until the training converges to obtain the optimal animal re-recognition result.

According to the video re-identification training method, multiple pedestrian pictures in the video sequence of the animal are used for identification according to the actual monitoring scene, so that the situation that a single picture is possibly poor in identification effect due to animal posture, environmental background and shielding is avoided. Meanwhile, fine-grained learning is carried out on different parts of the animal body, and meanwhile, end-to-end learning is realized by considering global characteristics. A non-local attention module is added into the convolutional neural network, each input frame image is correlated, and meanwhile, the position of each point is correlated with other positions in the space position, so that the accuracy and robustness of animal re-identification are improved.

Fig. 4 is a flowchart illustrating a video re-recognition method according to an embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 4 if the results are substantially the same. As shown in fig. 4, the method includes the steps of:

step S401: and detecting the picture sequence of the animal to be detected in the video to be detected by using an animal detection and animal tracking method.

In step S401, a surveillance video is first acquired, animal pictures are then extracted from the surveillance video by using an animal detection and animal tracking method, and the extracted animal pictures are made into an animal picture sequence. The animal of the present embodiment includes, but is not limited to, a pedestrian.

Step S402: selecting a plurality of pictures to be detected from the picture sequence of the animal to be detected, and carrying out time domain feature and space feature fusion processing and blocking processing on the pictures to be detected to obtain a feature vector of the picture sequence of the animal to be detected.

In step S402, the step of selecting a plurality of pictures to be tested from the picture sequence of the animal to be tested specifically includes: selecting a plurality of pictures from the picture sequence of the animal to be detected by adopting a random sampling mode, sequencing the pictures according to the sequence of the shooting time, dividing the pictures into a plurality of sub-picture sequences, randomly selecting one picture from each sub-picture sequence, and sequentially carrying out zooming processing and random horizontal turning processing. The method comprises the following steps of performing time domain feature and spatial feature fusion processing and blocking processing on a picture to be detected to obtain a feature vector of an animal picture sequence to be detected: inputting the processed pictures into a convolutional neural network, performing first convolution on the pictures to obtain an input feature map with the number of channels reduced, associating frame dimensions in the input features in a matrix multiplication mode, associating width dimensions and height dimensions in the input features to obtain an output feature map with fused time domain features and space features, and performing blocking processing on the feature map in different sizes in a horizontal dimension; and sequentially carrying out maximum pooling processing and convolution dimension reduction processing on the block processing result, and then sequentially carrying out second convolution and third convolution to extract the feature vector of the animal picture sequence.

Specifically, in the sub-block processing, the feature map is divided into three branches, and in the high dimension, the first branch is divided into one block, the second branch is divided into one block and two blocks, and the third branch is divided into one block and three blocks. More specifically, the feature map of the first branch has a size of (12, 4, 2048), the feature map of the second branch has a size of (24, 8, 2048), and the feature map of the third branch has a size of (24, 8, 2048); performing pooling operation on the feature map of the first branch using maximal pooling with kernel (12, 4); for the feature map of the second branch, dividing the feature map into one block by using the maximum pooling of the kernel (24, 8), and dividing the feature map into two blocks by using the maximum pooling of the kernel (12, 8); and for the feature map of the third branch, dividing the feature map into one block by using the maximum pooling with the kernel of (24, 8), dividing the feature map into three blocks by using the maximum pooling with the kernel of (8, 8), and performing convolution dimensionality reduction on the 2048-dimensional feature map obtained by block processing to 256 dimensions, thereby reducing the calculation amount.

Step S403: and comparing the characteristic vector of the animal picture sequence to be detected with the characteristic vector of the animal picture sequence in a preset search base library, searching out a target picture with the highest similarity, and outputting a re-identification matching result.

In step S403, calculating an euclidean distance between the feature vector of the animal picture sequence to be detected and the feature vector of the animal picture sequence in the preset search base; and sequencing the Euclidean distances, and outputting an animal picture sequence of which the minimum Euclidean distance corresponds to a preset search base.

In this embodiment, the re-identification method further includes: and establishing a preset search base. As shown in fig. 5, the step of establishing the preset search base includes:

step S501: the method comprises the steps of collecting registered animals in a monitoring video by using an animal detection and animal tracking method, detecting and extracting a registration picture of each registered animal, and forming a section of registration animal picture sequence for each registered animal;

step S502: marking corresponding animal identity labels for each registered animal picture sequence;

step S503: inputting the registered animal picture sequence into a re-recognition training model to obtain a feature vector of the registered animal picture sequence;

step S504: and establishing a preset search base according to the characteristic vector of the registered animal picture sequence.

The re-recognition training method in the video of the embodiment of the invention adopts the re-recognition training model obtained by training the re-recognition training method in the video, thereby not only reducing the calculated amount in the re-recognition process, but also improving the accuracy and robustness of animal re-recognition.

Fig. 6 is a schematic structural diagram of a video re-recognition training device according to an embodiment of the present invention. As shown in fig. 6, the apparatus 60 includes a picture sequence acquiring module 61, a feature fusion and extraction module 62, a block processing module 63, and an optimization module 64.

The image sequence acquiring module 61 is configured to detect an animal image sequence in the video by using an animal detection and animal tracking method.

The feature fusion and extraction module 62 is coupled to the picture sequence acquisition module 61, and is configured to extract time-domain features and spatial features of the animal picture sequence, fuse the time-domain features and the spatial features, and obtain a feature map of the animal picture sequence.

The block processing module 63 is coupled to the feature fusion and extraction module 62, and configured to perform block processing on the feature map in different sizes in the horizontal dimension, and calculate the local block feature map and the loss between the global feature map and the real animal respectively.

The optimization module 64 is coupled to the block processing module 63, and is configured to optimize the loss for training until the training converges to obtain an optimal animal re-recognition result.

Fig. 7 is a schematic structural diagram of a video re-recognition apparatus according to an embodiment of the present invention. As shown in fig. 7, the apparatus 70 includes a picture sequence acquiring module 71, a feature extracting module 72, and a re-identifying module 73.

The image sequence acquiring module 71 is configured to detect an animal image sequence in a video to be detected by using an animal detection and animal tracking method.

The feature extraction module 72 is coupled to the image sequence acquisition module 71, and is configured to select multiple images to be detected from the image sequence of the animal to be detected, input the images to be detected into the re-recognition training model, and obtain feature vectors of the image sequence of the animal to be detected.

The re-recognition training model is obtained by adopting the above-mentioned re-recognition training method in the video, and for the sake of brevity, the re-recognition training method in the video is not repeated herein.

The re-recognition module 73 is coupled to the feature extraction module 72, and configured to compare the feature vector of the animal picture sequence to be detected with the feature vector of the animal picture sequence in the preset search base, search out a target picture with the highest similarity, and output a re-recognition matching result.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a memory device according to an embodiment of the invention. The storage device of the embodiment of the present invention stores a program file 81 capable of implementing all the methods described above, wherein the program file 81 may be stored in the storage device in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. The aforementioned storage device includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

Claims

1. A re-recognition training method in video is characterized by comprising the following steps:

2. The re-recognition training method according to claim 1, wherein the step of performing block processing on the feature map in different sizes in a horizontal dimension and calculating the loss between the local block feature map and the global feature map and the real animal respectively comprises:

carrying out blocking processing on the feature map in different sizes in a horizontal dimension;

performing maximum pooling and convolution dimension reduction on the block processing result in sequence to obtain the local block feature map and the global feature map;

and calculating the cross entropy loss of the local block feature map, and calculating the triple loss and the cross entropy loss between the global feature map and the real animal.

3. The re-recognition training method of claim 1, wherein the step of extracting the time-domain features and the spatial features of the animal picture sequence, fusing the time-domain features and the spatial features, and obtaining the feature map of the animal picture sequence comprises:

selecting a plurality of pictures from the animal picture sequence;

and inputting the picture into a convolutional neural network, extracting time domain features and space features of the picture, fusing the time domain features and the space features, and obtaining a feature map of the animal picture sequence.

4. The re-recognition training method of claim 3, wherein the step of inputting the picture into a convolutional neural network, extracting temporal features and spatial features of the picture, fusing the temporal features and the spatial features, and obtaining a feature map of the animal picture sequence comprises:

performing first convolution on the picture to obtain input features with the number of channels reduced;

correlating frame dimensions in the input features by using a matrix multiplication mode, correlating width dimensions and height dimensions in the input features, and obtaining output features after feature fusion;

and sequentially carrying out second convolution and third convolution on the output features to extract a feature map of the animal picture sequence.

5. The re-recognition training method of claim 3, wherein the step of selecting a plurality of pictures from the sequence of animal pictures comprises:

selecting a plurality of pictures from the animal picture sequence by adopting a random sampling mode;

and sequencing the plurality of pictures according to the sequence of the shooting time, dividing the pictures into a plurality of sub-picture sequences, randomly selecting one picture from each sub-picture sequence, and sequentially carrying out zooming processing and random horizontal turning processing.

6. A method for re-recognition in video is characterized in that,

7. The re-recognition method of claim 6, wherein the step of comparing the feature vector of the image sequence of the animal to be tested with the feature vector of the image sequence of the animal in a preset search base, searching out the target image with the highest similarity, and outputting the re-recognition matching result comprises:

calculating the Euclidean distance between the characteristic vector of the animal picture sequence to be detected and the characteristic vector of the animal picture sequence of a preset search base;

and sequencing the Euclidean distances, and outputting the animal picture sequence of which the minimum Euclidean distance corresponds to a preset search base.

8. The re-recognition method of claim 6, further comprising:

and establishing a preset search base.

9. The re-recognition method of claim 8, wherein the step of establishing a pre-set search base comprises:

the method comprises the steps of collecting registered animals in a monitoring video by using an animal detection and animal tracking method, detecting and extracting a registration picture of each registered animal, and forming a section of registration animal picture sequence for each registered animal;

marking corresponding animal identity labels for each registered animal picture sequence;

inputting the registered animal picture sequence into a re-recognition training model to obtain a feature vector of the registered animal picture sequence;

and establishing the preset search base according to the characteristic vector of the registered animal picture sequence.

10. A storage device in which a program file capable of implementing the re-recognition method according to any one of claims 6 to 9 is stored.