CN112668374A

CN112668374A - Image processing method and device, re-recognition network training method and electronic equipment

Info

Publication number: CN112668374A
Application number: CN201910985132.7A
Authority: CN
Inventors: 张启坤; 高岱恒; 吴臻志
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2021-04-16

Abstract

The embodiment of the disclosure discloses an image processing method, an image processing device, a training method of a re-recognition network and electronic equipment, wherein by acquiring an image to be recognized comprising at least one object to be recognized, extracting the features of the image to be recognized to obtain a feature map comprising global feature information and local feature information, and obtains the characteristic vector information of at least one object to be identified according to the characteristic diagram of the image to be identified, so as to determine the target object in the image to be recognized according to the feature vector information of the object to be recognized and the feature vector information of the target object, therefore, the embodiment of the disclosure can enable the object detection and the object recognition operation to share the feature map of the image to be recognized, and the feature representation capability of the feature information is enhanced by enabling the feature map of the image to be recognized to comprise the global feature information and the local feature information, so that the accuracy of object detection and object recognition is improved.

Description

Image processing method and device, re-recognition network training method and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus, a training method for re-recognition network, and an electronic device.

Background

The concept of pedestrian Re-identification (ReID) was first proposed in the 2006 CVPR conference, which can be generalized to a secondary image matching technique for known pedestrians, i.e. a technique where an image of a Person is already available and a specific Person is found in an unknown number of images or video frames.

The traditional ReID algorithm extracts low-level image features for global representation or local description through a complicated and time-consuming manual technology, the performance of the traditional ReID algorithm depends on human experience to a great extent, and the traditional ReID algorithm has no good effect basically. Pedestrian re-identification is currently generally realized based on deep learning, and the ReID based on deep learning generally comprises two parts: firstly, pedestrian detects, uses neural network, for example DPM (Deformable Parts Model, target detection algorithm based on subassembly) to detect out single pedestrian in the image, secondly, recongnites, trains convolution neural network and recongnites pedestrian's image to this discernment pedestrian's identity, but this kind of ReiD is influenced by pedestrian detection result to a great extent easily.

Disclosure of Invention

In view of this, the present disclosure provides an image processing method, an image processing apparatus, a training method of a re-recognition network, and an electronic device, so as to improve accuracy of object detection and object recognition.

In a first aspect, an embodiment of the present disclosure provides an image processing method, where the method includes:

acquiring an image to be recognized, wherein the image to be recognized comprises at least one object to be recognized;

extracting features of the image to be recognized to obtain a feature map of the image to be recognized, wherein the feature map of the image to be recognized comprises global feature information and local feature information;

acquiring feature vector information of the at least one object to be identified according to the feature map of the image to be identified;

and determining the target object in the image to be recognized according to the characteristic vector information of each object to be recognized and the characteristic vector information of the target object.

Optionally, the performing feature extraction on the image to be recognized to obtain a feature map of the image to be recognized includes:

and inputting the image to be identified into a feature extraction network of a re-identification network for feature extraction to obtain a feature map of the image to be identified, wherein the convolution layer of the feature extraction network adopts deformation convolution.

Optionally, the obtaining the feature vector information of the at least one object to be recognized according to the feature map of the image to be recognized includes:

detecting the characteristic diagram according to an object detection network of the re-identification network to obtain at least one object detection frame to be identified;

and extracting the characteristics of the at least one object detection frame to be identified according to the object identification network of the re-identification network to obtain the characteristic vector information of the at least one object to be identified.

Optionally, the feature map includes at least one feature unit, where detecting the feature map according to an object detection network of a re-recognition network to obtain at least one object detection frame to be recognized includes:

performing feature decoding on each feature unit in the feature map to obtain a detection frame corresponding to each feature unit;

and performing de-duplication processing on each detection frame to obtain at least one object detection frame to be identified.

Optionally, determining the target object in the objects to be recognized according to the feature vector information of each object to be recognized and the feature vector information of the target object includes:

respectively calculating the similarity between the target object and each object to be identified according to the characteristic vector information of each object to be identified and the characteristic vector information of the target object;

for each object to be identified, determining the object to be identified as the target object in response to the similarity between the target object and the object to be identified being greater than or equal to a similarity threshold.

and performing feature extraction on the image to be recognized, wherein feature maps of multiple sizes of the image to be recognized are obtained.

Optionally, the method further includes:

and inputting the target object image into a re-identification network for processing to obtain the characteristic vector information of the target object.

In a second aspect, an embodiment of the present disclosure provides a training method for re-recognition of a network, where the method includes:

acquiring a training set, wherein the training set comprises an image group of a plurality of objects, and the image group comprises a plurality of images of the corresponding objects at different angles;

training the re-recognition network according to the training set based on a loss function;

the re-identification network comprises a feature extraction network, an object detection network and an object identification network, and the convolution layer of the feature extraction network adopts deformation convolution.

Optionally, the images in the training set have position labeling information of a global region and a plurality of local regions of the images;

training the re-recognition network according to the training set based on the loss function comprises:

and training the re-recognition network according to the position marking information of each image in the training set based on a first loss function corresponding to the object detection network and a second loss function corresponding to the object recognition network.

In a third aspect, an embodiment of the present disclosure provides an image processing apparatus, including:

the device comprises an image acquisition unit, a recognition unit and a recognition unit, wherein the image acquisition unit is used for acquiring an image to be recognized, and the image to be recognized comprises at least one object to be recognized;

the characteristic extraction unit is used for extracting the characteristics of the image to be recognized to obtain a characteristic diagram of the image to be recognized, and the characteristic diagram of the image to be recognized comprises global characteristic information and local characteristic information;

the first information acquisition unit is used for acquiring the characteristic vector information of the at least one object to be identified according to the characteristic diagram of the image to be identified;

and the target object determining unit is used for determining the target object in the image to be recognized according to the characteristic vector information of each object to be recognized and the characteristic vector information of the target object.

Optionally, the feature extraction unit includes:

and the first feature extraction subunit is used for inputting the image to be identified into a feature extraction network of a re-identification network for feature extraction to obtain a feature map of the image to be identified, wherein the convolution layer of the feature extraction network adopts deformation convolution.

Optionally, the first information obtaining unit includes:

the detection subunit is used for detecting the characteristic diagram according to an object detection network of the re-identification network to obtain at least one object detection frame to be identified;

and the detection frame feature extraction subunit is used for performing feature extraction on the at least one object to be identified detection frame according to the object identification network of the re-identification network to obtain feature vector information of the at least one object to be identified.

Optionally, the feature map includes at least one feature unit, where the detection subunit includes:

the characteristic decoding module is used for carrying out characteristic decoding on each characteristic unit in the characteristic diagram to obtain a detection frame corresponding to each characteristic unit;

and the de-duplication processing module is used for performing de-duplication processing on each detection frame to obtain at least one object detection frame to be identified.

Optionally, the target object determining unit includes:

the similarity calculation operator unit is used for calculating the similarity between the target object and each object to be identified according to the characteristic vector information of each object to be identified and the characteristic vector information of the target object;

and the similarity comparison subunit is used for responding to that the similarity between the target object and the object to be identified is greater than or equal to a similarity threshold value for each object to be identified, and determining that the object to be identified is the target object.

Optionally, the feature extraction unit further includes:

and the second feature extraction subunit is used for performing feature extraction on the image to be identified to obtain feature maps of the image to be identified in multiple sizes.

Optionally, the apparatus further comprises:

and the second information acquisition unit is used for outputting the target object image to a re-identification network for processing to obtain the characteristic vector information of the target object.

In a fourth aspect, the disclosed embodiments provide an electronic device comprising a memory and a processor, the memory being configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of the first aspect of the disclosed embodiments and/or the method of the second aspect of the disclosed embodiments.

In a fifth aspect, the embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, the program being executed by a processor to implement the method of the first aspect of the embodiments of the present disclosure and/or the method of the second aspect of the embodiments of the present disclosure.

According to the method and the device, the image to be recognized comprising at least one object to be recognized is obtained, feature extraction is carried out on the image to be recognized, the feature map comprising global feature information and local feature information is obtained, the feature vector information of the at least one object to be recognized is obtained according to the feature map of the image to be recognized, the target object in the image to be recognized is determined according to the feature vector information of the object to be recognized and the feature vector information of the target object, therefore, the feature map of the image to be recognized can be shared by object detection and object recognition operation, the feature expression capacity of the feature information is enhanced by enabling the feature map of the image to be recognized to comprise the global feature information and the local feature information, and therefore the accuracy of object detection and object recognition is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 is a schematic diagram of a pedestrian re-identification method of the related art;

FIG. 2 is a schematic diagram of a pedestrian re-identification method of the related art;

FIG. 3 is a schematic diagram of image comparison before and after alignment of a pedestrian of the related art;

FIG. 4 is a flow chart of an image processing method of an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an end-to-end ReID network framework according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an end-to-end re-identification network of an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of multi-dimensional feature extraction of the present embodiment;

FIG. 8 is a schematic diagram of an object recognition network according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a feature extraction network of an embodiment of the present disclosure;

FIG. 10 is a diagram of a standard convolution of the related art;

11-13 are schematic diagrams of a morphed convolution according to an embodiment of the present disclosure;

FIG. 14 is a flow chart of a re-recognition network training method according to an embodiment of the present disclosure;

FIG. 15 is a schematic illustration of a thermodynamic diagram distribution of an embodiment of the disclosure;

FIG. 16 is a schematic diagram of re-recognition network training according to an embodiment of the present disclosure;

FIG. 17 is a schematic loss function diagram of an object recognition network according to an embodiment of the present disclosure;

FIG. 18 is a schematic diagram of an image recognition process of an embodiment of the present disclosure;

FIG. 19 is a schematic diagram of an image processing apparatus according to an embodiment of the disclosure;

FIG. 20 is a schematic diagram of a training apparatus for re-identifying a network according to an embodiment of the present disclosure;

fig. 21 is a schematic diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, some specific details are set forth in detail. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present disclosure.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified.

Fig. 1 is a schematic diagram of a pedestrian re-identification method of the related art. The traditional ReID method designs a bottom-layer feature extraction method in an emphasized manner, and uses traditional feature operators (such as SIFT, LBP, HOG and the like) to perform feature measurement. As shown in fig. 1, a conventional ReID method generally performs block processing on an image, divides the image into a plurality of image blocks according to a certain proportion, extracts color features or other classical conventional features from each image block, performs normalization combination on different features to obtain a feature representation of a mixed feature, and performs feature matching by using a feature measurement method such as DTW (Dynamic Time Warping) or LDA (Linear Discriminant Analysis) to perform pedestrian re-identification.

The traditional ReID method performs feature representation based on image color, texture information or other traditional image operators, so that the method is very sensitive to illumination change and image definition, and the identification accuracy is low when illumination conditions are poor or the image resolution is low.

Fig. 2 is a schematic diagram of a pedestrian re-identification method of the related art. At present, the ReID network based on deep learning can extract not only color texture information of an image, but also high-level semantic information, spatial information and timing information of the image. As shown in fig. 2, the ReID network based on deep learning is generally divided into two steps, that is, firstly, a single person is detected by using a pedestrian detection network, and then the identity of the pedestrian is determined through an identification network. Pedestrian detection often uses some target detection networks with higher detection accuracy or better speed performance, such as classical target detection networks like DPM, SSD (Single Shot multi box Detector). Then, deep learning networks such as CNN (Neural Network) or RNN (Recurrent Neural Network) are used for feature extraction, and pedestrian identity recognition is completed.

Fig. 3 is a diagram illustrating image contrast before and after alignment of a pedestrian according to the related art. In order to avoid influence on the pedestrian recognition performance caused by the unexpected interference information of a single pedestrian in the detection frame, pedestrian alignment (person alignment) is proposed as preprocessing of a pedestrian re-recognition network so as to improve the recognition accuracy and stability to a certain extent, and images before and after the pedestrian alignment are as shown in fig. 3.

In the ReID network based on deep learning, the performance of pedestrian re-identification is comparatively dependent on the image content of pedestrian detection. The whole pedestrian image is re-identified, the influence of posture change on identification is not well considered, and local saliency information of the human body image is not highlighted, so that the pedestrian identification accuracy of the conventional ReID algorithm based on deep learning is low. Therefore, the embodiment of the disclosure provides an image processing method to improve the accuracy of object detection and object identification.

In the following description, the present embodiment is described by taking the target object as a pedestrian as an example, and it should be understood that the target object of the embodiment of the present disclosure may also be any animal or other object that needs to be identified, and the embodiment of the present disclosure does not limit this.

Fig. 4 is a flowchart of an image processing method according to an embodiment of the present disclosure. As shown in fig. 4, the image processing method of the present embodiment includes the steps of:

step S100, acquiring an image to be recognized, wherein the image to be recognized comprises at least one object to be recognized. The image to be recognized can be any image or video frame, and the object to be recognized is the same type of object as the target object. For example, the image to be recognized may be an image photographed at an intersection, and the object to be recognized may be a pedestrian at the intersection and recorded by the photographed image.

And S200, performing feature extraction on the image to be recognized to obtain a feature map of the image to be recognized. The feature map of the image to be recognized comprises global feature information and local feature information. In this embodiment, when performing feature extraction on an image to be recognized, not only global feature extraction is performed on the whole image, but also local feature extraction is performed on a local area in the image, and the feature map of the image to be recognized is obtained by integrating global feature information and local feature information, so that feature expression capability of the feature map can be improved, and accuracy of object detection and object recognition can be further improved. For example, for a single pedestrian image, feature extraction is performed on the whole image, feature extraction is performed on local regions (for example, local regions such as a head region and a leg region of a pedestrian), and global feature information and local feature information of the pedestrian image are integrated to obtain a corresponding feature map, so that feature expression capability of the feature map can be enhanced, and thereby, pedestrian information in the pedestrian image can be better determined. The method for obtaining the feature map including the global feature information and the local feature information may be various, for example, feature extraction may be performed on an image to be recognized through a trained feature extraction network, and in the training process, the feature extraction network performs feature extraction on the image to be recognized through performing position labeling of a global region and position labeling of a local region on each image in a training set, so that the feature extraction network obtained through training performs feature extraction on the image to be recognized, and the feature map including the global feature information and the local feature information may be directly obtained. Or respectively carrying out global feature extraction and local feature extraction on the image to be recognized, and comprehensively obtaining a feature map comprising global feature information and local feature information. The present disclosure does not limit the manner in which the feature map including the global feature information and the local feature information is extracted.

And step S300, acquiring the characteristic vector information of at least one object to be identified according to the characteristic diagram of the image to be identified. In an optional implementation manner, the feature map of the image to be recognized is detected to obtain a detection frame of at least one object to be recognized, and feature extraction is performed on corresponding features in the detection frame of the at least one object to be recognized to obtain feature vector information of the at least one object to be recognized.

And step S400, determining the target object in the image to be recognized according to the characteristic vector information of each object to be recognized and the characteristic vector information of the target object. In an optional implementation manner, feature extraction is performed on a target image including a target object in advance to obtain feature vector information of the target object, the feature vector information of the target object is stored in a corresponding database, and when an object to be identified and the target object are compared, the feature vector information of the target object is obtained from the corresponding database. In another optional implementation manner, when the object to be recognized and the target object need to be compared, feature vector information of the target object is obtained by performing feature extraction on a target image including the target object.

In an optional implementation manner, step S400 specifically includes: according to the feature vector information of each object to be recognized and the feature vector information of the target object, respectively calculating the similarity between the target object and each object to be recognized, comparing the similarity between each object to be recognized and the target object with a similarity threshold, and determining that the object to be recognized is the target object in response to the similarity between the object to be recognized and the target object being greater than or equal to the similarity threshold. For example, feature vector information of an object a to be recognized, an object B to be recognized, and an object C to be recognized is obtained, similarity calculation is performed on the feature vector information of the three objects to be recognized and the feature vector information of the target object, and the three similarities obtained through the calculation are compared with a similarity threshold value, for example, if the similarity between the object a to be recognized and the target object is greater than the similarity threshold value, and the similarity between the object B to be recognized and the object C to be recognized and the target object is smaller than the similarity threshold value, the object a to be recognized may be the target object. The corresponding similarity can be calculated by calculating cosine similarity or Euclidean distance between the feature vector information of the object to be identified and the feature vector information of the target object.

In this embodiment, an image to be recognized including at least one object to be recognized is obtained, feature extraction is performed on the image to be recognized, a feature map including global feature information and local feature information is obtained, feature vector information of the at least one object to be recognized is obtained according to the feature map of the image to be recognized, and a target object in the image to be recognized is determined according to the feature vector information of the object to be recognized and the feature vector information of the target object.

In an alternative implementation manner, the image processing method of the present embodiment may be implemented based on an end-to-end ReID network, so that the object detection processing and the object identification processing can be performed based on the same feature map, thereby improving the accuracy and the identification efficiency of the object identification. Fig. 5 is a schematic diagram of an end-to-end ReID network framework according to an embodiment of the disclosure. As shown in fig. 5, the CNN convolutional neural network performs Feature extraction on an input image to obtain a corresponding Feature map (Feature Maps), extracts Feature information of a candidate (candidate person) from the Feature map, inputs the Feature information to a RoI Align module, performs L2 regularization after the RoI Align module processes the Feature information, compares the Feature information after the L2 regularization with Feature information of a target object, and determines whether the input image includes the target object according to Feature distance or similarity information. The region of interest is aligned to a region of interest of a predetermined image, and the region of interest is used as a key point of analysis, so that the image can be analyzed more accurately.

Optionally, taking a ResNet-50 network structure as an example of a basic CNN convolutional neural network, inputting an image into the CNN convolutional neural network, and outputting 1024 channel feature maps, where a prediction network (Pedestrian probabilistic Net) uses 512 × 3 convolutional layer transform feature maps, then a plurality of anchors (anchors) are used to locate each feature map, and predict whether each anchor is a Pedestrian, a prediction frame obtained after non-maximum suppression and deduplication processing is retained from a candidate prediction frame corresponding to each anchor, and an image block where the prediction frames are located is sent to an Identification network (Identification Net), so as to obtain feature information of an object corresponding to each prediction frame, and perform Pedestrian detection and re-Identification operations based on the feature information.

Fig. 6 is a schematic diagram of an end-to-end re-identification network according to an embodiment of the disclosure. The re-identification network of the present embodiment is based on the End-to-End (End-to-End) ReID network framework shown in fig. 5. As shown in fig. 6, the re-recognition network 6 of the present embodiment includes a feature extraction network 61, an object detection network 62, and an object recognition network 63. The convolution layer of the feature extraction network 61 adopts the deformation convolution, and the offset of the deformation convolution is obtained through pre-training, so that the positions of the sampling points of the convolution kernels at different positions can be adaptively changed according to the image content, and the shape, posture, size and other changes of different objects can be adapted. Therefore, the feature extraction network adopting deformation convolution can have better feature extraction capability on the deformation, size and other geometric deformation of different objects in the image, and further can improve the identification accuracy of the re-identification network.

The feature extraction network 61 is configured to obtain a feature map f1 of the image to be recognized, where the feature map f1 includes global feature information and local feature information of the image to be recognized. In this embodiment, a RoI Align operation is performed on the feature map f1 of the image to be recognized based on the feature extraction network 61, so as to obtain an acquired feature map f 2. The feature map f2 includes a plurality of feature cells f 21. Each feature cell corresponds to at least one detection box, and assuming that the size of the feature map f2 is 13 × 13, the feature map f2 may be divided into 13 × 13 feature cells f21(grid cells), and if the center coordinates of an object to be recognized fall within a certain feature cell f21, the object to be recognized is predicted according to the feature cell.

The object detection network 62 detects the feature map f2 and obtains at least one object detection frame to be identified according to the plurality of feature units f 21. In an optional implementation manner, feature decoding is performed on each feature unit f21 in a feature map f2 of an image to be recognized to obtain a detection frame corresponding to each feature unit f21, and de-duplication processing is performed on each detection frame to obtain at least one object to be recognized detection frame f 21'. Optionally, each detection frame may be subjected to a deduplication process by a non-maxima suppression method. Thus, the object detection network 62 detects the feature map f2 of the image to be recognized, and obtains the position information of at least one object detection frame f21' to be recognized.

The object identification network 63 is configured to perform feature extraction on at least one object to be identified detection box f21' to obtain feature vector information of at least one object to be identified, and calculate cosine similarities of the obtained feature vector information of the target object and the feature vector information of each object to be identified, so as to obtain similarities between the target object and each object to be identified, and compare the similarities between the target object and each object to be identified and a similarity threshold value, so as to determine the target object from each object to be identified. In this embodiment, when the similarity between the object to be recognized and the target object is greater than or equal to the similarity threshold, it is determined that the object to be recognized is the target object. Optionally, in this embodiment, the re-recognition network 6 may be adopted to perform feature extraction on the image including the target object to obtain feature vector information of the target object, and store the feature vector information in the corresponding database.

It should be understood that if the similarity between a plurality of objects to be recognized and the target object is greater than or equal to the similarity threshold, the object recognition network 63 may determine the object to be recognized corresponding to the maximum similarity as the target object, or the object recognition network 63 determines all the objects to be recognized as the target objects, and then finally determines the target object in the image to be recognized by other methods (e.g., performing manual recognition later). In the above description, the cosine similarity is used to calculate the similarity between the object to be recognized and the target object, and it should be understood that other methods for calculating the similarity, such as euclidean distance and hamming distance between feature vectors, may be applied to this embodiment, and this embodiment is not limited thereto.

In the present embodiment, the object detection network 62 and the object recognition network 63 both process the feature map f2 of the image to be recognized. After the object detection network 62 determines the position information of each object detection frame f21 'to be recognized in the feature map f2 of the image to be recognized, the object identification network 63 extracts the feature vector information (i.e., the feature vector information of the object to be recognized) of the image block corresponding to each object detection frame f21', so that the object detection network and the object identification network share the feature map of the image to be recognized by using the end-to-end re-recognition network in the embodiment, and the accuracy of object detection and object identification is improved. Optionally, when the object detection network detects the feature map of the image to be identified, the object detection network may scan the feature map of the image to be identified at the same time, and extract the feature vector information of the image block corresponding to the object detection frame to be identified after finding the object detection frame marked by the object detection network, so as to perform the image identification operation. Therefore, the embodiment can realize detection and identification at the same time, and improves the efficiency of object re-identification.

In an alternative implementation, step S200 may include: and performing feature extraction on the image to be recognized to obtain feature maps of multiple sizes of the image to be recognized.

Fig. 7 is a schematic diagram of multi-size feature extraction of the present embodiment. As shown in fig. 7, the image 71 to be recognized is input into the re-recognition network 6, and after the feature extraction network 61 performs convolution processing, sampling processing, residual error processing, RoI Align processing and the like on the image 71 to be recognized, feature maps of the image 71 to be recognized in a plurality of sizes are output, as shown in fig. 7, in this embodiment, feature maps of the image 71 to be recognized in three sizes are output, and the feature maps are Nx13x13x27, Nx26x26x27 and Nx52x52x27 respectively. It should be understood that the present embodiment does not limit the number and size of feature maps output by the feature extraction network. Each feature map of the image to be recognized 71 includes 13 × 13 feature cells, 26 × 26 feature cells, and 52 × 52 feature cells, and assuming that each feature cell corresponds to 3 detection frames, the object detection network 62 performs feature decoding on each feature cell to obtain (13 × 13+26 +52 × 52) × 3 × 10647 detection frames, and performs non-maximum suppression and deduplication processing on the detection frames to obtain at least one detection frame of the object to be recognized. Therefore, in this embodiment, by performing de-duplication processing on the detection frames obtained by the feature maps of different sizes, feature fusion can be performed on the feature maps of different sizes, and feature expression capability is further enhanced, so that accuracy of object detection and object identification can be further improved.

Fig. 8 is a schematic diagram of an object recognition network according to an embodiment of the present disclosure. As shown in fig. 8, the object recognition network 63 may include a Global average pooling layer 81(Global average pool), a Fully Connected layer 82 (FC), a regularization processing layer 83 (e.g., regularized processing L2-norm of a given domain), and a recognition unit 84. The object identification network 63 obtains image blocks corresponding to the detection frames of the objects to be identified, and obtains feature vector information of the objects to be identified after the image blocks corresponding to the objects to be identified are processed by the global average pooling layer 81, the full connection layer 82 and the regularization processing layer 83. The identifying unit 84 calculates a distance (e.g., cosine similarity or euclidean distance) between the feature vector information of the object to be identified and the feature vector information of the target object to obtain a similarity between the object to be identified and the target object, and then determines whether the object to be identified is the target object according to the similarity and a preset similarity threshold. Optionally, the object recognition network 63 obtains feature vector information of the target object from a corresponding database.

Fig. 9 is a schematic diagram of a feature extraction network according to an embodiment of the present disclosure. In an alternative implementation, the feature extraction network 61 is determined according to the darknet53 network structure. On the basis of the backbone network darknet53 network structure in YOLOv3, the standard convolutional layer of the backbone network darknet53 in YOLOv3 is changed into a morphable convolutional layer to obtain the feature extraction network 61 of the present embodiment. As shown in fig. 9, the feature extraction network 61 includes a deformation convolution layer 91 and residual modules 92-96. The residual block 92' is a specific structure of the residual block 92, and includes two convolutional layers. In the present embodiment, the deformed convolution layer 91 employs deformed convolution, and the offset of the deformed convolution is obtained by pre-training.

Theoretically, the more the number of layers of the neural network is, the stronger the translation and rotation invariance is, and the property has positive significance for ensuring the robustness of the classification model. However, for the end-to-end re-recognition network of this embodiment, the target object recognition and positioning task needs the network structure to have good sensing capability for the location information, and therefore, excessive translational and rotational invariance impairs the sensing capability of the darknet53 network structure, that is, as the neural network layer of the darknet53 network structure deepens, its sensitivity to the location information of the object decreases, and the accuracy of the detection Box (Bounding Box) may also decrease, so that the object detection accuracy of the existing darknet53 network structure is low. Therefore, in order to eliminate or weaken the limitation that the regular standard convolution kernel in the standard convolution layer is difficult to adapt to geometric deformation, the standard convolution is replaced by the deformation convolution so as to enhance the modeling capability of deformation characteristics of objects with different postures and improve the accuracy of object detection and object identification.

In the deformed convolutional layer 91, an offset x is added to the position of each sampling point in the convolutional kernel, so that the convolutional kernel can sample near the current position according to the offset x and is not limited to the regular sampling points of the standard convolutional kernel any more, thereby having good representation capability on the change of the posture of the object and improving the accuracy of object detection and object identification.

Fig. 10 is a diagram of a standard convolution of the related art. 11-13 are schematic diagrams of a morphed convolution according to an embodiment of the present disclosure. As shown in fig. 10-13, the convolution kernel in the convolutional layer of this embodiment adds an offset on the basis of the standard convolution kernel, and the offset of the morphometric convolution in fig. 11-13 is different, which shows the feature expression capability of the morphometric convolution kernel for images with different sizes, different proportions and different rotation angles. In this embodiment, the offset of the deformation convolution in the feature extraction network 61 is determined by learning in the training process of the re-recognition network, so that the size and the position of the offset of the deformation convolution can be dynamically adjusted according to the image content to be recognized, that is, the positions of the sampling points of the convolution kernels of the deformation convolution layers at different positions can be adaptively changed according to the image content, thereby adapting to the geometric deformation such as the shape and the size of different objects. Thus, the accuracy of object detection and object recognition can be further improved.

In this embodiment, an image to be recognized including at least one object to be recognized is input into a re-recognition network, feature extraction is performed on the image to be recognized through a feature extraction network in the re-recognition network to obtain a feature map including global feature information and local feature information, the feature map is detected according to an object detection network in the re-recognition network to obtain at least one object detection frame to be recognized, feature vector information of the object to be recognized is obtained through an object recognition network of the re-recognition network, and a target object in the image to be recognized is determined based on the feature vector information of the object to be recognized and the feature vector information of the target object. Therefore, the embodiment adopts the end-to-end re-identification network, realizes the sharing processing of the characteristic diagram of the image to be identified by the object detection network and the object identification network, and improves the accuracy of the object detection and the object identification. In addition, the convolution layer of the feature extraction network in this embodiment adopts a deformed convolution, which can adapt to geometric deformation such as shapes and sizes of different objects, and further improve the accuracy of object detection and object identification.

In an optional implementation manner, the image processing method of this embodiment further includes: the re-recognition network of the present embodiment is trained.

Fig. 14 is a flowchart of a re-recognition network training method according to an embodiment of the present disclosure, and as shown in fig. 14, the re-recognition network training method according to the embodiment includes the following steps:

step S141, a training set is acquired. Wherein the training set comprises image groups of a plurality of objects, each image group comprising images of a plurality of different poses or angles of the corresponding object, such as poses or angles of a pedestrian's front, back, left, right, walking, running, etc. Alternatively, a data set such as CUHK03 or PAP2.0 may be used as the training set of the re-recognition network of the present embodiment. In this embodiment, a plurality of images in the training set are first carded, M (M is greater than 1) images of K (K is greater than 1) objects are selected, and the re-recognition network is trained as a training batch. Wherein the M images of each object have different poses of the object (e.g., different poses when a pedestrian walks, etc.). Therefore, the capability of the re-recognition network of the embodiment in detecting and recognizing objects in different postures can be improved.

And step S142, training the re-recognition network according to the training set based on the loss function. The re-identification network comprises a feature extraction network, an object detection network and an object identification network, and the convolution layer of the feature extraction network adopts deformation convolution. And the feature extraction network performs feature extraction on the image to be identified to obtain a feature map comprising global feature information and local feature information. The convolution layer of the feature extraction network adopts deformation convolution to adapt to the geometric deformation of different objects, such as shapes, sizes and the like. And the object detection network detects the characteristic diagram to obtain at least one object detection frame to be identified. The object identification network obtains the characteristic vector information of the object to be identified, and determines the target object in the image to be identified based on the characteristic vector information of the object to be identified and the characteristic vector information of the target object.

In an optional implementation manner, the images in the training set of this embodiment have position labeling information of a global region and a plurality of local regions of the image, so as to obtain a feature map including global feature information and local feature information of the image, thereby improving feature expression capability of the feature map of the image.

Fig. 15 is a schematic illustration of a thermodynamic diagram distribution of an embodiment of the present disclosure. As shown in fig. 15, in the distribution of the thermodynamic diagram of the pedestrian, the head, the upper body, and the like of the pedestrian are all related to the key information region. Therefore, in an optional implementation manner, the pedestrian images in the training set are divided into three local areas, namely a head area, an upper body area and a lower body area, the global area of the pedestrian images and the three local areas are labeled with position information, and the labeled images are input into the re-recognition network to train the re-recognition network.

Fig. 16 is a schematic diagram of re-recognition network training according to an embodiment of the disclosure. As shown in fig. 16, a plurality of images of a pedestrian in different postures are input to the re-recognition network, wherein the plurality of images of the pedestrian in different postures all have position labeling information of the global region, the head region, the upper body region and the lower body region. The feature extraction network in the re-recognition network extracts global feature information and local region features of the multiple images, and performs convolution processing, sampling processing, residual error connection processing, feature map splicing processing, RoI Align processing and the like on the multiple acquired feature maps to obtain the feature maps of three sizes corresponding to the pedestrian. Therefore, the feature map of the three sizes corresponding to the pedestrian comprises the global feature information and the local feature information of the pedestrian. The object detection network and the object recognition network in the re-recognition network train processing of three-size feature maps of a plurality of objects.

The re-recognition network of the embodiment performs feature extraction on the global area and the local area of the plurality of objects in different postures to obtain a plurality of feature maps of different sizes corresponding to the plurality of objects, and trains the re-recognition network based on the plurality of feature maps of different sizes corresponding to the plurality of objects, so that the object recognition accuracy of the re-recognition network can be improved. It should be understood that the present embodiment is illustrated in three dimensions, but the present embodiment is not limited thereto.

In an alternative implementation, step S142 may include:

and training the re-recognition network according to the position marking information of each image in the training set based on a first loss function corresponding to the object detection network and a second loss function corresponding to the object recognition network. The object detection network controls the extraction of the position information of the object detection frame to be identified based on a first loss function, and the object identification network controls the extraction of the identity characteristics of the object based on a second loss function.

In this embodiment, the re-recognition network is trained according to the position labeling information of each image in the training set, so that a feature map output by the feature extraction network in the re-recognition network includes global feature information and local feature information, and based on a first loss function of the object detection network and a second loss function corresponding to the object recognition network, a parameter of the feature extraction network, an offset of the morpho-convolution, a parameter of the object detection network, and a parameter of the object recognition network are adjusted according to an output result in the training process.

In an alternative implementation, the first loss function may be a YOLO loss function. The YOLO loss function is a loss function applied in the YOLO network, and includes 4 parts: 1. and (5) losing the central coordinates of the predicted detection frame. 2. Loss is made to the width and height of the detection frame. 3. A penalty is made to the predicted category. 4. The confidence of whether the prediction contains the object to be recognized is lost.

Fig. 17 is a schematic loss function diagram of an object recognition network according to an embodiment of the disclosure. In an alternative implementation, the second loss function may be a ternary loss function (Triplet loss). As shown in fig. 17, the input data includes Anchor (Anchor) samples 171, Positive (Positive) samples 172, and Negative (Negative) samples 173. The anchor sample 171 is far from the positive sample 172 (the eigenvector representing the anchor sample 171 is at a larger angle with respect to the eigenvector of the positive sample 172), and the anchor sample 171 is close to the negative sample 173 (the eigenvector representing the anchor sample 171 is at a relatively smaller angle with respect to the eigenvector of the positive sample 173). After the training learning, as shown in fig. 17, the anchor sample 171 is closer to the positive sample 172, and the feature vector of the anchor sample 171 is farther from the negative sample 173. Therefore, the trained object recognition network can determine whether the object to be recognized is the target object according to the distance between the object to be recognized and the feature vector of the target object.

Alternatively, the formula of the ternary loss function is as follows:

wherein the content of the first and second substances,

the a-th picture representing the ith person,

the feature vectors are obtained by the image through a neural network, θ is a mapping relation between the image and a feature map, D (a, b) is an euclidean distance representing two feature vectors a and b, m represents a minimum margin (margin) between a maximum feature distance and a minimum feature distance, and optionally, m is 0.5.

In this embodiment, the re-recognition network is constructed based on an end-to-end ReID network, wherein a convolution layer of the feature extraction network adopts a deformed convolution, the re-recognition network is trained through a plurality of images with different postures or angles of a plurality of objects with position labeling information, and parameters of the feature extraction network, offset of the deformed convolution, parameters of the object detection network and parameters of the object recognition network are adjusted in the training process based on a first loss function and a second loss function, so that the trained re-recognition network can be dynamically adjusted according to image content to be recognized to adapt to geometric deformations such as shapes and sizes of different objects, and a feature map output by the feature extraction network includes global feature information and local feature information, so that feature expression capability of the obtained feature map is enhanced, and thus the re-recognition network is trained by the training method of this embodiment, the accuracy of object detection and object identification can be improved.

Fig. 18 is a schematic diagram of an image recognition process according to an embodiment of the present disclosure. As shown in fig. 18, the image to be recognized fig. 1 is input into the re-recognition network 18. The feature extraction network 181 in the re-recognition network 18 performs feature extraction on the image to be recognized to obtain three feature maps with different sizes. The feature maps of all sizes comprise global feature information and local feature information. As shown in fig. 18, each feature map includes at least one feature cell. The object detection network 182 performs feature decoding on each feature unit in each feature map to obtain a detection frame corresponding to each feature unit, and performs non-maximum suppression and de-duplication processing on each detection frame to obtain at least one object detection frame f1-f5 to be identified. The object identification network 183 performs feature extraction on image blocks corresponding to the object detection boxes f1-f5 to be identified, so as to obtain feature vectors of the objects to be identified. The object identification network 183 obtains the feature vector of the target object from the feature vector database D, and calculates the cosine similarity between the feature vector of each object to be identified and the feature vector of the target object, respectively, to obtain the similarity between each object to be identified and the target object, and then compares the similarity between each object to be identified and the target object with the similarity threshold, respectively, to determine the target object in the image to be identified. In the present embodiment, the degree of similarity between the object to be recognized corresponding to the object to be recognized detection frame f1 and the target object is greater than or equal to the similarity threshold value, and thereby the object to be recognized tag1 corresponding to the object to be recognized detection frame f1 is subjected to flag output in the image fig 1.

In the present embodiment, the target object image fig2 is input into the re-recognition network 18 to acquire the feature vector of the target object and stored into the corresponding feature vector database D. In this way, when the re-recognition network 18 recognizes the target object, the feature vector of the target object may be directly acquired from the feature vector library D.

It should be understood that the pedestrian re-identification process shown in fig. 18 is only exemplary, so as to easily and clearly show the object detection and identification process of the present embodiment, which does not correspond to the processing process of the re-identification network in the present embodiment one-to-one.

In this embodiment, the re-recognition network is constructed based on an end-to-end ReID network, wherein a convolution layer of the feature extraction network adopts a deformed convolution, the re-recognition network is trained through a plurality of images with different postures or angles of a plurality of objects with position labeling information, and parameters of the feature extraction network, offset of the deformed convolution, parameters of the object detection network and parameters of the object recognition network are adjusted in the training process based on a first loss function and a second loss function, so that the trained re-recognition network can be dynamically adjusted according to image content to be recognized to adapt to geometric deformations such as shapes and sizes of different objects, and a feature map output by the feature extraction network includes global feature information and local feature information, and feature expression capability of the extracted feature information is enhanced, thereby, object detection and object recognition are performed on an image to be recognized through the re-recognition network of this embodiment, the accuracy of object detection and object identification can be improved.

Fig. 19 is a schematic diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 19, the image processing apparatus 19 of the present embodiment includes an image acquisition unit 191, a feature extraction unit 192, a first information acquisition unit 193, and a target object determination unit 194.

The image acquiring unit 191 is configured to acquire an image to be recognized, where the image to be recognized includes at least one object to be recognized.

The feature extraction unit 192 is configured to perform feature extraction on the image to be recognized to obtain a feature map of the image to be recognized, where the feature map of the image to be recognized includes global feature information and local feature information. Optionally, the feature extraction unit 192 includes a first feature extraction subunit 1921. The first feature extraction subunit 1921 is configured to input the image to be identified into a feature extraction network of a re-identification network to perform feature extraction, so as to obtain a feature map of the image to be identified, where a convolutional layer of the feature extraction network adopts a deformed convolution. Optionally, the feature extraction unit 192 further includes a second feature extraction subunit 1922. The second feature extraction subunit 1922 is configured to perform feature extraction on the image to be identified, so as to obtain feature maps of multiple sizes of the image to be identified.

The first information obtaining unit 193 is configured to obtain feature vector information of the at least one object to be recognized according to the feature map of the image to be recognized. Optionally, the first information obtaining unit 193 includes a detection subunit 1931 and a detection frame feature extraction subunit 1932. The detecting subunit 1931 is configured to detect the feature map according to the object detection network of the re-identification network, so as to obtain at least one object detection frame to be identified. The detection frame feature extraction subunit 1932 is configured to perform feature extraction on the at least one object to be identified detection frame according to the object identification network of the re-identification network, so as to obtain feature vector information of the at least one object to be identified.

Optionally, the feature map includes at least one feature unit, wherein the detection subunit 1931 includes a feature decoding module 1931a and a deduplication processing module 1931 b. The feature decoding module 1931a is configured to perform feature decoding on each feature unit in the feature map, so as to obtain a detection frame corresponding to each feature unit. The deduplication processing module 1931b is configured to perform deduplication processing on each detection frame to obtain at least one object detection frame to be identified.

The target object determining unit 194 is configured to determine a target object in the image to be recognized according to the feature vector information of each object to be recognized and the feature vector information of the target object. Optionally, the target object determination unit 194 includes a similarity degree operator unit 1941 and a similarity degree comparison sub-unit 1942. The similarity calculation operator unit 1941 is configured to calculate similarities between the target object and the objects to be identified, respectively, according to the feature vector information of the objects to be identified and the feature vector information of the target object. The similarity comparison subunit 1942 is configured to, for each object to be identified, determine that the object to be identified is the target object in response to a similarity between the target object and the object to be identified being greater than or equal to a similarity threshold.

Optionally, the image processing apparatus 19 of the present embodiment further includes a second information acquiring unit 195. The second information obtaining unit 195 is configured to output the target object image to the re-recognition network for processing, so as to obtain feature vector information of the target object.

Fig. 20 is a schematic diagram of a training apparatus for re-identifying a network according to an embodiment of the present disclosure. In this embodiment, the re-recognition network includes a feature extraction network, an object detection network, and an object recognition network, and the convolution layer of the feature extraction network adopts a deformed convolution. As shown in fig. 20, the training device 20 of the re-recognition network of the present embodiment includes a training set acquisition unit 201 and a training unit 202. The training set obtaining unit 201 is configured to obtain a training set, where the training set includes an image group of multiple objects, and the image group includes images of multiple different angles of corresponding objects. The training unit 202 is configured to train the re-recognition network according to the training set based on a loss function.

In an alternative implementation, the images in the training set have position labeling information of a global region and a plurality of local regions of the images. The training unit 202 is further configured to train the re-recognition network according to the position labeling information of each image in the training set based on a first loss function corresponding to the object detection network and a second loss function corresponding to the object recognition network.

Fig. 21 is a schematic diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 21, the electronic device 21: includes at least one processor 211; and a memory 212 communicatively coupled to the processor 211; and a communication component 213 communicatively coupled to the scanning device, the communication component 213 receiving and transmitting data under control of the processor 211; the memory 212 stores instructions executable by the at least one processor 211, and the instructions are executed by the at least one processor 211 to implement the image processing method and/or the training method of any of the above embodiments. The processor 211 is a CPU processor or an acceleration processor (e.g., a GPU processor).

Specifically, the electronic device 21 includes: one or more processors 211 and a memory 212, in fig. 21, for example, one processor 211 is included, and the processor 211 is configured to execute at least one step of the image processing method and/or the training method in this embodiment. The processor 211 and the memory 212 may be connected by a bus or other means, and fig. 21 illustrates an example of a connection by a bus. The memory 212, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 211 executes various functional applications of the device and data processing, i.e., implements the image processing method and/or the training method of the disclosed embodiments, by executing nonvolatile software programs, instructions, and modules stored in the memory 212.

The memory 212 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory 212 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 212 may optionally include memory located remotely from processor 211, which may be connected to an external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The memory 212 stores one or more units, which when executed by the processor 211, perform the image processing method and/or the training method of any of the above-described method embodiments.

Another embodiment of the disclosure is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The product can execute the method provided by the embodiment of the disclosure, has corresponding functional modules and beneficial effects of the execution method, and reference can be made to the method provided by the embodiment of the disclosure for technical details which are not described in detail in the embodiment.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method according to claim 1, wherein the extracting the features of the image to be recognized to obtain the feature map of the image to be recognized comprises:

3. The method according to claim 1 or 2, wherein obtaining the feature vector information of the at least one object to be recognized according to the feature map of the image to be recognized comprises:

4. The method according to claim 3, wherein the feature map comprises at least one feature unit, and wherein detecting the feature map according to an object detection network of a re-recognition network to obtain at least one object detection frame to be recognized comprises:

5. The method according to any one of claims 1 to 4, wherein determining the target object of the objects to be identified according to the feature vector information of each of the objects to be identified and the feature vector information of the target object comprises:

6. The method according to any one of claims 1 to 5, wherein the feature extraction of the image to be recognized to obtain the feature map of the image to be recognized comprises:

and performing feature extraction on the image to be recognized to obtain feature maps of a plurality of sizes of the image to be recognized.

7. The method according to any one of claims 1-6, further comprising:

8. A training method for re-identifying a network, the method comprising:

9. The method of claim 8, wherein the images in the training set have position labeling information for a global region and a plurality of local regions of the images;

10. An image processing apparatus, characterized in that the apparatus comprises:

11. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any of claims 1-9.

12. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-9.