CN116168418A

CN116168418A - Multi-mode target perception and re-identification method for image

Info

Publication number: CN116168418A
Application number: CN202310043891.8A
Authority: CN
Inventors: 金�一; 亓佳; 梁腾飞; 王旭; 李浥东; 王涛
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2023-01-29
Filing date: 2023-01-29
Publication date: 2023-05-26

Abstract

The invention provides a multi-mode target sensing and re-identification method for an image. The method comprises the following steps: preprocessing cross-modal image data to obtain a block vector sequence, learning the modal information of the cross-modal image data through ME, fusing the block data, the position information and the modal information of the cross-modal image data together to obtain serialized image data, inputting the serialized image data into a ViT model, outputting characteristic information of the cross-modal image data through a ViT model, calculating a modal perception enhancement loss value, carrying out back propagation adjustment model parameters according to the loss value to obtain a trained target re-recognition model, and carrying out cross-modal target re-recognition on a pedestrian image to be recognized through the model. The method of the invention embeds the learning mode into the network, directly encodes the mode information, and can be effectively used for relieving the gap between heterogeneous images, thereby realizing target perception and re-recognition of the cross-mode images.

Description

Multi-mode target perception and re-identification method for image

Technical Field

The invention relates to the technical field of target re-recognition, in particular to a multi-mode target sensing and re-recognition method for images.

Background

Along with the continuous increase of the number of monitoring cameras and the improvement of public safety requirements, the target re-identification is of great interest to the industry, and has great research significance. With the development of deep learning, the corresponding method has achieved good performance. However, these target re-recognition methods can only be applied to ideal illumination conditions, and cannot solve the problem of images in the weak illumination environment in the actual scene. In order to solve the problem, a large number of infrared cameras are put into use in a video monitoring system, and have great application value. Thus, researchers have begun to focus on cross-modal target re-recognition issues.

Visible and infrared images are generated by cameras capturing light in different wavelength ranges, the visible image consisting of three channels (red, green and blue) containing color information, while the infrared image contains only one channel with infrared radiation, which are different in nature. Reducing the modal differences of the same target object or the same identity is therefore crucial to solving the cross-modal target re-recognition task. Currently, the target re-identification methods in the prior art can be roughly divided into two directions: methods based on modality conversion and methods based on characterization and metric learning. The modality conversion based approach attempts to convert images of one modality to another modality to eliminate modality differences, learning a modality conversion map by generating an countermeasure network. However, since the mapping process is not one-to-one, the generation process may produce images that are not consistent in color, and there is no reliable mapping relationship to support the generation model. Therefore, researchers focus on the structural design of CNN (Convolutional Neural Networks, convolutional neural network) models, extracting features shared by modalities through characterization learning and metric learning, reducing the variability between modalities. Based on the double-flow framework, the corresponding method extracts the shared characteristics of different modes by using a shallow layer with non-shared weight and learns the distinguishing characteristics by using a deep layer with shared weight, but the learning strategy cannot fully sense and deeply mine the built-in mode characteristics and cannot learn good mode-unchanged characteristic representation.

In contrast to CNN, the transducer model can obtain a global receptive field with a self-attention module and complete spatial features. The present invention is therefore directed to a new system and method for transducer-based multi-modal target perception and re-recognition that can capture modal features by learning feature vectors and generate more efficient matching vectors based thereon.

The cross-modal target re-recognition method is a challenging task in computer vision, whose target is to match the target object or pedestrian in images in both visible and infrared modes. A cross-modal target re-identification method in the prior art comprises the following steps: DFLN-ViT scheme as shown in fig. 1. According to the scheme, a double-flow network is adopted to extract the features, a transducer block added with a spatial feature perception module and a channel feature enhancement module is inserted into the back of different convolution blocks, and position dependence is mined to obtain refined features. In the backbone network, by means of a jump connection from the first stage to the last stage, a robust feature representation can be formed by combining the high-level and low-level information. The output I of each channel feature enhancement module will be input to the next stage of the network and the output S of each spatial feature perception module will be combined with I by element addition in the last stage to obtain the final feature representation. The space feature perception module and the channel feature enhancement module both introduce a transducer structure, and capture the long-term dependence of the space position and the channel between the features. Finally the network is trained with the designed ternary assisted heterogeneous center loss (THC loss) and classification loss (ID loss).

The input of the spatial feature perception module is the feature extracted from the convolution stage, in order to input the feature map into the transformer, the thinned spatial feature is obtained, and the information of the adjacent pixels is combined through convolution operation. For integration in the final stage, the output features of all spatial feature aware modules have the same size as the final stage output features. The transducer is adept at modeling dependencies, enabling capture of correlations between channels. The channel feature enhancement module employs a method similar to the attention mechanism to generate an attention weight for each channel. In order to obtain the sequence input, GAP is adopted to carry out patch coding, and attention weight obtained through transformation is multiplied by the original feature map to obtain the final output feature.

Drawbacks of the prior art DFLN-ViT scheme described above include: according to the scheme, the mode invariant features are learned in an implicit mode, the direct mining and utilization of the mode information are omitted, and the mode invariant features with good discrimination can not be learned.

The existing loss directly acts on the extracted features, and the scheme does not consider the effective mining and reasonable utilization of the modal information, which can limit the model performance.

The DFLN-ViT model uses a multi-layer transducer structure for multi-layer fusion of semantic information, and is more complex to calculate.

Disclosure of Invention

The embodiment of the invention provides a multi-mode target sensing and re-identification method for an image, which is used for effectively sensing and re-identifying targets of a cross-mode image.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A multi-mode target perception and re-identification method for an image comprises the following steps:

preprocessing, blocking and vectorizing cross-modal image data to obtain a blocking vector sequence, and adding category information and position information for each blocking vector;

the ME is embedded into the mode information of the cross-mode image data through the mode, and the block data, the position information and the mode information of the cross-mode image data are fused together in an embedded and overlapped mode to obtain the serialized image data;

inputting the serialized image data to a ViT model, the ViT model outputting characteristic information of the cross-modality image data;

calculating a modal perception enhancement loss value according to the characteristic information of the cross-modal image data, carrying out back propagation according to the modal perception enhancement loss value, and adjusting parameters of the ViT model to obtain a trained target re-identification model;

and performing cross-mode target re-recognition on the pedestrian image to be recognized by using the trained target re-recognition model, and outputting a target re-recognition result of the pedestrian image.

Preferably, preprocessing, blocking and vectorizing the cross-modal image data to obtain a blocking vector sequence, and adding category information and position information for each blocking vector, including:

loading image data of a training set and a testing set in a cross-mode pedestrian data set into a graphic processor, performing standardized operation on images of the training set and the testing set by the graphic processor, scaling a pixel value range of the images to be between 0 and 1, cutting the images according to the set size, and performing random horizontal overturning, random cutting and random erasing data enhancement operation on the images;

dividing the image data of each batch into overlapped small blocks of a sequence according to the set batch size, vectorizing the small blocks through compression operation, linearly mapping the vectorized small blocks by using a linear transformation matrix to obtain a block vector sequence, and adding category information and position information for each block vector.

Preferably, the learning the modality information of the cross-modality image data by the ME fuses the block data, the position information and the modality information of the cross-modality image data together in an embedded and overlapped mode to obtain the serialized image data, which includes:

designing a ViT model comprising a modality embedded ME, inputting a blocking vector in the cross-modality image data, and category information and position information of the blocking vector into the ViT model, wherein the ViT model learns modality information of the cross-modality image data through the ME, and the modality information comprises a visible light RGB modality or an Infrared modality and is used for sensing and encoding different types of information;

and fusing the blocking vector of the image, and the position information, the category information and the modal information of the blocking vector together in an embedded superposition mode to obtain the serialized image data.

Preferably, the inputting the serialized image data into a ViT model, the ViT model outputting the characteristic information of the cross-modality image data includes:

the serialized image data is input to a ViT model, and the ViT model performs feature extraction on the serialized image data by using a multi-layer self-attention module, and outputs a feature vector of each image in the serialized image data.

Preferably, the calculating a modal sense enhancement loss value according to the characteristic information of the cross-modal image data includes:

calculating a classification loss ID loss and a weighted regularization triplet loss WRT loss according to the feature vector of each image in the serialized image data;

calculating modal perception enhancement loss L according to the characteristic vector of each image in the serialized image data and the formulas (1) - (5) _MAE ；

L _MAE ＝L _MAC +L _MAID (5)

Features extracted from a kth image representing an m-mode;

representing an identity tag;

phi represents the mapping of the mining modality embedded knowledge;

e ^m representing modality embedding;

a central feature vector representing the q identity, which is the average of the image features after modality removal;

a label representing the prediction;

L _MAID calculating cross entropy between the prediction and the target; l (L) _MAC Representing modal perceived center loss.

Selecting K images of Q identities in one batch, and selecting a full connection layer as a mapping phi for embedding knowledge in an excavating mode;

and carrying out weighted fusion on the calculated classification loss, the weighted regularized triplet loss and the modal perception enhancement loss according to the set super parameter lambda to obtain a final loss value.

Preferably, the counter-propagating according to the modal sensing enhancement loss value adjusts parameters of the ViT model to obtain a trained target re-recognition model, which includes:

calculating the gradient of the current parameter of the ViT model by utilizing an objective function according to the final loss value, calculating the first-order momentum and the second-order momentum according to the historical gradient, calculating the descending gradient at the current moment, updating according to the descending gradient, updating the parameter value in the ViT model by utilizing an optimizer according to the calculated gradient, repeating the processing process before reaching the super-parameter set training round number of the iterative optimization process, stopping the iterative optimization training process of the parameter of the ViT model after reaching the set training round number, evaluating the ViT model by utilizing an evaluation index, and obtaining the trained target re-identification model after the evaluation is qualified.

Preferably, the cross-modal target re-recognition is performed on the pedestrian image to be recognized by using the trained target re-recognition model, and the target re-recognition result of the pedestrian image is output, including:

inputting the pedestrian image to be identified into the trained target re-identification model, performing cross-mode target re-identification on the pedestrian image to be identified by the trained target re-identification model, outputting a target re-identification result of the pedestrian image, wherein the target re-identification result comprises a classification result with highest possibility that the pedestrian is identified, and outputting a label and a probability value of the classification result with highest possibility.

As can be seen from the technical solution provided by the above embodiment of the present invention, the method of the embodiment of the present invention uses a transducer architecture to enhance the perception of modal information by capturing the advantage of global context information using the transducer; the learning mode is embedded into the ME and introduced into the network, and the mode information is directly encoded, so that the method can be effectively used for relieving the gap between heterogeneous images; forcing ME to capture more useful features of the modality through MAE loss functions and adjusting the distribution of extracted embeddings.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a prior art DFLN-ViT process;

FIG. 2 is a schematic diagram of an implementation of a method for multi-modal target perception and re-recognition of an image according to an embodiment of the present invention;

FIG. 3 is a process flow diagram of a multi-modal target awareness and re-recognition method for an image according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of input data of a ViT model according to an embodiment of the present invention;

fig. 5 is a schematic diagram of modal awareness enhancement loss according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.

The invention focuses on solving the core challenge of cross-modal target re-recognition, namely the huge modal difference between visible light and infrared images, and provides a system and a method for multi-modal target perception and re-recognition, wherein in order to capture inherent characteristics of modes, new ME (Modality Embeddings, mode embedding) direct coding mode information is designed, in order to strengthen constraint on a leachable ME and optimize distribution of extracted characteristics, MAE (model-aware Enhancement, mode perception enhancement) loss is further designed, and by subtracting the learned mode specific characteristics from the ME, the ME is forced to capture more useful characteristics of each mode, and the distribution of the extracted characteristics is adjusted, so that the defects of the existing loss are overcome. The network provided by the invention can be jointly optimized in an end-to-end manner, and more effective and discriminant characteristics are generated.

An implementation schematic diagram of a multi-mode target perception and re-identification method for an image provided by an embodiment of the present invention is shown in fig. 2, and includes: an image preprocessing stage, a data serialization input stage, a feature extraction stage, a loss calculation stage, a model iteration optimization stage, a model test evaluation stage and the like. Firstly, in a model training stage, pedestrian image data of a training set is input, and through some image preprocessing operations including standardization of the image data, change of picture size, random horizontal overturning, random cutting, random erasing and the like, the image data is transmitted forwards through a designed network model to obtain a classification result of the image, loss is calculated, back transmission is carried out by using the loss, model weight is updated, and the process is repeated until a set iteration round number is reached.

In the test stage, loading image data of a test set, removing a neural network layer of a classified part through a trained model, directly obtaining test sample characteristics, calculating and comparing characteristic similarity, completing a retrieval process, then calculating an evaluation index, judging the performance of the model, returning to a training link again if the expected requirement is not met, carrying out further adjustment training, and storing model weights if the expected performance is met, thus completing the flow of the whole technical invention and obtaining a final solution.

Cross-modality image data: the image data has two different modes of visible light RGB and Infrared Inforred, each image belongs to the visible light RGB mode or the Infrared Inforred mode respectively, and each image also corresponds to a plurality of labeling target categories. The cross-modality image data includes a training image (train), a search image (query), and a search library (gamma). The training images are used for training the feature extraction capability of the target re-recognition model, and the searching images and the searching library are used for verifying the performance of the target re-recognition model.

A ViT (vision Transformer, visual transducer model) model was used for the Infrared modality, which ViT model had pre-trained weights on the ImageNet dataset. The sizes of the visible and infrared images were adjusted to 3×256×128 (c×h×w), and a single channel of the infrared image was repeated three times so as to contain three channels. The input image is segmented in a size of 16 x 16, with the step size S set to 8. An AdamW optimizer was used. The base learning rate was initialized to 0.001, and the learning rate of all pre-training layers was 0.1 times the base learning rate. Model algorithm hyper-parameters: the method comprises the steps of cutting the image, batch size in training, iteration round number and learning rate, inputting the step S of image blocking, and balancing coefficient lambda of modal perception enhancement loss.

The specific processing flow of the multi-mode target perception and re-identification method for the image provided by the embodiment of the invention is shown in fig. 3, and comprises the following processing steps:

step S10: image preprocessing stage

Loading image data (comprising a training set and a testing set) of the cross-modal pedestrian data set into a GPU (graphics processing unit, graphics processor) video memory;

carrying out standardized operation on images of a training set and a testing set, scaling a pixel value range to between 0 and 1, cutting according to the set size, and properly using data enhancement operations such as random horizontal overturn, random cutting, random erasing and the like;

and forming the data into a batch form according to the set batch size, and inputting the batch form into a post-model algorithm.

Step S20: data serialization input stage

Dividing the preprocessed image into sequential small blocks (patches), and dividing overlapped patches according to step length;

vectorizing a small block (patch) through a compression flat operation, and linearly mapping the vectorized small block by using a linear transformation matrix to obtain a block vector sequence;

category information and location information are added for each block vector.

The transducer model is an NLP (Natural Language Processing ) model, and uses Self-Attention mechanism to obtain global information. ViT (Vision Transformer, visual transducer) model the transducer model is applied to visual tasks for classification tasks.

Fig. 4 is a schematic structural diagram of input data of a ViT model according to an embodiment of the present invention. In the ViT model of the embodiment of the invention, a mode code is designed, and an ME is introduced to learn each mode information. The different types of embedding are fused together in an additive manner, the location embedding being different for different image tiles, while the modality embedding also serves as a learnable parameter, changing between image modalities, perceiving and encoding different types of information. The modality information includes both visible light and infrared.

The modality information can be naturally integrated into the transducer framework as a position code. As shown in fig. 4, the input data of the ViT model is composed of three parts, namely a block vector, position information and category information of an image. Corresponding to the visible and infrared modes, the invention defines two kinds of learning embeddings, respectively, which are used for learning the information of each mode, which is helpful for the learning process of the unchanged characteristics of the subsequent modes. The images in each modality share the same embedding with all image tiles. The block vectors, the position information and the modal information of the images are fused together in an embedded and overlapped mode to obtain the serialized image data. Wherein the position information is different for different image tiles and the modality information varies between image modalities for sensing and encoding different types of information.

Step S30: feature extraction stage

In the feature extraction stage, a ViT model is used as a stem extractor. The ViT model can perceive global features more efficient than CNN-based methods by using multi-layer self-attention modules. Inputting the serialized image data into a ViT model for feature extraction;

the feature vector of each image is obtained from the output data of the ViT model, corresponding to the position of the category vector. These feature vectors pass through the batch normalization operation (Batch Normalization) and full join (FC) layers in sequence, after which multiple losses, such as MAE losses and classification losses, are calculated, which together constrain the distribution of the extracted vectors.

Step S40: loss calculation stage

Calculating a classification loss (ID loss);

the re-recognition training process is considered as an image classification problem, i.e. each different label is a different class, and the classification loss is calculated by cross entropy using the input image with the label and the prediction probability of being recognized as a certain class.

Calculating weighted regularized triplet loss (Weighted Regularization Triplet (WRT) loss);

considering the re-recognition training process as a search ordering problem, the distance between the pairs should be a predefined margin less than the negative pair. The triplet contains positive samples, negative samples and anchor points, and the weighted regularization triplet loses a weighting strategy that uses a softmax function, giving all positive and negative samples a weight. The weighted regularized triplet loss is as follows:

where i, j, k represent triples in each training batch. For anchor point i, P _i Is a corresponding positive set, N _i Is a negative set.

Representing the paired distance of the positive and negative sample pairs.

Calculating modal sense enhancement loss (MAE loss) according to the formula (1) - (5);

and calculating the total loss value, and carrying out weighted fusion on the three losses by using the set super parameter lambda to obtain a final loss value.

Step S50: model iterative optimization stage

The code implementation is based on a PyTorch deep learning framework, and can be used for carrying out back propagation from a finally calculated loss value, and automatically calculating the gradient value of the parameter in the target re-identification model;

updating the parameter values in the target re-recognition model using an optimizer (e.g., an Adam optimizer of Pytorch) with the gradients calculated in the previous steps;

and when iterative optimization is carried out, calculating the gradient of the objective function relative to the current parameter, calculating the first-order momentum and the second-order momentum according to the historical gradient, calculating the descending gradient at the current moment, and updating according to the descending gradient.

And repeating all the execution steps before the target re-identification model reaches the number of rounds set by the super parameters, and stopping the training process of the parameters in the target re-identification model after the number of rounds is reached, so as to obtain the trained target re-identification model reaching the performance evaluation standard.

Step S60: test evaluation stage

Reading pedestrian images of the test set, loading the pedestrian images into a GPU video memory, and performing standardized operation the same as that of a training link (note that data enhancement operations such as random horizontal overturn and the like are not required during test);

and (3) adopting CMC (Cumulative Matching Characteristics, cumulative matching curve) and mAP (Mean Average Precsion, average precision mean) evaluation indexes commonly used for pedestrian re-identification, and primarily evaluating the quality of the target re-identification model by evaluating the calculated index values.

If the evaluation result does not meet the requirement, the parameters of the target re-recognition model type need to be adjusted, the first step of the execution step is returned to, the training link of the target re-recognition model is carried out again, and if the evaluation result meets the requirement, the parameters of the target re-recognition model can be saved, and the trained target re-recognition model is obtained. The trained target re-recognition model can be used as a solution for a visible light infrared cross-mode pedestrian re-recognition task.

And then, performing cross-mode target re-recognition on the pedestrian image to be recognized by using the trained target re-recognition model, outputting a target re-recognition result of the pedestrian image, wherein the target re-recognition result comprises a classification result with highest possibility that the pedestrian is recognized, and outputting a label and a probability value of the classification result.

Fig. 5 is a schematic diagram of modal sensing enhancement loss provided by an embodiment of the present invention, where the modal sensing enhancement loss includes two parts: the modal sensing center loss (formula 1) and modal sensing ID loss (formula 5) aim to shorten the intra-class distance and enlarge the inter-class distance. The loss of the mode perception center focuses on reducing the gap between different modes under the same identity and reducing the feature distance in the class by using the knowledge learned from the ME. First, the central feature vector (formula 2) of all identities in a batch needs to be calculated, and the mode-invariant features are filtered out by removing mode specific information, so that an average value of the image features after mode removal is obtained. The cosine distance D is used to calculate the distance between the extracted image feature and its central feature vector. Through the constraint of loss of the modal awareness center, more compact cross-modal features can be extracted for each identity. The mode-aware ID loss (formula 3) aims at learning distinguishing features between different identities, expanding the distance between image features of different identities based on the information learned from ME, and also has the mode removal process, classifying the input images of different identities by calculating cross entropy loss. Under the constraint of loss of modal perception ID, the features extracted by the ViT model have stronger recognition capability, and can realize more accurate matching.

L _MAE ＝L _MAC +L _MAID (5)

Features extracted from a kth image representing an m-mode.

An identity tag is shown.

Phi represents the mapping of the mining modality embedded knowledge.

e ^m Representing the embedding of the modality.

The central feature vector representing the q identity is the average of the image features after modality removal.

Representing the predicted tag.

L _MAID The cross entropy between the prediction and the target is calculated. L (L) _MAC Representing modal perceived center loss.

And selecting K images of Q identities in one batch, and selecting a full connection layer as a mapping phi for embedding knowledge in the mining mode. The modal perception enhancement loss can force the ME to mine more useful modal specific features by using a modal removal process, the distribution of feature vectors can be adjusted based on a loss function of the ME, and features with more discriminant image retrieval are generated without being influenced by huge modal differences.

In summary, the method of the embodiment of the invention can better sense the modal information by using the modal coding and learn the better characteristic representation of unchanged modal.

The modal perception enhancement loss function better adjusts the characteristic distribution, so that compact intra-class distances and larger inter-class distances are obtained, and the learning capacity of modal embedding is enhanced.

The method of the embodiment of the invention utilizes the transformation structure to enhance the perception of the modal information by utilizing the advantage of capturing the global context information by utilizing the transformation structure; the learning mode is embedded into the ME and introduced into the network, and the ME directly codes mode information, so that the method can be effectively used for relieving the gap between heterogeneous images; a new MAE loss function is designed that forces the ME to capture more useful features of the modality and adjusts the extracted embedded distribution.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The multi-mode target perception and re-identification method for the image is characterized by comprising the following steps of:

2. The method of claim 1, wherein preprocessing, blocking and vectorizing the cross-modality image data to obtain a sequence of block vectors, adding category information and location information to each block vector, comprising:

3. The method according to claim 2, wherein learning the modality information of the cross-modality image data by the ME fuses the block data, the position information and the modality information of the cross-modality image data together in an embedded and superimposed manner to obtain the serialized image data, and includes:

4. A method according to claim 3, wherein said inputting said serialized image data into a ViT model, said ViT model outputting characteristic information of said cross-modality image data, comprises:

5. The method of claim 4, wherein said calculating a modal sense enhancement loss value from the characteristic information of the cross-modal image data comprises:

from the feature vector of each image in the serialized image dataCalculating modal sense enhancement loss L according to the formulas (1) - (5) _MAE ；

L _MAE ＝L _MAC +L _MAID (5)

Features extracted from a kth image representing an m-mode;

representing an identity tag;

phi represents the mapping of the mining modality embedded knowledge;

e ^m representing modality embedding;

a label representing the prediction;

6. The method of claim 5, wherein the back-propagating according to the modal sense enhancement loss value adjusts parameters of the ViT model to obtain a trained target re-recognition model, comprising:

7. The method of claim 6, wherein the performing cross-modal target re-recognition on the pedestrian image to be recognized using the trained target re-recognition model, outputting a target re-recognition result of the pedestrian image, comprises: