CN117079310A - Pedestrian re-identification method based on image-text multi-mode fusion - Google Patents

Pedestrian re-identification method based on image-text multi-mode fusion Download PDF

Info

Publication number
CN117079310A
CN117079310A CN202311052722.7A CN202311052722A CN117079310A CN 117079310 A CN117079310 A CN 117079310A CN 202311052722 A CN202311052722 A CN 202311052722A CN 117079310 A CN117079310 A CN 117079310A
Authority
CN
China
Prior art keywords
pedestrian
text
image
mode
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311052722.7A
Other languages
Chinese (zh)
Inventor
颜成钢
游泽洪
江涛
孙垚棋
朱尊杰
高宇涵
王鸿奎
赵治栋
殷海兵
王帅
张继勇
李宗鹏
丁贵广
付莹
郭雨晨
赵思成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202311052722.7A priority Critical patent/CN117079310A/en
Publication of CN117079310A publication Critical patent/CN117079310A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a pedestrian re-recognition method based on image-text multi-mode, which adopts a Vision Transformer model to extract features of images and texts, and also constructs a multi-mode feature fusion network to align and fuse features of different modes, thereby solving the problem that the features of different modes are difficult to fuse. And finally, the characteristic vector after fusion and the characteristic vector of the pedestrian to be detected are used as a loss function, so that the multi-mode pedestrian re-identification of the graphics context can be realized. According to the application, the characteristics of two different modes of images and texts are used for visual pedestrian re-recognition through research, so that the characteristics provided by the two different modes can be effectively utilized, and the trouble caused by changeable text description is solved.

Description

Pedestrian re-identification method based on image-text multi-mode fusion
Technical Field
The application relates to the technical field of artificial intelligence and computer vision, in particular to a novel pedestrian re-recognition algorithm with image-text multi-mode fusion
Background
Pedestrian Re-identification (Re-ID) is an important image recognition technology, also called pedestrian Re-identification, and is now regarded as a key sub-problem of image retrieval. The method comprises the steps of matching pedestrian images or videos of cross-equipment by using a computer vision algorithm, namely, giving a query image, and searching the same pedestrian in image libraries of different monitoring equipment. Pedestrian re-recognition technology is an important basis of intelligent analysis technology, and recently, close attention of scientific researchers in the field of computer vision is gradually drawn.
With the wide application of deep learning in the field of computer vision, pedestrian re-recognition based on deep learning is currently performed
Is becoming a mainstream model and is far beyond the pedestrian re-recognition scheme based on traditional machine learning in effect. Since natural literal statement of witness is also an important clue in addition to the person images captured from various cameras in an actual scene, matching natural language with a target pedestrian is also a very meaningful work. So cross-modal teletext pedestrian re-recognition is proposed for retrieving images matching the same pedestrian using linguistic text information in a multimodal image library. The method effectively solves the problem of the re-recognition of the image-text cross-mode pedestrians, and has great significance in public safety, crime prevention, criminal investigation, and the like.
In recent years, a great deal of research work and related framework for cross-modal pedestrian re-identification has emerged. However, there are still some gaps between the distance cross-mode pedestrian re-recognition and the application of the distance cross-mode pedestrian re-recognition in the actual scene. The difficulties and challenges facing today mainly are:
(1) The images and the texts belong to two different characteristics, the existing neural network model usually only uses one characteristic to search pedestrians, and it is difficult to fuse the characteristics of the two modes and then act on the matching of the pedestrians together.
(2) From the perspective of video investigation applications, text-to-image re-recognition is of two points worth noting. First, witness literal statements of criminal suspects are always incomplete and criminal suspects may not be noticed until the criminal occurs. Second, these expressions are sometimes incorrect. For example, a witness may miss the color of the garment due to certain lighting conditions. Thus, although text-to-image re-ID has been less studied, it can be further developed due to its own uniqueness.
Therefore, how to effectively utilize the features provided by two different modes and the trouble brought by the variability of the text description are key to the field of the multi-mode pedestrian re-recognition research of the graphics context.
Disclosure of Invention
In order to solve the problems, the application provides a pedestrian re-recognition method for multi-mode fusion of graphics and texts, which is used for re-recognizing pedestrian targets with images and related semantic information.
The object of the application is achieved in the following way:
a pedestrian re-identification method based on image-text multi-mode fusion comprises the following steps:
a pedestrian image and a pedestrian dataset corresponding to at least two different texts of the pedestrian are acquired.
And inputting the images of the pedestrians and the corresponding text descriptions into a pre-trained multi-mode image-text pedestrian re-recognition network to obtain a prediction classification result.
The multi-mode image-text pedestrian re-recognition network is configured to: two branches are included to capture the character features in the two modalities of image and text, respectively, resulting in character features representing the image and text modalities, respectively.
A pedestrian re-identification method based on image-text multi-mode fusion comprises the following steps:
step S1, preprocessing a labeled image-text data set to obtain an image text pedestrian re-identification data set, dividing a training set, a verification set and a test set;
and S2, constructing a feature extraction network in the multi-mode image-text pedestrian re-recognition model, pre-training the feature extraction network, and storing pre-training model weights of the feature extraction network.
S3, constructing a multi-mode feature fusion network, and determining a total loss function of a multi-mode image-text pedestrian re-recognition model;
and S4, training a multi-mode image-text pedestrian re-recognition model by using the pre-training weight of the feature extraction network and adopting a total loss function until the objective function converges.
And S5, performing pedestrian re-recognition on the target domain image and text data to be detected by using the trained multi-mode pedestrian re-recognition model, and outputting a recognition result of the corresponding target domain.
Further, preprocessing of pedestrian text descriptions, including standardization of character codes, unification of english case letters, and text cleaning to remove non-text content by Unicode codes, is employed with a labeled teletext dataset containing a pedestrian image and at least two different texts corresponding to pedestrians. The pre-processed text data is obtained and needs to be converted into a form that can be input into a transducer. For a sentence of text description, the natural language in the text description is converted to a feature of integer (there may be one word to become multiple shaping features). Each word is mapped into an integer, the mapping table is composed of 256 ASCLL code mappings and bpe common characters (the mapping table is a list of character combinations, the list order indicates the frequency of the character combinations), and then a sentence of text description is encoded according to the mapping table.
Further, the feature extraction network comprises an Image sub-neural network and a Text sub-neural network, the Image features and the Text features of pedestrians are respectively processed, and each sub-neural network correspondingly inputs information of pedestrians in two different modes of images and texts.
Further, in the feature extraction network, the network body of the Image sub-neural network is Vision Transformer, the Text sub-neural network is a transducer, both sub-neural networks include transducer layers, and for each transducer layer, a multi-head attention network embedded in different positions is adopted.
Further, to ensure that the image and text are converted into batches, the correlation between the batches is preserved when entering the transform coding layer, each batch is added with
Class token and Position Embedding.
Further, the multi-mode feature fusion network is composed of a plurality of groups of feature fusion modules, and the feature fusion modules adopt the following structures: the feature vectors obtained from the two sub-neural networks of the feature extraction network are simultaneously input into the feature fusion module, self-attention operation and cross-attention operation are firstly carried out to obtain the feature vector which is preliminarily fused, and then the feature vector which is preliminarily fused is input into a feedforward neural network which consists of two linear layers and a LayerNorm layer in the embodiment. The input of the subsequent feature fusion module is the output of the previous feature fusion module, and the final multi-mode feature vector fused with the pedestrian image text is obtained after the six groups of continuous feature fusion modules are processed in total.
Further, the total loss function includes a contrast loss between pairs of image text, a classified pedestrian ID loss, and a triplet loss.
And determining the total loss value of the multi-mode image-text pedestrian re-recognition model based on the comparison loss value of the image text pair of the pedestrian feature vector, the ID loss value of the pedestrian and the triple loss value of the pedestrian re-recognition.
Determining the total loss value of the multi-mode image-text pedestrian re-identification model through the following formula:
L=Lcon+λ1*L_id+λ2*L Triplet
wherein L is the total loss value of the multi-mode image-text pedestrian re-recognition model, lcon is the contrast loss of image text, L_id is the category loss of pedestrians, L Triplet For the triplet loss value, λ1 and λ2 are coefficients.
Image text contrast loss value lcon= (l_i+l_t)/2, whereinWhere N is the total number of samples, M is the total number of pedestrian categories for the input training image, l_i is the cross entropy loss function for the image, and l_t is the cross entropy loss function for the text. y is ic Is a sign function, takes a value of 0 or 1, if the true class of the sample i is equal to c, takes a value of 1, otherwise takes a value of 0.P is p ic The predicted probability that sample i belongs to category c is observed.
The category loss of pedestrians isWherein, regard different pictures of pedestrian as same category, the essence is a cross entropy loss function.
The triplet loss expression is
For a triplet { x } a ,x p ,x n In terms of }, where x a ,x p And x n Representing the anchor image, positive and negative samples, respectively.
The application also provides electronic equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the pedestrian re-recognition method of the graphic multi-mode fusion when executing the program.
The application has the following beneficial effects:
the application uses the characteristics of two different modes of images and texts for visual pedestrian re-recognition through research, which is a few methods for pedestrian re-recognition by using multi-mode characteristics. The method adopts the currently popular Vision Transformer model to extract the characteristics of the images and the texts, and constructs a multi-mode characteristic fusion network to align and fuse the characteristics of different modes, so that the problem that the characteristics of different modes are difficult to fuse is solved. And finally, the characteristic vector after fusion and the characteristic vector of the pedestrian to be detected are used as a loss function, so that the multi-mode pedestrian re-identification of the graphics context can be realized. By training the network model by using the method provided by the application, the pedestrian re-identification of graphic multi-mode can be performed. The method can effectively utilize the characteristics provided by two different modes, and solve the trouble caused by changeable text description.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present application;
FIG. 2 is a schematic diagram of an overall architecture for cross-modal pedestrian re-recognition in an embodiment of the application;
FIG. 3 is a schematic diagram of a feature extraction network according to an embodiment of the application;
fig. 4 is a schematic diagram of a multi-modal feature fusion network according to an embodiment of the present application.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the application easy to understand, the application relates to a pedestrian re-identification method with multi-mode fusion of graphics and texts, which is specifically described below with reference to examples and drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
Fig. 1 is a schematic diagram of an overall architecture of cross-mode pedestrian re-recognition in an embodiment of the present application, and fig. 2 is a graph-text-based multi-mode pedestrian re-recognition method in an embodiment of the present application.
As shown in fig. 1 and fig. 2, a pedestrian re-recognition method with multi-mode fusion of graphics context can be executed on a server, and the method comprises the following steps:
step S10, obtaining image information containing pedestrians and corresponding text information describing the pedestrians to form a tagged image-text data set, wherein the tagged image-text data set comprises at least two different texts corresponding to the pedestrians and images of the pedestrians. (the tagged teletext data sets may also employ existing data sets)
Preprocessing the tagged image-text data set to obtain an image text pedestrian re-identification data set: preprocessing of pedestrian text descriptions includes normalization of character codes, unification of english capital and small letters, and text cleansing to remove non-text content by Unicode codes. The pre-processed text data is obtained and needs to be converted into a form that can be input into a transducer. For a sentence of text description, the natural language in the text description is converted to a feature of integer (there may be one word to become multiple shaping features). Each word is mapped into an integer, the mapping table is composed of 256 ASCLL code mappings and bpe common characters (the mapping table is a list of character combinations, the list order indicates the frequency of the character combinations), and then a sentence of text description is encoded according to the mapping table.
The image information containing pedestrians in the embodiment may be a pedestrian video or a pedestrian picture shot by the monitoring camera, and may be images in various forms such as an RGB image, an infrared image, and the like. The text information about the pedestrian is text in which two sentences differently describe the pedestrian. Wherein text cleansing is implemented for text sentences. In one embodiment, each pedestrian in the dataset comprises 6 images of pedestrians with the same text information.
Step S20, inputting the image information containing the pedestrians into a feature extraction network to obtain feature vectors of the pedestrians; and inputting the text information of the pedestrians into a feature extraction network to obtain feature vectors of the pedestrians. The feature extraction network comprises two sub-neural networks, and each sub-neural network corresponds to the information of pedestrians contained in two different modes of the input image and the text. Each sub-neural network may generate a feature vector through an averaging pooling operation. In an embodiment, please refer to fig. 2, the feature extraction network includes an Image sub-neural network and a Text sub-neural network, the Image sub-neural network is used for inputting a pedestrian Image and extracting a pedestrian feature vector in the Image, and the Text sub-neural network is used for inputting a Text description of a pedestrian and extracting a pedestrian feature vector in the Text description. All the sub-neural networks in the embodiment are basic neural networks, wherein the network body of the Image sub-neural network is Vision Transformer, and the Text sub-neural network is a transducer. FIG. 3 is a schematic diagram of a feature extraction network according to an embodiment of the application;
the image features and the text features of the pedestrians are processed through the feature extraction network, the pedestrian image feature and the text features comprise two sub-neural networks, and each sub-neural network correspondingly inputs information of the pedestrians in two different modes of images and texts. We adopt the currently popular deep learning model, the vision transformer model, as an Image sub-neural network. For the vision transformer model, the data type required to input the model is a sequence of vectors, i.e., a two-dimensional matrix. However, the data format of the image data is a three-dimensional matrix, which is obviously not desirable for the transformation, so that the image data needs to be transformed by an Embedding layer. The Embedding layer adopts a convolution layer with a convolution kernel size of nxn. Processing an HxWxC image through an Embedding layer to obtainAnd the patches. At this time, each patch is still a three-dimensional matrix, matrix transformation is required, and the height dimension and the width dimension of each patch are flattened to obtain a two-dimensional matrix, which is used as the input of a transducer coding layer.
To ensure that the image and text are converted to batches, the correlation between batches is preserved when entering the transform coding layer, and a class token and Position Embedding are added to each batch.
In the feature extraction network, the network extraction structures of the images and the texts comprise a transducer layer, and for each transducer layer, a multi-head attention network embedded in different positions is adopted.
In the training stage of the neural network, the feature extraction network obtains the pre-training weight of the feature extraction network through the following training tasks:
step S201, constructing a feature extraction network model. The feature extraction network structure in the provided embodiment is formed by connecting an Image sub-neural network and a Text sub-neural network in parallel.
Step S202, the pedestrian re-identification data set of the image text is randomly extracted to obtain 60% of data, and the data is used as a pre-training task. Cutting the size of the extracted pedestrian image, and uniformly changing the size of the image into 416X416; for extracted pedestrian text data, text representing apparel and colors in the text description is replaced with token characters. After the processing, a data set of the pre-training task is obtained;
the image text data set comprises a plurality of groups of pedestrian images and two different English descriptive texts corresponding to the pedestrian images, wherein the images and the texts of the pedestrians are provided with ID labels of the pedestrians.
In step S203, the pre-training data set is input into a neural network for training, wherein the Image information of the pedestrian is input into an Image sub-neural network, and the Text information of the pedestrian is input into a Text sub-neural network. In the pre-training stage, each time a group of image text pairs with the same pedestrian ID label in the image text pedestrian re-recognition training data set are input into the corresponding sub-neural network to complete training, and the feature extraction network formed by network parameters at the moment is the pre-trained feature extraction network after multiple times of training until the loss value of the feature extraction network is stable and cannot be reduced. We save the network weights of the feature extraction network at this point to the local, denoted Weight1.
In the pre-training task, we will only use the contrast penalty value of the image text pairs, the pedestrian ID penalty value, for the feature extraction network's penalty function. For an explanation of these two losses, see step S302 in detail;
in this embodiment, inputting the pre-training data set into the feature extraction network for pre-training includes:
inputting the image and text information of each mode in the pre-training data set into a corresponding sub-neural network to obtain feature vectors of at least two pedestrians; the image information of each modality contains the same pedestrian. I.e. a tag with the same pedestrian ID for each time a set of image information of different modalities is entered into the neural network.
For image data, the data format is a three-dimensional matrix of [ H, W, C ]. We use a convolutional layer to flatten it into a two-dimensional matrix, and the input image I is encoded into { Vcls, V1, V2,..vn }, where Vcls is the embedded vector of the position code [ CLS ]. And inputting the obtained embedded sequence into a six-layer transducer to obtain the feature vector of the pedestrian image. The Text encoder of the Text sub-neural network also uses a six-layer structured transducer model to obtain the Text feature vector of the pedestrian.
Step S30, a multi-mode feature fusion network is constructed, wherein the multi-mode feature fusion network is composed of a plurality of feature fusion modules. After the network structure is built, determining a total loss function of the multi-mode image-text pedestrian re-recognition model by calculating mathematical relations among various vectors according to the image feature vectors, the text feature vectors and the multi-mode feature vectors fused with the image text of the pedestrian;
step S301, after the pre-training Weight1 of the feature extraction network is obtained, a multi-mode feature fusion network is constructed. The multi-mode feature fusion network is composed of a plurality of groups of feature fusion modules, and the feature fusion modules adopt the following structures: the feature vectors obtained from the two sub-neural networks of the feature extraction network are simultaneously input into the feature fusion module, self-attention operation and cross-attention operation are firstly carried out to obtain the feature vector which is preliminarily fused, and then the feature vector which is preliminarily fused is input into a feedforward neural network which consists of two linear layers and a LayerNorm layer in the embodiment. The input of the subsequent feature fusion module is the output of the previous feature fusion module, and the final multi-mode feature vector fused with the pedestrian image text is obtained after the six groups of continuous feature fusion modules are processed in total. Fig. 4 is a schematic diagram of a multi-modal feature fusion network according to an embodiment of the present application.
And step S302, determining the total loss value of the multi-mode image-text pedestrian re-recognition model based on the comparison loss value of the image text pairs of the pedestrian feature vector, the ID loss value of the pedestrian and the triple loss value of the pedestrian re-recognition.
In this embodiment, determining the loss value of the neural network based on the contrast loss value of the image text pair of the feature vector, the ID loss value of the pedestrian, and the triplet loss value of the pedestrian re-recognition includes:
determining the total loss value of the multi-mode image-text pedestrian re-identification model through the following formula:
L=Lcon+λ1*L_id+λ2*L Triplet
wherein L is the total loss value of the multi-mode image-text pedestrian re-recognition model, lcon is the contrast loss of image text, L_id is the category loss of pedestrians, L Triplet For the triplet loss value, λ1 and λ2 are coefficients. The range of values of λ1 and λ2 in this embodiment may be 0.1 to 1.
In this embodiment, the optimal values of λ1 and λ2 can be determined by the following experiment:
firstly fixing the rest parameters of the neural network, randomly selecting a plurality of values in a value range from 0.1 to 1 as candidate values of lambda 1, and then randomly selecting a plurality of values in a value range from 0.1 to 1 as candidate values of lambda 2, wherein the selected candidate values of lambda 1 and the candidate values of lambda 2 are identical in quantity, in addition, the candidate values of lambda 1 can be selected at equal intervals in the value range from 0.1 to 1, the candidate values of lambda 2 can be selected at equal intervals in the value range from 0.1 to 1, and the candidate values of lambda 2 and the candidate values of lambda 1 have the same quantity.
And enabling the lambda 1 to be sequentially equal to one of the corresponding candidate values, enabling the lambda 2 to be sequentially equal to one of the corresponding candidate values, and respectively recording mAP values which correspond to the cross-mode pedestrian re-identification when the lambda 1 and the lambda 2 take each candidate value, so that the candidate value of the lambda 1 and the candidate value of the lambda 2 with the highest mAP are the optimal values of the lambda 1 and the lambda 2 corresponding to the current neural network parameters.
In the present embodiment, the image text contrast loss value lcon= (l_i+l_t)/2, whereinWhere N is the total number of samples, M is the total number of pedestrian categories for the input training image, l_i is the cross entropy loss function for the image, and l_t is the cross entropy loss function for the text. y is ic Is a sign function, takes a value of 0 or 1, if the true class of the sample i is equal to c, takes a value of 1, otherwise takes a value of 0.P is p ic To observe the samplei belongs to the prediction probability of category c.
The category loss of pedestrians isWherein, regard different pictures of pedestrian as same category, the essence is a cross entropy loss function.
The triplet loss expression is
For a triplet { x } a ,x p ,x n In terms of }, where x a ,x p And x n Representing the anchor image, positive and negative samples, respectively.
And S40, inputting the Image Text pedestrian re-identification data set obtained in the step S10 into a feature extraction network to extract features, wherein the feature extraction network adopts a network structure with a loading network Weight of Weight1, wherein the Image information of the pedestrian is input into an Image sub-neural network, and the Text information of the pedestrian is input into a Text sub-neural network. And respectively obtaining the feature vectors of the image and the text.
Inputting the obtained pedestrian image feature vector and the pedestrian text feature vector into the multi-modal feature fusion network constructed in the step S301 for fusion, splicing the pedestrian image feature vector and the text feature vector into a feature vector, namely a final multi-modal feature vector fused with the pedestrian image text, and calculating the loss value of the feature vector. In this embodiment, feature vectors of pedestrians output by the respective sub-neural networks are spliced into one long feature vector (i.e., a multi-mode feature vector fused with the image text of the pedestrian) according to a preset sequence (the sequence in which the image feature vector of the pedestrian is in front and the text vector of the pedestrian is in back).
And repeatedly inputting the information of each mode in the image text pedestrian re-recognition data set into a sub-neural network corresponding to the feature extraction network, and adjusting network parameters of the multi-mode image-text pedestrian re-recognition model based on the total loss value of the multi-mode image-text pedestrian re-recognition model until the loss value of the multi-mode image-text pedestrian re-recognition model is stable and does not become small. We save the network Weight at this time as Weight2;
and S50, performing pedestrian re-recognition on the target domain image and text data to be detected by using the trained multi-mode pedestrian re-recognition model, and outputting a recognition result of the corresponding target domain.
Inputting the data to be identified into a trained multi-modal pedestrian re-identification model to obtain multi-modal feature vectors fused with pedestrian image texts to re-identify pedestrians in the image information. In the implementation, the multi-modal feature vector of the pedestrian and the target pedestrian label are re-identified in a similarity mode.
The process of re-identifying the target new person according to the multi-mode feature vector fusing the pedestrian image text is as follows: and calculating cosine similarity scores of the multi-mode feature vectors and the target pedestrian labels, wherein the target pedestrian label with the highest score is a prediction result of the multi-mode image-text pedestrian re-recognition model.
Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.
The foregoing description of the application has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the application pertains, based on the idea of the application.

Claims (10)

1. The pedestrian re-identification method based on the image-text multi-mode fusion is characterized by comprising the following steps of:
a pedestrian image and a pedestrian dataset corresponding to at least two different texts of the pedestrian are acquired.
And inputting the images of the pedestrians and the corresponding text descriptions into a pre-trained multi-mode image-text pedestrian re-recognition network to obtain a prediction classification result.
The multi-mode image-text pedestrian re-recognition network is configured to: two branches are included to capture the character features in the two modalities of image and text, respectively, resulting in character features representing the image and text modalities, respectively.
2. The pedestrian re-recognition method based on the image-text multi-mode fusion according to claim 1, which is characterized by comprising the following steps:
step S1, preprocessing a labeled image-text data set to obtain an image text pedestrian re-identification data set, dividing a training set, a verification set and a test set;
and S2, constructing a feature extraction network in the multi-mode image-text pedestrian re-recognition model, pre-training the feature extraction network, and storing pre-training model weights of the feature extraction network.
S3, constructing a multi-mode feature fusion network, and determining a total loss function of a multi-mode image-text pedestrian re-recognition model;
and S4, training a multi-mode image-text pedestrian re-recognition model by using the pre-training weight of the feature extraction network and adopting a total loss function until the objective function converges.
And S5, performing pedestrian re-recognition on the target domain image and text data to be detected by using the trained multi-mode pedestrian re-recognition model, and outputting a recognition result of the corresponding target domain.
3. A pedestrian re-recognition method of multi-modal fusion of graphics context according to claim 2, characterized in that a tagged graphic context data set containing a pedestrian image and at least two different texts corresponding to pedestrians is used, pre-processing of pedestrian text description including normalization of character codes, unification of english case letters, and text cleaning to remove non-text content by Unicode codes. The pre-processed text data is obtained and needs to be converted into a form that can be input into a transducer. For a sentence of text description, natural language in the text description is converted into a feature of integer. Each word is mapped to an integer, the mapping table is composed of 256 ASCLL code mappings and bpe common characters, and then a sentence of text description is encoded according to the mapping table.
4. The pedestrian re-recognition method based on the Image-Text multi-mode fusion according to claim 2, wherein the feature extraction network comprises an Image sub-neural network and a Text sub-neural network, the Image features and the Text features of the pedestrians are respectively processed, and each sub-neural network is used for inputting information of the pedestrians in two different modes corresponding to the Image and the Text.
5. The pedestrian re-recognition method based on the Image-Text multi-mode fusion according to claim 4, wherein in the feature extraction network, a network body of an Image sub-neural network is Vision Transformer, a Text sub-neural network is a transducer, both sub-neural networks comprise transducer layers, and for each transducer layer, a multi-head attention network embedded in different positions is adopted.
6. The pedestrian re-recognition method of multi-modal fusion of graphics and text according to claim 5, wherein in order to ensure that the images and text are converted into clips, the relevance between clips is preserved when entering a transform coding layer, and a class token and a Position Embedding are added to each clip.
7. The pedestrian re-recognition method based on the image-text multi-mode fusion according to claim 1, wherein the multi-mode feature fusion network is composed of a plurality of groups of feature fusion modules, and the feature fusion modules adopt the following structures: the feature vectors respectively obtained from two sub-neural networks of the feature extraction network are simultaneously input into a feature fusion module, self-attention operation and cross-attention operation are firstly carried out to obtain a primary fused feature vector, and then the primary fused feature vector is input into a feedforward neural network which consists of two linear layers and a LayerNorm layer in the embodiment; the input of the subsequent feature fusion module is the output of the previous feature fusion module, and the final multi-mode feature vector fused with the pedestrian image text is obtained after the six groups of continuous feature fusion modules are processed in total.
8. The pedestrian re-recognition method of claim 2 wherein the total loss function includes contrast loss between pairs of image text, classified pedestrian ID loss and triplet loss.
9. The pedestrian re-recognition method based on the image text pair contrast loss value of the pedestrian feature vector, the pedestrian ID loss value and the pedestrian re-recognition triplet loss value, determining a total loss value of a multi-mode image-text pedestrian re-recognition model;
determining the total loss value of the multi-mode image-text pedestrian re-identification model through the following formula:
L=Lcon+λ1*L_id+λ2*L Triplet
wherein L is the total loss value of the multi-mode image-text pedestrian re-recognition model, lcon is the contrast loss of the image text, and L_id is the category of the pedestrianLoss, L Triplet For the triplet loss value, λ1 and λ2 are coefficients;
image text contrast loss value lcon= (l_i+l_t)/2, whereinWherein N is the total number of samples, M is the total number of pedestrian categories of the input training image, L_i is the cross entropy loss function of the image, and L_t is the cross entropy loss function of the text; y is ic Is a sign function, takes a value of 0 or 1, if the true category of the sample i is equal to c, takes a value of 1, otherwise takes a value of 0; p is p ic The prediction probability of the observation sample i belonging to the category c;
the category loss of pedestrians isWherein, regard different pictures of the pedestrian as the same kind, the essence is a cross entropy loss function;
the triplet loss expression is
For a triplet { x } a ,x p ,x n In terms of }, where x a ,x p And x n Representing the anchor image, positive and negative samples, respectively.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing a pedestrian re-recognition method of the multimodal fusion of graphics and text as claimed in any one of claims 1 to 9 when the program is executed by the processor.
CN202311052722.7A 2023-08-21 2023-08-21 Pedestrian re-identification method based on image-text multi-mode fusion Pending CN117079310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311052722.7A CN117079310A (en) 2023-08-21 2023-08-21 Pedestrian re-identification method based on image-text multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311052722.7A CN117079310A (en) 2023-08-21 2023-08-21 Pedestrian re-identification method based on image-text multi-mode fusion

Publications (1)

Publication Number Publication Date
CN117079310A true CN117079310A (en) 2023-11-17

Family

ID=88707563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311052722.7A Pending CN117079310A (en) 2023-08-21 2023-08-21 Pedestrian re-identification method based on image-text multi-mode fusion

Country Status (1)

Country Link
CN (1) CN117079310A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133231A (en) * 2024-05-10 2024-06-04 成都梵辰科技有限公司 Multi-mode data processing method and processing system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133231A (en) * 2024-05-10 2024-06-04 成都梵辰科技有限公司 Multi-mode data processing method and processing system

Similar Documents

Publication Publication Date Title
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
Shi et al. Can a machine generate humanlike language descriptions for a remote sensing image?
CN113283551B (en) Training method and training device of multi-mode pre-training model and electronic equipment
EP3399460A1 (en) Captioning a region of an image
US20170011279A1 (en) Latent embeddings for word images and their semantics
Jain et al. Unconstrained scene text and video text recognition for arabic script
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN111339343A (en) Image retrieval method, device, storage medium and equipment
CN113593661B (en) Clinical term standardization method, device, electronic equipment and storage medium
CN111475622A (en) Text classification method, device, terminal and storage medium
Gordo et al. LEWIS: latent embeddings for word images and their semantics
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN114282013A (en) Data processing method, device and storage medium
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN114691864A (en) Text classification model training method and device and text classification method and device
CN117079310A (en) Pedestrian re-identification method based on image-text multi-mode fusion
CN112085120A (en) Multimedia data processing method and device, electronic equipment and storage medium
CN113076905B (en) Emotion recognition method based on context interaction relation
CN117453859A (en) Agricultural pest and disease damage image-text retrieval method, system and electronic equipment
Li et al. Review network for scene text recognition
CN114925198B (en) Knowledge-driven text classification method integrating character information
CN113516118B (en) Multi-mode cultural resource processing method for joint embedding of images and texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination