CN114547249A

CN114547249A - Vehicle retrieval method based on natural language and visual features

Info

Publication number: CN114547249A
Application number: CN202210173817.3A
Authority: CN
Inventors: 高文飞; 王瑞雪; 王磊; 王辉; 郭丽丽
Original assignee: Jinan Rongling Technology Development Co ltd
Current assignee: Jinan Rongling Technology Development Co ltd
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-05-27

Abstract

The invention discloses a vehicle retrieval method based on natural language and visual features, which comprises the following steps: s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set; s2, training a vehicle heavy identification model by using a multi-task learning framework as a basic model; s3, obtaining a feature extractor; and S4, constructing a multi-modal vehicle track retrieval system based on natural language and visual features, and retrieving the vehicle track. According to the vehicle retrieval method based on the natural language and the visual features, vehicles matched with semantics can be found conveniently through the natural language, compared with a vehicle retrieval system based on vision only in the prior art, the vehicle retrieval method is more flexible, the retrieval threshold is reduced, meanwhile, the visual features of vehicles with strong distinguishing capability are extracted through a vehicle re-recognition model, and fine-grained information of the features is enriched.

Description

Vehicle retrieval method based on natural language and visual features

Technical Field

The invention relates to the technical field of intelligent traffic, in particular to a vehicle retrieval method based on natural language and visual features.

Background

Target tracking is one of the popular fields of computer vision research, and refers to automatically tracking a fixed target appearing in a continuous video frame by using an artificial intelligence technology in the continuous video frame. As a basic technology, target tracking is widely applied in various fields, for example: a plurality of scenes such as autopilot, smart city and intelligent monitoring.

The vehicle retrieval method based on natural language and visual features plays an important role in target tracking of smart city traffic. The vehicle retrieval task based on the natural language and the visual features means that given natural language needs to retrieve corresponding vehicle track segments from a video segment library, for example, "a red SUV turns right at an intersection", and the corresponding vehicle track segments need to be retrieved and recalled, however, in the prior art, cross-modal vehicle retrieval based on the natural language and the visual features is simpler in used visual features, for example, based on ImageNet pre-training, and has a larger difference with a vehicle in a domain, so that features with high efficient distinguishing capability cannot be extracted, or only based on the visual modal retrieval, the flexibility is lacked, meanwhile, the retrieval threshold is higher, the features used by the cross-modal vehicle retrieval are simpler, and the vehicles cannot be described in a fine-grained level. Therefore, we improve this and propose a vehicle retrieval method based on natural language and visual features.

Disclosure of Invention

In order to solve the technical problems, the invention provides the following technical scheme:

the invention relates to a vehicle retrieval method based on natural language and visual characteristics, which comprises the following steps:

s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set;

s2, training a vehicle heavy identification model by using the multitask learning framework as a basic model, and specifically comprising the following steps:

s2-1, performing data preprocessing including random erasing, random cutting and standardization processing on the vehicle pictures, and then constructing batch training data, specifically, extracting P types of pictures which are not put back from a library, wherein each type of pictures comprises K pictures, and using the pictures as the batch training data;

s2-2, sending the batch data into a residual error network, obtaining a feature map through convolution operation, then carrying out generalized average pooling on the feature map, converting the feature map into a bit vector, and defining the feature as F₁Then, the one-bit vector is used for calculating metric learning loss;

s2-3, and comparing the characteristic F₁Obtaining a characteristic F through a batch normalization layer₂Then using this feature to calculate a classification penalty;

s2-4, optimizing the network parameters through back propagation, wherein the network has the capability of distinguishing different vehicles through multiple iterations, and then storing the trained network parameters;

s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts after BN, and then using the obtained feature after BN, namely the feature F₂A feature extractor for obtaining a feature of the vehicle as a feature representation of the vehicle;

s4, constructing a multi-modal vehicle track retrieval system based on natural language and visual characteristics, and retrieving vehicle tracks, wherein the method specifically comprises the following steps:

s4-1, extracting visual features, performing video frame extraction on each video, cutting out the main body part of a vehicle from each frame, then performing feature extraction on each frame picture by using a vehicle feature extractor in S3, converting the feature extraction into a feature vector V, and finally mining time sequence information through a GRU model to perform fusion to obtain the visual features f_v；

S4-2, extracting natural language features, inputting N sections of natural languages, extracting word vector features S for each section of natural language by using a GLove model pre-trained on large-scale corpus data, and fusing the word vector features by using a GRU model to obtain natural language features f_s；

S4-3, comparing and learning, and using the obtained visual feature f_vAnd natural language features f_sAnd calculating contrast loss in a high-dimensional space, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.

As a preferred technical solution of the present invention, in S1, a specific manner of constructing the data set is as follows: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.

As a preferred technical solution of the present invention, in S2-2, the computation of the metric learning loss is performed by using a triplet loss computation, where the triplet loss is as follows:

in the formula: l is_tRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, x_a，x_p，x_nThe partial table represents anchor images, normal images and reverse images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, then the same kind of picture with the farthest distance is found as a normal image, and the different kind of picture with the closest distance is found as a reverse image, so that a triplet is constructed.

As a preferred embodiment of the present invention, in S2-3, a classification loss is calculated, where the classification loss label is a previously set ID label, and the loss function used is a cross entropy loss:

in the formula: l is_sRepresenting class learning penalty, i.e. cross-entropy penalty, y_iIs an indicator variable, y if the ith class matches the target class_iNot 1 but 0, p_iIs the predicted likelihood that the picture belongs to the i-th class.

As a preferred embodiment of the present invention, in S4-1, the feature vector V is equal to

Wherein T is_vIs the number of frames in a video segment, c_tIs the feature representation of the T-th frame, 2048 is the dimension of the feature, after which T is taken_vThe characteristics are fused by mining time sequence information through GRU

Obtaining the fused features

Finally, the features are mapped to a high-dimensional space through a full connection layer andobtaining final visual characteristics through batch standardization

W herein_αAnd b_αRepresenting the weight and deviation of the fully connected layer.

As a preferred technical scheme of the invention, in S4-2, the word vector characteristics

Wherein T is_sRepresenting the number of words in the natural language, w_tRepresenting the t-th word vector, and then fusing the word vector features by using another GRU module

Finally, the fused features

Obtaining final natural language features through a full concatenation and batch layer

W herein_γAnd b_γRepresenting the weight and deviation of the fully connected layer.

As a preferred embodiment of the present invention, in S4-3, the contrast loss is defined as L,

where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | f_s-f_v||₂And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold value.

The invention has the beneficial effects that:

according to the vehicle retrieval method based on the natural language and the visual features, vehicles matched with semantics can be conveniently found through the natural language, compared with a conventional vehicle retrieval system based on the vision, the method is more flexible, the retrieval threshold is reduced, meanwhile, the visual features of the vehicles with strong distinguishing capability are extracted by using a vehicle re-identification model, and fine-grained information of the features is enriched.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

In the drawings:

FIG. 1 is a schematic view of a vehicle re-identification model of the present invention;

FIG. 2 is a schematic diagram of a vehicle trajectory retrieval system of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1

s2, as shown in fig. 1, training a vehicle heavy identification model by using a multitask learning framework as a basic model, specifically including the following steps:

s2-2, sending the batch data into a residual error netIn the network, a feature map is obtained through convolution operation, then the feature map is subjected to generalized average pooling, the feature map is converted into a bit vector, and the feature is defined as F₁Then, the one-bit vector is used for calculating metric learning loss;

s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts behind the BN, and then using the obtained features behind the BN, namely the feature F₂As a feature representation of the vehicle, a feature extractor of the vehicle is available, which functions to convert a picture of the vehicle into a feature vector, which is a complete description of a vehicle containing high-level semantic information, because the feature extractor has been completely trained using ID tags;

s4, as shown in fig. 2, constructing a multi-modal vehicle trajectory retrieval system based on natural language and visual features to retrieve vehicle trajectories, specifically including the following steps:

S4-3, comparing and learning, and using the obtained visual feature f_vAnd natural language features f_sAnd calculating contrast loss in a high-dimensional space, shortening the distance between the matched natural language features and visual features, simultaneously shortening the distance between the unmatched natural language features and visual features, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.

In S1, the specific way of constructing the data set is as follows: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.

Wherein, in S2-2, the computation metric learning loss is computed using the triplet loss as follows:

Wherein, in S2-3, a classification loss is calculated, where the classification loss label is a previously set ID label, and the loss function used is the cross entropy loss:

in the formula: l is_sRepresenting class learning penalty, i.e. cross-entropy penalty, y_iIs an indicator variable if the ith category and purposeStandard class matching rule y_iNot 1 but 0, p_iThe predicted possibility that the picture belongs to the ith class can be used, so that the problems of large intra-class difference and small inter-class difference in vehicle re-identification can be solved through a multi-task learning mechanism of metric learning and classification learning.

Wherein, in S4-1, the feature vector

Obtaining the fused features

Finally, mapping the features to a high-dimensional space through a full connection layer and obtaining the final visual features f through batch standardization_v，

Wherein, in S4-2, the word vector is characterized

Wherein T is_sRepresents the number of words in the natural language, w_tRepresenting the t-th word vector, and then fusing the word vector features by using another GRU module

Finally, the fused features

Wherein, in S4-3, the contrast loss is defined as L,

where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | f_s-f_v||₂And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold.

After the model training is completed, all natural languages in the Query library are subjected to feature extraction by using the natural language part of the model, the natural languages are converted into feature representation, and meanwhile, the visual feature part of the model is used for performing feature extraction on the vehicle track video concentrated by the Gallery. And calculating the matching degrees of the vehicle tracks with all the vehicle tracks in the Gallery library, namely cosine similarity, aiming at a section of natural language, sequencing the vehicle tracks according to the matching degrees, and returning a plurality of vehicle tracks with the highest similarity, thereby completing the retrieval of the vehicle tracks through the natural language.

The vehicle retrieval mode is more flexible, the required retrieval threshold is lower, and the vehicle retrieval mode is a natural language; and meanwhile, the visual characteristics of the vehicle with strong distinguishing capability are extracted by using the vehicle weight recognition model, and the fine-grained information of the characteristics is enriched.

Example 2

The expression of the vehicle retrieval method based on natural language and visual characteristics on the CityFlow-NL data set is shown in the following table:

Method	MRR	Recall@5	Recall@10
				foundation method	0.0269	0.0264	0.0491
Methods of the invention (ImageNet characteristics)	0.1091	0.1669	0.3178
				Method of the invention (Re-ID characteristics)	0.1613	0.2585	0.3925

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A vehicle retrieval method based on natural language and visual features is characterized by comprising the following steps:

s2-1, carrying out data preprocessing on the vehicle pictures, including random erasing, random cutting and standardization processing, and then constructing batch training data, specifically extracting P types of pictures, each K types of pictures, which are not returned from a library, and using the pictures as the batch training data;

s4-1, visual feature extraction, video frame extraction is carried out on each video, the main body part of the vehicle is cut out from each frame, feature extraction is carried out on each frame picture by using a vehicle feature extractor in S3, the feature extraction is converted into a feature vector V, and finally the GRU model is used for extracting features of the vehicleMining time sequence information for fusion to obtain visual characteristics f_v；

2. The vehicle retrieval method based on natural language and visual features of claim 1, wherein in S1, the specific way to construct the data set is: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.

3. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S2-2, the computing metric learning loss is computed by using the triplet loss as follows:

in the formula: l is_tRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, x_a，x_p，x_nThe partial tables represent anchor images, positive examples images and negative examples images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, and then the same image with the farthest distance is foundAnd constructing a triple by using the class picture as a normal example image and using the different class pictures with the nearest distance as a reverse example image.

4. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S2-3, a classification loss is calculated, wherein the classification loss label is a previously set ID label, and the loss function used is cross entropy loss:

5. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S4-1, the feature vector

Obtaining the fused features

6. The method for vehicle search based on natural language and visual features of claim 1, wherein in S4-2, the word vector features

Finally, the fused features

7. The vehicle retrieval method based on natural language and visual characteristics of claim 1, wherein in S4-3, a contrast loss is defined as L,

yd²+(1-y)max(m-d,0)²where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | f_s-f_v||₂And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold value.