CN114547249A - Vehicle retrieval method based on natural language and visual features - Google Patents
Vehicle retrieval method based on natural language and visual features Download PDFInfo
- Publication number
- CN114547249A CN114547249A CN202210173817.3A CN202210173817A CN114547249A CN 114547249 A CN114547249 A CN 114547249A CN 202210173817 A CN202210173817 A CN 202210173817A CN 114547249 A CN114547249 A CN 114547249A
- Authority
- CN
- China
- Prior art keywords
- vehicle
- natural language
- feature
- features
- visual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000001514 detection method Methods 0.000 claims abstract description 4
- 238000000605 extraction Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims description 5
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 239000004576 sand Substances 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000004904 shortening Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7837—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Library & Information Science (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a vehicle retrieval method based on natural language and visual features, which comprises the following steps: s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set; s2, training a vehicle heavy identification model by using a multi-task learning framework as a basic model; s3, obtaining a feature extractor; and S4, constructing a multi-modal vehicle track retrieval system based on natural language and visual features, and retrieving the vehicle track. According to the vehicle retrieval method based on the natural language and the visual features, vehicles matched with semantics can be found conveniently through the natural language, compared with a vehicle retrieval system based on vision only in the prior art, the vehicle retrieval method is more flexible, the retrieval threshold is reduced, meanwhile, the visual features of vehicles with strong distinguishing capability are extracted through a vehicle re-recognition model, and fine-grained information of the features is enriched.
Description
Technical Field
The invention relates to the technical field of intelligent traffic, in particular to a vehicle retrieval method based on natural language and visual features.
Background
Target tracking is one of the popular fields of computer vision research, and refers to automatically tracking a fixed target appearing in a continuous video frame by using an artificial intelligence technology in the continuous video frame. As a basic technology, target tracking is widely applied in various fields, for example: a plurality of scenes such as autopilot, smart city and intelligent monitoring.
The vehicle retrieval method based on natural language and visual features plays an important role in target tracking of smart city traffic. The vehicle retrieval task based on the natural language and the visual features means that given natural language needs to retrieve corresponding vehicle track segments from a video segment library, for example, "a red SUV turns right at an intersection", and the corresponding vehicle track segments need to be retrieved and recalled, however, in the prior art, cross-modal vehicle retrieval based on the natural language and the visual features is simpler in used visual features, for example, based on ImageNet pre-training, and has a larger difference with a vehicle in a domain, so that features with high efficient distinguishing capability cannot be extracted, or only based on the visual modal retrieval, the flexibility is lacked, meanwhile, the retrieval threshold is higher, the features used by the cross-modal vehicle retrieval are simpler, and the vehicles cannot be described in a fine-grained level. Therefore, we improve this and propose a vehicle retrieval method based on natural language and visual features.
Disclosure of Invention
In order to solve the technical problems, the invention provides the following technical scheme:
the invention relates to a vehicle retrieval method based on natural language and visual characteristics, which comprises the following steps:
s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set;
s2, training a vehicle heavy identification model by using the multitask learning framework as a basic model, and specifically comprising the following steps:
s2-1, performing data preprocessing including random erasing, random cutting and standardization processing on the vehicle pictures, and then constructing batch training data, specifically, extracting P types of pictures which are not put back from a library, wherein each type of pictures comprises K pictures, and using the pictures as the batch training data;
s2-2, sending the batch data into a residual error network, obtaining a feature map through convolution operation, then carrying out generalized average pooling on the feature map, converting the feature map into a bit vector, and defining the feature as F1Then, the one-bit vector is used for calculating metric learning loss;
s2-3, and comparing the characteristic F1Obtaining a characteristic F through a batch normalization layer2Then using this feature to calculate a classification penalty;
s2-4, optimizing the network parameters through back propagation, wherein the network has the capability of distinguishing different vehicles through multiple iterations, and then storing the trained network parameters;
s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts after BN, and then using the obtained feature after BN, namely the feature F2A feature extractor for obtaining a feature of the vehicle as a feature representation of the vehicle;
s4, constructing a multi-modal vehicle track retrieval system based on natural language and visual characteristics, and retrieving vehicle tracks, wherein the method specifically comprises the following steps:
s4-1, extracting visual features, performing video frame extraction on each video, cutting out the main body part of a vehicle from each frame, then performing feature extraction on each frame picture by using a vehicle feature extractor in S3, converting the feature extraction into a feature vector V, and finally mining time sequence information through a GRU model to perform fusion to obtain the visual features fv;
S4-2, extracting natural language features, inputting N sections of natural languages, extracting word vector features S for each section of natural language by using a GLove model pre-trained on large-scale corpus data, and fusing the word vector features by using a GRU model to obtain natural language features fs;
S4-3, comparing and learning, and using the obtained visual feature fvAnd natural language features fsAnd calculating contrast loss in a high-dimensional space, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.
As a preferred technical solution of the present invention, in S1, a specific manner of constructing the data set is as follows: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.
As a preferred technical solution of the present invention, in S2-2, the computation of the metric learning loss is performed by using a triplet loss computation, where the triplet loss is as follows:
in the formula: l istRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, xa,xp,xnThe partial table represents anchor images, normal images and reverse images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, then the same kind of picture with the farthest distance is found as a normal image, and the different kind of picture with the closest distance is found as a reverse image, so that a triplet is constructed.
As a preferred embodiment of the present invention, in S2-3, a classification loss is calculated, where the classification loss label is a previously set ID label, and the loss function used is a cross entropy loss:
in the formula: l issRepresenting class learning penalty, i.e. cross-entropy penalty, yiIs an indicator variable, y if the ith class matches the target classiNot 1 but 0, piIs the predicted likelihood that the picture belongs to the i-th class.
As a preferred embodiment of the present invention, in S4-1, the feature vector V is equal toWherein T isvIs the number of frames in a video segment, ctIs the feature representation of the T-th frame, 2048 is the dimension of the feature, after which T is takenvThe characteristics are fused by mining time sequence information through GRUObtaining the fused featuresFinally, the features are mapped to a high-dimensional space through a full connection layer andobtaining final visual characteristics through batch standardizationW hereinαAnd bαRepresenting the weight and deviation of the fully connected layer.
As a preferred technical scheme of the invention, in S4-2, the word vector characteristics Wherein T issRepresenting the number of words in the natural language, wtRepresenting the t-th word vector, and then fusing the word vector features by using another GRU module Finally, the fused featuresObtaining final natural language features through a full concatenation and batch layerW hereinγAnd bγRepresenting the weight and deviation of the fully connected layer.
As a preferred embodiment of the present invention, in S4-3, the contrast loss is defined as L,where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | fs-fv||2And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold value.
The invention has the beneficial effects that:
according to the vehicle retrieval method based on the natural language and the visual features, vehicles matched with semantics can be conveniently found through the natural language, compared with a conventional vehicle retrieval system based on the vision, the method is more flexible, the retrieval threshold is reduced, meanwhile, the visual features of the vehicles with strong distinguishing capability are extracted by using a vehicle re-identification model, and fine-grained information of the features is enriched.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
In the drawings:
FIG. 1 is a schematic view of a vehicle re-identification model of the present invention;
FIG. 2 is a schematic diagram of a vehicle trajectory retrieval system of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
The invention relates to a vehicle retrieval method based on natural language and visual characteristics, which comprises the following steps:
s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set;
s2, as shown in fig. 1, training a vehicle heavy identification model by using a multitask learning framework as a basic model, specifically including the following steps:
s2-1, performing data preprocessing including random erasing, random cutting and standardization processing on the vehicle pictures, and then constructing batch training data, specifically, extracting P types of pictures which are not put back from a library, wherein each type of pictures comprises K pictures, and using the pictures as the batch training data;
s2-2, sending the batch data into a residual error netIn the network, a feature map is obtained through convolution operation, then the feature map is subjected to generalized average pooling, the feature map is converted into a bit vector, and the feature is defined as F1Then, the one-bit vector is used for calculating metric learning loss;
s2-3, and comparing the characteristic F1Obtaining a characteristic F through a batch normalization layer2Then using this feature to calculate a classification penalty;
s2-4, optimizing the network parameters through back propagation, wherein the network has the capability of distinguishing different vehicles through multiple iterations, and then storing the trained network parameters;
s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts behind the BN, and then using the obtained features behind the BN, namely the feature F2As a feature representation of the vehicle, a feature extractor of the vehicle is available, which functions to convert a picture of the vehicle into a feature vector, which is a complete description of a vehicle containing high-level semantic information, because the feature extractor has been completely trained using ID tags;
s4, as shown in fig. 2, constructing a multi-modal vehicle trajectory retrieval system based on natural language and visual features to retrieve vehicle trajectories, specifically including the following steps:
s4-1, extracting visual features, performing video frame extraction on each video, cutting out the main body part of a vehicle from each frame, then performing feature extraction on each frame picture by using a vehicle feature extractor in S3, converting the feature extraction into a feature vector V, and finally mining time sequence information through a GRU model to perform fusion to obtain the visual features fv;
S4-2, extracting natural language features, inputting N sections of natural languages, extracting word vector features S for each section of natural language by using a GLove model pre-trained on large-scale corpus data, and fusing the word vector features by using a GRU model to obtain natural language features fs;
S4-3, comparing and learning, and using the obtained visual feature fvAnd natural language features fsAnd calculating contrast loss in a high-dimensional space, shortening the distance between the matched natural language features and visual features, simultaneously shortening the distance between the unmatched natural language features and visual features, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.
In S1, the specific way of constructing the data set is as follows: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.
Wherein, in S2-2, the computation metric learning loss is computed using the triplet loss as follows:
in the formula: l istRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, xa,xp,xnThe partial table represents anchor images, normal images and reverse images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, then the same kind of picture with the farthest distance is found as a normal image, and the different kind of picture with the closest distance is found as a reverse image, so that a triplet is constructed.
Wherein, in S2-3, a classification loss is calculated, where the classification loss label is a previously set ID label, and the loss function used is the cross entropy loss:
in the formula: l issRepresenting class learning penalty, i.e. cross-entropy penalty, yiIs an indicator variable if the ith category and purposeStandard class matching rule yiNot 1 but 0, piThe predicted possibility that the picture belongs to the ith class can be used, so that the problems of large intra-class difference and small inter-class difference in vehicle re-identification can be solved through a multi-task learning mechanism of metric learning and classification learning.
Wherein, in S4-1, the feature vectorWherein T isvIs the number of frames in a video segment, ctIs the feature representation of the T-th frame, 2048 is the dimension of the feature, after which T is takenvThe characteristics are fused by mining time sequence information through GRUObtaining the fused featuresFinally, mapping the features to a high-dimensional space through a full connection layer and obtaining the final visual features f through batch standardizationv,W hereinαAnd bαRepresenting the weight and deviation of the fully connected layer.
Wherein, in S4-2, the word vector is characterizedWherein T issRepresents the number of words in the natural language, wtRepresenting the t-th word vector, and then fusing the word vector features by using another GRU moduleFinally, the fused featuresObtaining final natural language features through a full concatenation and batch layerW hereinγAnd bγRepresenting the weight and deviation of the fully connected layer.
Wherein, in S4-3, the contrast loss is defined as L, where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | fs-fv||2And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold.
After the model training is completed, all natural languages in the Query library are subjected to feature extraction by using the natural language part of the model, the natural languages are converted into feature representation, and meanwhile, the visual feature part of the model is used for performing feature extraction on the vehicle track video concentrated by the Gallery. And calculating the matching degrees of the vehicle tracks with all the vehicle tracks in the Gallery library, namely cosine similarity, aiming at a section of natural language, sequencing the vehicle tracks according to the matching degrees, and returning a plurality of vehicle tracks with the highest similarity, thereby completing the retrieval of the vehicle tracks through the natural language.
The vehicle retrieval mode is more flexible, the required retrieval threshold is lower, and the vehicle retrieval mode is a natural language; and meanwhile, the visual characteristics of the vehicle with strong distinguishing capability are extracted by using the vehicle weight recognition model, and the fine-grained information of the characteristics is enriched.
Example 2
The expression of the vehicle retrieval method based on natural language and visual characteristics on the CityFlow-NL data set is shown in the following table:
Method | MRR | Recall@5 | Recall@10 |
foundation method | 0.0269 | 0.0264 | 0.0491 |
Methods of the invention (ImageNet characteristics) | 0.1091 | 0.1669 | 0.3178 |
Method of the invention (Re-ID characteristics) | 0.1613 | 0.2585 | 0.3925 |
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. A vehicle retrieval method based on natural language and visual features is characterized by comprising the following steps:
s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set;
s2, training a vehicle heavy identification model by using the multitask learning framework as a basic model, and specifically comprising the following steps:
s2-1, carrying out data preprocessing on the vehicle pictures, including random erasing, random cutting and standardization processing, and then constructing batch training data, specifically extracting P types of pictures, each K types of pictures, which are not returned from a library, and using the pictures as the batch training data;
s2-2, sending the batch data into a residual error network, obtaining a feature map through convolution operation, then carrying out generalized average pooling on the feature map, converting the feature map into a bit vector, and defining the feature as F1Then, the one-bit vector is used for calculating metric learning loss;
s2-3, and comparing the characteristic F1Obtaining a characteristic F through a batch normalization layer2Then using this feature to calculate a classification penalty;
s2-4, optimizing the network parameters through back propagation, wherein the network has the capability of distinguishing different vehicles through multiple iterations, and then storing the trained network parameters;
s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts after BN, and then using the obtained feature after BN, namely the feature F2A feature extractor for obtaining a feature of the vehicle as a feature representation of the vehicle;
s4, constructing a multi-modal vehicle track retrieval system based on natural language and visual characteristics, and retrieving vehicle tracks, wherein the method specifically comprises the following steps:
s4-1, visual feature extraction, video frame extraction is carried out on each video, the main body part of the vehicle is cut out from each frame, feature extraction is carried out on each frame picture by using a vehicle feature extractor in S3, the feature extraction is converted into a feature vector V, and finally the GRU model is used for extracting features of the vehicleMining time sequence information for fusion to obtain visual characteristics fv;
S4-2, extracting natural language features, inputting N sections of natural languages, extracting word vector features S for each section of natural language by using a GLove model pre-trained on large-scale corpus data, and fusing the word vector features by using a GRU model to obtain natural language features fs;
S4-3, comparing and learning, and using the obtained visual feature fvAnd natural language features fsAnd calculating contrast loss in a high-dimensional space, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.
2. The vehicle retrieval method based on natural language and visual features of claim 1, wherein in S1, the specific way to construct the data set is: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.
3. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S2-2, the computing metric learning loss is computed by using the triplet loss as follows:
in the formula: l istRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, xa,xp,xnThe partial tables represent anchor images, positive examples images and negative examples images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, and then the same image with the farthest distance is foundAnd constructing a triple by using the class picture as a normal example image and using the different class pictures with the nearest distance as a reverse example image.
4. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S2-3, a classification loss is calculated, wherein the classification loss label is a previously set ID label, and the loss function used is cross entropy loss:
in the formula: l issRepresenting class learning penalty, i.e. cross-entropy penalty, yiIs an indicator variable, y if the ith class matches the target classiNot 1 but 0, piIs the predicted likelihood that the picture belongs to the i-th class.
5. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S4-1, the feature vector Wherein T isvIs the number of frames in a video segment, ctIs the feature representation of the T-th frame, 2048 is the dimension of the feature, after which T is takenvThe characteristics are fused by mining time sequence information through GRU Obtaining the fused featuresFinally, mapping the features to a high-dimensional space through a full connection layer and obtaining the final visual features f through batch standardizationv, W hereinαAnd bαRepresenting the weight and deviation of the fully connected layer.
6. The method for vehicle search based on natural language and visual features of claim 1, wherein in S4-2, the word vector features Wherein T issRepresenting the number of words in the natural language, wtRepresenting the t-th word vector, and then fusing the word vector features by using another GRU module Finally, the fused featuresObtaining final natural language features through a full concatenation and batch layerW hereinγAnd bγRepresenting the weight and deviation of the fully connected layer.
7. The vehicle retrieval method based on natural language and visual characteristics of claim 1, wherein in S4-3, a contrast loss is defined as L,yd2+(1-y)max(m-d,0)2where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | fs-fv||2And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210173817.3A CN114547249A (en) | 2022-02-24 | 2022-02-24 | Vehicle retrieval method based on natural language and visual features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210173817.3A CN114547249A (en) | 2022-02-24 | 2022-02-24 | Vehicle retrieval method based on natural language and visual features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114547249A true CN114547249A (en) | 2022-05-27 |
Family
ID=81678470
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210173817.3A Pending CN114547249A (en) | 2022-02-24 | 2022-02-24 | Vehicle retrieval method based on natural language and visual features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114547249A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115841596A (en) * | 2022-12-16 | 2023-03-24 | 华院计算技术(上海)股份有限公司 | Multi-label image classification method and training method and device of multi-label image classification model |
CN115880661A (en) * | 2023-02-01 | 2023-03-31 | 天翼云科技有限公司 | Vehicle matching method and device, electronic equipment and storage medium |
CN117171382A (en) * | 2023-07-28 | 2023-12-05 | 宁波善德电子集团有限公司 | Vehicle video retrieval method based on comprehensive features and natural language |
CN117630344A (en) * | 2024-01-25 | 2024-03-01 | 西南科技大学 | Method for detecting slump range of concrete on line in real time |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108647700A (en) * | 2018-04-14 | 2018-10-12 | 华中科技大学 | Multitask vehicle part identification model based on deep learning, method and system |
CN109871449A (en) * | 2019-03-18 | 2019-06-11 | 北京邮电大学 | A kind of zero sample learning method end to end based on semantic description |
CN110073371A (en) * | 2017-05-05 | 2019-07-30 | 辉达公司 | For to reduce the loss scaling that precision carries out deep neural network training |
KR102095685B1 (en) * | 2019-12-02 | 2020-04-01 | 주식회사 넥스파시스템 | vehicle detection method and device |
CN111914664A (en) * | 2020-07-06 | 2020-11-10 | 同济大学 | Vehicle multi-target detection and track tracking method based on re-identification |
CN111931902A (en) * | 2020-07-03 | 2020-11-13 | 江苏大学 | Countermeasure network generation model and vehicle track prediction method using the same |
WO2022001489A1 (en) * | 2020-06-28 | 2022-01-06 | 北京交通大学 | Unsupervised domain adaptation target re-identification method |
-
2022
- 2022-02-24 CN CN202210173817.3A patent/CN114547249A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110073371A (en) * | 2017-05-05 | 2019-07-30 | 辉达公司 | For to reduce the loss scaling that precision carries out deep neural network training |
CN108647700A (en) * | 2018-04-14 | 2018-10-12 | 华中科技大学 | Multitask vehicle part identification model based on deep learning, method and system |
CN109871449A (en) * | 2019-03-18 | 2019-06-11 | 北京邮电大学 | A kind of zero sample learning method end to end based on semantic description |
KR102095685B1 (en) * | 2019-12-02 | 2020-04-01 | 주식회사 넥스파시스템 | vehicle detection method and device |
WO2022001489A1 (en) * | 2020-06-28 | 2022-01-06 | 北京交通大学 | Unsupervised domain adaptation target re-identification method |
CN111931902A (en) * | 2020-07-03 | 2020-11-13 | 江苏大学 | Countermeasure network generation model and vehicle track prediction method using the same |
CN111914664A (en) * | 2020-07-06 | 2020-11-10 | 同济大学 | Vehicle multi-target detection and track tracking method based on re-identification |
Non-Patent Citations (1)
Title |
---|
王崇屹: "基于多任务学习的车辆重识别***研究与实现", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》, 31 January 2020 (2020-01-31), pages 034 - 1266 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115841596A (en) * | 2022-12-16 | 2023-03-24 | 华院计算技术(上海)股份有限公司 | Multi-label image classification method and training method and device of multi-label image classification model |
CN115841596B (en) * | 2022-12-16 | 2023-09-15 | 华院计算技术(上海)股份有限公司 | Multi-label image classification method and training method and device for model thereof |
CN115880661A (en) * | 2023-02-01 | 2023-03-31 | 天翼云科技有限公司 | Vehicle matching method and device, electronic equipment and storage medium |
CN117171382A (en) * | 2023-07-28 | 2023-12-05 | 宁波善德电子集团有限公司 | Vehicle video retrieval method based on comprehensive features and natural language |
CN117171382B (en) * | 2023-07-28 | 2024-05-03 | 宁波善德电子集团有限公司 | Vehicle video retrieval method based on comprehensive features and natural language |
CN117630344A (en) * | 2024-01-25 | 2024-03-01 | 西南科技大学 | Method for detecting slump range of concrete on line in real time |
CN117630344B (en) * | 2024-01-25 | 2024-04-05 | 西南科技大学 | Method for detecting slump range of concrete on line in real time |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zou et al. | Object detection in 20 years: A survey | |
Cao et al. | Cross-modal hamming hashing | |
US11263753B2 (en) | Method for training a convolutional neural network for image recognition using image-conditioned masked language modeling | |
Hausler et al. | Multi-process fusion: Visual place recognition using multiple image processing methods | |
CN114547249A (en) | Vehicle retrieval method based on natural language and visual features | |
Yu et al. | Unsupervised random forest indexing for fast action search | |
An et al. | Fast and incremental loop closure detection with deep features and proximity graphs | |
Wang et al. | Progressive local filter pruning for image retrieval acceleration | |
Wang et al. | Video event detection using motion relativity and feature selection | |
CN110196918B (en) | Unsupervised deep hashing method based on target detection | |
Plummer et al. | Revisiting image-language networks for open-ended phrase detection | |
CN114358188A (en) | Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment | |
CN113642482B (en) | Video character relation analysis method based on video space-time context | |
CN112580362A (en) | Visual behavior recognition method and system based on text semantic supervision and computer readable medium | |
CN112597324A (en) | Image hash index construction method, system and equipment based on correlation filtering | |
Zhan et al. | A method of hierarchical image retrieval for real-time photogrammetry based on multiple features | |
Zhang et al. | Appearance-based loop closure detection via locality-driven accurate motion field learning | |
CN114882351B (en) | Multi-target detection and tracking method based on improved YOLO-V5s | |
Ning et al. | Deep Spatial/temporal-level feature engineering for Tennis-based action recognition | |
Ma et al. | Loop closure detection via locality preserving matching with global consensus | |
Tsintotas et al. | The revisiting problem in simultaneous localization and mapping | |
Zhou et al. | Retrieval and localization with observation constraints | |
CN112084353A (en) | Bag-of-words model method for rapid landmark-convolution feature matching | |
Chen et al. | DVHN: A Deep Hashing Framework for Large-scale Vehicle Re-identification | |
Chen et al. | Fine aligned discriminative hashing for remote sensing image-audio retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |