CN114547249A - Vehicle retrieval method based on natural language and visual features - Google Patents

Vehicle retrieval method based on natural language and visual features Download PDF

Info

Publication number
CN114547249A
CN114547249A CN202210173817.3A CN202210173817A CN114547249A CN 114547249 A CN114547249 A CN 114547249A CN 202210173817 A CN202210173817 A CN 202210173817A CN 114547249 A CN114547249 A CN 114547249A
Authority
CN
China
Prior art keywords
vehicle
natural language
feature
features
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210173817.3A
Other languages
Chinese (zh)
Inventor
高文飞
王瑞雪
王磊
王辉
郭丽丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Rongling Technology Development Co ltd
Original Assignee
Jinan Rongling Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Rongling Technology Development Co ltd filed Critical Jinan Rongling Technology Development Co ltd
Priority to CN202210173817.3A priority Critical patent/CN114547249A/en
Publication of CN114547249A publication Critical patent/CN114547249A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a vehicle retrieval method based on natural language and visual features, which comprises the following steps: s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set; s2, training a vehicle heavy identification model by using a multi-task learning framework as a basic model; s3, obtaining a feature extractor; and S4, constructing a multi-modal vehicle track retrieval system based on natural language and visual features, and retrieving the vehicle track. According to the vehicle retrieval method based on the natural language and the visual features, vehicles matched with semantics can be found conveniently through the natural language, compared with a vehicle retrieval system based on vision only in the prior art, the vehicle retrieval method is more flexible, the retrieval threshold is reduced, meanwhile, the visual features of vehicles with strong distinguishing capability are extracted through a vehicle re-recognition model, and fine-grained information of the features is enriched.

Description

Vehicle retrieval method based on natural language and visual features
Technical Field
The invention relates to the technical field of intelligent traffic, in particular to a vehicle retrieval method based on natural language and visual features.
Background
Target tracking is one of the popular fields of computer vision research, and refers to automatically tracking a fixed target appearing in a continuous video frame by using an artificial intelligence technology in the continuous video frame. As a basic technology, target tracking is widely applied in various fields, for example: a plurality of scenes such as autopilot, smart city and intelligent monitoring.
The vehicle retrieval method based on natural language and visual features plays an important role in target tracking of smart city traffic. The vehicle retrieval task based on the natural language and the visual features means that given natural language needs to retrieve corresponding vehicle track segments from a video segment library, for example, "a red SUV turns right at an intersection", and the corresponding vehicle track segments need to be retrieved and recalled, however, in the prior art, cross-modal vehicle retrieval based on the natural language and the visual features is simpler in used visual features, for example, based on ImageNet pre-training, and has a larger difference with a vehicle in a domain, so that features with high efficient distinguishing capability cannot be extracted, or only based on the visual modal retrieval, the flexibility is lacked, meanwhile, the retrieval threshold is higher, the features used by the cross-modal vehicle retrieval are simpler, and the vehicles cannot be described in a fine-grained level. Therefore, we improve this and propose a vehicle retrieval method based on natural language and visual features.
Disclosure of Invention
In order to solve the technical problems, the invention provides the following technical scheme:
the invention relates to a vehicle retrieval method based on natural language and visual characteristics, which comprises the following steps:
s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set;
s2, training a vehicle heavy identification model by using the multitask learning framework as a basic model, and specifically comprising the following steps:
s2-1, performing data preprocessing including random erasing, random cutting and standardization processing on the vehicle pictures, and then constructing batch training data, specifically, extracting P types of pictures which are not put back from a library, wherein each type of pictures comprises K pictures, and using the pictures as the batch training data;
s2-2, sending the batch data into a residual error network, obtaining a feature map through convolution operation, then carrying out generalized average pooling on the feature map, converting the feature map into a bit vector, and defining the feature as F1Then, the one-bit vector is used for calculating metric learning loss;
s2-3, and comparing the characteristic F1Obtaining a characteristic F through a batch normalization layer2Then using this feature to calculate a classification penalty;
s2-4, optimizing the network parameters through back propagation, wherein the network has the capability of distinguishing different vehicles through multiple iterations, and then storing the trained network parameters;
s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts after BN, and then using the obtained feature after BN, namely the feature F2A feature extractor for obtaining a feature of the vehicle as a feature representation of the vehicle;
s4, constructing a multi-modal vehicle track retrieval system based on natural language and visual characteristics, and retrieving vehicle tracks, wherein the method specifically comprises the following steps:
s4-1, extracting visual features, performing video frame extraction on each video, cutting out the main body part of a vehicle from each frame, then performing feature extraction on each frame picture by using a vehicle feature extractor in S3, converting the feature extraction into a feature vector V, and finally mining time sequence information through a GRU model to perform fusion to obtain the visual features fv
S4-2, extracting natural language features, inputting N sections of natural languages, extracting word vector features S for each section of natural language by using a GLove model pre-trained on large-scale corpus data, and fusing the word vector features by using a GRU model to obtain natural language features fs
S4-3, comparing and learning, and using the obtained visual feature fvAnd natural language features fsAnd calculating contrast loss in a high-dimensional space, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.
As a preferred technical solution of the present invention, in S1, a specific manner of constructing the data set is as follows: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.
As a preferred technical solution of the present invention, in S2-2, the computation of the metric learning loss is performed by using a triplet loss computation, where the triplet loss is as follows:
Figure BDA0003519651110000031
in the formula: l istRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, xa,xp,xnThe partial table represents anchor images, normal images and reverse images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, then the same kind of picture with the farthest distance is found as a normal image, and the different kind of picture with the closest distance is found as a reverse image, so that a triplet is constructed.
As a preferred embodiment of the present invention, in S2-3, a classification loss is calculated, where the classification loss label is a previously set ID label, and the loss function used is a cross entropy loss:
Figure BDA0003519651110000041
in the formula: l issRepresenting class learning penalty, i.e. cross-entropy penalty, yiIs an indicator variable, y if the ith class matches the target classiNot 1 but 0, piIs the predicted likelihood that the picture belongs to the i-th class.
As a preferred embodiment of the present invention, in S4-1, the feature vector V is equal to
Figure BDA0003519651110000042
Wherein T isvIs the number of frames in a video segment, ctIs the feature representation of the T-th frame, 2048 is the dimension of the feature, after which T is takenvThe characteristics are fused by mining time sequence information through GRU
Figure BDA0003519651110000043
Obtaining the fused features
Figure BDA0003519651110000044
Finally, the features are mapped to a high-dimensional space through a full connection layer andobtaining final visual characteristics through batch standardization
Figure BDA0003519651110000045
W hereinαAnd bαRepresenting the weight and deviation of the fully connected layer.
As a preferred technical scheme of the invention, in S4-2, the word vector characteristics
Figure BDA0003519651110000046
Figure BDA0003519651110000047
Wherein T issRepresenting the number of words in the natural language, wtRepresenting the t-th word vector, and then fusing the word vector features by using another GRU module
Figure BDA0003519651110000048
Figure BDA0003519651110000049
Finally, the fused features
Figure BDA00035196511100000410
Obtaining final natural language features through a full concatenation and batch layer
Figure BDA00035196511100000411
W hereinγAnd bγRepresenting the weight and deviation of the fully connected layer.
As a preferred embodiment of the present invention, in S4-3, the contrast loss is defined as L,
Figure BDA0003519651110000051
where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | fs-fv||2And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold value.
The invention has the beneficial effects that:
according to the vehicle retrieval method based on the natural language and the visual features, vehicles matched with semantics can be conveniently found through the natural language, compared with a conventional vehicle retrieval system based on the vision, the method is more flexible, the retrieval threshold is reduced, meanwhile, the visual features of the vehicles with strong distinguishing capability are extracted by using a vehicle re-identification model, and fine-grained information of the features is enriched.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
In the drawings:
FIG. 1 is a schematic view of a vehicle re-identification model of the present invention;
FIG. 2 is a schematic diagram of a vehicle trajectory retrieval system of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
The invention relates to a vehicle retrieval method based on natural language and visual characteristics, which comprises the following steps:
s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set;
s2, as shown in fig. 1, training a vehicle heavy identification model by using a multitask learning framework as a basic model, specifically including the following steps:
s2-1, performing data preprocessing including random erasing, random cutting and standardization processing on the vehicle pictures, and then constructing batch training data, specifically, extracting P types of pictures which are not put back from a library, wherein each type of pictures comprises K pictures, and using the pictures as the batch training data;
s2-2, sending the batch data into a residual error netIn the network, a feature map is obtained through convolution operation, then the feature map is subjected to generalized average pooling, the feature map is converted into a bit vector, and the feature is defined as F1Then, the one-bit vector is used for calculating metric learning loss;
s2-3, and comparing the characteristic F1Obtaining a characteristic F through a batch normalization layer2Then using this feature to calculate a classification penalty;
s2-4, optimizing the network parameters through back propagation, wherein the network has the capability of distinguishing different vehicles through multiple iterations, and then storing the trained network parameters;
s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts behind the BN, and then using the obtained features behind the BN, namely the feature F2As a feature representation of the vehicle, a feature extractor of the vehicle is available, which functions to convert a picture of the vehicle into a feature vector, which is a complete description of a vehicle containing high-level semantic information, because the feature extractor has been completely trained using ID tags;
s4, as shown in fig. 2, constructing a multi-modal vehicle trajectory retrieval system based on natural language and visual features to retrieve vehicle trajectories, specifically including the following steps:
s4-1, extracting visual features, performing video frame extraction on each video, cutting out the main body part of a vehicle from each frame, then performing feature extraction on each frame picture by using a vehicle feature extractor in S3, converting the feature extraction into a feature vector V, and finally mining time sequence information through a GRU model to perform fusion to obtain the visual features fv
S4-2, extracting natural language features, inputting N sections of natural languages, extracting word vector features S for each section of natural language by using a GLove model pre-trained on large-scale corpus data, and fusing the word vector features by using a GRU model to obtain natural language features fs
S4-3, comparing and learning, and using the obtained visual feature fvAnd natural language features fsAnd calculating contrast loss in a high-dimensional space, shortening the distance between the matched natural language features and visual features, simultaneously shortening the distance between the unmatched natural language features and visual features, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.
In S1, the specific way of constructing the data set is as follows: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.
Wherein, in S2-2, the computation metric learning loss is computed using the triplet loss as follows:
Figure BDA0003519651110000071
in the formula: l istRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, xa,xp,xnThe partial table represents anchor images, normal images and reverse images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, then the same kind of picture with the farthest distance is found as a normal image, and the different kind of picture with the closest distance is found as a reverse image, so that a triplet is constructed.
Wherein, in S2-3, a classification loss is calculated, where the classification loss label is a previously set ID label, and the loss function used is the cross entropy loss:
Figure BDA0003519651110000081
in the formula: l issRepresenting class learning penalty, i.e. cross-entropy penalty, yiIs an indicator variable if the ith category and purposeStandard class matching rule yiNot 1 but 0, piThe predicted possibility that the picture belongs to the ith class can be used, so that the problems of large intra-class difference and small inter-class difference in vehicle re-identification can be solved through a multi-task learning mechanism of metric learning and classification learning.
Wherein, in S4-1, the feature vector
Figure BDA0003519651110000082
Wherein T isvIs the number of frames in a video segment, ctIs the feature representation of the T-th frame, 2048 is the dimension of the feature, after which T is takenvThe characteristics are fused by mining time sequence information through GRU
Figure BDA0003519651110000083
Obtaining the fused features
Figure BDA0003519651110000084
Finally, mapping the features to a high-dimensional space through a full connection layer and obtaining the final visual features f through batch standardizationv
Figure BDA0003519651110000085
W hereinαAnd bαRepresenting the weight and deviation of the fully connected layer.
Wherein, in S4-2, the word vector is characterized
Figure BDA0003519651110000086
Wherein T issRepresents the number of words in the natural language, wtRepresenting the t-th word vector, and then fusing the word vector features by using another GRU module
Figure BDA0003519651110000087
Finally, the fused features
Figure BDA0003519651110000088
Obtaining final natural language features through a full concatenation and batch layer
Figure BDA0003519651110000089
W hereinγAnd bγRepresenting the weight and deviation of the fully connected layer.
Wherein, in S4-3, the contrast loss is defined as L,
Figure BDA0003519651110000091
Figure BDA0003519651110000092
where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | fs-fv||2And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold.
After the model training is completed, all natural languages in the Query library are subjected to feature extraction by using the natural language part of the model, the natural languages are converted into feature representation, and meanwhile, the visual feature part of the model is used for performing feature extraction on the vehicle track video concentrated by the Gallery. And calculating the matching degrees of the vehicle tracks with all the vehicle tracks in the Gallery library, namely cosine similarity, aiming at a section of natural language, sequencing the vehicle tracks according to the matching degrees, and returning a plurality of vehicle tracks with the highest similarity, thereby completing the retrieval of the vehicle tracks through the natural language.
The vehicle retrieval mode is more flexible, the required retrieval threshold is lower, and the vehicle retrieval mode is a natural language; and meanwhile, the visual characteristics of the vehicle with strong distinguishing capability are extracted by using the vehicle weight recognition model, and the fine-grained information of the characteristics is enriched.
Example 2
The expression of the vehicle retrieval method based on natural language and visual characteristics on the CityFlow-NL data set is shown in the following table:
Method MRR Recall@5 Recall@10
foundation method 0.0269 0.0264 0.0491
Methods of the invention (ImageNet characteristics) 0.1091 0.1669 0.3178
Method of the invention (Re-ID characteristics) 0.1613 0.2585 0.3925
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A vehicle retrieval method based on natural language and visual features is characterized by comprising the following steps:
s1, constructing a vehicle weight recognition data set, acquiring videos from different cameras, and then detecting vehicle pictures from the videos by using a detection model to construct the data set;
s2, training a vehicle heavy identification model by using the multitask learning framework as a basic model, and specifically comprising the following steps:
s2-1, carrying out data preprocessing on the vehicle pictures, including random erasing, random cutting and standardization processing, and then constructing batch training data, specifically extracting P types of pictures, each K types of pictures, which are not returned from a library, and using the pictures as the batch training data;
s2-2, sending the batch data into a residual error network, obtaining a feature map through convolution operation, then carrying out generalized average pooling on the feature map, converting the feature map into a bit vector, and defining the feature as F1Then, the one-bit vector is used for calculating metric learning loss;
s2-3, and comparing the characteristic F1Obtaining a characteristic F through a batch normalization layer2Then using this feature to calculate a classification penalty;
s2-4, optimizing the network parameters through back propagation, wherein the network has the capability of distinguishing different vehicles through multiple iterations, and then storing the trained network parameters;
s3, obtaining a feature extractor, removing the head part of the Re-ID model, namely the classification layer, namely all parts after BN, and then using the obtained feature after BN, namely the feature F2A feature extractor for obtaining a feature of the vehicle as a feature representation of the vehicle;
s4, constructing a multi-modal vehicle track retrieval system based on natural language and visual characteristics, and retrieving vehicle tracks, wherein the method specifically comprises the following steps:
s4-1, visual feature extraction, video frame extraction is carried out on each video, the main body part of the vehicle is cut out from each frame, feature extraction is carried out on each frame picture by using a vehicle feature extractor in S3, the feature extraction is converted into a feature vector V, and finally the GRU model is used for extracting features of the vehicleMining time sequence information for fusion to obtain visual characteristics fv
S4-2, extracting natural language features, inputting N sections of natural languages, extracting word vector features S for each section of natural language by using a GLove model pre-trained on large-scale corpus data, and fusing the word vector features by using a GRU model to obtain natural language features fs
S4-3, comparing and learning, and using the obtained visual feature fvAnd natural language features fsAnd calculating contrast loss in a high-dimensional space, calculating the matching degree of the natural language and the vehicle track video, namely cosine similarity, then sequencing the vehicle tracks according to the matching degree, returning a plurality of vehicle tracks with the highest similarity, and searching the vehicle tracks through the natural language.
2. The vehicle retrieval method based on natural language and visual features of claim 1, wherein in S1, the specific way to construct the data set is: the same license plate number is regarded as one type, ID tags are sequentially given thereto, and the number of IDs is defined as N.
3. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S2-2, the computing metric learning loss is computed by using the triplet loss as follows:
Figure FDA0003519651100000021
in the formula: l istRepresenting triple losses, f (—) representing a mapping function of the network, i.e. a function that transforms the picture into a one-dimensional vector, xa,xp,xnThe partial tables represent anchor images, positive examples images and negative examples images of triples, the triples are obtained in a difficult sampling mode, specifically, for a group of batch data, each picture is circularly used as an anchor image, and then the same image with the farthest distance is foundAnd constructing a triple by using the class picture as a normal example image and using the different class pictures with the nearest distance as a reverse example image.
4. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S2-3, a classification loss is calculated, wherein the classification loss label is a previously set ID label, and the loss function used is cross entropy loss:
Figure FDA0003519651100000031
in the formula: l issRepresenting class learning penalty, i.e. cross-entropy penalty, yiIs an indicator variable, y if the ith class matches the target classiNot 1 but 0, piIs the predicted likelihood that the picture belongs to the i-th class.
5. The method for vehicle retrieval based on natural language and visual features of claim 1, wherein in S4-1, the feature vector
Figure FDA0003519651100000032
Figure FDA0003519651100000033
Wherein T isvIs the number of frames in a video segment, ctIs the feature representation of the T-th frame, 2048 is the dimension of the feature, after which T is takenvThe characteristics are fused by mining time sequence information through GRU
Figure FDA0003519651100000034
Figure FDA0003519651100000035
Obtaining the fused features
Figure FDA0003519651100000036
Finally, mapping the features to a high-dimensional space through a full connection layer and obtaining the final visual features f through batch standardizationv
Figure FDA0003519651100000037
Figure FDA0003519651100000038
W hereinαAnd bαRepresenting the weight and deviation of the fully connected layer.
6. The method for vehicle search based on natural language and visual features of claim 1, wherein in S4-2, the word vector features
Figure FDA0003519651100000039
Figure FDA00035196511000000310
Wherein T issRepresenting the number of words in the natural language, wtRepresenting the t-th word vector, and then fusing the word vector features by using another GRU module
Figure FDA00035196511000000311
Figure FDA0003519651100000041
Finally, the fused features
Figure FDA0003519651100000042
Obtaining final natural language features through a full concatenation and batch layer
Figure FDA0003519651100000043
W hereinγAnd bγRepresenting the weight and deviation of the fully connected layer.
7. The vehicle retrieval method based on natural language and visual characteristics of claim 1, wherein in S4-3, a contrast loss is defined as L,
Figure FDA0003519651100000044
yd2+(1-y)max(m-d,0)2where N represents the number of sample pairs and d represents the euclidean distance of the two features, i.e., d | | | fs-fv||2And y represents whether the two features are matched, wherein in the case that the natural language feature and the visual feature are matched, y is equal to 1, and in the case that the natural language feature and the visual feature are not matched, y is equal to 0, and m is a preset threshold value.
CN202210173817.3A 2022-02-24 2022-02-24 Vehicle retrieval method based on natural language and visual features Pending CN114547249A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210173817.3A CN114547249A (en) 2022-02-24 2022-02-24 Vehicle retrieval method based on natural language and visual features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210173817.3A CN114547249A (en) 2022-02-24 2022-02-24 Vehicle retrieval method based on natural language and visual features

Publications (1)

Publication Number Publication Date
CN114547249A true CN114547249A (en) 2022-05-27

Family

ID=81678470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210173817.3A Pending CN114547249A (en) 2022-02-24 2022-02-24 Vehicle retrieval method based on natural language and visual features

Country Status (1)

Country Link
CN (1) CN114547249A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115841596A (en) * 2022-12-16 2023-03-24 华院计算技术(上海)股份有限公司 Multi-label image classification method and training method and device of multi-label image classification model
CN115880661A (en) * 2023-02-01 2023-03-31 天翼云科技有限公司 Vehicle matching method and device, electronic equipment and storage medium
CN117171382A (en) * 2023-07-28 2023-12-05 宁波善德电子集团有限公司 Vehicle video retrieval method based on comprehensive features and natural language
CN117630344A (en) * 2024-01-25 2024-03-01 西南科技大学 Method for detecting slump range of concrete on line in real time

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647700A (en) * 2018-04-14 2018-10-12 华中科技大学 Multitask vehicle part identification model based on deep learning, method and system
CN109871449A (en) * 2019-03-18 2019-06-11 北京邮电大学 A kind of zero sample learning method end to end based on semantic description
CN110073371A (en) * 2017-05-05 2019-07-30 辉达公司 For to reduce the loss scaling that precision carries out deep neural network training
KR102095685B1 (en) * 2019-12-02 2020-04-01 주식회사 넥스파시스템 vehicle detection method and device
CN111914664A (en) * 2020-07-06 2020-11-10 同济大学 Vehicle multi-target detection and track tracking method based on re-identification
CN111931902A (en) * 2020-07-03 2020-11-13 江苏大学 Countermeasure network generation model and vehicle track prediction method using the same
WO2022001489A1 (en) * 2020-06-28 2022-01-06 北京交通大学 Unsupervised domain adaptation target re-identification method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110073371A (en) * 2017-05-05 2019-07-30 辉达公司 For to reduce the loss scaling that precision carries out deep neural network training
CN108647700A (en) * 2018-04-14 2018-10-12 华中科技大学 Multitask vehicle part identification model based on deep learning, method and system
CN109871449A (en) * 2019-03-18 2019-06-11 北京邮电大学 A kind of zero sample learning method end to end based on semantic description
KR102095685B1 (en) * 2019-12-02 2020-04-01 주식회사 넥스파시스템 vehicle detection method and device
WO2022001489A1 (en) * 2020-06-28 2022-01-06 北京交通大学 Unsupervised domain adaptation target re-identification method
CN111931902A (en) * 2020-07-03 2020-11-13 江苏大学 Countermeasure network generation model and vehicle track prediction method using the same
CN111914664A (en) * 2020-07-06 2020-11-10 同济大学 Vehicle multi-target detection and track tracking method based on re-identification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王崇屹: "基于多任务学习的车辆重识别***研究与实现", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》, 31 January 2020 (2020-01-31), pages 034 - 1266 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115841596A (en) * 2022-12-16 2023-03-24 华院计算技术(上海)股份有限公司 Multi-label image classification method and training method and device of multi-label image classification model
CN115841596B (en) * 2022-12-16 2023-09-15 华院计算技术(上海)股份有限公司 Multi-label image classification method and training method and device for model thereof
CN115880661A (en) * 2023-02-01 2023-03-31 天翼云科技有限公司 Vehicle matching method and device, electronic equipment and storage medium
CN117171382A (en) * 2023-07-28 2023-12-05 宁波善德电子集团有限公司 Vehicle video retrieval method based on comprehensive features and natural language
CN117171382B (en) * 2023-07-28 2024-05-03 宁波善德电子集团有限公司 Vehicle video retrieval method based on comprehensive features and natural language
CN117630344A (en) * 2024-01-25 2024-03-01 西南科技大学 Method for detecting slump range of concrete on line in real time
CN117630344B (en) * 2024-01-25 2024-04-05 西南科技大学 Method for detecting slump range of concrete on line in real time

Similar Documents

Publication Publication Date Title
Zou et al. Object detection in 20 years: A survey
Cao et al. Cross-modal hamming hashing
US11263753B2 (en) Method for training a convolutional neural network for image recognition using image-conditioned masked language modeling
Hausler et al. Multi-process fusion: Visual place recognition using multiple image processing methods
CN114547249A (en) Vehicle retrieval method based on natural language and visual features
Yu et al. Unsupervised random forest indexing for fast action search
An et al. Fast and incremental loop closure detection with deep features and proximity graphs
Wang et al. Progressive local filter pruning for image retrieval acceleration
Wang et al. Video event detection using motion relativity and feature selection
CN110196918B (en) Unsupervised deep hashing method based on target detection
Plummer et al. Revisiting image-language networks for open-ended phrase detection
CN114358188A (en) Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment
CN113642482B (en) Video character relation analysis method based on video space-time context
CN112580362A (en) Visual behavior recognition method and system based on text semantic supervision and computer readable medium
CN112597324A (en) Image hash index construction method, system and equipment based on correlation filtering
Zhan et al. A method of hierarchical image retrieval for real-time photogrammetry based on multiple features
Zhang et al. Appearance-based loop closure detection via locality-driven accurate motion field learning
CN114882351B (en) Multi-target detection and tracking method based on improved YOLO-V5s
Ning et al. Deep Spatial/temporal-level feature engineering for Tennis-based action recognition
Ma et al. Loop closure detection via locality preserving matching with global consensus
Tsintotas et al. The revisiting problem in simultaneous localization and mapping
Zhou et al. Retrieval and localization with observation constraints
CN112084353A (en) Bag-of-words model method for rapid landmark-convolution feature matching
Chen et al. DVHN: A Deep Hashing Framework for Large-scale Vehicle Re-identification
Chen et al. Fine aligned discriminative hashing for remote sensing image-audio retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination