CN115099358A - Open world target detection training method based on dictionary creation and field self-adaptation - Google Patents

Open world target detection training method based on dictionary creation and field self-adaptation Download PDF

Info

Publication number
CN115099358A
CN115099358A CN202210811954.5A CN202210811954A CN115099358A CN 115099358 A CN115099358 A CN 115099358A CN 202210811954 A CN202210811954 A CN 202210811954A CN 115099358 A CN115099358 A CN 115099358A
Authority
CN
China
Prior art keywords
training
text
target detection
field
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210811954.5A
Other languages
Chinese (zh)
Inventor
杨阳
马泽宇
***
徐行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210811954.5A priority Critical patent/CN115099358A/en
Publication of CN115099358A publication Critical patent/CN115099358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an open world target detection training method based on dictionary creation and field self-adaptation, relates to the technical field of communication software, and solves the problem that the existing open world target detection is limited to a single scene, firstly, a picture description data set and a multi-mode feature extraction network are introduced, text mode and visual mode features output by the picture description data set and the multi-mode feature extraction network are aligned, and meanwhile, a multi-mode transform network is introduced to perform text matching learning and text mask learning of an image text; and transferring the parameters of the regional visual feature extraction model and the visual mapping text layer learned in the pre-training stage into a target detection model, inputting pictures from two field data sets, wherein the source field picture participates in target detection training, the target field picture only participates in global field self-adaptive training, and in the training process, replacing the weight of a classifier of a detection head with a fixed word vector of a known class to perform target detection training.

Description

Open world target detection training method based on dictionary creation and field self-adaptation
Technical Field
The invention relates to the technical field of communication, in particular to the technical field of an open world target detection training method based on dictionary creation and field self-adaptation.
Background
The main purpose of target detection is to detect and locate specific multiple targets from a picture, and the core problem of target detection is to locate and classify the content to be detected, so that the shape, size and position of the target appearing in the picture need to be determined according to the influence of the detected object under different conditions such as illumination, shade and the like, and higher accuracy and shorter detection time are ensured. The open-world target detection method is a method of identifying new classes in a real complex scene.
The traditional target detection method is mainly limited to a fixed class data set in a fixed scene, a trained classifier only has the capability of identifying the class of a label, but does not have the capability of efficiently identifying a known class and an unknown class in a non-fixed scene, all information of all scene labels is unrealistic, the traditional new class detection method only learns the implicit relation between classes, the internal relation between text features and visual features is seriously ignored, the identification precision is low, meanwhile, due to the lack of a cross-scene data set, a model is difficult to generalize to severe weather after being trained in the fixed data set, and the capability of detecting the new class in the severe weather is further reduced.
The existing open world target detection is limited to a single scene, such as indoor or normal weather, but in the real open world, the scene is complex and faces various severe weather, the existing methods are trained in normal weather and are difficult to generalize to severe weather, resulting in low recognition accuracy, although there are related documents disclosing that a target detection model based on domain adaptation can solve the problem, such as Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool.2018.Domain adaptive fast r-cnn for object detection in the world in Proceedings of the IEEE conference on Computer Vision and Pattern recognition.3339-3348, but they do not have the capability of recognizing new classes in the world, and the reason is that: the pure domain adaptive method only generalizes global scene features and does not increase category features.
Disclosure of Invention
The invention aims to solve the problem that the open world target detection in the prior art is only limited to a single scene, and provides an open world target detection training method based on dictionary creation and field self-adaptation in order to solve the technical problem.
The invention specifically adopts the following technical scheme for realizing the purpose:
the open world target detection method based on dictionary creation and field self-adaptation comprises the following steps:
acquiring image sample data and picture description data corresponding to the image sample data;
step two, constructing a regional visual feature extraction model, a visual mapping text layer, a BERT word vector extraction model and a multi-modal Transformer network model;
inputting the image sample data in the step one into a regional visual characteristic extraction model in a pre-training stage, wherein the output of the regional visual characteristic extraction model is used as the input of a visual mapping text layer; inputting the picture description data in the first step into a BERT word vector extraction model; after the output of the BERT word vector extraction model and the output of the visual mapping text model are aligned in features, the output of the BERT word vector extraction model and the output of the visual mapping text model are input into a multi-mode Transformer network model to perform text matching learning and text mask learning of image texts;
and in the training stage, the parameters of the regional visual feature extraction model and the visual mapping text layer learned in the pre-training stage are transferred to a visual feature extraction module, pictures from two field data sets are input, wherein the source field picture participates in target detection training, the target field picture only participates in global field adaptive training, the classifier weight of the detection head is replaced by a fixed word vector of a known class in the training process, and target detection training is carried out.
During testing, different fixed character features are adopted to replace the classification heads of the detection models.
The technical principle is as follows: in the pre-training stage, a picture description data set is adopted, picture information and word description are respectively input into a visual feature extraction model and a text feature extraction model, regional visual features and single word features are further obtained, a visual mapping text layer is designed behind the visual features, the output visual features and the text feature dimensions are guaranteed to be consistent, finally, the distance between two modal features is calculated, the similarity of multi-modal features is further improved by reducing specific loss function values, in the pre-training process, the models mainly learn the visual mapping text layer and the visual feature extraction model, and the construction of a dictionary is realized by combining the two models.
In the training stage, FasterR-CNN is introduced as a target detection model and is influenced by environmental factors, most of the existing target detection data sets are under normal weather, so a field self-adaption method is introduced, the domain invariant features of objects under normal weather and severe weather are learned in the training process, meanwhile, a zero sample target identification method is introduced in the detection stage, fixed character features are substituted for a classification head of the detection model, a visual mapping text layer trained in the pre-training stage is introduced, and finally, the target detection model is trained.
In the training stage, parameters of a regional visual feature extraction model and a visual mapping text layer which are learned in the pre-training stage are transferred to a target detection model, pictures from two field data sets are input in the training stage, wherein a source field data set is provided with target labeling information, a target field data set is only provided with field label information, only the source field picture participates in target detection training, a target field picture only participates in global field adaptive training, and the main purpose of the global field adaptive training is to extract the field invariant features of a source field and a target field.
In the testing stage, different fixed character features are used for replacing the detection model classification heads so as to achieve the purpose of identifying different types of targets in the open world.
Further, in the pre-training process, the relationship between the BERT word vector and the visual features output by the visual mapping text layer is measured by dot product distance, and the formula is defined as follows:
Figure BDA0003740420540000031
wherein
Figure BDA0003740420540000032
For visual features passing through the regional visual feature extraction model and the visual mapping text layer,
Figure BDA0003740420540000033
word vector features for a single word extracted via the BERT word vector, n I Is the number of image features, n L In order to be the number of the present features,
Figure BDA0003740420540000034
is composed of
Figure BDA0003740420540000035
Characteristic of and
Figure BDA0003740420540000036
a distance measure between the features.
Further, in the pre-training process, the feature alignment mainly includes two parts, namely text image alignment and image text alignment, and the specific loss function is as follows:
Figure BDA0003740420540000041
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G As a degree of matching of the image sequence and the corresponding text sequence, exp<I,L′> G The matching degree of the text sequence and the corresponding image sequence is obtained.
Further, in the pre-training process, the formula for the model to run in an auto-supervised manner by image-text matching and text mask learning is defined as follows:
Figure BDA0003740420540000042
Figure BDA0003740420540000043
wherein w m Is a text mask block, E (w,l)~D Is a mean value of P θ (w m |w m And I) is a conditional probability function under model parameters. S θ A function is generated for the image classification score.
Further, in the global domain adaptive training in the training stage, the distance between the two domains is shortened to extract the domain invariant features under different domains, and the formula is defined as follows:
Figure BDA0003740420540000044
wherein D is t 0 stands for feature from source domain, D t The representative feature is from the target domain as 1,
Figure BDA0003740420540000045
the method is a field characteristic prediction result, and the whole loss function is a two-class cross entropy loss function.
Further, in the training phase, the class to which the word vector with the smallest distance belongs is selected as the classification of the feature, and the formula is defined as follows:
Figure BDA0003740420540000046
wherein
Figure BDA0003740420540000047
Is characterized in that the method is characterized in that,
Figure BDA0003740420540000048
as BERT word vectors, e B Representing all 0 background embeddings.
Figure BDA0003740420540000049
As a feature of an image
Figure BDA00037404205400000410
And text features
Figure BDA00037404205400000411
Is measured by the distance of (a) to (b),
Figure BDA00037404205400000412
as a feature of an image
Figure BDA00037404205400000413
With background features e B Is measured by the distance of (a) to (b),
Figure BDA0003740420540000051
as a feature of an image
Figure BDA0003740420540000052
With different text characteristics
Figure BDA0003740420540000053
Is measured.
Further, the detection tasks of different classes are completed by replacing the classification heads with BERT word vectors of different classes.
An open world target detection training device based on dictionary creation and field self-adaptation comprises a pre-training module, a training module and a detection module;
the pre-training module is used for introducing the picture description data set and the multi-modal feature extraction network, aligning the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network, introducing the multi-modal Transformer network to perform text matching learning and text mask learning of the image text, and enabling the whole pre-training model to run in a self-supervision mode;
the training module is used for transferring the parameters of the regional visual feature extraction model and the visual mapping text layer which are learned in the pre-training stage to a target detection model in the training stage, inputting pictures from two field data sets, wherein the source field pictures participate in target detection training, the target field features only participate in global field adaptive training, and the classifier weight of the detection head is replaced by fixed word vectors of known classes in the training process to carry out target detection training;
and the test module is used for replacing the detection model classification head with different fixed character characteristics.
A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.
A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.
The invention has the following beneficial effects:
(1) the open world target detection method based on dictionary creation and field self-adaptation can identify a new model in severe weather, effectively combines field self-adaptation with zero sample identification, greatly improves the accuracy of zero sample identification on a severe weather data set, and exceeds most methods based on field self-adaptation in the aspect of known categories;
(2) the open world target detection method based on dictionary creation and field self-adaptation constructs an open world target detection model based on field self-adaptation, and has the capability of detecting known classes and unknown classes in severe weather, so that the problem that single zero sample detection is poor in weather difference generalization capability is solved, and the problem that a field self-adaptation method cannot detect new classes in the face of knowledge difference is solved.
Drawings
FIG. 1 is a diagram of an embodiment of an open world target detection pre-training model;
FIG. 2 is a diagram of an embodiment open world target detection training model.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
Example 1
Referring to fig. 1 to 2, the embodiment provides an open world target detection training method based on dictionary creation and domain adaptation, and solves the problem that the existing open world target detection is only limited to a single scene. Meanwhile, the method has the capability of detecting known classes and unknown classes in severe weather, so that the problem that single zero sample detection is poor in weather difference generalization capability is solved, and the problem that a field self-adaptive method cannot detect new classes in knowledge difference is solved.
(1) Pre-training phase
In order to establish a display relation between category texts and visual features, a picture description data set and a multi-modal feature extraction network are introduced, the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network are aligned, and a multi-modal Transformer network is introduced to perform text matching learning and text mask learning of image texts, wherein the multi-modal Transformer network is used for enabling the whole pre-training model to run in a self-supervision mode, and the visual feature extraction network and a visual mapping text layer are mainly learned in the whole pre-training stage, as shown in fig. 1.
During the whole pre-training process, we measure the relationship between the BERT word vector and the visual features output by the visual mapping text layer through the dot product distance, and the formula is defined as follows:
Figure BDA0003740420540000061
wherein
Figure BDA0003740420540000062
For visual features passing through the regional visual feature extraction model and the visual mapping text layer,
Figure BDA0003740420540000063
word vector features for a single word extracted via a BERT word vector, n I Is the number of image features, n L In order to be the number of such features,
Figure BDA0003740420540000071
is composed of
Figure BDA0003740420540000072
Characteristic of and
Figure BDA0003740420540000073
the distance between features is measured, and the specific loss function is as follows:
Figure BDA0003740420540000074
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G For matching of image sequences with corresponding text sequences, exp<I,L′> G And the matching degree of the text sequence and the corresponding image sequence.
The image text matching and text mask learning mainly comprises the purpose of enabling a model to run in an automatic supervision mode, and the formula is defined as follows:
Figure BDA0003740420540000076
Figure BDA0003740420540000075
wherein w m Is a text mask block, E (w,I)~D Is a mean value, P θ (w m |w m And I) is a conditional probability function under model parameters. S. the θ A function is generated for the image classification score.
(2) Training phase
In the training stage, parameters of a regional visual feature extraction model and a visual mapping text layer which are learned in the pre-training stage are transferred to a target detection model, pictures from two field data sets are input in the training stage, wherein a source field data set is provided with target labeling information, a target field data set is only provided with field label information, only the source field picture participates in target detection training, target field features only participate in global field adaptive training, the weight of a classifier of a detection head is replaced by fixed word vectors of known classes in the training process, target detection training is carried out, and the whole network model is as shown in fig. 2.
The global domain adaptive expression mainly extracts domain invariant features under different domains by shortening the distance between the two domains, and the formula is defined as follows:
Figure BDA0003740420540000081
wherein D is i 0 stands for feature from source domain, D i The representative feature is from the target domain as 1,
Figure BDA0003740420540000082
the method is a field characteristic prediction result, and the whole loss function is a two-class cross entropy loss function.
In the final classification process, the distance between the output feature and the word vectors of different words BERT is compared, and the category to which the word vector with the minimum distance belongs is selected as the classification of the feature, and the formula is defined as follows:
Figure BDA0003740420540000083
wherein
Figure BDA0003740420540000084
Is characterized in that the method is characterized in that,
Figure BDA0003740420540000085
as BERT word vectors, e B Representing all 0 background embeddings.
Figure BDA0003740420540000086
As a feature of an image
Figure BDA0003740420540000087
With text features
Figure BDA0003740420540000088
Is measured by the distance of (a) to (b),
Figure BDA0003740420540000089
as a feature of an image
Figure BDA00037404205400000810
With background features e B Is measured by the distance of (a) to (b),
Figure BDA00037404205400000811
as a feature of an image
Figure BDA00037404205400000812
With different text characteristics
Figure BDA00037404205400000813
Is measured.
(3) Testing phase
The detection tasks of different categories can be alternatively completed by replacing the classification heads by BERT word vectors of different categories, and the test process is the same.
Experimental tests and results
The mAP @50 index is adopted to evaluate the model effect, and the mAP @50 is the proportion that the coincidence degree of the coordinate position of the target which is correctly predicted and the coordinate position of the labeling information exceeds 50 percent, and is also the mainstream target detection and evaluation method at present. In the pre-training stage, we use a "picture description" dataset (COCOCaption dataset) to establish an explicit mapping relationship between features and text features. In the training stage, an automatic driving Cityscapes data set and a FoggyCityscapes data set are introduced, and the specific categories are as follows: car, Person, Rider, Motor, Train, Truck, Bus, and Bike. Three groups of new classes are divided, and the new classes are as follows: rider and Motor, Train and Truck, Bus and Bike, the corresponding remaining categories are old categories. A plurality of sets of experimental settings are tested in a Cityscapes data set and a FoggyCityscapes data set respectively, and the method is superior to the current mainstream method in detecting new types and old types in severe weather by analyzing experimental results. In table 1, on the foggy cityscaps dataset, in the case that the new category is Rider and Motor, the mAP @50 values of the new category and the old category are measured separately by the method, which are 21.97 and 0.51 higher than those of the latest method, and the settings of other unknown categories are improved to different degrees than those of the latest method. In table 2, we tested that the value of mapp @50 of the old class is higher than that of the latest methods 3.05, 1.62 and 0.66 respectively under different new class settings, and the effectiveness of our method is demonstrated.
Meanwhile, the inventor compares the technical scheme of the application with the technical scheme disclosed by the related literature at present, and the specific experimental results are shown in tables 1 and 2.
The relevant documents are as follows:
[1]Ankan Bansal,Karan Sikka,Gaurav Sharma,Rama Chellappa,and Ajay Divakaran.2018.Zero-shot object detection.In Proceedings of the European Conference on Computer Vision(ECCV).384–400.
[2]Yuhua Chen,Wen Li,Christos Sakaridis,Dengxin Dai,and Luc Van Gool.2018.Domain adaptive faster r-cnn for object detection in the wild.In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.3339–3348.
[3]Jinhong Deng,Wen Li,Yuhua Chen,and Lixin Duan.2021.Unbiased mean teacher for cross-domain object detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.4091–4101.
[4]Zhenwei He and Lei Zhang.2019.Multi-adversarial faster-rcnn for unrestricted object detection.In Proceedings of the IEEE/CVF International Conference on Computer Vision.6668–6677.
[5]Congcong Li,Dawei Du,Libo Zhang,Longyin Wen,Tiejian Luo,Yanjun Wu,and Pengfei Zhu.2020.Spatial attention pyramid network for unsupervised domain adaptation.In European Conference on Computer Vision.Springer,481–497.
[6]Shafin Rahman,Salman Khan,and Nick Barnes.2020.Improved visual-semantic alignment for zero-shot object detection.In Proceedings of the AAAI Conference on Artificial Intelligence,Vol.34.11932–11939.
[7]Kuniaki Saito,Yoshitaka Ushiku,Tatsuya Harada,and Kate Saenko.2019.Strong-weak distribution alignment for adaptive object detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.6956–6965.
[8]Zhiqiang Shen,Harsh Maheshwari,Weichen Yao,and MariosSavvides.2019.Scl:Towards accurate domain adaptive object detection via gradient detach based stacked complementary losses.arXiv preprint arXiv:1911.02559(2019).
[9]Vibashan VS,Vikram Gupta,PoojanOza,Vishwanath ASindagi,and Vishal M Patel.2021.Mega-cda:Memory guided attention for category-aware unsupervised domain adaptive object detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.4516–4526.
[10]Alireza Zareian,Kevin Dela Rosa,Derek Hao Hu,and Shih-Fu Chang.2021.Open-vocabulary object detection using captions.In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition.14393–14402.
table 1 New and old class detection results on FoggyCityscapes data set under different experimental settings
Figure BDA0003740420540000101
Table 2 old class detection results on foggy cityscaps dataset under different experimental settings
Figure BDA0003740420540000102
Example 2
Referring to fig. 1 to 2, the present embodiment provides an open world target detection device based on dictionary creation and domain adaptation, which solves the problem that the current open world target detection is only limited to a single scene. Meanwhile, the method has the capability of detecting the known class and the unknown class in severe weather, not only solves the problem that single zero sample detection is poor in the generalization capability of weather difference, but also solves the problem that a field adaptive method cannot detect a new class in the face of knowledge difference. The detection device specifically comprises: the device comprises a pre-training module, a training module and a detection module;
pre-training module
In order to establish a display relation between category texts and visual features, a picture description data set and a multi-modal feature extraction network are introduced, the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network are aligned, and a multi-modal Transformer network is introduced to perform text matching learning and text mask learning of image texts, wherein the multi-modal Transformer network is used for enabling the whole pre-training model to run in a self-supervision mode, and the visual feature extraction network and a visual mapping text layer are mainly learned in the whole pre-training stage, as shown in fig. 1.
During the whole pre-training process, we measure the relationship between the BERT word vector and the visual features output by the visual mapping text layer by the dot product distance <, > and the formula is defined as follows:
Figure BDA0003740420540000103
wherein
Figure BDA0003740420540000111
For visual features passing through the regional visual feature extraction model and the visual mapping text layer,
Figure BDA0003740420540000112
word vector features for a single word extracted via the BERT word vector, n I Is the number of image features, n L In order to be the number of the present features,
Figure BDA0003740420540000113
is composed of
Figure BDA0003740420540000114
Characteristic of and
Figure BDA0003740420540000115
the distance between features is measured, and the specific loss function is as follows:
Figure BDA0003740420540000116
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G For matching of image sequences with corresponding text sequences, exp<I,L′> G And the matching degree of the text sequence and the corresponding image sequence.
The image text matching and text mask learning mainly comprises the purpose of enabling a model to run in an automatic supervision mode, and the formula is defined as follows:
Figure BDA0003740420540000117
Figure BDA0003740420540000118
wherein w m Is a text mask block, E (w,I)~D Is a mean value of P θ (w m |w m And I) is a conditional probability function under model parameters. S θ A function is generated for the image classification score.
Training module
In the training stage, parameters of a regional visual feature extraction model and a visual mapping text layer which are learned in the pre-training stage are transferred to a target detection model, pictures from two field data sets are input in the training stage, wherein a source field data set is provided with target labeling information, a target field data set is only provided with field label information, only the source field picture participates in target detection training, target field features only participate in global field adaptive training, the weight of a classifier of a detection head is replaced by fixed word vectors of known classes in the training process, target detection training is carried out, and the whole network model is as shown in fig. 2.
The global domain self-adaptive formula mainly extracts domain invariant features under different domains by shortening the distance between the two domains, and the formula is defined as follows:
Figure BDA0003740420540000121
wherein D is i 0 stands for feature from source domain, D i The representative feature is from the target domain as 1,
Figure BDA0003740420540000122
the method is a field characteristic prediction result, and the whole loss function is a two-class cross entropy loss function.
In the final classification process, the distance between the output feature and the word vectors of different words BERT is compared, and the category to which the word vector with the minimum distance belongs is selected as the classification of the feature, and the formula is defined as follows:
Figure BDA0003740420540000123
wherein
Figure BDA0003740420540000124
Is characterized in that the method is characterized in that,
Figure BDA0003740420540000125
as BERT word vectors, e B Representing all 0 background embeddings.
Figure BDA0003740420540000126
As a feature of an image
Figure BDA0003740420540000127
With text features
Figure BDA0003740420540000128
Is measured by the distance of (a) to (b),
Figure BDA0003740420540000129
as a feature of an image
Figure BDA00037404205400001210
With background features e B Is measured by the distance of (a) to (b),
Figure BDA00037404205400001211
as a feature of an image
Figure BDA00037404205400001213
With different text characteristics
Figure BDA00037404205400001212
Is measured.
Test module
The detection tasks of different classes can be alternatively completed by replacing the classification heads by using BERT word vectors of different classes, and the test process is the same.
Example 3
The embodiment also provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the open world object detection method based on dictionary creation and domain adaptation.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory includes at least one type of readable storage medium including flash memory, hard disks, multimedia cards, card-type memory (e.g., SD or D interface display memory, etc.), Random Access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), magnetic memory, magnetic disks, optical disks, and the like. In some embodiments, the storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device. Of course, the memory may also include both internal and external storage devices of the computer device. In this embodiment, the memory is used to store an operating system and various types of application software installed in the computer device, such as program codes for running the method for detecting an abdominal lymph node based on semi-supervised learning. Further, the memory may be used to temporarily store various types of data that have been output or are to be output.
The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, execute the program code of the semi-supervised learning based abdominal lymph node detection method.
Example 4
The present embodiment also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the processor executes the steps of the above-mentioned dictionary creation and domain adaptive open world object detection method.
Wherein the computer readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of a semi-supervised learning based abdominal lymph node detection method.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Claims (10)

1. The open world target detection training method based on dictionary creation and field self-adaptation is characterized by comprising the following steps of:
acquiring image sample data and picture description data corresponding to the image sample data;
step two, constructing a regional visual feature extraction model, a visual mapping text layer, a BERT word vector extraction model and a multi-modal Transformer network model;
inputting the image sample data in the step one into a regional visual characteristic extraction model in a pre-training stage, wherein the output of the regional visual characteristic extraction model is used as the input of a visual mapping text layer; inputting the picture description data in the first step into a BERT word vector extraction model; after the output of the BERT word vector extraction model and the output of the visual mapping text model are aligned in features, the output of the BERT word vector extraction model and the output of the visual mapping text model are input into a multi-mode Transformer network model for text matching learning and text mask learning of image texts;
and in the training stage, the parameters of the regional visual feature extraction model and the visual mapping text layer learned in the pre-training stage are transferred to a visual feature extraction module, pictures from two field data sets are input, wherein the source field picture participates in target detection training, the target field picture only participates in global field adaptive training, the classifier weight of the detection head is replaced by a fixed word vector of a known class in the training process, and target detection training is carried out.
2. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the pre-training process, the relationship between the BERT word vector and the visual features output by the visual mapping text layer is measured by dot product distance, and the formula is defined as follows:
Figure FDA0003740420530000011
wherein
Figure FDA0003740420530000012
For visual features passing through the regional visual feature extraction model and the visual mapping text layer,
Figure FDA0003740420530000013
word vector features for a single word extracted via the BERT word vector, n I Is the number of image features, n L In order to be the number of such features,
Figure FDA0003740420530000014
is composed of
Figure FDA0003740420530000015
Characteristic of and
Figure FDA0003740420530000016
a distance measure between the features.
3. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the pre-training process, the feature alignment mainly comprises two parts of text image alignment and image text alignment, and the specific loss function is as follows:
Figure FDA0003740420530000021
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G For matching of image sequences with corresponding text sequences, exp<I,L′> G And the matching degree of the text sequence and the corresponding image sequence.
4. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the pre-training process, the formula for the model to run in an unsupervised manner by image-text matching and text mask learning is defined as follows:
Figure FDA0003740420530000022
Figure FDA0003740420530000023
wherein w m Is a text maskBlock, E (w,l)~D Is a mean value of P θ (w m |w m And I) is a conditional probability function under model parameters. S θ A function is generated for the image classification score.
5. The open world target detection training method based on dictionary creation and domain adaptation as claimed in claim 1, wherein in the global domain adaptation training of the training stage, the distance between two domains is reduced to extract domain invariant features under different domains, and the formula is defined as follows:
Figure FDA0003740420530000024
wherein D is i 0 stands for feature from source domain, D i The representative feature is from the target domain as 1,
Figure FDA0003740420530000025
the method is a field characteristic prediction result, and the whole loss function is a two-class cross entropy loss function.
6. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the training phase, the class to which the word vector with the smallest distance belongs is selected as the classification of the feature, and the formula is defined as follows:
Figure FDA0003740420530000031
wherein
Figure FDA0003740420530000032
Is characterized in that the method is characterized in that,
Figure FDA0003740420530000033
as BERT word vectors, e B Representing all 0 background embedding。
Figure FDA0003740420530000034
As a feature of an image
Figure FDA0003740420530000035
With text features
Figure FDA0003740420530000036
Is measured by the distance of (a) to (b),
Figure FDA0003740420530000037
as a feature of an image
Figure FDA0003740420530000038
With background features e B Is measured by the distance of (a) to (b),
Figure FDA0003740420530000039
as a feature of an image
Figure FDA00037404205300000310
With different text characteristics
Figure FDA00037404205300000311
Is measured.
7. The open-world target detection training method based on dictionary creation and domain adaptation as claimed in claim 1, wherein different classes of detection tasks are performed by replacing classification heads with different classes of BERT word vectors.
8. An open world target detection training device based on dictionary creation and field self-adaptation is characterized by comprising a pre-training module and a training module;
the pre-training module is used for introducing the picture description data set and the multi-modal feature extraction network, aligning the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network, introducing the multi-modal Transformer network to perform text matching learning and text mask learning of the image text, and enabling the whole pre-training model to run in a self-supervision mode;
and the training module is used for transferring the parameters of the regional visual feature extraction model and the visual mapping text layer which are learned in the pre-training stage to the target detection model in the training stage, inputting pictures from two field data sets, wherein the source field pictures participate in the target detection training, the target field features only participate in the global field self-adaptive training, and the classifier weight of the detection head is replaced by a fixed word vector of a known class in the training process to perform the target detection training.
9. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.
CN202210811954.5A 2022-07-11 2022-07-11 Open world target detection training method based on dictionary creation and field self-adaptation Pending CN115099358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210811954.5A CN115099358A (en) 2022-07-11 2022-07-11 Open world target detection training method based on dictionary creation and field self-adaptation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210811954.5A CN115099358A (en) 2022-07-11 2022-07-11 Open world target detection training method based on dictionary creation and field self-adaptation

Publications (1)

Publication Number Publication Date
CN115099358A true CN115099358A (en) 2022-09-23

Family

ID=83297737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210811954.5A Pending CN115099358A (en) 2022-07-11 2022-07-11 Open world target detection training method based on dictionary creation and field self-adaptation

Country Status (1)

Country Link
CN (1) CN115099358A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117576982A (en) * 2024-01-16 2024-02-20 青岛培诺教育科技股份有限公司 Spoken language training method and device based on ChatGPT, electronic equipment and medium
CN117852624A (en) * 2024-03-08 2024-04-09 腾讯科技(深圳)有限公司 Training method, prediction method, device and equipment of time sequence signal prediction model
CN117893876A (en) * 2024-01-08 2024-04-16 中国科学院自动化研究所 Zero sample training method and device, storage medium and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117893876A (en) * 2024-01-08 2024-04-16 中国科学院自动化研究所 Zero sample training method and device, storage medium and electronic equipment
CN117576982A (en) * 2024-01-16 2024-02-20 青岛培诺教育科技股份有限公司 Spoken language training method and device based on ChatGPT, electronic equipment and medium
CN117576982B (en) * 2024-01-16 2024-04-02 青岛培诺教育科技股份有限公司 Spoken language training method and device based on ChatGPT, electronic equipment and medium
CN117852624A (en) * 2024-03-08 2024-04-09 腾讯科技(深圳)有限公司 Training method, prediction method, device and equipment of time sequence signal prediction model

Similar Documents

Publication Publication Date Title
CN115099358A (en) Open world target detection training method based on dictionary creation and field self-adaptation
Zuo et al. Natural scene text recognition based on encoder-decoder framework
CN110704633A (en) Named entity recognition method and device, computer equipment and storage medium
CN111079785A (en) Image identification method and device and terminal equipment
CN108959474B (en) Entity relation extraction method
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
US20220301334A1 (en) Table generating method and apparatus, electronic device, storage medium and product
US12051256B2 (en) Entry detection and recognition for custom forms
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN114724145A (en) Character image recognition method, device, equipment and medium
CN117197904A (en) Training method of human face living body detection model, human face living body detection method and human face living body detection device
EP3913533A2 (en) Method and apparatus of processing image device and medium
CN114495113A (en) Text classification method and training method and device of text classification model
CN114511857A (en) OCR recognition result processing method, device, equipment and storage medium
CN106709490B (en) Character recognition method and device
WO2022126917A1 (en) Deep learning-based face image evaluation method and apparatus, device, and medium
CN117173154A (en) Online image detection system and method for glass bottle
Duan et al. Attention enhanced ConvNet-RNN for Chinese vehicle license plate recognition
CN113111833B (en) Safety detection method and device of artificial intelligence system and terminal equipment
CN117421244B (en) Multi-source cross-project software defect prediction method, device and storage medium
CN113139187B (en) Method and device for generating and detecting pre-training language model
CN116052220B (en) Pedestrian re-identification method, device, equipment and medium
CN116012656B (en) Sample image generation method and image processing model training method and device
CN114005005B (en) Double-batch standardized zero-instance image classification method
Wang et al. Scene text identification by leveraging mid-level patches and context information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination