CN115359492A - Text image matching model training method, picture labeling method, device and equipment - Google Patents

Text image matching model training method, picture labeling method, device and equipment Download PDF

Info

Publication number
CN115359492A
CN115359492A CN202211065029.9A CN202211065029A CN115359492A CN 115359492 A CN115359492 A CN 115359492A CN 202211065029 A CN202211065029 A CN 202211065029A CN 115359492 A CN115359492 A CN 115359492A
Authority
CN
China
Prior art keywords
text
features
picture
feature
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211065029.9A
Other languages
Chinese (zh)
Inventor
刘世超
乔秋飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yuer Network Technology Co ltd
Original Assignee
Shanghai Yuer Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yuer Network Technology Co ltd filed Critical Shanghai Yuer Network Technology Co ltd
Priority to CN202211065029.9A priority Critical patent/CN115359492A/en
Publication of CN115359492A publication Critical patent/CN115359492A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/147Determination of region of interest
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a text image matching model training method and a picture labeling method. The method comprises the following steps: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and interesting region features of the sample picture through an image feature extractor of the text image matching model; extracting original text features and abstract text features of a sample text through a text feature extractor of a text image matching model, wherein the abstract text features are generated based on labeling information of the sample text and a sample picture; performing comparative learning on the basis of the global feature, the abstract text feature, the local feature and the original text feature, the region-of-interest feature and the abstract text feature to generate each loss item; calculating Hungarian losses based on the loss terms; and training the text image matching model according to the Hungarian loss. By adopting the method, the pictures can be automatically marked.

Description

Text image matching model training method, picture labeling method, device and equipment
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a text image matching model training method, a picture labeling method, an apparatus, a computer device, a storage medium, and a computer program product.
Background
Throughout various basic tasks of computer vision, such as image classification, target detection, semantic segmentation and the like, long-term research and application routes of a data layer follow a learning paradigm of accurate labeling of a picture data set, and a visual detection task based on the paradigm achieves good effects. However, this method is still limited to using only limited label learning mechanism in the computer vision field, and it needs to consume high manpower labeling cost.
However, with the mature development situation of deep learning in the deep learning field, researchers pursue more extreme for model learning and generalization ability, and the requirements of the application for labor cost, task learning cycle and deployment efficiency are gradually increased, and the current learning paradigm undoubtedly ties the development of the task.
Disclosure of Invention
Based on this, it is necessary to provide a text image matching model training method, a picture labeling method, an apparatus, a computer device, a storage medium, and a computer program product, which can automatically establish a matching relationship between a picture and a text, for solving the above technical problems.
A method of text image matching model training, the method comprising:
acquiring a sample picture, a sample text and the labeling information of the sample text and the sample picture;
extracting global features, local features and region-of-interest features of the sample picture through an image feature extractor of a text image matching model;
extracting original text features and abstract text features of the sample text through a text feature extractor of the text image matching model, wherein the abstract text features are generated based on the sample text and the labeling information of the sample picture;
performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item;
calculating a Hungarian loss based on the first, second, third, and fourth loss terms;
and training the text image matching model according to the Hungarian loss.
In one embodiment, the process of extracting the global features of the sample picture comprises:
cutting the sample picture according to a first cutting proportion to obtain a global picture;
carrying out feature extraction on the global picture to obtain global features;
the process of extracting the local features of the sample picture comprises the following steps:
cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion;
performing feature extraction on the local picture to obtain local features;
the extraction process of the interesting region features of the sample picture comprises the following steps:
identifying an interested area of the sample picture, and acquiring area position information of the interested area;
and carrying out image coding on the region of interest to obtain image characteristics, and obtaining the region of interest characteristics according to the image and the region position information.
In one embodiment, the extracting process of the original text features of the sample text comprises:
extracting text features of the sample text as original text features;
the extraction process of the abstract text features of the sample text comprises the following steps:
filtering the sample text according to the sample text and the labeling information of the sample picture;
and extracting text features of the filtered sample text as abstract text features.
In one embodiment, the generating a first loss term based on the comparison learning of the global features and the abstract text features comprises:
calculating a first similarity of the global feature and the abstract text feature and a second similarity of the abstract text feature and the local feature;
generating a first loss term based on the first similarity and the second similarity;
the generating a second loss term based on the local feature and the original text feature through contrast learning comprises:
calculating a third similarity of the local feature and the original text feature, and a fourth similarity of the original text feature and the local feature;
generating a second loss term based on the third similarity and the fourth similarity;
the generating a third loss term based on the region-of-interest feature and the original text feature comprises:
calculating a fifth similarity of the region of interest feature and the original text feature, and a sixth similarity of the original text feature and the region of interest feature;
generating a third loss term based on the fifth similarity and the sixth similarity;
the generating a fourth loss item through comparison learning based on the region-of-interest features and the abstract text features comprises:
calculating a seventh similarity of the region-of-interest feature and the abstract text feature and an eighth similarity of the abstract text feature and the region-of-interest feature;
generating a fourth loss term based on the seventh similarity and the eighth similarity.
A picture labeling method comprises the following steps:
receiving a picture to be processed and a text to be processed;
and inputting the picture to be processed and the text to be processed into a text image matching model obtained by training in any one of the above embodiments to obtain a label text at a corresponding position of the picture to be processed.
A text image matching model training apparatus, the apparatus comprising:
the sample acquisition module is used for acquiring a sample picture, a sample text and the labeling information of the sample text and the sample picture;
the image feature extraction module is used for extracting global features, local features and region-of-interest features of the sample picture through an image feature extractor of a text image matching model;
the text feature extraction module is used for extracting original text features and abstract text features of the sample text through a text feature extractor of the text image matching model, wherein the abstract text features are generated based on the sample text and the labeling information of the sample picture;
the loss item generation module is used for performing contrast learning on the basis of the global feature and the abstract text feature to generate a first loss item, performing contrast learning on the basis of the local feature and the original text feature to generate a second loss item, performing contrast learning on the basis of the region-of-interest feature and the original text feature to generate a third loss item, and performing contrast learning on the basis of the region-of-interest feature and the abstract text feature to generate a fourth loss item;
a Hungarian loss calculation module for calculating Hungarian losses based on the first, second, third and fourth loss terms;
and the training module is used for training the text image matching model according to the Hungarian loss.
A picture annotation device, said picture annotation device comprising:
the receiving module is used for receiving the picture to be processed and the text to be processed;
and the labeling module is used for inputting the picture to be processed and the text to be processed into the text image matching model obtained by training in any one of the embodiments to obtain a labeled text at the corresponding position of the picture to be processed.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described in any one of the above embodiments when the processor executes the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as set forth in any one of the above embodiments.
A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method as described in any one of the above embodiments.
The text image matching model training method, the image labeling device, the computer equipment, the storage medium and the computer program product align images and texts more accurately in a hierarchical structure form, an input pyramid with different semantic levels is constructed on two sides of the double-flow network based on the texts and the images, original text features and abstract text features are subjected to visual modeling and language modeling according to global features, local features and region-of-interest features, then comparison learning is carried out based on the global features and the abstract text features to generate first loss items, comparison learning is carried out based on the local features and the original text features to generate second loss items, comparison learning is carried out based on the region-of-interest features and the original text features to generate third loss items, and comparison learning is carried out based on the region-of-interest features and the abstract text features to generate fourth loss items; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term, so that matching is more accurate, and a text-picture cross-modal visual representation learning potential is stimulated.
Drawings
FIG. 1 is a diagram of an exemplary environment in which a method for training a text-image matching model may be implemented;
FIG. 2 is a schematic flow chart diagram of a method for training a text-image matching model in one embodiment;
FIG. 3 is a schematic illustration of antagonistic learning in one embodiment;
FIG. 4 is a flowchart illustrating a method for annotating pictures in an embodiment;
FIG. 5 is a schematic diagram of the structure of a model in one embodiment;
FIG. 6 is a block diagram showing the structure of a training apparatus for a text image matching model according to an embodiment;
FIG. 7 is a block diagram showing the construction of an apparatus for annotating pictures in one embodiment;
FIG. 8 is a diagram of an internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The text picture matching model training method and the picture labeling method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be placed on the cloud or other network server.
The server 104 obtains a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term; and training the text picture matching model according to Hungarian loss. After the model is obtained through training, the model can be installed at the terminal subsequently, so that image marking can be conveniently carried out.
The text picture matching model training method comprises the steps of aligning pictures and texts more accurately in a hierarchical structure mode, constructing an input pyramid with different semantic levels on two sides of the input pyramid based on a double-flow network of the texts and the pictures, carrying out original text features and abstract text features of visual modeling and language modeling according to global features, local features and region-of-interest features, further carrying out contrast learning based on the global features and the abstract text features to generate a first loss item, carrying out contrast learning based on the local features and the original text features to generate a second loss item, carrying out contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and carrying out contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term, so that matching is more accurate, and a text-picture cross-modal visual representation learning potential is stimulated.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In one embodiment, as shown in fig. 2, a text image matching model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s202: and acquiring the sample picture, the sample text and the labeling information of the sample text and the sample picture.
Specifically, the sample picture and the sample text are corresponding samples, wherein the user labels the sample picture and the sample text in advance to obtain standard information, for example, labels an area or an entirety in the sample picture by a word in the sample text, so as to establish an association relationship between a local word or an entirety word in the sample text and the local or overall sample picture.
S204: and extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model.
Specifically, the global feature refers to a feature of the whole sample picture, the local feature refers to a feature of a region obtained by cropping the sample picture, and the region-of-interest feature may refer to a feature of a region in the sample picture.
S206: and extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture.
Specifically, the original text features are text features extracted before the text is deleted, and the abstract text features are text features extracted after the text is deleted, wherein in the sample, the abstract text features are generated based on the sample text and the labeling information of the sample picture, for example, the labeled sample text is used as the abstract text, so as to extract the abstract text features.
S208: performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item.
Specifically, in conjunction with fig. 3, where fig. 3 is a schematic illustration of counterlearning in one embodiment, semantic mismatches between visual and linguistic modalities typically exist in text-picture pairs, e.g., title redundancy; picture redundancy; the relationship between the target instances is missing; the traditional method directly considers the other as a negative sample without considering the correlation and can cause the model to be over-fitted. Therefore, the embodiment aligns the pictures and the texts more accurately in the form of a hierarchical structure, constructs an input pyramid with different semantic levels on two sides of the double-flow network based on the texts and the pictures, and performs original text and text summarization of visual modeling and language modeling according to the characteristics of global pictures, local picture areas and significant examples in the pictures.
Specifically, for hierarchical internal semantic alignment, since the global regions of the picture and the text abstract both contain global semantic information, and the local regions and the original text contain semantic information of finer granularity, they are regarded as two pairs of positive samples, that is, the global features and the abstract text features are used for performing contrast learning to generate a first loss term, and the local features and the original text features are used for performing contrast learning to generate a second loss term.
For cross-hierarchy relationship alignment, in order to avoid that the modeling of the target relationship by the visual encoder is submerged by scene semantic modeling, the embodiment aligns the relationship between target instances with language elements, that is, performs contrast learning based on the region-of-interest feature and the original text feature to generate a third loss term, and performs contrast learning based on the region-of-interest feature and the abstract text feature to generate a fourth loss term.
S210: calculating Hungarian losses based on the first, second, third, and fourth loss terms.
S212: and training the text picture matching model according to Hungarian loss.
Specifically, for the compatibility problem between the image/text pairs, in the comparative learning process, the loss items of the negative samples, that is, the unpaired samples, are softened, so that the strict loss constraint is alleviated, and further the negative effects of some local similarities are reduced.
For N picture text pairs in a batch
Figure BDA0003828015510000081
Where i denotes the ith pair, the normalized embedded vector of the same dimension is obtained by a dual stream encoder. Wherein the picture encoder generates from the global cropping picture G, the local cropping picture L and the ROI feature sequence, respectively, and the image encoder generates from the global cropping picture G, the local cropping picture L and the ROI feature sequence, respectively, the global features
Figure BDA0003828015510000082
Local features
Figure BDA0003828015510000083
Region of interest features
Figure BDA0003828015510000084
The text encoder separately extracts the text abstract T S Generating original text features from original text T
Figure BDA0003828015510000085
Text summary features
Figure BDA0003828015510000086
Then, four supervisory signals L are constructed using the vector sets GS ,L LT ,L RS ,L RT For intra-batch contrast learning, aiming at achieving alignment between visual and linguistic representations from different semantic levels.
In one embodiment, the generating of the first loss term based on the global feature and the abstract text feature by the comparative learning includes: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; and performing contrast learning based on the local features and the original text features to generate a second loss term, wherein the second loss term comprises: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the step of performing contrast learning to generate a third loss term based on the region-of-interest feature and the original text feature comprises: calculating fifth similarity of the interesting region features and the original text features and sixth similarity of the original text features and the interesting region features; generating a third loss term based on the fifth similarity and the sixth similarity; performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss term, including: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.
For the convenience of understanding, to
Figure BDA0003828015510000091
First loss term L of GS For example. For the ith pair, the normalized picture-to-language similarity and the language-to-picture similarity can be calculated by the following formulas:
Figure BDA0003828015510000092
Figure BDA0003828015510000093
Figure BDA0003828015510000094
representing computing a first similarity of the global feature and the abstract text feature,
Figure BDA0003828015510000095
and the second similarity of the abstract text features and the local features, sim is similarity calculation, tau is a constant item, similarity distinction is realized, matching between the patches is realized, the loss function calculation adopts Hungary loss, and a more accurate matching result is obtained:
Figure BDA0003828015510000096
wherein, b i Is a vector defining the coordinates of the center of the real box and its height and width relative to the image size, c i Is a target class label, which cannot be an empty set,
Figure BDA0003828015510000097
identification class c i Is predicted in a prediction box of
Figure BDA0003828015510000098
L box The bounding box is scored.
In the embodiment, for the semantically aligned pictures patch, the similarity of the semantically aligned pictures patch is referred, hungarian matching loss is added, so that matching is more accurate, and the text-picture cross-modal visual representation learning potential is stimulated.
The text picture matching model training method comprises the steps of aligning pictures and texts more accurately in a hierarchical structure mode, constructing an input pyramid with different semantic levels on two sides of the input pyramid based on a double-flow network of the texts and the pictures, carrying out original text features and abstract text features of visual modeling and language modeling according to global features, local features and region-of-interest features, further carrying out contrast learning based on the global features and the abstract text features to generate a first loss item, carrying out contrast learning based on the local features and the original text features to generate a second loss item, carrying out contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and carrying out contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term, so that matching is more accurate, and a text-picture cross-modal visual representation learning potential is stimulated.
In one embodiment, the process of extracting the global features of the sample picture includes: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the extraction process of the local features of the sample picture comprises the following steps: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the interesting region feature extraction process of the sample picture comprises the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture plus region position information.
In one embodiment, the process of extracting the original text features of the sample text comprises the following steps: extracting text features of the sample text as original text features; the extraction process of the abstract text features of the sample text comprises the following steps: filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.
For hierarchical internal semantic alignment, the pictures and the text summaries are regarded as two pairs of positive samples because the global regions contain global semantic information, and the local regions and the original text contain semantic information with finer granularity.
Specifically, the global view G is first randomly cropped, where the cropping scale is set to [0.9,1.0 [ ]]In other embodiments, the range of the cropping ratio is 0.9-1.0, and if the random cropping is performed according to a ratio of 0.93, that is, 93%, then a picture is cropped at random by 93%, and most of the effective area can still be reserved, and some cropping is performed to maintain the condition that all information in the original picture is includedThe operation is mainly used for data enhancement, so that the training data form is richer, and a model with more generalization performance is obtained. Text abstract T S The original text T is compressed and some redundant and overly detailed information in the original text T is removed. G and T S Global information is captured and can be used as pairs of positive samples. By comparative learning, G and T S Projection embedding v g And l S The distance is zoomed in, where V (vision) refers to the picture and l (language) refers to the text.
For the comparison learning of fine-grained local information, the global view G and the text abstract T are used S Is relatively coarse and therefore fine grained information is largely discarded. While in this embodiment it is intended that the picture sub-regions may be aligned with some description of the title. To this end, the present embodiment introduces fine-grained local contrast. The random clipping ratio for generating the partial view L is set to [0.6,1]It focuses on a sub-region of picture I. The original text T contains many detailed descriptions and is therefore more appropriately considered to be a positive sample of L. Then, the projections of L and T are embedded into v l And l T Also merged together by contrast loss. Here, V (vision) refers to a picture, and l (language) refers to a text.
For cross-level relationship alignment, to avoid the modeling of target relationships by the visual encoder being overwhelmed by scene semantic modeling, the present embodiment aligns the relationships between target instances with language elements. To further improve the alignment accuracy, a sequence of ROI features of salient objects in the picture is introduced here to provide more supervision. Specifically, given a picture I with M salient objects, the visual semantics of each object region are extracted using a pre-trained object detector, defined as [ o' m ,z m ]Wherein M represents the Mth object, o' m Is a 2048-dimensional feature vector, z m Is a 4-dimensional normalized position vector representing the coordinates of the upper left and lower right corners.
Through cascade o' m And z m A 2048-dimensional position-sensitive ROI feature vector O can be obtained m Forming a ROI feature sequence. To enhance the ability of the text encoder to model conceptual relationships while avoiding impairing the reasoning ability of the visual encoder, (v) r ,l s ) And (v) r ,l t ) Is used as another two facing while minimizing v r And l s And v and r and l t The distance between them. This training process is referred to herein as cross-hierarchy relationship alignment because the example-level input used by the visual modality is very fine-grained, while the input used by the linguistic modality is a complete sentence, i.e., a text abstract and original text.
To summarize, the patch token is projected to a higher dimension and reshape is performed by the linear projection layer. Next, local information is captured using convolution in the 3 × 3 depth direction. The features are then mapped to the token sequence and re-projected to the original dimensions. While the cls token is invariant in the process and is connected with the locally enhanced patch token, generating the final output.
In the embodiment, hungarian matching based on patch similarity is introduced to the text-picture data using mode, the picture patch similarity is utilized to the maximum extent, a better matching effect is obtained, and the problems that task learning is directly performed by directly using text-picture data in the traditional technology, text information redundancy, picture information redundancy, interrelation among picture target examples and interrelation among different target examples existing in the type of training data are ignored are solved.
In an embodiment, as shown in fig. 4, a method for annotating a picture is provided, which is described by taking the method as an example for being applied to the server or the terminal in fig. 1, and includes the following steps:
s402: and receiving the picture to be processed and the text to be processed.
S404: and inputting the picture to be processed and the text to be processed into a text picture matching model obtained by training in any one of the previous input embodiments to obtain a labeled text at a corresponding position of the picture to be processed.
In particular, in connection with FIG. 5, for a given text-to-picture pair, the goal is to expect the text-to-picture matching model to learn a common visual representation. The textual description contains rich semantic information of the target scene in the corresponding picture, such as target object class, color, space, action state, etc., which rich information represents a significant value for downstream visual tasks such as image picture classification or target detection.
For this purpose, the input text is preliminarily encoded to obtain a semantic representation corresponding to the picture, the image picture decoder performs visual feature extraction on the image picture, and then fusion learning is performed on the semantic representation output by the text extractor and the extracted picture feature. And then cascading the image picture characteristic vectors and the text characteristic vectors in the shared space to obtain a complete and independent sequence containing the image picture characteristics and the text characteristics. The goal is to learn a scene descriptor about the picture content that predicts and outputs a visual content representation about the input picture, thereby completing the annotation of the picture.
In the embodiment, by the semantic alignment of the text and the picture, the mass production of high-quality training data becomes possible and efficient, and the manual marking is not strongly relied on, so that the task research and development period of algorithm personnel is shortened, and the service application efficiency is improved.
It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the application also provides a text picture matching model training device and a picture labeling device for realizing the text picture matching model training method and the picture labeling method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so the specific limitations in one or more embodiments of the text image matching model training device and the image labeling device provided below can be referred to the limitations of the text image matching model training method and the image labeling method in the foregoing, and are not described herein again.
In one embodiment, as shown in fig. 6, there is provided a text image matching model training apparatus, including: a sample acquisition module 601, an image feature extraction module 602, a text feature extraction module 603, a loss item generation module 604, a Hungarian loss calculation module 605 and a training module 606, wherein:
the sample obtaining module 601 is configured to obtain a sample picture, a sample text, and labeling information of the sample text and the sample picture.
And the picture feature extraction module 602 is configured to extract global features, local features, and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model.
The text feature extraction module 603 is configured to extract, by using a text feature extractor of the text-picture matching model, an original text feature and a digest text feature of the sample text, where the digest text feature is generated based on the label information of the sample text and the sample picture.
The loss term generation module 604 is configured to perform contrast learning based on the global feature and the abstract text feature to generate a first loss term, perform contrast learning based on the local feature and the original text feature to generate a second loss term, perform contrast learning based on the region-of-interest feature and the original text feature to generate a third loss term, and perform contrast learning based on the region-of-interest feature and the abstract text feature to generate a fourth loss term.
A hungarian loss calculation module 605 for calculating the hungarian loss based on the first, second, third and fourth loss terms.
And a training module 606 for training the text picture matching model according to Hungarian loss.
In one embodiment, the image feature extraction module 602 is further configured to crop the sample image according to a first cropping ratio to obtain a global image; carrying out feature extraction on the global picture to obtain global features; cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture plus region position information.
In one embodiment, the text feature extraction module 603 is further configured to extract text features of the sample text as original text features; filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.
In one embodiment, the above-mentioned loss term generating module 604 is further configured to calculate a first similarity between the global feature and the abstract text feature, and a second similarity between the abstract text feature and the local feature; generating a first loss term based on the first similarity and the second similarity; calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; calculating fifth similarity of the region-of-interest features and the original text features and sixth similarity of the original text features and the region-of-interest features; generating a third loss term based on the fifth similarity and the sixth similarity; calculating a seventh similarity of the interesting region features and the abstract text features and an eighth similarity of the abstract text features and the interesting region features; a fourth loss term is generated based on the seventh similarity and the eighth similarity.
In one embodiment, as shown in fig. 7, there is provided a picture annotation device, including: a receiving module 701 and an annotating module 702, wherein:
the receiving module 701 is configured to receive a to-be-processed picture and a to-be-processed text.
And the labeling module 702 is configured to input the picture to be processed and the text to be processed into the text-picture matching model obtained through training in any one of the above embodiments, so as to obtain a labeled text at a corresponding position of the picture to be processed.
All modules in the text image matching model training device and the image labeling device can be completely or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a text picture matching model training method and a picture labeling method.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term; and training the text picture matching model according to Hungarian loss.
In one embodiment, the process of extracting global features of a sample picture involved in the execution of the computer program by the processor comprises: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the process of extracting the local features of the sample picture involved in the execution of the computer program by the processor comprises the following steps: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the process of extracting the interesting region characteristic of the sample picture involved in the processor executing the computer program comprises the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture sum region position information.
In one embodiment, the process of extracting original text features of sample text involved in the execution of the computer program by the processor comprises: extracting text features of the sample text as original text features; the extraction process of the abstract text characteristics of the sample text involved in the execution of the computer program by the processor comprises the following steps: filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.
In one embodiment, the comparison learning based on the global features and the abstract text features, which is implemented when the processor executes the computer program, generates the first loss item, and includes: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; the processor, implemented when executing the computer program, generates a second loss term based on the local feature and the original text feature through comparative learning, and comprises: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the generating of the third loss term based on the region-of-interest feature and the raw text feature through the comparative learning implemented when the processor executes the computer program includes: calculating fifth similarity of the region-of-interest features and the original text features and sixth similarity of the original text features and the region-of-interest features; generating a third loss term based on the fifth similarity and the sixth similarity; the fourth loss item is generated by the processor based on the region-of-interest feature and the abstract text feature through comparison learning when the processor executes the computer program, and the fourth loss item comprises the following steps: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.
In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of: receiving a picture to be processed and a text to be processed; and inputting the picture to be processed and the text to be processed into the text picture matching model obtained by training in any one of the embodiments to obtain the labeled text at the corresponding position of the picture to be processed.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on labeling information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term; and training the text picture matching model according to Hungarian loss.
In one embodiment, the process of extracting global features of a sample picture involved when the computer program is executed by a processor comprises: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the extraction process of the local features of the sample picture involved in the execution of the computer program by the processor comprises the following steps: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the process of extracting the characteristics of the interested region of the sample picture includes the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture plus region position information.
In one embodiment, the process of extracting original text features of a sample text involved in the execution of the computer program by the processor comprises: extracting text features of the sample text as original text features; the extraction process of the abstract text features of the sample text involved in the execution of the computer program by the processor comprises the following steps: filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.
In one embodiment, the generation of the first loss term based on the global feature and the digest-text feature comparative learning implemented by the computer program when the computer program is executed by the processor includes: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; the computer program, when executed by the processor, implements a comparative learning based on the local features and the original text features to generate a second loss term, comprising: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the third loss term generated by the computer program based on the region-of-interest feature and the original text feature through comparison learning when the computer program is executed by the processor comprises the following steps: calculating fifth similarity of the region-of-interest features and the original text features and sixth similarity of the original text features and the region-of-interest features; generating a third loss term based on the fifth similarity and the sixth similarity; the fourth loss term is generated based on the region-of-interest feature and the abstract text feature through comparison learning, and when the computer program is executed by the processor, the fourth loss term comprises: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: receiving a picture to be processed and a text to be processed; and inputting the picture to be processed and the text to be processed into the text picture matching model obtained by training in any one of the embodiments to obtain the labeled text at the corresponding position of the picture to be processed.
In one embodiment, a computer program product is provided, comprising a computer program which when executed by a processor performs the steps of: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first, second, third, and fourth loss terms; and training the text picture matching model according to Hungarian loss.
In one embodiment, the process of extracting global features of a sample picture involved when the computer program is executed by a processor comprises: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the process of extracting local features of a sample picture involved in the execution of the computer program by the processor comprises: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the process of extracting the characteristics of the interested region of the sample picture includes the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture sum region position information.
In one embodiment, the process of extracting original text features of a sample text involved in the execution of the computer program by the processor comprises: extracting text features of the sample text as original text features; the extraction process of the abstract text features of the sample text involved in the execution of the computer program by the processor comprises the following steps: filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.
In one embodiment, the comparison learning based on global features and abstract text features implemented when the computer program is executed by a processor generates a first loss term, comprising: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; the computer program, when executed by the processor, implements the comparison learning based on the local feature and the original text feature to generate a second loss term, comprising: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the third loss term generated by the computer program based on the region-of-interest feature and the original text feature through comparison learning when the computer program is executed by the processor comprises the following steps: calculating fifth similarity of the interesting region features and the original text features and sixth similarity of the original text features and the interesting region features; generating a third loss term based on the fifth similarity and the sixth similarity; the fourth loss term is generated based on the region-of-interest feature and the abstract text feature through comparison learning, and when the computer program is executed by the processor, the fourth loss term comprises: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.
In one embodiment, a computer program product is provided, comprising a computer program which when executed by a processor performs the steps of: receiving a picture to be processed and a text to be processed; and inputting the picture to be processed and the text to be processed into the text picture matching model obtained by training in any one of the embodiments to obtain the labeled text at the corresponding position of the picture to be processed.
It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A text image matching model training method is characterized by comprising the following steps:
acquiring a sample picture, a sample text and the labeling information of the sample text and the sample picture;
extracting global features, local features and region-of-interest features of the sample picture through an image feature extractor of a text image matching model;
extracting original text features and abstract text features of the sample text through a text feature extractor of the text image matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture;
performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item;
calculating a Hungarian loss based on the first, second, third, and fourth loss terms;
and training the text image matching model according to the Hungarian loss.
2. The method according to claim 1, wherein the extracting process of the global features of the sample picture comprises:
cutting the sample picture according to a first cutting proportion to obtain a global picture;
carrying out feature extraction on the global picture to obtain global features;
the process of extracting the local features of the sample picture comprises the following steps:
cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion;
performing feature extraction on the local picture to obtain local features;
the extraction process of the interesting region features of the sample picture comprises the following steps:
identifying an interested area of the sample picture, and acquiring area position information of the interested area;
and carrying out image coding on the region of interest to obtain image characteristics, and obtaining the region of interest characteristics according to the image and the region position information.
3. The method of claim 1, wherein the extracting of the original text features of the sample text comprises:
extracting text features of the sample text as original text features;
the extraction process of the abstract text features of the sample text comprises the following steps:
filtering the sample text according to the sample text and the labeling information of the sample picture;
and extracting text features of the filtered sample text as abstract text features.
4. The method of claim 1, wherein the generating a first loss term based on the global feature and the abstract text feature through contrast learning comprises:
calculating a first similarity of the global feature and the abstract text feature and a second similarity of the abstract text feature and the local feature;
generating a first loss term based on the first similarity and the second similarity;
the generating a second loss term based on the local feature and the original text feature through comparison learning comprises the following steps:
calculating a third similarity of the local feature and the original text feature, and a fourth similarity of the original text feature and the local feature;
generating a second loss term based on the third similarity and the fourth similarity;
the generating a third loss term based on the region-of-interest feature and the original text feature comprises:
calculating a fifth similarity of the region-of-interest feature and the original text feature, and a sixth similarity of the original text feature and the region-of-interest feature;
generating a third loss term based on the fifth similarity and the sixth similarity;
the generating a fourth loss term based on the region-of-interest feature and the abstract text feature through contrast learning includes:
calculating a seventh similarity of the region-of-interest features and the abstract text features and an eighth similarity of the abstract text features and the region-of-interest features;
generating a fourth loss term based on the seventh similarity and the eighth similarity.
5. A picture labeling method is characterized by comprising the following steps:
receiving a picture to be processed and a text to be processed;
inputting the picture to be processed and the text to be processed into a text image matching model obtained by training according to any one of claims 1 to 4, and obtaining a labeled text at a corresponding position of the picture to be processed.
6. An apparatus for training a text image matching model, the apparatus comprising:
the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture;
the image feature extraction module is used for extracting global features, local features and region-of-interest features of the sample picture through an image feature extractor of a text image matching model;
the text feature extraction module is used for extracting original text features and abstract text features of the sample text through a text feature extractor of the text image matching model, wherein the abstract text features are generated based on the sample text and the labeling information of the sample picture;
the loss item generation module is used for performing contrast learning on the basis of the global feature and the abstract text feature to generate a first loss item, performing contrast learning on the basis of the local feature and the original text feature to generate a second loss item, performing contrast learning on the basis of the region-of-interest feature and the original text feature to generate a third loss item, and performing contrast learning on the basis of the region-of-interest feature and the abstract text feature to generate a fourth loss item;
a Hungarian loss calculation module to calculate Hungarian losses based on the first, second, third, and fourth loss terms;
and the training module is used for training the text image matching model according to the Hungarian loss.
7. A picture labeling apparatus, comprising:
the receiving module is used for receiving the picture to be processed and the text to be processed;
and the labeling module is used for inputting the picture to be processed and the text to be processed into the text image matching model obtained by training according to claim 6, so as to obtain a labeled text at the corresponding position of the picture to be processed.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 4 or 5.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4 or 5.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 4 or 5 when executed by a processor.
CN202211065029.9A 2022-09-01 2022-09-01 Text image matching model training method, picture labeling method, device and equipment Pending CN115359492A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211065029.9A CN115359492A (en) 2022-09-01 2022-09-01 Text image matching model training method, picture labeling method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211065029.9A CN115359492A (en) 2022-09-01 2022-09-01 Text image matching model training method, picture labeling method, device and equipment

Publications (1)

Publication Number Publication Date
CN115359492A true CN115359492A (en) 2022-11-18

Family

ID=84003887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211065029.9A Pending CN115359492A (en) 2022-09-01 2022-09-01 Text image matching model training method, picture labeling method, device and equipment

Country Status (1)

Country Link
CN (1) CN115359492A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035427A (en) * 2024-04-15 2024-05-14 之江实验室 Method and device for enhancing multi-mode image-text retrieval through 3D contrast learning
CN118097670A (en) * 2024-04-28 2024-05-28 绵阳师范学院 Text image processing method and system based on multi-mode and SAM technology fusion

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035427A (en) * 2024-04-15 2024-05-14 之江实验室 Method and device for enhancing multi-mode image-text retrieval through 3D contrast learning
CN118097670A (en) * 2024-04-28 2024-05-28 绵阳师范学院 Text image processing method and system based on multi-mode and SAM technology fusion
CN118097670B (en) * 2024-04-28 2024-06-28 绵阳师范学院 Text image processing method and system based on multi-mode and SAM technology fusion

Similar Documents

Publication Publication Date Title
US11373390B2 (en) Generating scene graphs from digital images using external knowledge and image reconstruction
Jing et al. Self-supervised visual feature learning with deep neural networks: A survey
Jeong et al. Deep joint spatiotemporal network (DJSTN) for efficient facial expression recognition
US11776267B2 (en) Intelligent cataloging method for all-media news based on multi-modal information fusion understanding
Abbas et al. A comprehensive review of recent advances on deep vision systems
CN109117777B (en) Method and device for generating information
Feris et al. Large-scale vehicle detection, indexing, and search in urban surveillance videos
CN109711463B (en) Attention-based important object detection method
US8170280B2 (en) Integrated systems and methods for video-based object modeling, recognition, and tracking
WO2022105125A1 (en) Image segmentation method and apparatus, computer device, and storage medium
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN112100438A (en) Label extraction method and device and computer readable storage medium
CN113378710A (en) Layout analysis method and device for image file, computer equipment and storage medium
Oluwasammi et al. Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning
CN114299321A (en) Video classification method, device, equipment and readable storage medium
Elharrouss et al. FSC-set: counting, localization of football supporters crowd in the stadiums
CN115359492A (en) Text image matching model training method, picture labeling method, device and equipment
Zhu et al. Spatial-temporal knowledge integration: Robust self-supervised facial landmark tracking
Bin et al. Combining multi-representation for multimedia event detection using co-training
CN117635275A (en) Intelligent electronic commerce operation commodity management platform and method based on big data
Lei et al. Recent advances in multi-modal 3D scene understanding: A comprehensive survey and evaluation
Jiang et al. Video searching and fingerprint detection by using the image query and PlaceNet-based shot boundary detection method
Yang et al. SAL‐Net: Self‐Supervised Attribute Learning for Object Recognition and Segmentation
Liu et al. A framework for short video recognition based on motion estimation and feature curves on SPD manifolds
Tan et al. 3D detection transformer: Set prediction of objects using point clouds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination