CN115359492A

CN115359492A - Text image matching model training method, picture labeling method, device and equipment

Info

Publication number: CN115359492A
Application number: CN202211065029.9A
Authority: CN
Inventors: 刘世超; 乔秋飞
Original assignee: Shanghai Yuer Network Technology Co ltd
Current assignee: Shanghai Yuer Network Technology Co ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-11-18

Abstract

The application relates to a text image matching model training method and a picture labeling method. The method comprises the following steps: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and interesting region features of the sample picture through an image feature extractor of the text image matching model; extracting original text features and abstract text features of a sample text through a text feature extractor of a text image matching model, wherein the abstract text features are generated based on labeling information of the sample text and a sample picture; performing comparative learning on the basis of the global feature, the abstract text feature, the local feature and the original text feature, the region-of-interest feature and the abstract text feature to generate each loss item; calculating Hungarian losses based on the loss terms; and training the text image matching model according to the Hungarian loss. By adopting the method, the pictures can be automatically marked.

Description

Text image matching model training method, picture labeling method, device and equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text image matching model training method, a picture labeling method, an apparatus, a computer device, a storage medium, and a computer program product.

Background

Throughout various basic tasks of computer vision, such as image classification, target detection, semantic segmentation and the like, long-term research and application routes of a data layer follow a learning paradigm of accurate labeling of a picture data set, and a visual detection task based on the paradigm achieves good effects. However, this method is still limited to using only limited label learning mechanism in the computer vision field, and it needs to consume high manpower labeling cost.

However, with the mature development situation of deep learning in the deep learning field, researchers pursue more extreme for model learning and generalization ability, and the requirements of the application for labor cost, task learning cycle and deployment efficiency are gradually increased, and the current learning paradigm undoubtedly ties the development of the task.

Disclosure of Invention

Based on this, it is necessary to provide a text image matching model training method, a picture labeling method, an apparatus, a computer device, a storage medium, and a computer program product, which can automatically establish a matching relationship between a picture and a text, for solving the above technical problems.

A method of text image matching model training, the method comprising:

acquiring a sample picture, a sample text and the labeling information of the sample text and the sample picture;

extracting global features, local features and region-of-interest features of the sample picture through an image feature extractor of a text image matching model;

extracting original text features and abstract text features of the sample text through a text feature extractor of the text image matching model, wherein the abstract text features are generated based on the sample text and the labeling information of the sample picture;

performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item;

calculating a Hungarian loss based on the first, second, third, and fourth loss terms;

and training the text image matching model according to the Hungarian loss.

In one embodiment, the process of extracting the global features of the sample picture comprises:

cutting the sample picture according to a first cutting proportion to obtain a global picture;

carrying out feature extraction on the global picture to obtain global features;

the process of extracting the local features of the sample picture comprises the following steps:

cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion;

performing feature extraction on the local picture to obtain local features;

the extraction process of the interesting region features of the sample picture comprises the following steps:

identifying an interested area of the sample picture, and acquiring area position information of the interested area;

and carrying out image coding on the region of interest to obtain image characteristics, and obtaining the region of interest characteristics according to the image and the region position information.

In one embodiment, the extracting process of the original text features of the sample text comprises:

extracting text features of the sample text as original text features;

the extraction process of the abstract text features of the sample text comprises the following steps:

filtering the sample text according to the sample text and the labeling information of the sample picture;

and extracting text features of the filtered sample text as abstract text features.

In one embodiment, the generating a first loss term based on the comparison learning of the global features and the abstract text features comprises:

calculating a first similarity of the global feature and the abstract text feature and a second similarity of the abstract text feature and the local feature;

generating a first loss term based on the first similarity and the second similarity;

the generating a second loss term based on the local feature and the original text feature through contrast learning comprises:

calculating a third similarity of the local feature and the original text feature, and a fourth similarity of the original text feature and the local feature;

generating a second loss term based on the third similarity and the fourth similarity;

the generating a third loss term based on the region-of-interest feature and the original text feature comprises:

calculating a fifth similarity of the region of interest feature and the original text feature, and a sixth similarity of the original text feature and the region of interest feature;

generating a third loss term based on the fifth similarity and the sixth similarity;

the generating a fourth loss item through comparison learning based on the region-of-interest features and the abstract text features comprises:

calculating a seventh similarity of the region-of-interest feature and the abstract text feature and an eighth similarity of the abstract text feature and the region-of-interest feature;

generating a fourth loss term based on the seventh similarity and the eighth similarity.

A picture labeling method comprises the following steps:

receiving a picture to be processed and a text to be processed;

and inputting the picture to be processed and the text to be processed into a text image matching model obtained by training in any one of the above embodiments to obtain a label text at a corresponding position of the picture to be processed.

A text image matching model training apparatus, the apparatus comprising:

the sample acquisition module is used for acquiring a sample picture, a sample text and the labeling information of the sample text and the sample picture;

the image feature extraction module is used for extracting global features, local features and region-of-interest features of the sample picture through an image feature extractor of a text image matching model;

the text feature extraction module is used for extracting original text features and abstract text features of the sample text through a text feature extractor of the text image matching model, wherein the abstract text features are generated based on the sample text and the labeling information of the sample picture;

the loss item generation module is used for performing contrast learning on the basis of the global feature and the abstract text feature to generate a first loss item, performing contrast learning on the basis of the local feature and the original text feature to generate a second loss item, performing contrast learning on the basis of the region-of-interest feature and the original text feature to generate a third loss item, and performing contrast learning on the basis of the region-of-interest feature and the abstract text feature to generate a fourth loss item;

a Hungarian loss calculation module for calculating Hungarian losses based on the first, second, third and fourth loss terms;

and the training module is used for training the text image matching model according to the Hungarian loss.

A picture annotation device, said picture annotation device comprising:

the receiving module is used for receiving the picture to be processed and the text to be processed;

and the labeling module is used for inputting the picture to be processed and the text to be processed into the text image matching model obtained by training in any one of the embodiments to obtain a labeled text at the corresponding position of the picture to be processed.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described in any one of the above embodiments when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as set forth in any one of the above embodiments.

A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method as described in any one of the above embodiments.

The text image matching model training method, the image labeling device, the computer equipment, the storage medium and the computer program product align images and texts more accurately in a hierarchical structure form, an input pyramid with different semantic levels is constructed on two sides of the double-flow network based on the texts and the images, original text features and abstract text features are subjected to visual modeling and language modeling according to global features, local features and region-of-interest features, then comparison learning is carried out based on the global features and the abstract text features to generate first loss items, comparison learning is carried out based on the local features and the original text features to generate second loss items, comparison learning is carried out based on the region-of-interest features and the original text features to generate third loss items, and comparison learning is carried out based on the region-of-interest features and the abstract text features to generate fourth loss items; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term, so that matching is more accurate, and a text-picture cross-modal visual representation learning potential is stimulated.

Drawings

FIG. 1 is a diagram of an exemplary environment in which a method for training a text-image matching model may be implemented;

FIG. 2 is a schematic flow chart diagram of a method for training a text-image matching model in one embodiment;

FIG. 3 is a schematic illustration of antagonistic learning in one embodiment;

FIG. 4 is a flowchart illustrating a method for annotating pictures in an embodiment;

FIG. 5 is a schematic diagram of the structure of a model in one embodiment;

FIG. 6 is a block diagram showing the structure of a training apparatus for a text image matching model according to an embodiment;

FIG. 7 is a block diagram showing the construction of an apparatus for annotating pictures in one embodiment;

FIG. 8 is a diagram of an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The text picture matching model training method and the picture labeling method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be placed on the cloud or other network server.

The server 104 obtains a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term; and training the text picture matching model according to Hungarian loss. After the model is obtained through training, the model can be installed at the terminal subsequently, so that image marking can be conveniently carried out.

The text picture matching model training method comprises the steps of aligning pictures and texts more accurately in a hierarchical structure mode, constructing an input pyramid with different semantic levels on two sides of the input pyramid based on a double-flow network of the texts and the pictures, carrying out original text features and abstract text features of visual modeling and language modeling according to global features, local features and region-of-interest features, further carrying out contrast learning based on the global features and the abstract text features to generate a first loss item, carrying out contrast learning based on the local features and the original text features to generate a second loss item, carrying out contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and carrying out contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term, so that matching is more accurate, and a text-picture cross-modal visual representation learning potential is stimulated.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 2, a text image matching model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

s202: and acquiring the sample picture, the sample text and the labeling information of the sample text and the sample picture.

Specifically, the sample picture and the sample text are corresponding samples, wherein the user labels the sample picture and the sample text in advance to obtain standard information, for example, labels an area or an entirety in the sample picture by a word in the sample text, so as to establish an association relationship between a local word or an entirety word in the sample text and the local or overall sample picture.

S204: and extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model.

Specifically, the global feature refers to a feature of the whole sample picture, the local feature refers to a feature of a region obtained by cropping the sample picture, and the region-of-interest feature may refer to a feature of a region in the sample picture.

S206: and extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture.

Specifically, the original text features are text features extracted before the text is deleted, and the abstract text features are text features extracted after the text is deleted, wherein in the sample, the abstract text features are generated based on the sample text and the labeling information of the sample picture, for example, the labeled sample text is used as the abstract text, so as to extract the abstract text features.

S208: performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item.

Specifically, in conjunction with fig. 3, where fig. 3 is a schematic illustration of counterlearning in one embodiment, semantic mismatches between visual and linguistic modalities typically exist in text-picture pairs, e.g., title redundancy; picture redundancy; the relationship between the target instances is missing; the traditional method directly considers the other as a negative sample without considering the correlation and can cause the model to be over-fitted. Therefore, the embodiment aligns the pictures and the texts more accurately in the form of a hierarchical structure, constructs an input pyramid with different semantic levels on two sides of the double-flow network based on the texts and the pictures, and performs original text and text summarization of visual modeling and language modeling according to the characteristics of global pictures, local picture areas and significant examples in the pictures.

Specifically, for hierarchical internal semantic alignment, since the global regions of the picture and the text abstract both contain global semantic information, and the local regions and the original text contain semantic information of finer granularity, they are regarded as two pairs of positive samples, that is, the global features and the abstract text features are used for performing contrast learning to generate a first loss term, and the local features and the original text features are used for performing contrast learning to generate a second loss term.

For cross-hierarchy relationship alignment, in order to avoid that the modeling of the target relationship by the visual encoder is submerged by scene semantic modeling, the embodiment aligns the relationship between target instances with language elements, that is, performs contrast learning based on the region-of-interest feature and the original text feature to generate a third loss term, and performs contrast learning based on the region-of-interest feature and the abstract text feature to generate a fourth loss term.

S210: calculating Hungarian losses based on the first, second, third, and fourth loss terms.

S212: and training the text picture matching model according to Hungarian loss.

Specifically, for the compatibility problem between the image/text pairs, in the comparative learning process, the loss items of the negative samples, that is, the unpaired samples, are softened, so that the strict loss constraint is alleviated, and further the negative effects of some local similarities are reduced.

For N picture text pairs in a batch

Where i denotes the ith pair, the normalized embedded vector of the same dimension is obtained by a dual stream encoder. Wherein the picture encoder generates from the global cropping picture G, the local cropping picture L and the ROI feature sequence, respectively, and the image encoder generates from the global cropping picture G, the local cropping picture L and the ROI feature sequence, respectively, the global features

Local features

Region of interest features

The text encoder separately extracts the text abstract T _S Generating original text features from original text T

Text summary features

Then, four supervisory signals L are constructed using the vector sets _GS ，L _LT ，L _RS ，L _RT For intra-batch contrast learning, aiming at achieving alignment between visual and linguistic representations from different semantic levels.

In one embodiment, the generating of the first loss term based on the global feature and the abstract text feature by the comparative learning includes: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; and performing contrast learning based on the local features and the original text features to generate a second loss term, wherein the second loss term comprises: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the step of performing contrast learning to generate a third loss term based on the region-of-interest feature and the original text feature comprises: calculating fifth similarity of the interesting region features and the original text features and sixth similarity of the original text features and the interesting region features; generating a third loss term based on the fifth similarity and the sixth similarity; performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss term, including: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.

For the convenience of understanding, to

First loss term L of _GS For example. For the ith pair, the normalized picture-to-language similarity and the language-to-picture similarity can be calculated by the following formulas:

representing computing a first similarity of the global feature and the abstract text feature,

and the second similarity of the abstract text features and the local features, sim is similarity calculation, tau is a constant item, similarity distinction is realized, matching between the patches is realized, the loss function calculation adopts Hungary loss, and a more accurate matching result is obtained:

wherein, b _i Is a vector defining the coordinates of the center of the real box and its height and width relative to the image size, c _i Is a target class label, which cannot be an empty set,

identification class c _i Is predicted in a prediction box of

L _box The bounding box is scored.

In the embodiment, for the semantically aligned pictures patch, the similarity of the semantically aligned pictures patch is referred, hungarian matching loss is added, so that matching is more accurate, and the text-picture cross-modal visual representation learning potential is stimulated.

In one embodiment, the process of extracting the global features of the sample picture includes: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the extraction process of the local features of the sample picture comprises the following steps: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the interesting region feature extraction process of the sample picture comprises the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture plus region position information.

In one embodiment, the process of extracting the original text features of the sample text comprises the following steps: extracting text features of the sample text as original text features; the extraction process of the abstract text features of the sample text comprises the following steps: filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.

For hierarchical internal semantic alignment, the pictures and the text summaries are regarded as two pairs of positive samples because the global regions contain global semantic information, and the local regions and the original text contain semantic information with finer granularity.

Specifically, the global view G is first randomly cropped, where the cropping scale is set to [0.9,1.0 [ ]]In other embodiments, the range of the cropping ratio is 0.9-1.0, and if the random cropping is performed according to a ratio of 0.93, that is, 93%, then a picture is cropped at random by 93%, and most of the effective area can still be reserved, and some cropping is performed to maintain the condition that all information in the original picture is includedThe operation is mainly used for data enhancement, so that the training data form is richer, and a model with more generalization performance is obtained. Text abstract T _S The original text T is compressed and some redundant and overly detailed information in the original text T is removed. G and T _S Global information is captured and can be used as pairs of positive samples. By comparative learning, G and T _S Projection embedding v ^g And l ^S The distance is zoomed in, where V (vision) refers to the picture and l (language) refers to the text.

For the comparison learning of fine-grained local information, the global view G and the text abstract T are used _S Is relatively coarse and therefore fine grained information is largely discarded. While in this embodiment it is intended that the picture sub-regions may be aligned with some description of the title. To this end, the present embodiment introduces fine-grained local contrast. The random clipping ratio for generating the partial view L is set to [0.6,1]It focuses on a sub-region of picture I. The original text T contains many detailed descriptions and is therefore more appropriately considered to be a positive sample of L. Then, the projections of L and T are embedded into v ^l And l ^T Also merged together by contrast loss. Here, V (vision) refers to a picture, and l (language) refers to a text.

For cross-level relationship alignment, to avoid the modeling of target relationships by the visual encoder being overwhelmed by scene semantic modeling, the present embodiment aligns the relationships between target instances with language elements. To further improve the alignment accuracy, a sequence of ROI features of salient objects in the picture is introduced here to provide more supervision. Specifically, given a picture I with M salient objects, the visual semantics of each object region are extracted using a pre-trained object detector, defined as [ o' _m ，z _m ]Wherein M represents the Mth object, o' _m Is a 2048-dimensional feature vector, z _m Is a 4-dimensional normalized position vector representing the coordinates of the upper left and lower right corners.

Through cascade o' _m And z _m A 2048-dimensional position-sensitive ROI feature vector O can be obtained _m Forming a ROI feature sequence. To enhance the ability of the text encoder to model conceptual relationships while avoiding impairing the reasoning ability of the visual encoder, (v) ^r ，l ^s ) And (v) ^r ，l ^t ) Is used as another two facing while minimizing v ^r And l ^s And v and ^r and l ^t The distance between them. This training process is referred to herein as cross-hierarchy relationship alignment because the example-level input used by the visual modality is very fine-grained, while the input used by the linguistic modality is a complete sentence, i.e., a text abstract and original text.

To summarize, the patch token is projected to a higher dimension and reshape is performed by the linear projection layer. Next, local information is captured using convolution in the 3 × 3 depth direction. The features are then mapped to the token sequence and re-projected to the original dimensions. While the cls token is invariant in the process and is connected with the locally enhanced patch token, generating the final output.

In the embodiment, hungarian matching based on patch similarity is introduced to the text-picture data using mode, the picture patch similarity is utilized to the maximum extent, a better matching effect is obtained, and the problems that task learning is directly performed by directly using text-picture data in the traditional technology, text information redundancy, picture information redundancy, interrelation among picture target examples and interrelation among different target examples existing in the type of training data are ignored are solved.

In an embodiment, as shown in fig. 4, a method for annotating a picture is provided, which is described by taking the method as an example for being applied to the server or the terminal in fig. 1, and includes the following steps:

s402: and receiving the picture to be processed and the text to be processed.

S404: and inputting the picture to be processed and the text to be processed into a text picture matching model obtained by training in any one of the previous input embodiments to obtain a labeled text at a corresponding position of the picture to be processed.

In particular, in connection with FIG. 5, for a given text-to-picture pair, the goal is to expect the text-to-picture matching model to learn a common visual representation. The textual description contains rich semantic information of the target scene in the corresponding picture, such as target object class, color, space, action state, etc., which rich information represents a significant value for downstream visual tasks such as image picture classification or target detection.

For this purpose, the input text is preliminarily encoded to obtain a semantic representation corresponding to the picture, the image picture decoder performs visual feature extraction on the image picture, and then fusion learning is performed on the semantic representation output by the text extractor and the extracted picture feature. And then cascading the image picture characteristic vectors and the text characteristic vectors in the shared space to obtain a complete and independent sequence containing the image picture characteristics and the text characteristics. The goal is to learn a scene descriptor about the picture content that predicts and outputs a visual content representation about the input picture, thereby completing the annotation of the picture.

In the embodiment, by the semantic alignment of the text and the picture, the mass production of high-quality training data becomes possible and efficient, and the manual marking is not strongly relied on, so that the task research and development period of algorithm personnel is shortened, and the service application efficiency is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a text picture matching model training device and a picture labeling device for realizing the text picture matching model training method and the picture labeling method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so the specific limitations in one or more embodiments of the text image matching model training device and the image labeling device provided below can be referred to the limitations of the text image matching model training method and the image labeling method in the foregoing, and are not described herein again.

In one embodiment, as shown in fig. 6, there is provided a text image matching model training apparatus, including: a sample acquisition module 601, an image feature extraction module 602, a text feature extraction module 603, a loss item generation module 604, a Hungarian loss calculation module 605 and a training module 606, wherein:

the sample obtaining module 601 is configured to obtain a sample picture, a sample text, and labeling information of the sample text and the sample picture.

And the picture feature extraction module 602 is configured to extract global features, local features, and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model.

The text feature extraction module 603 is configured to extract, by using a text feature extractor of the text-picture matching model, an original text feature and a digest text feature of the sample text, where the digest text feature is generated based on the label information of the sample text and the sample picture.

The loss term generation module 604 is configured to perform contrast learning based on the global feature and the abstract text feature to generate a first loss term, perform contrast learning based on the local feature and the original text feature to generate a second loss term, perform contrast learning based on the region-of-interest feature and the original text feature to generate a third loss term, and perform contrast learning based on the region-of-interest feature and the abstract text feature to generate a fourth loss term.

A hungarian loss calculation module 605 for calculating the hungarian loss based on the first, second, third and fourth loss terms.

And a training module 606 for training the text picture matching model according to Hungarian loss.

In one embodiment, the image feature extraction module 602 is further configured to crop the sample image according to a first cropping ratio to obtain a global image; carrying out feature extraction on the global picture to obtain global features; cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture plus region position information.

In one embodiment, the text feature extraction module 603 is further configured to extract text features of the sample text as original text features; filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.

In one embodiment, the above-mentioned loss term generating module 604 is further configured to calculate a first similarity between the global feature and the abstract text feature, and a second similarity between the abstract text feature and the local feature; generating a first loss term based on the first similarity and the second similarity; calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; calculating fifth similarity of the region-of-interest features and the original text features and sixth similarity of the original text features and the region-of-interest features; generating a third loss term based on the fifth similarity and the sixth similarity; calculating a seventh similarity of the interesting region features and the abstract text features and an eighth similarity of the abstract text features and the interesting region features; a fourth loss term is generated based on the seventh similarity and the eighth similarity.

In one embodiment, as shown in fig. 7, there is provided a picture annotation device, including: a receiving module 701 and an annotating module 702, wherein:

the receiving module 701 is configured to receive a to-be-processed picture and a to-be-processed text.

And the labeling module 702 is configured to input the picture to be processed and the text to be processed into the text-picture matching model obtained through training in any one of the above embodiments, so as to obtain a labeled text at a corresponding position of the picture to be processed.

All modules in the text image matching model training device and the image labeling device can be completely or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a text picture matching model training method and a picture labeling method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term; and training the text picture matching model according to Hungarian loss.

In one embodiment, the process of extracting global features of a sample picture involved in the execution of the computer program by the processor comprises: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the process of extracting the local features of the sample picture involved in the execution of the computer program by the processor comprises the following steps: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the process of extracting the interesting region characteristic of the sample picture involved in the processor executing the computer program comprises the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture sum region position information.

In one embodiment, the process of extracting original text features of sample text involved in the execution of the computer program by the processor comprises: extracting text features of the sample text as original text features; the extraction process of the abstract text characteristics of the sample text involved in the execution of the computer program by the processor comprises the following steps: filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.

In one embodiment, the comparison learning based on the global features and the abstract text features, which is implemented when the processor executes the computer program, generates the first loss item, and includes: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; the processor, implemented when executing the computer program, generates a second loss term based on the local feature and the original text feature through comparative learning, and comprises: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the generating of the third loss term based on the region-of-interest feature and the raw text feature through the comparative learning implemented when the processor executes the computer program includes: calculating fifth similarity of the region-of-interest features and the original text features and sixth similarity of the original text features and the region-of-interest features; generating a third loss term based on the fifth similarity and the sixth similarity; the fourth loss item is generated by the processor based on the region-of-interest feature and the abstract text feature through comparison learning when the processor executes the computer program, and the fourth loss item comprises the following steps: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.

In one embodiment, a computer device is provided, comprising a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of: receiving a picture to be processed and a text to be processed; and inputting the picture to be processed and the text to be processed into the text picture matching model obtained by training in any one of the embodiments to obtain the labeled text at the corresponding position of the picture to be processed.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on labeling information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first loss term, the second loss term, the third loss term and the fourth loss term; and training the text picture matching model according to Hungarian loss.

In one embodiment, the process of extracting global features of a sample picture involved when the computer program is executed by a processor comprises: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the extraction process of the local features of the sample picture involved in the execution of the computer program by the processor comprises the following steps: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the process of extracting the characteristics of the interested region of the sample picture includes the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture plus region position information.

In one embodiment, the process of extracting original text features of a sample text involved in the execution of the computer program by the processor comprises: extracting text features of the sample text as original text features; the extraction process of the abstract text features of the sample text involved in the execution of the computer program by the processor comprises the following steps: filtering the sample text according to the labeling information of the sample text and the sample picture; and extracting text features of the filtered sample text as abstract text features.

In one embodiment, the generation of the first loss term based on the global feature and the digest-text feature comparative learning implemented by the computer program when the computer program is executed by the processor includes: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; the computer program, when executed by the processor, implements a comparative learning based on the local features and the original text features to generate a second loss term, comprising: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the third loss term generated by the computer program based on the region-of-interest feature and the original text feature through comparison learning when the computer program is executed by the processor comprises the following steps: calculating fifth similarity of the region-of-interest features and the original text features and sixth similarity of the original text features and the region-of-interest features; generating a third loss term based on the fifth similarity and the sixth similarity; the fourth loss term is generated based on the region-of-interest feature and the abstract text feature through comparison learning, and when the computer program is executed by the processor, the fourth loss term comprises: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: receiving a picture to be processed and a text to be processed; and inputting the picture to be processed and the text to be processed into the text picture matching model obtained by training in any one of the embodiments to obtain the labeled text at the corresponding position of the picture to be processed.

In one embodiment, a computer program product is provided, comprising a computer program which when executed by a processor performs the steps of: acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture; extracting global features, local features and region-of-interest features of the sample picture through a picture feature extractor of the text picture matching model; extracting original text features and abstract text features of the sample text through a text feature extractor of the text picture matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture; performing contrast learning based on the global features and the abstract text features to generate a first loss item, performing contrast learning based on the local features and the original text features to generate a second loss item, performing contrast learning based on the region-of-interest features and the original text features to generate a third loss item, and performing contrast learning based on the region-of-interest features and the abstract text features to generate a fourth loss item; calculating Hungarian losses based on the first, second, third, and fourth loss terms; and training the text picture matching model according to Hungarian loss.

In one embodiment, the process of extracting global features of a sample picture involved when the computer program is executed by a processor comprises: cutting the sample picture according to a first cutting proportion to obtain a global picture; carrying out feature extraction on the global picture to obtain global features; the process of extracting local features of a sample picture involved in the execution of the computer program by the processor comprises: cutting the sample picture according to a second cutting proportion to obtain a local picture, wherein the second cutting proportion is smaller than the first cutting proportion; performing feature extraction on the local picture to obtain local features; the process of extracting the characteristics of the interested region of the sample picture includes the following steps: identifying an interested area of the sample picture, and acquiring area position information of the interested area; and carrying out picture coding on the region of interest to obtain picture characteristics, and obtaining the region of interest characteristics according to the picture sum region position information.

In one embodiment, the comparison learning based on global features and abstract text features implemented when the computer program is executed by a processor generates a first loss term, comprising: calculating a first similarity of the global features and the abstract text features and a second similarity of the abstract text features and the local features; generating a first loss term based on the first similarity and the second similarity; the computer program, when executed by the processor, implements the comparison learning based on the local feature and the original text feature to generate a second loss term, comprising: calculating a third similarity of the local features and the original text features and a fourth similarity of the original text features and the local features; generating a second loss term based on the third similarity and the fourth similarity; the third loss term generated by the computer program based on the region-of-interest feature and the original text feature through comparison learning when the computer program is executed by the processor comprises the following steps: calculating fifth similarity of the interesting region features and the original text features and sixth similarity of the original text features and the interesting region features; generating a third loss term based on the fifth similarity and the sixth similarity; the fourth loss term is generated based on the region-of-interest feature and the abstract text feature through comparison learning, and when the computer program is executed by the processor, the fourth loss term comprises: calculating a seventh similarity between the region-of-interest feature and the abstract text feature and an eighth similarity between the abstract text feature and the region-of-interest feature; a fourth loss term is generated based on the seventh similarity and the eighth similarity.

In one embodiment, a computer program product is provided, comprising a computer program which when executed by a processor performs the steps of: receiving a picture to be processed and a text to be processed; and inputting the picture to be processed and the text to be processed into the text picture matching model obtained by training in any one of the embodiments to obtain the labeled text at the corresponding position of the picture to be processed.

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A text image matching model training method is characterized by comprising the following steps:

extracting original text features and abstract text features of the sample text through a text feature extractor of the text image matching model, wherein the abstract text features are generated based on the labeled information of the sample text and the sample picture;

and training the text image matching model according to the Hungarian loss.

2. The method according to claim 1, wherein the extracting process of the global features of the sample picture comprises:

performing feature extraction on the local picture to obtain local features;

3. The method of claim 1, wherein the extracting of the original text features of the sample text comprises:

extracting text features of the sample text as original text features;

4. The method of claim 1, wherein the generating a first loss term based on the global feature and the abstract text feature through contrast learning comprises:

the generating a second loss term based on the local feature and the original text feature through comparison learning comprises the following steps:

calculating a fifth similarity of the region-of-interest feature and the original text feature, and a sixth similarity of the original text feature and the region-of-interest feature;

the generating a fourth loss term based on the region-of-interest feature and the abstract text feature through contrast learning includes:

calculating a seventh similarity of the region-of-interest features and the abstract text features and an eighth similarity of the abstract text features and the region-of-interest features;

5. A picture labeling method is characterized by comprising the following steps:

receiving a picture to be processed and a text to be processed;

inputting the picture to be processed and the text to be processed into a text image matching model obtained by training according to any one of claims 1 to 4, and obtaining a labeled text at a corresponding position of the picture to be processed.

6. An apparatus for training a text image matching model, the apparatus comprising:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a sample picture, a sample text and labeling information of the sample text and the sample picture;

a Hungarian loss calculation module to calculate Hungarian losses based on the first, second, third, and fourth loss terms;

7. A picture labeling apparatus, comprising:

and the labeling module is used for inputting the picture to be processed and the text to be processed into the text image matching model obtained by training according to claim 6, so as to obtain a labeled text at the corresponding position of the picture to be processed.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 4 or 5.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4 or 5.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 4 or 5 when executed by a processor.